<a href="https://colab.research.google.com/github/Kan-tapon/229352-Statistical-Learning-for-Data-Science-670510751/blob/main/Lab04_Naive_Bayes_Grid_and_Random_Search_670510751.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #4

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from scipy.stats import uniform

In [2]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

Xtrain = train.data[:3000]
ytrain = train.target[:3000]
Xtest = test.data[:500]
ytest = test.target[:500]

print("X:", len(Xtest))
print("y:", len(ytest))

X: 500
y: 500


### Naive Bayes [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

In [3]:
pipeline = Pipeline([('tfidf', TfidfVectorizer()), ('nb', MultinomialNB())])

# Define the parameter grid for alpha
param_grid = {'nb__alpha': [0.01, 0.05, 0.1, 0.5, 1.0]}

# Perform Grid Search Cross-Validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search.fit(Xtrain, ytrain)

print("Grid Search - Best alpha:", grid_search.best_params_)

# Predict on the test set with the best model from Grid Search
y_pred_grid = grid_search.predict(Xtest)

# Evaluate the model with f1_macro
f1_macro_grid = classification_report(ytest, y_pred_grid, output_dict=True)['macro avg']['f1-score']
print("Grid Search - f1_macro score on test set:", f1_macro_grid)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
Grid Search - Best alpha: {'nb__alpha': 0.01}
Grid Search - f1_macro score on test set: 0.7725871193182322


### Random Search Cross-Validation [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

### Uniform distribution in `Scipy` [(Documentation)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.uniform.html)

In [4]:
pipeline_rs = Pipeline([('tfidf', TfidfVectorizer()), ('nb', MultinomialNB())])

# Define the parameter distribution for alpha using uniform distribution
param_distributions = {'nb__alpha': uniform(loc=0, scale=1)}

# Perform Randomized Search Cross-Validation
random_search = RandomizedSearchCV(pipeline_rs, param_distributions, n_iter=10, cv=5, scoring='f1_macro', verbose=1, n_jobs=-1, random_state=42)
random_search.fit(Xtrain, ytrain)

print("Random Search - Best alpha:", random_search.best_params_)

# Predict on the test set with the best model from Random Search
y_pred_random = random_search.predict(Xtest)

# Evaluate the model with f1_macro
f1_macro_random = classification_report(ytest, y_pred_random, output_dict=True)['macro avg']['f1-score']
print("Random Search - f1_macro score on test set:", f1_macro_random)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Random Search - Best alpha: {'nb__alpha': np.float64(0.05808361216819946)}
Random Search - f1_macro score on test set: 0.7309479579937921


#### Exercise

1. For the Naive Bayes model, use grid search 5-fold cross-validation across different values of `alpha` to find the best model.

2. For the best value of `alpha`, compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* What is the model's `f1_macro` score?

3. Repeat Exercise 1 and 2 for **random search** 5-fold cross validation across different values of `alpha`. Compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* Did you get a better `f1_macro` score compared to grid search in Exercise 2?

In [5]:
print(f"\nComparison of Results:\n")
print(f"Grid Search Best Alpha: {grid_search.best_params_['nb__alpha']:.4f}")
print(f"Grid Search f1_macro: {f1_macro_grid:.4f}")
print(f"Random Search Best Alpha: {random_search.best_params_['nb__alpha']:.4f}")
print(f"Random Search f1_macro: {f1_macro_random:.4f}")

if f1_macro_random > f1_macro_grid:
    print("Random Search yielded a better f1_macro score.")
elif f1_macro_random < f1_macro_grid:
    print("Grid Search yielded a better f1_macro score.")
else:
    print("Both Grid Search and Random Search yielded the same f1_macro score.")


Comparison of Results:

Grid Search Best Alpha: 0.0100
Grid Search f1_macro: 0.7726
Random Search Best Alpha: 0.0581
Random Search f1_macro: 0.7309
Grid Search yielded a better f1_macro score.


2.-What value of alpha did you obtain?

The best alpha value obtained from Grid Search was 0.01.

-What is the model's f1_macro score?

The f1_macro score on the test set for the best model identified by Grid Search was 0.7726.


3.-What value of alpha did you obtain? The best alpha value

obtained from Random Search was approximately 0.0581.

-Did you get a better f1_macro score compared to grid search in Exercise 2?

No, Random Search did not yield a better f1_macro score in this instance. The f1_macro score from Random Search was 0.7309, which is lower than the 0.7726 achieved by Grid Search. Therefore, Grid Search performed better in this comparison.