<a href="https://colab.research.google.com/github/PinkOrangeSapphire/229352/blob/main/Lab04_Naive_Bayes_Grid_and_Random_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #4

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from scipy.stats import uniform

In [2]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

Xtrain = train.data[:3000]
ytrain = train.target[:3000]
Xtest = test.data[:500]
ytest = test.target[:500]

print("X:", len(Xtest))
print("y:", len(ytest))

X: 500
y: 500


### Naive Bayes [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

In [4]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('nb', MultinomialNB())
])
pipeline.fit(Xtrain, ytrain)
ypred = pipeline.predict(Xtest)
print(classification_report(ytest, ypred))

              precision    recall  f1-score   support

           0       0.67      0.38      0.48        21
           1       0.79      0.52      0.63        21
           2       0.58      0.69      0.63        26
           3       0.74      0.68      0.71        34
           4       0.72      0.85      0.78        34
           5       0.88      0.81      0.84        26
           6       1.00      0.73      0.84        22
           7       0.70      1.00      0.82        28
           8       0.90      0.82      0.86        33
           9       0.88      0.84      0.86        25
          10       0.82      1.00      0.90        27
          11       0.79      0.95      0.86        20
          12       0.59      0.54      0.57        24
          13       0.75      0.78      0.77        23
          14       0.87      0.71      0.78        28
          15       0.53      0.90      0.67        29
          16       0.50      0.95      0.66        21
          17       0.94    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Random Search Cross-Validation [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

### Uniform distribution in `Scipy` [(Documentation)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.uniform.html)

In [5]:
parameter = {'nb__alpha': uniform(loc=0, scale=10)}
clf = RandomizedSearchCV(pipeline, parameter, n_iter=10) #10rng
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
print(classification_report(ytest, ypred))

              precision    recall  f1-score   support

           0       0.73      0.38      0.50        21
           1       0.91      0.48      0.62        21
           2       0.56      0.73      0.63        26
           3       0.72      0.62      0.67        34
           4       0.66      0.85      0.74        34
           5       0.88      0.85      0.86        26
           6       1.00      0.64      0.78        22
           7       0.61      1.00      0.76        28
           8       0.90      0.82      0.86        33
           9       0.88      0.84      0.86        25
          10       0.82      1.00      0.90        27
          11       0.86      0.95      0.90        20
          12       0.57      0.54      0.55        24
          13       0.74      0.74      0.74        23
          14       0.90      0.68      0.78        28
          15       0.51      0.90      0.65        29
          16       0.49      0.90      0.63        21
          17       0.93    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


#### Exercise

1. For the Naive Bayes model, use grid search 5-fold cross-validation across different values of `alpha` to find the best model.

2. For the best value of `alpha`, compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* What is the model's `f1_macro` score?

3. Repeat Exercise 1 and 2 for **random search** 5-fold cross validation across different values of `alpha`. Compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* Did you get a better `f1_macro` score compared to grid search in Exercise 2?

In [10]:
param_grid = {'nb__alpha': [0.01, 0.1, 1, 10, 100]}

# Instantiate GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1_macro')

# Fit GridSearchCV to the training data
grid_search.fit(Xtrain, ytrain)

print("Grid Search completed <333")

Grid Search completed <333


In [7]:
print(f"Best alpha value: {grid_search.best_params_['nb__alpha']}")

best_grid_model = grid_search.best_estimator_
ypred_grid = best_grid_model.predict(Xtest)

# Calculate f1_macro score
from sklearn.metrics import f1_score
f1_macro_grid = f1_score(ytest, ypred_grid, average='macro')

print(f"Model's f1_macro score on test set: {f1_macro_grid}")

Best alpha value: 0.01
Model's f1_macro score on test set: 0.7708784080844321


In [8]:
parameter_dist = {'nb__alpha': uniform(loc=0, scale=10)}

# Instantiate RandomizedSearchCV
random_search = RandomizedSearchCV(pipeline, parameter_dist, n_iter=10, cv=5, scoring='f1_macro')

# Fit RandomizedSearchCV to the training data
random_search.fit(Xtrain, ytrain)

print("Random Search completed.")

Random Search completed.


In [9]:
print(f"Best alpha value (Random Search): {random_search.best_params_['nb__alpha']}")

best_random_model = random_search.best_estimator_
ypred_random = best_random_model.predict(Xtest)

f1_macro_random = f1_score(ytest, ypred_random, average='macro')

print(f"Model's f1_macro score on test set (Random Search): {f1_macro_random}")
print(f"f1_macro score from Grid Search: {f1_macro_grid}")

if f1_macro_random > f1_macro_grid:
    print("Random search yielded a better f1_macro score.")
elif f1_macro_random < f1_macro_grid:
    print("Grid search yielded a better f1_macro score.")
else:
    print("Both grid search and random search yielded the same f1_macro score.")

Best alpha value (Random Search): 1.5544266668627293
Model's f1_macro score on test set (Random Search): 0.6801380191485173
f1_macro score from Grid Search: 0.7708784080844321
Grid search yielded a better f1_macro score.
