<a href="https://colab.research.google.com/github/SoyMilkQwQ/229352-StatisticalLearning-or-Statistical-Learning-Labs-qxq./blob/main/Lab04_Naive_Bayes_Grid_and_Random_Search_660510751.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #4

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from scipy.stats import uniform

In [2]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

Xtrain = train.data[:3000]
ytrain = train.target[:3000]
Xtest = test.data[:500]
ytest = test.target[:500]

print("X:", len(Xtest))
print("y:", len(ytest))

X: 500
y: 500


### Naive Bayes [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

In [3]:
model = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

model.fit(Xtrain, ytrain)


y_pred_nb = model.predict(Xtest)
print(classification_report(ytest, y_pred_nb, target_names=test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.83      0.24      0.37        21
           comp.graphics       0.83      0.24      0.37        21
 comp.os.ms-windows.misc       0.71      0.65      0.68        26
comp.sys.ibm.pc.hardware       0.73      0.56      0.63        34
   comp.sys.mac.hardware       0.73      0.79      0.76        34
          comp.windows.x       0.95      0.81      0.88        26
            misc.forsale       1.00      0.50      0.67        22
               rec.autos       0.67      1.00      0.80        28
         rec.motorcycles       0.96      0.73      0.83        33
      rec.sport.baseball       0.91      0.84      0.88        25
        rec.sport.hockey       0.85      0.85      0.85        27
               sci.crypt       0.58      0.95      0.72        20
         sci.electronics       0.56      0.38      0.45        24
                 sci.med       0.81      0.57      0.67        23
         

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Random Search Cross-Validation [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

### Uniform distribution in `Scipy` [(Documentation)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.uniform.html)

In [4]:
param_dist = {
    'nb__alpha': uniform(loc=0, scale=2)
}

random_search = RandomizedSearchCV(
    model,
    param_distributions=param_dist,
    n_iter=20,
    cv=5,
    scoring='f1_macro',
    random_state=42,
    n_jobs=-1
)


random_search.fit(Xtrain, ytrain)
y_pred_random = random_search.predict(Xtest)
print(classification_report(ytest, y_pred_random, target_names=test.target_names))
print("Best alpha (random search):", random_search.best_params_['nb__alpha'])

                          precision    recall  f1-score   support

             alt.atheism       0.79      0.52      0.63        21
           comp.graphics       0.67      0.57      0.62        21
 comp.os.ms-windows.misc       0.60      0.58      0.59        26
comp.sys.ibm.pc.hardware       0.71      0.71      0.71        34
   comp.sys.mac.hardware       0.88      0.88      0.88        34
          comp.windows.x       0.89      0.65      0.76        26
            misc.forsale       0.94      0.77      0.85        22
               rec.autos       0.80      1.00      0.89        28
         rec.motorcycles       0.97      0.88      0.92        33
      rec.sport.baseball       0.92      0.88      0.90        25
        rec.sport.hockey       0.87      1.00      0.93        27
               sci.crypt       0.79      0.95      0.86        20
         sci.electronics       0.65      0.71      0.68        24
                 sci.med       0.82      0.78      0.80        23
         

#### Exercise

1. For the Naive Bayes model, use grid search 5-fold cross-validation across different values of `alpha` to find the best model.

2. For the best value of `alpha`, compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* What is the model's `f1_macro` score?

3. Repeat Exercise 1 and 2 for **random search** 5-fold cross validation across different values of `alpha`. Compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* Did you get a better `f1_macro` score compared to grid search in Exercise 2?

In [5]:
param_grid = {
    'nb__alpha': [0.01, 0.1, 0.5, 1.0, 1.5, 2.0]
}
grid_search = GridSearchCV(
    model,
    param_grid=param_grid,
    cv=5,
    scoring='f1_macro',
    n_jobs=-1
)
grid_search.fit(Xtrain, ytrain)

best_alpha_grid = grid_search.best_params_['nb__alpha']
print("Best alpha (grid search):", best_alpha_grid)

y_pred_grid = grid_search.predict(Xtest)
f1_grid = f1_score(ytest, y_pred_grid, average='macro')
print("f1_macro score (grid search):", f1_grid)


best_alpha_random = random_search.best_params_['nb__alpha']
print("Best alpha (random search):", best_alpha_random)

y_pred_random = random_search.predict(Xtest)
f1_random = f1_score(ytest, y_pred_random, average='macro')
print("f1_macro score (random search):", f1_random)

print('---------------------------------------------------------------------')
if f1_random > f1_grid:
    print("Random search achieved a better f1_macro score.")

print("Grid search achieved a better f1_macro score.")

Best alpha (grid search): 0.01
f1_macro score (grid search): 0.7725871193182322
Best alpha (random search): 0.041168988591604894
f1_macro score (random search): 0.7436412921130966
---------------------------------------------------------------------
Grid search achieved a better f1_macro score.
