<a href="https://colab.research.google.com/github/ECHOIgOng/229352_660510593/blob/main/660510593_Lab04_Naive_Bayes_Grid_and_Random_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #4

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from scipy.stats import uniform

In [2]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

Xtrain = train.data[:3000]
ytrain = train.target[:3000]
Xtest = test.data[:500]
ytest = test.target[:500]

print("X:", len(Xtest))
print("y:", len(ytest))

X: 500
y: 500


### Naive Bayes [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

### Random Search Cross-Validation [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

### Uniform distribution in `Scipy` [(Documentation)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.uniform.html)

#### Exercise

1. For the Naive Bayes model, use grid search 5-fold cross-validation across different values of `alpha` to find the best model.

2. For the best value of `alpha`, compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* What is the model's `f1_macro` score?

3. Repeat Exercise 1 and 2 for **random search** 5-fold cross validation across different values of `alpha`. Compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* Did you get a better `f1_macro` score compared to grid search in Exercise 2?

In [3]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

param_grid = {
    'nb__alpha': [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0]
}

grid = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='f1_macro',
    n_jobs=-1
)

grid.fit(Xtrain, ytrain)

print("\n===== GRID SEARCH =====")
print("Best alpha:", grid.best_params_['nb__alpha'])
print("Best CV f1_macro:", grid.best_score_)

# Evaluate on test set
best_grid_model = grid.best_estimator_
y_pred_grid = best_grid_model.predict(Xtest)

print("\nGrid Search Test Performance:")
print(classification_report(ytest, y_pred_grid))

param_dist = {
    'nb__alpha': uniform(0.001, 5)
}

random = RandomizedSearchCV(
    pipeline,
    param_dist,
    n_iter=20,
    cv=5,
    scoring='f1_macro',
    random_state=42,
    n_jobs=-1
)

random.fit(Xtrain, ytrain)

print("\n===== RANDOM SEARCH =====")
print("Best alpha:", random.best_params_['nb__alpha'])
print("Best CV f1_macro:", random.best_score_)

best_random_model = random.best_estimator_
y_pred_random = best_random_model.predict(Xtest)

print("\nRandom Search Test Performance:")
print(classification_report(ytest, y_pred_random))


===== GRID SEARCH =====
Best alpha: 0.001
Best CV f1_macro: 0.8316349455351428

Grid Search Test Performance:
              precision    recall  f1-score   support

           0       0.81      0.62      0.70        21
           1       0.52      0.52      0.52        21
           2       0.65      0.58      0.61        26
           3       0.64      0.68      0.66        34
           4       0.83      0.71      0.76        34
           5       0.83      0.58      0.68        26
           6       0.82      0.82      0.82        22
           7       0.82      0.96      0.89        28
           8       0.93      0.85      0.89        33
           9       0.88      0.88      0.88        25
          10       0.89      0.93      0.91        27
          11       0.83      0.95      0.88        20
          12       0.64      0.58      0.61        24
          13       0.87      0.87      0.87        23
          14       0.68      0.93      0.79        28
          15       0.77 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
