<a href="https://colab.research.google.com/github/SaiButhongyou/229352-StatisticalLearning/blob/main/Lab04_Naive_Bayes_Grid_and_Random_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #4

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from scipy.stats import uniform

In [2]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

Xtrain = train.data[:3000]
ytrain = train.target[:3000]
Xtest = test.data[:500]
ytest = test.target[:500]

print("X:", len(Xtest))
print("y:", len(ytest))

X: 500
y: 500


### Naive Bayes [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

In [5]:
tfidf = TfidfVectorizer(stop_words='english')
Xtrain_tfidf = tfidf.fit_transform(Xtrain)
Xtest_tfidf = tfidf.transform(Xtest)


clf = MultinomialNB()
clf.fit(Xtrain_tfidf, ytrain)


print("Prediction for index 2-3:", clf.predict(Xtest_tfidf[2:3]))
print("Actual Label:", ytest[2:3])

Prediction for index 2-3: [0]
Actual Label: [0]


### Random Search Cross-Validation [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

### Uniform distribution in `Scipy` [(Documentation)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.uniform.html)

In [6]:
distributions = dict(
    nb__alpha=uniform(loc=0, scale=4)
)

clf_rs = RandomizedSearchCV(
    pipeline,
    distributions,
    n_iter=10,
    random_state=0,
    cv=3
)

search = clf_rs.fit(Xtrain, ytrain)

print("Best Params:", search.best_params_)
print("Best Score:", search.best_score_)

Best Params: {'nb__alpha': np.float64(1.5337660753031108)}
Best Score: 0.7683333333333334


In [8]:
alpha_distribution = uniform(loc=0, scale=4)
alpha_distribution.rvs(5, random_state=0)

array([2.19525402, 2.86075747, 2.4110535 , 2.17953273, 1.6946192 ])

#### Exercise

1. For the Naive Bayes model, use grid search 5-fold cross-validation across different values of `alpha` to find the best model.

2. For the best value of `alpha`, compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* What is the model's `f1_macro` score?

3. Repeat Exercise 1 and 2 for **random search** 5-fold cross validation across different values of `alpha`. Compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* Did you get a better `f1_macro` score compared to grid search in Exercise 2?

In [9]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('nb', MultinomialNB())
])

grid_params = {'nb__alpha': [0.1, 0.5, 1.0, 2.0, 5.0]}

grid_search = GridSearchCV(pipeline, grid_params, cv=5, scoring='f1_macro')
grid_search.fit(Xtrain, ytrain)

print(f"Best Alpha: {grid_search.best_params_['nb__alpha']}")

#f1_macro
ypred_grid = grid_search.predict(Xtest)
print(classification_report(ytest, ypred_grid))

--- Grid Search Result ---
Best Alpha: 0.1
              precision    recall  f1-score   support

           0       0.73      0.52      0.61        21
           1       0.68      0.62      0.65        21
           2       0.58      0.58      0.58        26
           3       0.70      0.68      0.69        34
           4       0.88      0.82      0.85        34
           5       0.89      0.62      0.73        26
           6       0.89      0.77      0.83        22
           7       0.79      0.96      0.87        28
           8       0.97      0.85      0.90        33
           9       0.92      0.88      0.90        25
          10       0.87      1.00      0.93        27
          11       0.77      1.00      0.87        20
          12       0.61      0.71      0.65        24
          13       0.83      0.83      0.83        23
          14       0.83      0.89      0.86        28
          15       0.60      0.93      0.73        29
          16       0.50      0.95     

In [11]:
random_params = {'nb__alpha': uniform(loc=0, scale=4)}

random_search = RandomizedSearchCV(
    pipeline,
    param_distributions=random_params,
    n_iter=10,
    cv=5,
    scoring='f1_macro',
    random_state=0
)
random_search.fit(Xtrain, ytrain)

print(f"Best Alpha: {random_search.best_params_['nb__alpha']}")

#f1_macro
ypred_rand = random_search.predict(Xtest)
print(classification_report(ytest, ypred_rand))

Best Alpha: 1.5337660753031108
              precision    recall  f1-score   support

           0       0.73      0.38      0.50        21
           1       0.91      0.48      0.62        21
           2       0.56      0.73      0.63        26
           3       0.72      0.62      0.67        34
           4       0.66      0.85      0.74        34
           5       0.88      0.85      0.86        26
           6       1.00      0.64      0.78        22
           7       0.62      1.00      0.77        28
           8       0.90      0.82      0.86        33
           9       0.88      0.84      0.86        25
          10       0.79      1.00      0.89        27
          11       0.86      0.95      0.90        20
          12       0.57      0.54      0.55        24
          13       0.74      0.74      0.74        23
          14       0.90      0.68      0.78        28
          15       0.51      0.90      0.65        29
          16       0.49      0.90      0.63       

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
