### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #4

In [12]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from scipy.stats import uniform

In [13]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

Xtrain = train.data[:3000]
ytrain = train.target[:3000]
Xtest = test.data[:500]
ytest = test.target[:500]

print("X:", len(Xtest))
print("y:", len(ytest))

X: 500
y: 500


### Naive Bayes [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

In [14]:
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

model = make_pipeline(TfidfVectorizer(), MultinomialNB())


model.fit(Xtrain, ytrain)


labels = model.predict(Xtest)

print(f"Accuracy: {accuracy_score(ytest, labels):.2f}")
print("\nClassification Report:\n")
print(classification_report(ytest, labels, target_names=train.target_names))

Accuracy: 0.64

Classification Report:

                          precision    recall  f1-score   support

             alt.atheism       0.83      0.24      0.37        21
           comp.graphics       0.83      0.24      0.37        21
 comp.os.ms-windows.misc       0.71      0.65      0.68        26
comp.sys.ibm.pc.hardware       0.73      0.56      0.63        34
   comp.sys.mac.hardware       0.73      0.79      0.76        34
          comp.windows.x       0.95      0.81      0.88        26
            misc.forsale       1.00      0.50      0.67        22
               rec.autos       0.67      1.00      0.80        28
         rec.motorcycles       0.96      0.73      0.83        33
      rec.sport.baseball       0.91      0.84      0.88        25
        rec.sport.hockey       0.85      0.85      0.85        27
               sci.crypt       0.58      0.95      0.72        20
         sci.electronics       0.56      0.38      0.45        24
                 sci.med       0.81

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Random Search Cross-Validation [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

### Uniform distribution in `Scipy` [(Documentation)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.uniform.html)

In [15]:
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words='english')),
    ('classifier', MultinomialNB())
])


param_dist = {
    'classifier__alpha': uniform(0.001, 0.999),
    'vectorizer__max_df': [0.5, 0.75, 1.0],
    'vectorizer__ngram_range': [(1, 1), (1, 2)]
}

random_search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_dist,
    n_iter=10,
    cv=3,
    verbose=1,
    n_jobs=-1,
    random_state=42
)

random_search.fit(Xtrain, ytrain)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


#### Exercise

1. For the Naive Bayes model, use grid search 5-fold cross-validation across different values of `alpha` to find the best model.

2. For the best value of `alpha`, compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* What is the model's `f1_macro` score?

3. Repeat Exercise 1 and 2 for **random search** 5-fold cross validation across different values of `alpha`. Compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* Did you get a better `f1_macro` score compared to grid search in Exercise 2?

In [16]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score


pipeline = make_pipeline(TfidfVectorizer(stop_words='english'), MultinomialNB())

param_grid = {'multinomialnb__alpha': [0.001, 0.01, 0.1, 0.5, 1.0]}


grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1_macro')
grid_search.fit(Xtrain, ytrain)


best_alpha_grid = grid_search.best_params_['multinomialnb__alpha']
y_pred_grid = grid_search.predict(Xtest)
f1_grid = f1_score(ytest, y_pred_grid, average='macro')

print(f"Grid Search Best Alpha: {best_alpha_grid}")
print(f"Grid Search Test F1 Macro: {f1_grid:.4f}")

Grid Search Best Alpha: 0.01
Grid Search Test F1 Macro: 0.7709


In [17]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform


param_dist = {'multinomialnb__alpha': uniform(0.001, 1.0)}

random_search = RandomizedSearchCV(pipeline, param_dist, n_iter=10, cv=5,
                                   scoring='f1_macro', random_state=42)
random_search.fit(Xtrain, ytrain)


best_alpha_rand = random_search.best_params_['multinomialnb__alpha']
y_pred_rand = random_search.predict(Xtest)
f1_rand = f1_score(ytest, y_pred_rand, average='macro')

print(f"Random Search Best Alpha: {best_alpha_rand}")
print(f"Random Search Test F1 Macro: {f1_rand:.4f}")

Random Search Best Alpha: 0.05908361216819946
Random Search Test F1 Macro: 0.7373



3. Repeat Exercise 1 and 2 for **random search** 5-fold cross validation across different values of `alpha`. Compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* Did you get a better `f1_macro` score compared to grid search in Exercise 2?

Part 1: Grid Search Cross-Validation (Exercise 1-2)

Value of alpha obtained: 0.01

Model's f1_macro score: 0.7709

Part 2: Random Search Cross-Validation (Exercise 3-4)

Value of alpha obtained: 0.05908361216819946

Model's f1_macro score: 0.7373

ไม่ได้ดีกว่าจากการทดลองพบว่า Grid Search ให้ค่า f1_macro (0.7709) ที่สูงกว่า Random Search (0.7373)