# CPSC 330 Lecture 5

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.size']=16

from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# load data
imdb_df = pd.read_csv('./data/imdb_master.csv', index_col=0, encoding='ISO-8859-1')
imdb_df= imdb_df[imdb_df["label"].str.startswith(('pos','neg'))]
imdb_df = imdb_df.sample(frac=0.2, random_state=999)
imdb_df.head(20)

Unnamed: 0,type,review,label,file
12438,test,As Jennifer Denuccio used to say on Square Peg...,neg,9946_2.txt
5705,test,"With Knightly and O'Tool as the leads, this fi...",neg,3886_3.txt
11675,test,"Take a bad script, some lousy acting and throw...",neg,9259_1.txt
9824,test,Strange things happen to Americans Will (Greg ...,neg,7593_3.txt
22581,test,"Sometimes, you're up late at night flipping th...",pos,7824_7.txt
10164,test,"I like to like movies, but I found nothing to ...",neg,789_1.txt
38437,train,"This reminded me of Spinal Tap, on a more seri...",pos,10844_9.txt
37278,train,I remember this show being on the television w...,neg,9801_1.txt
49829,train,"I looked at this movie with my child eyes, and...",pos,9848_7.txt
24753,test,I just had to add my comment to raise the aver...,pos,977_9.txt


In [3]:
X= imdb_df["review"]
y= imdb_df["label"]
X_train_raw, X_test_raw, y_train, y_test= train_test_split(X, y, random_state=123)

In [4]:
cv = CountVectorizer(min_df=50, binary=True)
lr= LogisticRegression(max_iter=1000)

In [5]:
X_train= cv.fit_transform(X_train_raw)
X_test= cv.transform(X_test_raw)

In [6]:
lr.fit(X_train, y_train)
lr.score(X_train, y_train)

0.9834666666666667

In [7]:
lr.score(X_test, y_test)

0.8256

In [8]:
cross_val_score( lr, X_train, y_train)

array([0.82866667, 0.836     , 0.838     , 0.83266667, 0.834     ])

- The code runs.
- But we have a problem... our good friend the Golden Rule.
- It is actually the exact same problem we fit/transformed the `CountVectorizer` before splitting.
- Remember, cross-validation involves splitting!!!

## Pipelines

In [9]:
cv= CountVectorizer(min_df=50)
lr= LogisticRegression(max_iter=1000)

In [10]:
pipe = Pipeline([
    ('countvec', cv),
    ('logreg', lr)
])

In [11]:
pipe.fit(X_train_raw,y_train)

In [12]:
pipe.predict(X_test_raw)

array(['pos', 'pos', 'pos', ..., 'pos', 'pos', 'pos'], dtype=object)

The pipeline is doing the following steps:

1. Fitting `CountVectorizer`.
2. Transforming the data using the fit `CountVectorizer`.
3. Fitting the `LogisticRegression` on the transformed data.

When we call `predict` (or `score`), we also feed in the raw data:

Here is a schematic assuming you have two transformers:

<img src="./img/pipeline.png" width="400">

[Source](https://amueller.github.io/COMS4995-s20/slides/aml-04-preprocessing/#18)

- One thing that is awesome here is that we can't make the mistakes we showed last time:
  - We call `fit` on the train split and `score` on the test split, it's clean.
  - We can't accidentally re-fit the preprocessor on the test data like we did last time.
  - It automatically makes sure the same transformations are applied to train and test.

And now, the moment of truth:

In [13]:
cross_val_score(pipe, X_train_raw, y_train)

array([0.82533333, 0.83466667, 0.824     , 0.83466667, 0.83266667])

In [14]:
# cross validation without using pipeline
cross_val_score( lr, X_train, y_train)

array([0.82866667, 0.836     , 0.838     , 0.83266667, 0.834     ])

- BTW, the scores here aren't that different.
- I don't suspect it matters all that much here.
- But there could be cases where the effect is large.
- In this course I want you to build good habits that will serve you well going forward.

## Hyperparameter optimization: grid search and random search (30 min)

#### Manual hyperparameter optimization

- We tried this a bit.
- Advantage: we may have some intuition about what might work.
  - E.g. if I'm massively overfitting, try decreasing `max_depth` or `C`.
- Disadvantage: it takes a lot of work.
- Disadvantage: in very complicated cases, our intuition might be worse than a data-driven approach.

#### Automated hyperparameter optimization

- Advantage: reduce human effort
- Advantage: less prone to error and improve reproducibility
- Advantage: data-driven approaches may be effective
- Disadvantage: may be hard to incorporate intuition
- Disadvantage: be careful about overfitting on the validation set.



There are two automated hyperparameter search methods in scikit-learn:

  - Exhaustive grid search: [`sklearn.model_selection.GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
  - Randomized hyperparameter optimization: [`sklearn.model_selection.RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  
The "CV" stands for cross-validation; these searchers have cross-validation built right in.

#### Exhaustive grid search

- A user specifies a set of values for each hyperparameter. 
- The method considers "product" of the sets and then evaluates each combination one by one.    

Let's start the automated hyperparameter optimization.

In [15]:
countvec= CountVectorizer(binary=True)
lr= LogisticRegression(max_iter=1000)

In [16]:
pipe= Pipeline([
    ('countvec',countvec),
    ('lr',lr)
])

In [17]:
params={
    "countvec__min_df": [0,10,100],
    "lr__C":[0.001, 1, 10, 100]
}

In [18]:
from joblib import Parallel
grid_search = GridSearchCV(pipe, params, verbose=4 , n_jobs=-1,)

In [19]:
grid_search.fit(X_train_raw, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


- Note the `n_jobs=-1` above.
- Hyperparameter optimization can be done _in parallel_ for each of the configurations.
- This is very useful when scaling up to large numbers of machines in the cloud.
- But even on my laptop there are 8 cores it can use, so that makes it a lot faster.

In [20]:
grid_search.best_params_

{'countvec__min_df': 0, 'lr__C': 1}

Heh, here we get back the defaults again. This happens surprisingly often - the defaults are well chosen!

- Note the number of candidates comes from the **product** of the number of options for each hyperparameter.
- And then the whole thing multiplied by the number of folds (default is 5).
- So, this number can get big really fast.

But note that we're searching more possibilities than if we just sweep one hyperparameter at a time:

![](img/gridsearch.png)

In that case we'd only get the ones in red, but here we get the entire grid.

(Img source: see credit below)

#### Problems with exhaustive grid search 

- Required number of models to evaluate grows exponentially with the dimensionally of the configuration space. 
- Exhaustive search may become infeasible fairly quickly. 

#### Randomized hyperparameter search

- Randomized hyperparameter optimization: [`sklearn.model_selection.RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
- Samples configurations at random until certain budget (e.g., time) is exhausted.
- Advantage: you can choose how many runs you'll do.
- Advantage: you can restrict yourself less on what values you might try.
- Advantage: Adding parameters that do not influence the performance does not affect efficiency.
- Advantage: research shows this is generally a better idea than grid search, see image for intuition:

<img src="img/randomsearch_bergstra.png" width="400">

Source: [Bergstra and Bengio, Random Search for Hyper-Parameter Optimization, JMLR 2012](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf).

- You don't know in advance which hyperparameters are important for your problem.
- But some of them might be unimportant.
- In the left figure, 6 of the 9 searches are useless because they are only varying the unimportant parameter.
- In the right figure, all 9 searches are useful.

In [37]:
countvec= CountVectorizer(binary=True)
lr= LogisticRegression(max_iter=1000)

In [43]:
pipe= Pipeline([
    ('countvec',countvec),
    ('lr',lr)
])

In [44]:
params= {
    "countvec__min_df":np.arange(0,100),
    "lr__C": 2.0**np.arange(-5,5)
}

In [51]:
random_search = RandomizedSearchCV(pipe, params,
                                   n_iter = 12, 
                                   verbose = 1,
                                   random_state = 123,
                                   return_train_score=True)

In [52]:
random_search.fit(X_train_raw, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


In [53]:
random_search.best_params_

{'lr__C': 0.0625, 'countvec__min_df': 13}

In [54]:
random_search.best_score_

0.8605333333333333

- So, they are very slightly different.
- Is that difference important?
- Do we BELIEVE that difference? How to figure this out?

- Some strategies:
  - We can try it out on the test set.
  - We can look at the sub-scores of the folds.
  - Try cross-validation with more folds.


In [55]:
random_search.cv_results_.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_lr__C', 'param_countvec__min_df', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'split3_train_score', 'split4_train_score', 'mean_train_score', 'std_train_score'])

In [57]:
pd.DataFrame(random_search.cv_results_)[['mean_test_score', 'mean_train_score', 'param_countvec__min_df', 'param_lr__C', 'mean_fit_time', 'rank_test_score']].set_index("rank_test_score").sort_index()

Unnamed: 0_level_0,mean_test_score,mean_train_score,param_countvec__min_df,param_lr__C,mean_fit_time
rank_test_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.860533,0.962767,13,0.0625,1.303155
2,0.860267,0.950733,5,0.03125,1.460388
3,0.858533,0.954767,22,0.0625,1.084777
4,0.8532,0.986267,20,0.25,1.246333
5,0.853067,0.938067,46,0.0625,1.159929
6,0.8492,0.942333,63,0.125,1.077099
7,0.8456,0.998767,19,1.0,1.506531
8,0.8404,1.0,13,8.0,1.763975
9,0.832,0.979,58,1.0,1.349338
10,0.8296,0.935,92,0.5,1.24058


In [58]:
grid_search.score(X_test_raw, y_test)

0.8556

In [59]:
random_search.score(X_test_raw, y_test)

0.8544