# CPSC 330 Lecture 5

#### Lecture plan

- 👋
- **Turn on recording**
- Announcements
- True/False questions from last time (10 min)
- Pipelines motivation (5 min)
- Pipelines (15 min)
- Break (5-10 min)
- Random forest classifiers (5 min)
- Hyperparameter optimization: grid search and random search (20 min)
- Bayesian hyperparameter optimization (10 min)
- Pipelines and hyperparameter tuning (5 min)

In [8]:
import numpy as np
import pandas as pd
import scipy.stats
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 16

from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

## Announcements

- hw2 solutions posted.
- hw3 posted, due Monday at 11:59pm.
  - You can work with a partner, see instructions [here](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md#partners).

## True/False questions from last time (10 min)

https://piazza.com/class/kb2e6nwu3uj23?cid=188

## Pipelines motivation (5 min)

Returning to our dataset of the week, which is IMDB movie reviews:

In [3]:
imdb_df = pd.read_csv('data/imdb_master.csv', index_col=0, encoding="ISO-8859-1")
imdb_df = imdb_df[imdb_df['label'].str.startswith(('pos','neg'))]
imdb_df = imdb_df.sample(frac=0.2, random_state=999) # Take a subsample of the dataset for speed

In [4]:
imdb_train, imdb_test = train_test_split(imdb_df, random_state=123)

As a reminder, here is what we did last time:

In [9]:
X_train_imdb_raw = imdb_train['review']
y_train_imdb = imdb_train['label']

X_test_imdb_raw = imdb_test['review']
y_test_imdb = imdb_test['label']

In [10]:
vec = CountVectorizer(min_df=50, binary=True)

In [12]:
X_train_imdb = vec.fit_transform(X_train_imdb_raw)
X_test_imdb = vec.transform(X_test_imdb_raw);

In [15]:
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_imdb, y_train_imdb);

In [16]:
lr.score(X_train_imdb, y_train_imdb)

0.9833333333333333

In [17]:
lr.score(X_test_imdb, y_test_imdb)

0.8256

Last time, we avoided cross-validation. Why?

In [18]:
cross_val_score(lr, X_train_imdb, y_train_imdb)

array([0.82866667, 0.836     , 0.83733333, 0.83266667, 0.834     ])

- The code runs.
- But we have a problem... our good friend the Golden Rule.
- It is actually the exact same problem we fit/transformed the `CountVectorizer` before splitting.
- Remember, cross-validation involves splitting!!!

In [22]:
X_train_fold_1 = X_train_imdb[:X_train_imdb.shape[0]//5]
X_valid_fold_1 = X_train_imdb[X_train_imdb.shape[0]//5:]

- But wait, the validation part was transformed using a `CountVectorizer` that was fit on the training split.
- Just like last time, this is a Golden Rule violation.
- For example, the validation split "gets to be aware of" words that are only in the training split.

So what do we do here?

![](img/hmm.png)

Enter pipelines to the rescue!!

## Q&A

(Pause for Q&A)

<br><br><br><br>

## Pipelines (20 min)

- scikit-learn `Pipeline` can help us with this.

In [23]:
from sklearn.pipeline import Pipeline

- This time we'll combine **the preprocessing and the model** with a `Pipeline`.

In [32]:
countvec = CountVectorizer(min_df=50, binary=True)
lr = LogisticRegression(max_iter=1000)

pipe = Pipeline([
    ('countvec', countvec),
    ('lr', lr)])

- Syntax: pass in a list of **steps**.
- The last step should be a model/classifier.
- All the earlier steps should be transformers.
  - Later in the course we'll see use cases for multiple rounds of transformers, here we only have one.

In [33]:
pipe.fit(X_train_imdb_raw, y_train_imdb);

- What is this doing?
- Note that I passed in the **raw** text data, not the vectorized word counts:

In [34]:
X_train_imdb_raw

34838    I guess that "Gunslinger" wasn't quite as god-...
345      Oh boy! Oh boy! On the cover of worn out VHS h...
48840    After you see Vertigo, then watch Bell, Book a...
4458     If this film is an accurate display of J. Smit...
23815    Only the Brits could make a film like this and...
                               ...                        
18386    I have probably watched the movie 4 or 5 times...
38425    This movie maked me cry at the end! I watch at...
46973    This is a weird movie about an archaeologist s...
14614    Finally a gangster Movie worth watching!<br />...
22094    A few things to touch on as a response to the ...
Name: review, Length: 7500, dtype: object

The pipeline is doing the following steps:

1. Fitting `CountVectorizer`.
2. Transforming the data using the fit `CountVectorizer`.
3. Fitting the `LinearRegression` on the transformed data.

When we call `predict` (or `score`), we also feed in the raw data:

In [35]:
pipe.predict(X_test_imdb_raw)

array(['pos', 'pos', 'pos', ..., 'pos', 'pos', 'pos'], dtype=object)

Here is a schematic assuming you have two transformers:

<img src="img/pipeline.png" width="700">

[Source](https://amueller.github.io/COMS4995-s20/slides/aml-04-preprocessing/#18)

- One thing that is awesome here is that we can't make the mistakes we showed last time:
  - We call `fit` on the train split and `score` on the test split, it's clean.
  - We can't accidentally re-fit the preprocessor on the test data like we did last time.
  - It automatically makes sure the same transformations are applied to train and test.

And now, the moment of truth:

In [36]:
cross_val_score(pipe, X_train_imdb_raw, y_train_imdb)

array([0.82666667, 0.824     , 0.83133333, 0.83066667, 0.83533333])

- Remember what cross-validation does - it calls `fit` and `score`.
- Now we're calling `fit` on the pipeline, not just the logistic regression.
  - So **both the vectorizer and the logistic regression are refit again on each fold**.
  - This is what we want to avoid the Golden Rule violation!
  - Every validation score is unseen data with respect to the pipeline.

![](img/yay.png)

- BTW, the scores here aren't that different.
- I don't suspect it matters all that much here.
- But there could be cases where the effect is large.
- In this course I want you to build good habits that will serve you well going forward.

## Q&A

(Pause for Q&A)

<br><br><br><br>

## Pipelines and the Golden Rule (20 min)

In [26]:
model = Ridge(alpha=100)

In [27]:
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', model)])

The next line fits the transformer **and** the model; there was no need to call `preprocessor.fit()` before this.

In [28]:
pipeline.fit(df_train, y_train_log);

Note how I passed in `df_train`, not `df_train_enc`.

In [29]:
pipeline.predict(df_train)

array([11.97330353, 11.69417912, 11.9050535 , ..., 12.20638141,
       11.99713426, 11.81988184])

In [30]:
pipeline.score(df_train, y_train_log)

0.8963159928298279

In [31]:
pipeline.score(df_valid, y_valid_log)

0.8913218700485418

We'll use this `Pipeline` on the combined train and validation sets:

In [33]:
cross_val_score(pipeline, df_trainvalid, y_trainvalid_log, cv=10)

array([0.91329227, 0.86442022, 0.84424495, 0.87895509, 0.90585488,
       0.89379337, 0.86963715, 0.90347339, 0.58863687, 0.91024683])

- Does this solve the problem?
  - Discuss for 2-3 minutes.
  - Don't look ahead!
  
<br><br><br><br><br><br>

Yes! Why does this work?

- Because `cross_val_score` calls `fit` for each fold.
- And this includes fitting the preprocessor.
- Thus, there is actually no difference between `df_train` and `df_valid` - nothing has actually been done to them yet!
  - `df_trainvalid` is just the part that's not the test set, that's all.

(optional note) Yet another idea could be to do the cross-validatin using only the validation split. I believe this does not technically violate the Golden Rule, but it's very wasteful in terms of data and is also not really representative of how your model will eventually be trained. The `Pipeline` approach is much better. 

## Q&A

(Pause for Q&A)

<br><br><br><br>

## Break (5-10 min)

Please fill out the mid-course survey at https://ubc.ca1.qualtrics.com/jfe/form/SV_6tevNhMjZxRiQEl

<br><br><br><br>

## Random forest classifiers (5 min)



## Hyperparameter optimization: grid search and random search (20 min)

#### Manual hyperparameter optimization

- We tried this a bit.
- Advantage: we may have some intuition about what might work.
  - E.g. if I'm massively overfitting, try decreasing `max_depth` or `C`.
- Disadvantage: it takes a lot of work.
- Disadvantage: in very complicated cases, our intuition might be worse than a data-driven approach.

#### Automated hyperparameter optimization

- Advantage: reduce human effort
- Advantage: less prone to error and improve reproducibility
- Advantage: data-driven approaches may be effective
- Disadvantage: may be hard to incorporate intuition
- Disadvantage: be careful about overfitting on the validation set.



There are two automated hyperparameter search methods in scikit-learn:

  - Exhaustive grid search: [`sklearn.model_selection.GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
  - Randomized hyperparameter optimization: [`sklearn.model_selection.RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  
The "CV" stands for cross-validation; these searchers have cross-validation built right in.

In [196]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

#### Exhaustive grid search

- A user specifies a set of values for each hyperparameter. 
- The method considers "product" of the sets and then evaluates each combination one by one.    

Let's start the automated hyperparameter optimization.

In [37]:
param_grid = {
    "C" : [0.01, 1, 10, 100]
}

In [38]:
grid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, verbose=1)

- Note that we can fix some hyperparameters and make others variable.
- `verbose=1` tells `GridSearchCV` to print some output while it's working.
  - This can be useful as this step sometimes takes a long time.

In [40]:
grid_search.fit(X_train_imdb, y_train_imdb);

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:    4.9s finished


Going back to Lecture 3, this is what it's doing:

```
for C in [0.01, 1, 10, 100]:
    for fold in folds:
        fit in training portion with the given C
        score on validatio portion
    compute average score
pick hypers with best score
```

From here, we can extract the best hyperparameter values:

In [41]:
grid_search.best_params_

{'C': 0.01}

In [42]:
grid_search.best_score_

0.8441333333333333

We can extract the classifier inside like this:

In [43]:
grid_search.best_estimator_

LogisticRegression(C=0.01, max_iter=1000)

In [44]:
grid_search.best_estimator_.predict(X_test_imdb)

array(['pos', 'pos', 'pos', ..., 'pos', 'pos', 'pos'], dtype=object)

They also provide some "syntactic sugar" and allow you to call `predict` or `score` directly on the `GridSearchCV` object:

In [45]:
grid_search.predict(X_test_imdb) ## Does the same thing

array(['pos', 'pos', 'pos', ..., 'pos', 'pos', 'pos'], dtype=object)

- Ok, so this is all the syntax, but now we know we've been violating the Golden Rule because of the cross-validation.
- So let's do it again properly this time.

In [209]:
param_grid = {
              "n_estimators"     : [10,100],
              "max_depth"        : [3, None],
              "max_features"     : [3, None]
             }
param_grid

{'n_estimators': [10, 100], 'max_depth': [3, None], 'max_features': [3, None]}

- How many combinations in total? 
- $2\times 2\times 2=8$

In [210]:
np.prod(list(map(len, param_grid.values())))

8

In [211]:
rf = RandomForestClassifier(random_state=321)
grid_search = GridSearchCV(rf, param_grid, cv=3, verbose=1)

In [212]:
grid_search.fit(X_train_transformed, y_train);

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  1.3min finished


In [213]:
grid_search.best_params_

{'max_depth': None, 'max_features': None, 'n_estimators': 100}

- lol... these are the default values.
- I guess they picked good defaults!

In [214]:
grid_search.best_score_

0.8549216369281329

In [215]:
pd.DataFrame(grid_search.cv_results_)[['mean_test_score', 'param_max_depth', 'param_max_features', 'param_n_estimators', 'mean_fit_time', 'rank_test_score']].set_index("rank_test_score").sort_index()

Unnamed: 0_level_0,mean_test_score,param_max_depth,param_max_features,param_n_estimators,mean_fit_time
rank_test_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.854922,,,100,14.519656
2,0.848587,,,10,1.399292
3,0.848127,,3.0,100,1.986127
4,0.844748,3.0,,10,0.51757
4,0.844748,3.0,,100,5.280527
6,0.839527,,3.0,10,0.226066
7,0.761057,3.0,3.0,10,0.18932
8,0.760519,3.0,3.0,100,0.907236


- Note that the grid search object acts like a scikit-learn model.
- It was actually refit on the _whole_ training set, as discussed earlier in the course!
- I believe it is the same as `grid_search.best_estimator_`.

In [216]:
grid_search.predict(X_test_transformed)

array(['<=50K', '>50K', '<=50K', ..., '>50K', '<=50K', '<=50K'],
      dtype=object)

Problems with exhaustive grid search 

- Required number of models to evaluate grows exponentially with the dimensionally of the configuration space. 
- Exhaustive search may become infeasible fairly quickly. 

Randomized hyperparameter search

- Randomized hyperparameter optimization: [`sklearn.model_selection.RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
- Samples configurations at random until certain budget (e.g., time) is exhausted.
- Advantage: you can choose how many runs you'll do.
- Advantage: you can restrict yourself less on what values you might try.
- Advantage: Adding parameters that do not influence the performance does not affect efficiency.
- Advantage: research shows this is generally a better idea than grid search, see image for intuition:

![](img/randomsearch_bergstra.png)

Source: [Bergstra and Bengio, Random Search for Hyper-Parameter Optimization, JMLR 2012](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf).

- You don't know in advance which hyperparameters are important for your problem.
- But some of them might be unimportant.
- In the left figure, 6 of the 9 searches are useless because they are only varying the unimportant parameter.
- In the right figure, all 9 searches are useful.

Back to syntax. We can have the parameters chosen from a list:

In [217]:
param_choices = {
              "n_estimators"     : [10, 30, 100, 300],
              "max_depth"        : [3, 10, None],
              "max_features"     : [3, 10, None]
             }

You can also give it distributions, instead of lists.

In [218]:
import scipy.stats

In [219]:
param_dist = {
              "n_estimators"     : scipy.stats.randint(low=10, high=300),
              "max_depth"        : scipy.stats.randint(low=10, high=30),
              "max_features"     : scipy.stats.randint(low=10, high=30)
             }

In [220]:
rf = RandomForestClassifier(random_state=321) # Note: you can set other hyperparameters here

In [None]:
random_search = RandomizedSearchCV(rf, param_distributions = param_dist, 
                                   n_iter = 10, 
                                   cv=3,
                                   verbose=1, random_state=123)

In [None]:
random_search.fit(X_train_transformed, y_train);

- Note: some hyperparameters significantly affect the training time!
- For example, setting `n_estimators=1000` is going to be very slow.

In [221]:
random_search.best_params_

{'max_depth': 16, 'max_features': 27, 'n_estimators': 93}

- Now we get something different! 
- What's the score?

In [222]:
random_search.best_score_

0.863175750441226

- So, we had 85.4% and now we have 86.1%.
- Is that difference important?
- Do we BELIEVE that difference?
  - We can try it out on the test set.
- But first:  

In [225]:
pd.DataFrame(random_search.cv_results_)[['mean_test_score', 'param_max_depth', 'param_max_features', 'param_n_estimators', 'mean_fit_time', 'rank_test_score']].set_index("rank_test_score").sort_index()

Unnamed: 0_level_0,mean_test_score,param_max_depth,param_max_features,param_n_estimators,mean_fit_time
rank_test_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.863176,16,27,93,4.662635
2,0.862561,20,11,106,2.776636
3,0.86164,23,12,108,3.506636
4,0.861602,17,12,94,2.629149
5,0.861333,14,10,218,5.281985
6,0.860527,27,25,83,3.811981
7,0.86045,25,29,263,15.884677
8,0.860412,10,24,234,8.109985
9,0.859951,25,26,145,7.407122
10,0.859375,14,27,12,0.542452


- Look at the timings, they are quite interesting.
- And now, the test set:

In [223]:
grid_search.score(X_test_transformed, y_test)

0.8527560264087211

In [224]:
random_search.score(X_test_transformed, y_test)

0.8622754491017964

## Q&A

(Pause for Q&A)

<br><br><br><br>

## Bayesian hyperparameter optimization (10 min)

- Both `GridSearchCV` and `RandomizedSearchCV` do each trial independently.
- What if you could learn from your experience, e.g. learn that `max_depth=3` is bad?
  - That could save time because you wouldn't try combinations involving `max_depth=3` in the future.
- We can do this with `scikit-optimize`, which is a completely different package from `scikit-learn`
- It uses a technique called "model-based optimization" and we'll specifically use "Bayesian optimization".
  - In short, it uses machine learning to predict what hyperparameters will be good.
  - Machine learning on machine learning!
- As it happens I did my PhD thesis on this topic.

In [311]:
from skopt import BayesSearchCV

- `BayesSearchCV` uses the same interface as `GridSearchCV` and `RandomSearchCV`.
- However, the way we specify the parameter distributions is slightly different.
- Here, we can just give the bounds as tuples.

In [312]:
bayes_opt = BayesSearchCV(
    RandomForestClassifier(random_state=321),
    {
        'n_estimators': (10, 300),  
        'max_depth': (3, 30),
        'max_features': (3, 30)
    },
    n_iter=10,
    cv=3,
    random_state=123,
    verbose=0,
    refit=True
)

In [313]:
%%time
bayes_opt.fit(X_train_transformed, y_train);

CPU times: user 2min 41s, sys: 3.35 s, total: 2min 44s
Wall time: 3min 3s


BayesSearchCV(cv=3, error_score='raise',
              estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                               class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features='auto',
                                               max_leaf_nodes=None,
                                               max_samples=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               n_estimators=100, n_jobs=None,
                                               oob_score=False,
                                 

- It took a similar amount of time to the other methods.
- In reality there is some extra computation to do the "meta-ML".
- However, the overall time is dominated by the time of calling `fit` on the random forests.

In [314]:
bayes_opt.best_params_

{'max_depth': 18, 'max_features': 12, 'n_estimators': 130}

In [315]:
bayes_opt.best_score_

0.8625998157248157

- The score looks promising.
- In theory, it should get even better as we increase `n_iter` (because it has more data to learn from).
- Checking the test score:

In [316]:
bayes_opt.score(X_test_transformed, y_test)

0.8621219100261016

And reproducing the previous test scores for comparison:

In [317]:
random_search.score(X_test_transformed, y_test)

0.8622754491017964

In [318]:
grid_search.score(X_test_transformed, y_test)

0.8527560264087211

- In this case, it seems we weren't overfitting on the validation set and this exercise was actually useful.
- Should I always use this? Not necessarily.
- Disadvantage: requires installation.
- Disadvantage: when number of trials is large (e.g. hundreds), the meta-ML can actually get too slow.
- Disadvantage: harder parallelize the search because each trial depends on the previous ones.
  - Note `n_jobs` parameter for `GridSearchCV` and `RandomizedSearchCV`.  
  - `BayesSearchCV` also has this parameter.
  - It can definitely parallelize the folds.
  - The search will be less effective if it parallelizes further.

- Can I generalize this to say `BayesSearchCV` > `RandomizedSearchCV` > `GridSearchCV`?
- Not quite. I'd say `RandomizedSearchCV` > `GridSearchCV` is pretty reasonable
- But we should think a bit more carefully about `BayesSearchCV` for the above reasons.
- `RandomizedSearchCV` is often a reasonable choice.

## Pipelines and hyperparameter optimization (5 min)

- The same problems arise when doing hyperparameter optimization, simply because these methods do cross-validation.
- We can avoid them with a `Pipeline` in the same way.
- I'll optimize `alpha`, so I'll create a new pipeline where `alpha` is not specified:

In [34]:
pipeline = Pipeline([('preprocessor', preprocessor),
                      ('model', Ridge())])

param_grid = {
    'preprocessor__numeric__imputer__strategy': ['mean', 'median'],
    'model__alpha': [1.0, 10, 100],
}

- Above: we have a nesting of transformers. 
- We can access the parameters of the "inner" objects by using `__` to go "deeper":
  - `model__alpha`: "the `alpha` of the model (of the pipeline)"
  - `preprocessor__numeric__imputer__strategy`: "the strategy of the imputer of the numeric transformer of the preprocessor (of the pipeline)"

In [35]:
grid_search = GridSearchCV(pipeline, param_grid, cv=10)
grid_search.fit(df_trainvalid, y_trainvalid_log);

In [36]:
grid_search.best_params_

{'model__alpha': 10, 'preprocessor__numeric__imputer__strategy': 'median'}

This is particularly useful when there are serious hyperparameters in the preprocessing pipeline, e.g. if you're using `CountVectorizer`.