# CPSC 330 Lecture 7

### Lecture plan

- 👋
- **Turn on recording**
- Announcements
- Missing data (15 min)
- Feature scaling (25 min)
- Break (5 min)
- Putting it all together with `ColumnTransformer` (20 min)
- Trying a bunch of classifiers (10 min)
- Summary (5 min)

## Learning objectives

TODO

In [47]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 16

from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV 
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer

## Dealing with missing data (15 min)

Today we'll continue with the census data:

In [2]:
census = pd.read_csv('data/adult.csv')

As discussed last time, we'll drop the `education` column because it's already been ordinally encoded in `education.num`.

In [3]:
census = census.drop(columns=["education"])

In [4]:
census_train, census_test = train_test_split(census, test_size=0.2, random_state=123)

In [5]:
census_train.sort_index().head()

Unnamed: 0,age,workclass,fnlwgt,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


- We can see we have a bunch of missing values, where presumably the person did not answer that question on the census.
- Interestingly, these were not picked up: 

In [6]:
census_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26048 entries, 17064 to 19966
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             26048 non-null  int64 
 1   workclass       26048 non-null  object
 2   fnlwgt          26048 non-null  int64 
 3   education.num   26048 non-null  int64 
 4   marital.status  26048 non-null  object
 5   occupation      26048 non-null  object
 6   relationship    26048 non-null  object
 7   race            26048 non-null  object
 8   sex             26048 non-null  object
 9   capital.gain    26048 non-null  int64 
 10  capital.loss    26048 non-null  int64 
 11  hours.per.week  26048 non-null  int64 
 12  native.country  26048 non-null  object
 13  income          26048 non-null  object
dtypes: int64(6), object(8)
memory usage: 3.0+ MB


- Everything is non-null because the missing values were encoded as the string "?" instead of an actual NaN in Python.
- We saw those last class, where "?" was a category generated by OHE.
- Let's change them to actual nulls:

In [7]:
df_train_nan = census_train.replace('?', np.NaN)
df_test_nan  = census_test.replace( '?', np.NaN)

df_train_nan.sort_index().head()

Unnamed: 0,age,workclass,fnlwgt,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,,77053,9,Widowed,,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,,186061,10,Widowed,,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [8]:
df_train_nan.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26048 entries, 17064 to 19966
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             26048 non-null  int64 
 1   workclass       24600 non-null  object
 2   fnlwgt          26048 non-null  int64 
 3   education.num   26048 non-null  int64 
 4   marital.status  26048 non-null  object
 5   occupation      24595 non-null  object
 6   relationship    26048 non-null  object
 7   race            26048 non-null  object
 8   sex             26048 non-null  object
 9   capital.gain    26048 non-null  int64 
 10  capital.loss    26048 non-null  int64 
 11  hours.per.week  26048 non-null  int64 
 12  native.country  25573 non-null  object
 13  income          26048 non-null  object
dtypes: int64(6), object(8)
memory usage: 3.0+ MB


- Now we can see the null values, and likely these would be picked out by pandas profiler.
  - Note: we'll address null values in the features, not in the targets.
- So, how should we address these?
- Disclaimer: we will only cover this in a super simplistic way.
- See STAT courses for a proper treatment of this topic!

#### Gotta drop 'em all

In [9]:
X_train_nan = df_train_nan.drop(columns=['income'])
X_test_nan  = df_test_nan.drop(columns=['income'])
y_train = df_train_nan['income']
y_test = df_test_nan['income']

In [10]:
X_train_nan.shape

(26048, 13)

In [11]:
X_train_nan.dropna(axis=0).shape

(24144, 13)

- So, we dropped about 2000 rows.
- We'd need to do the same in our test set.
- But what if we get a missing value in deployment?
- And furthermore, what if the missing values don't occur at random and we're systematically dropping certain data?
- This is not a great solution, especially if there's a lot of missing values.

In [12]:
X_train_nan.dropna(axis=1).shape

(26048, 10)

- One can also drop all _columns_ with missing values using `axis=1`. 
- This generally throws away a lot of information, because you lose a whole column just for 1 missing value.
- But I might drop a column if it's 99.9% missing values, for example.

#### Imputation

- Imputation means inventing values for the missing data.
- The strategies are different for numeric vs. categorical.
- In this dataset it turns out we only have missing values in the categorical features.

In [13]:
from sklearn.impute import SimpleImputer

In [14]:
imp = SimpleImputer(strategy='most_frequent')

- This imputer is another transformer, like the other ones we've seen (`CountVectorizer`, `OrdinalEncoder`, `OneHotEncoder`).
- The "most_frequent" strategy puts in the most frequent value seen in that column.
- There are also strategies for numeric variables, like taking the mean or median value.

In [15]:
numeric_features = ['age', 'fnlwgt', 'education.num', 'capital.gain', 
                    'capital.loss', 'hours.per.week']
categorical_features = ['workclass', 'marital.status', 'occupation', 
                        'relationship', 'race', 'sex', 'native.country']
target_column = 'income'

In [16]:
imp.fit(X_train_nan[categorical_features]);

In [17]:
X_train_imp_cat = pd.DataFrame(imp.transform(X_train_nan[categorical_features]),
                           columns=categorical_features, index=X_train_nan.index)
X_test_imp_cat = pd.DataFrame(imp.transform(X_test_nan[categorical_features]),
                           columns=categorical_features, index=X_test_nan.index)

X_train_imp = X_train_nan.copy()
X_train_imp.update(X_train_imp_cat)

X_test_imp = X_test_nan.copy()
X_test_imp.update(X_test_imp_cat)

We can see the missing values filled in. Before:

In [18]:
X_train_nan.sort_index().head()

Unnamed: 0,age,workclass,fnlwgt,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country
0,90,,77053,9,Widowed,,Not-in-family,White,Female,0,4356,40,United-States
1,82,Private,132870,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States
2,66,,186061,10,Widowed,,Unmarried,Black,Female,0,4356,40,United-States
3,54,Private,140359,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States
4,41,Private,264663,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States


After:

In [19]:
X_train_imp.sort_index().head()

Unnamed: 0,age,workclass,fnlwgt,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country
0,90,Private,77053,9,Widowed,Prof-specialty,Not-in-family,White,Female,0,4356,40,United-States
1,82,Private,132870,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States
2,66,Private,186061,10,Widowed,Prof-specialty,Unmarried,Black,Female,0,4356,40,United-States
3,54,Private,140359,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States
4,41,Private,264663,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States


We won't go into any detail about methods of imputation, but you can consider the different approaches as hyperparameters.

#### Pipeline

Let's build a Pipeline with what we have so far for categorical features only.

In [20]:
pipe = Pipeline([('imputation', SimpleImputer(strategy='most_frequent')),
                 ('ohe', OneHotEncoder(handle_unknown='ignore')),
                 ('lr', LogisticRegression(max_iter=1000))])

- Now we have a Pipeline with 3 stages: 2 transformers followed by a classifier.
- Now we can go back to that image from Lecture 5 and it's more appropriate:

<img src="img/pipeline.png" width="700">

[Source](https://amueller.github.io/COMS4995-s20/slides/aml-04-preprocessing/#18)

We can run the pipeline:

In [21]:
pd.DataFrame(cross_validate(pipe, X_train_nan[categorical_features], y_train))

Unnamed: 0,fit_time,score_time,test_score
0,0.880267,0.015519,0.813244
1,0.888231,0.014595,0.808829
2,0.895017,0.016058,0.80595
3,0.909701,0.015221,0.819159
4,0.883047,0.015392,0.815512


- Great, so this all works, but we only used the categorical features.
- Later today we'll see how to combine everything nicely with `ColumnTransformer`.
- But first, one more thing: preprocessing of the numeric variables!

## Q&A

(Pause for Q&A)

<br><br><br><br>

## Feature scaling (25 min)

Here are the numeric features:

In [36]:
X_train_imp[numeric_features]

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
17064,20,110998,10,0,0,30
18434,22,263670,9,0,0,80
3294,51,335997,9,4386,0,55
31317,53,111939,13,0,0,35
4770,52,51048,13,0,0,55
...,...,...,...,...,...,...
28636,48,70668,9,0,0,50
17730,35,340018,6,0,0,38
28030,26,373553,10,0,0,42
15725,28,155621,3,0,0,40


Let's train a model using only these features:

In [38]:
lr = LogisticRegression(max_iter=1000)
pd.DataFrame(cross_validate(lr, X_train_imp[numeric_features], y_train, return_train_score=True)).mean()

fit_time       0.114391
score_time     0.011969
test_score     0.799217
train_score    0.799505
dtype: float64

Ok, so `DummyClassifier` gets

In [24]:
DummyClassifier(strategy='prior').fit(None, y_train).score(None, y_train)

0.7605190417690417

- And here we do a few percent better.
- But let's look at the coefficients:

In [39]:
lr.fit(X_train_imp[numeric_features], y_train);

In [26]:
pd.DataFrame(data=lr.coef_[0], index=numeric_features, columns=['Coefficient'])

Unnamed: 0,Coefficient
age,-0.007233
fnlwgt,-4e-06
education.num,-0.001697
capital.gain,0.000337
capital.loss,0.000785
hours.per.week,-0.007883


- What we see here is a very small coefficient for `fnlwgt` (description of this feature [here](https://www.kaggle.com/uciml/adult-census-income), I couldn't quite decipher it).
- Why is this coefficient so small?

In [27]:
X_train_nan.describe()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
count,26048.0,26048.0,26048.0,26048.0,26048.0,26048.0
mean,38.586686,189229.5,10.070485,1075.695754,87.629991,40.433239
std,13.619181,105000.5,2.572231,7334.297499,404.192112,12.346313
min,17.0,13769.0,1.0,0.0,0.0,1.0
25%,28.0,117583.0,9.0,0.0,0.0,40.0
50%,37.0,177785.0,10.0,0.0,0.0,40.0
75%,48.0,236885.2,12.0,0.0,0.0,45.0
max,90.0,1366120.0,16.0,99999.0,4356.0,99.0


- Answer: because the values are so big (avg = 200,000)
- And what if these values happened to be even larger? Or what if capital gain/loss was measured in thousands of dollars?

In [40]:
X_train_mod = X_train_imp[numeric_features].copy()
X_train_mod["capital.gain"] /= 1000
X_train_mod["capital.loss"] /= 1000
X_train_mod["fnlwgt"] *= 1000

In [41]:
X_train_mod.head()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
17064,20,110998000,10,0.0,0.0,30
18434,22,263670000,9,0.0,0.0,80
3294,51,335997000,9,4.386,0.0,55
31317,53,111939000,13,0.0,0.0,35
4770,52,51048000,13,0.0,0.0,55


In [42]:
lr = LogisticRegression(max_iter=1000)
pd.DataFrame(cross_validate(lr, X_train_mod, y_train, return_train_score=True)).mean()

fit_time       0.056197
score_time     0.012094
test_score     0.760519
train_score    0.760519
dtype: float64

- Now our train & test scores went down to basically `DummyClassifier` level!
- But what is up with that, these units are arbitrary to begin with!!
- BTW, decision trees don't have this problem because they're only about thresholds, rather than crunching the actual number.
  - [Great post on Piazza](https://piazza.com/class/kb2e6nwu3uj23?cid=256) pointing out something similar with the spacing in ordinal encodings!

In [48]:
dt = DecisionTreeClassifier(random_state=1)
cross_val_score(dt, X_train_imp[numeric_features], y_train).mean()

0.7707694308794502

In [49]:
dt = DecisionTreeClassifier(random_state=1)
cross_val_score(dt, X_train_mod, y_train).mean()

0.7707694308794502

- But this problem affects plenty of ML methods.
- So it would be nice to just take care of this issue.
- The general approach is to rescale the features.
- Two specific approaches we'll cover are standardization and normalization.

## Q&A

(Pause for Q&A)

<br><br><br><br>

| Approach | What it does | How to update $X$ (but see below!) | sklearn implementation | 
|---------|------------|-----------------------|----------------|
| normalization | sets range to $[0,1]$   | `X -= np.min(X,axis=0)`<br>`X /= np.max(X,axis=0)`  | [`MinMaxScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)
| standardization | sets sample mean to $0$, s.d. to $1$   | `X -= np.mean(X,axis=0)`<br>`X /=  np.std(X,axis=0)` | [`StandardScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) |

There are all sorts of articles on this; see, e.g. [here](http://www.dataminingblog.com/standardization-vs-normalization/) and [here](https://medium.com/@rrfd/standardize-or-normalize-examples-in-python-e3f174b65dfc).

Let's use these scaling methods:

In [35]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [51]:
scaler = StandardScaler()
scaler.fit(X_train_imp[numeric_features]);

In [52]:
scaler.transform(X_train_imp[numeric_features])

array([[-1.36476947, -0.74507317, -0.02740291, -0.14666932, -0.21680698,
        -0.84506515],
       [-1.21791495,  0.70896709, -0.41617802, -0.14666932, -0.21680698,
         3.20480459],
       [ 0.91147565,  1.39780572, -0.41617802,  0.45135445, -0.21680698,
         1.17986972],
       ...,
       [-0.9242059 ,  1.75548713, -0.02740291, -0.14666932, -0.21680698,
         0.12690359],
       [-0.77735138, -0.32008602, -2.74882863, -0.14666932, -0.21680698,
        -0.0350912 ],
       [ 0.10377577, -0.36129614, -0.41617802, -0.14666932, -0.21680698,
         0.61288796]])

In [58]:
scaled_train_df = pd.DataFrame(scaler.transform(X_train_imp[numeric_features]),
                           columns=numeric_features, index=X_train_imp.index)
scaled_train_df.head()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
17064,-1.364769,-0.745073,-0.027403,-0.146669,-0.216807,-0.845065
18434,-1.217915,0.708967,-0.416178,-0.146669,-0.216807,3.204805
3294,0.911476,1.397806,-0.416178,0.451354,-0.216807,1.17987
31317,1.05833,-0.736111,1.138922,-0.146669,-0.216807,-0.440078
4770,0.984903,-1.316034,1.138922,-0.146669,-0.216807,1.17987


- Note the same Golden Rule issue we talked about before.
  - We fit the transformer on the training data, and then transform both data sets.
  - We need to use a Pipeline for cross-validation because the transformation of each row depends on the other rows.

In [59]:
scaled_test_df = pd.DataFrame(scaler.transform(X_test_imp[numeric_features]),
                           columns=numeric_features, index=X_test_imp.index)

Let's check that it did what we expected:

In [60]:
scaled_train_df.mean(axis=0)

age               3.634832e-16
fnlwgt           -4.961863e-17
education.num     1.371028e-15
capital.gain     -3.863724e-16
capital.loss      7.671122e-16
hours.per.week    9.381657e-16
dtype: float64

These are basically all zero ($10^{-16}$ is zero to numerical precision)

In [61]:
scaled_train_df.std(axis=0)

age               1.000019
fnlwgt            1.000019
education.num     1.000019
capital.gain      1.000019
capital.loss      1.000019
hours.per.week    1.000019
dtype: float64

Note that for test we get something different - that is OK!!

In [63]:
scaled_test_df.mean(axis=0)

age              -0.001850
fnlwgt            0.026132
education.num     0.019814
capital.gain      0.001331
capital.loss     -0.004034
hours.per.week    0.001708
dtype: float64

In [64]:
scaled_test_df.std(axis=0)

age               1.007872
fnlwgt            1.025728
education.num     1.000891
capital.gain      1.034391
capital.loss      0.984757
hours.per.week    1.000546
dtype: float64

Let's re-run our experiments now.

1. Without scaling

In [70]:
lr = LogisticRegression(max_iter=1000)
pd.DataFrame(cross_validate(lr, X_train_imp[numeric_features], y_train, return_train_score=True)).mean()

fit_time       0.111493
score_time     0.011850
test_score     0.799217
train_score    0.799505
dtype: float64

2. With scaling

In [71]:
pipe = Pipeline([('scaling', StandardScaler()),
                 ('lr', LogisticRegression(max_iter=1000))])

In [72]:
pd.DataFrame(cross_validate(pipe, X_train_imp[numeric_features], y_train, return_train_score=True)).mean()

fit_time       0.056782
score_time     0.011920
test_score     0.814727
train_score    0.815024
dtype: float64

Here we actually do a little better! Cool.

3. After messing with the data by rescaling some features

In [73]:
lr = LogisticRegression(max_iter=1000)
pd.DataFrame(cross_validate(lr, X_train_mod[numeric_features], y_train, return_train_score=True)).mean()

fit_time       0.054141
score_time     0.012030
test_score     0.760519
train_score    0.760519
dtype: float64

These are the same bad results we saw earlier.

4. After messing with the data, but using feature scaling

In [71]:
pipe = Pipeline([('scaling', StandardScaler()),
                 ('lr', LogisticRegression(max_iter=1000))])

In [74]:
pd.DataFrame(cross_validate(pipe, X_train_mod[numeric_features], y_train, return_train_score=True)).mean()

fit_time       0.057424
score_time     0.012416
test_score     0.814727
train_score    0.815024
dtype: float64

BAM! The scaling always sets the variance to 1, so the fact that we scaled up/down by 1000 is irrelevant!

## Q&A

(Pause for Q&A)

<br><br><br><br>

We can redo the same experiments but with min/max scaling:

In [75]:
pipe = Pipeline([('scaling', MinMaxScaler()),
                 ('lr', LogisticRegression(max_iter=1000))])

In [76]:
pd.DataFrame(cross_validate(pipe, X_train_imp[numeric_features], y_train, return_train_score=True)).mean()

fit_time       0.077138
score_time     0.011946
test_score     0.810427
train_score    0.810888
dtype: float64

- Here, we get similar results. 
- We can also check that it does what it's supposed to do.

In [77]:
minmax = MinMaxScaler()
minmax.fit(X_train_imp[numeric_features])
normalized_train_df = minmax.transform(X_train_imp[numeric_features])
normalized_test_df = minmax.transform(X_test_imp[numeric_features])

Let's again check the results:

In [78]:
normalized_train_df.min(axis=0)

array([0., 0., 0., 0., 0., 0.])

In [79]:
normalized_train_df.max(axis=0)

array([1., 1., 1., 1., 1., 1.])

And again for test:

In [80]:
normalized_test_df.min(axis=0)

array([ 0.        , -0.00109735,  0.        ,  0.        ,  0.        ,
        0.        ])

In [81]:
normalized_test_df.max(axis=0)

array([1.        , 1.08768803, 1.        , 1.        , 0.84550046,
       1.        ])

#### Preprocessing the targets?

- We'll discuss this when we get to numeric targets (regression) in a couple weeks

## Q&A

(Pause for Q&A)

<br><br><br><br>

## Break (5 min)

<br><br><br><br>

## Putting it all together with `ColumnTransformer` (20 min)


![](img/column-transformer.png)

Adapted from [here](https://amueller.github.io/COMS4995-s20/slides/aml-04-preprocessing/#37).

- Can we do better? What about our discussion of scaling from earlier?
- Our features are not on the same scale and our encodings are getting "drowned out" by `age` and `weight`.
- We should preprocess both numeric features (e.g., scaling) and categorical features (e.g., OHE).
- sklearn's [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) makes this more manageable.
  - A big advantage here is that we build all our transformations together into one object, and that way we're sure we do the same operations to all splits of the data. 
  - Otherwise we might, for example, do the OHE on both train and test but forget to scale the test data.

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
# Identify the categorical and numeric columns
numeric_features = ['age', 'weight']
categorical_features = ['treatment']

In [None]:
transformers=[
    ('scale', StandardScaler(), numeric_features),
    ('ohe', OneHotEncoder(drop='first'), categorical_features)]

In [None]:
# Create the transformer
preprocessor = ColumnTransformer(transformers=transformers)

When we fit the preprocessor, it fits _all_ the transformers.

In [None]:
preprocessor.fit(X);

We can get the new names of the columns that were generated by the one-hot encoding:

In [None]:
preprocessor.named_transformers_['ohe'].get_feature_names(categorical_features)

Combining this with the numeric feature names gives us all the column names:

In [None]:
columns = numeric_features + list(preprocessor.named_transformers_['ohe']
                                     .get_feature_names(categorical_features))
columns

Like fit, when we transform with the preprocessor, it calls `transform` on _all_ the transformers.

In [None]:
# Apply data transformations and convert back to dataframe
X_ohe_scale = pd.DataFrame(preprocessor.transform(X),
                       index=X.index,
                       columns=columns)

In [None]:
X_ohe_scale.head()

- Side note: the `ColumnTransformer` will automatically remove columns that are not being transformed:

In [None]:
preprocessor = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(drop='first'), categorical_features)])

preprocessor.fit_transform(X)

Using `remainder='passthrough'` keeps the other columns in tact:

In [None]:
preprocessor = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(drop='first'), categorical_features)], 
                                 remainder='passthrough')

preprocessor.fit_transform(X)

Hyperopt code from lecture 5:

In [None]:
param_grid = {
              "n_estimators"     : [10,100],
              "max_depth"        : [3, None],
              "max_features"     : [3, None]
             }
param_grid

- How many combinations in total? 
- $2\times 2\times 2=8$

In [None]:
np.prod(list(map(len, param_grid.values())))

In [None]:
rf = RandomForestClassifier(random_state=321)
grid_search = GridSearchCV(rf, param_grid, cv=3, verbose=1)

In [None]:
grid_search.fit(X_train_transformed, y_train);

In [None]:
grid_search.best_params_

- lol... these are the default values.
- I guess they picked good defaults!

In [None]:
grid_search.best_score_

In [None]:
pd.DataFrame(grid_search.cv_results_)[['mean_test_score', 'param_max_depth', 'param_max_features', 'param_n_estimators', 'mean_fit_time', 'rank_test_score']].set_index("rank_test_score").sort_index()

- Note that the grid search object acts like a scikit-learn model.
- It was actually refit on the _whole_ training set, as discussed earlier in the course!
- I believe it is the same as `grid_search.best_estimator_`.

In [None]:
grid_search.predict(X_test_transformed)

## TODO

using columntransformer to combine mutliple OHEs where you only specify the categories in some

note: actually you can use the `categories=[cats_1, cats_2]` -> too bad a dict is not allowed here...

# TODO

look at old hyperopt lecture, get the hyperopt code that goes multiple levels deep with `__` because we now have many transformers 

## Try different classifiers (10 min)

- Let's use cross-validation with our training set.
- This is implemented in `cross_val_score`

In [None]:
from sklearn.model_selection import cross_val_score

Example:

In [None]:
lr = LogisticRegression(max_iter=300)

In [None]:
scores = cross_val_score(lr, X_train_scale_ohe, y_train, cv=5)

In [None]:
scores

In [None]:
1.0 - np.mean(scores)

There is also the slightly more sophisticated `cross_validate`, which gives us some extra information:

In [None]:
from sklearn.model_selection import cross_validate

In [None]:
lr = LogisticRegression(max_iter=300)

In [None]:
scores = cross_validate(lr, X_train_scale_ohe, y_train, cv=5, return_train_score=True)

In [None]:
scores

In [None]:
scores_df = pd.DataFrame(scores).drop(columns=['score_time'])
scores_df

In [None]:
scores_df.mean()

Now let's try some different classifiers:

In [None]:
models = {'decision tree'      : DecisionTreeClassifier(),
          'logistic regression': LogisticRegression(max_iter=300),
          'RBF SVM'            : SVC(max_iter=1500), 
          'random forest'      : RandomForestClassifier(), 
         }
avg_scores = dict()

for model_name, model in models.items():
    print(model_name)
    scores = cross_validate(model, X_train_scale_ohe, y_train, cv=5, return_train_score=True)
    avg_scores[model_name] = pd.DataFrame(scores).drop(columns=['score_time']).mean()

In [None]:
pd.DataFrame(avg_scores).T

Let's discuss these results:

- Which methods are overfitting?
- Which methods are underfitting?
- Which methods are fast/slow?
- What is the best method so far?
- What hyperparameters should we tune?

In [None]:
models = {'decision tree'      : DecisionTreeClassifier(),
          'logistic regression orig': LogisticRegression(max_iter=300),
          'logistic regression': LogisticRegression(max_iter=300, C=100),
          'random forest orig'      : RandomForestClassifier(), 
          'random forest simpler'   : RandomForestClassifier(n_estimators=10)
         }
avg_scores = dict()

for model_name, model in models.items():
    print(model_name)
    scores = cross_validate(model, X_train_scale_ohe, y_train, cv=5, return_train_score=True)
    avg_scores[model_name] = pd.DataFrame(scores).drop(columns=['score_time']).mean()

In [None]:
pd.DataFrame(avg_scores).T

## Summary (5 min)

TODO