# CPSC 330 Lecture 6

#### Lecture outline

- 👋
- **Turn on recording**
- Announcements 
- Optimization bias (20 min)
- True/False (10 min)
- Break (5 min)
- Dataset of the week (5 min)
- Categorical variables (40 min)

## Learning outcomes

- Explain why and when optimization bias occurs.
- Use a test set appropriately to detect optimization bias.
- Appropriately select encodings for categorical variables (e.g. ordinal vs. OHE).
- Appropriately set up one-hot encoding given a problem.
- Use the `handle_error` hyperparameter of scikit-learn's `OneHotEncoder`.

## Announcements

- hw2 grades posted
- hw3 deadline passed and solutions posted
- no hw this week, hw4 will be a week later on the usual schedule
- thanks for your responses on the survey; see https://piazza.com/class/kb2e6nwu3uj23?cid=209
- Extra OH have been added on Thursdays and Fridays

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 16

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_validate 
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs
from sklearn.compose import ColumnTransformer

from pandas_profiling import ProfileReport

## Optimization bias (20 min)

- So, why this test set all the time?
- Why not just use cross-validation on our whole dataset? After all, cross-validation measures our performance on unseen data. 

# TODO

small dataset, very fine grid of `C`

#### The size of your dataset (5 min)

- A very important factor here is **how much data you have**.
- With infinite amounts of training data, **overfitting would not be a problem** and you could have $E_\textrm{train}=E_\textrm{test}=E_\textrm{best}$
  - This is a subtle point that takes time to digest.
  - But think of it this way: overfitting happens because you only see a bit of data and you learn patterns that are overly specific to your sample.
  - If you saw "all" the data, then the notion of "overly specific" would not apply.
- So, more data will make $E_\textrm{test}$ better.
- But furthermore, it will make your _error estimates_ less noisy.
- Remember, `train_test_split` is a random split.
- What if you split differently?

In [None]:
make_traintest_plot(X_train, y_train, X_test, y_test);

In [None]:
df_train_2, df_test_2 = train_test_split(df_nodup, random_state=1000)

X_train_2 = df_train_2[["meat", "grade"]]
y_train_2 = df_train_2["cilantro"]
X_test_2  = df_test_2[["meat", "grade"]]
y_test_2  = df_test_2["cilantro"]
make_traintest_plot(X_train_2, y_train_2, X_test_2, y_test_2);

- All the scores are different just from a different random split.
- This effect is stronger when you have less data.
- Your level of trust in everything you're doing **goes up with more data**.
- After the break we'll talk about a strategy for mitigating (not solving!) this.
- But first...

## This week's dataset: income prediction from adult census data (5 min)


- We will use a classic dataset from the UCI dataset repository.
  - This one is about predicting income from adult census data.
  - You can download the dataset [here](https://www.kaggle.com/uciml/adult-census-income).
  - To run the code, you will need to download it and put it in the `data` folder.

In [5]:
census = pd.read_csv('data/adult.csv')
census_train, census_test = train_test_split(df, test_size=0.2, random_state=123)

## EDA and pandas profiling (15 min)

Let's start with some exploratory data analysis (EDA):

In [9]:
census_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
17064,20,Private,110998,Some-college,10,Never-married,Adm-clerical,Own-child,Asian-Pac-Islander,Female,0,0,30,United-States,<=50K
18434,22,Private,263670,HS-grad,9,Never-married,Other-service,Own-child,Black,Male,0,0,80,United-States,<=50K
3294,51,Private,335997,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,4386,0,55,United-States,>50K
31317,53,Private,111939,Bachelors,13,Married-civ-spouse,Other-service,Husband,White,Male,0,0,35,United-States,<=50K
4770,52,Self-emp-inc,51048,Bachelors,13,Married-civ-spouse,Sales,Husband,White,Male,0,0,55,United-States,<=50K


In [6]:
census_train.shape

(26048, 15)

In [7]:
census_train.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education.num',
       'marital.status', 'occupation', 'relationship', 'race', 'sex',
       'capital.gain', 'capital.loss', 'hours.per.week', 'native.country',
       'income'],
      dtype='object')

In [8]:
census_train.describe()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
count,26048.0,26048.0,26048.0,26048.0,26048.0,26048.0
mean,38.586686,189229.5,10.070485,1075.695754,87.629991,40.433239
std,13.619181,105000.5,2.572231,7334.297499,404.192112,12.346313
min,17.0,13769.0,1.0,0.0,0.0,1.0
25%,28.0,117583.0,9.0,0.0,0.0,40.0
50%,37.0,177785.0,10.0,0.0,0.0,40.0
75%,48.0,236885.2,12.0,0.0,0.0,45.0
max,90.0,1366120.0,16.0,99999.0,4356.0,99.0


Let's explore the target value - this is often useful!

In [10]:
census_train[["income"]].describe()

Unnamed: 0,income
count,26048
unique,2
top,<=50K
freq,19810


- `pandas_profiling` is a fancy tool for EDA
- The term "profiling" is a bit unfortunate - this has nothing to do with timing code execution.

In [11]:
from pandas_profiling import ProfileReport

In [13]:
profile = ProfileReport(census_train, title='Pandas Profiling Report') #, minimal=True)

In [14]:
# profile.to_file('profile_report.html')

This next line can take several minutes...

In [None]:
profile.to_notebook_iframe()

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=30.0, style=ProgressStyle(descrip…

## Break (5 min)

## Encoding of categorical variables (25 mins)

From GH issue:

- first ask yourself "could you get new unseen values in deployment"
- if you make an "unknown" designation then in deployment, 3 different unknown categories would be treated the same. good to think about those types of nuances.

From GH issue:

- If you have a lot of unique values, a possible alternative would be to OHE the more frequently occurring manufacturers and then have an "other" category for the rest. These categorical can be passed into OneHotEncoder with the categories parameter. Should talk about this - it seems useful! Yes, maybe even the default. Also we should get used to using different OHE objects for different columns. Currently, e.g. on the final exam, I get ridiculous stuff like 2 columns for a binary feature.
- Also, following up on the above, we should more often pass in the categories regardless, so that things works even without an "other" category. Often we do know all the possible values, e.g. the provinces of Canada, and I think it's OK to put them in?
- Need to do a very thorough treatment of drop='first' and handle_unknown='ignore'.

From Piazza:

When you have a discrete, ordered feature (such as the [cold, warm, hot] example), the numerical value associated with a given feature example (i.e. cold=0, warm=1, hot=2) has meaning beyond just being a label. If this was a feature being used to classify drinks as iced tea or hot chocolate, the model would learn that the higher the value of this feature is for a given example, the more likely it is that the drink is a hot chocolate. The weights it learns would reflect this; the weight associated with this feature would be set such that larger values for the feature would result in a larger probability or score for the label hot chocolate. Note that here I am assuming we are using a model that learns weights (such as linear regression or SVMs) rather than a decision tree, but the logic holds regardless.

In another feature where there is no ordering (say container = [paper cup, glass, mug]), if we gave values to the categories in the same way (paper cup = 0, glass = 1, mug = 2) the model would try to learn weights in the same way, looking for a correlation between the magnitude of the value of the feature and the correct label. Except here there isn't this type of correlation; values of 0 or 2 are more likely for hot chocolate and 1 is more likely for iced tea. By assigning numerical values to the categories we have imposed an ordering where there isn't one, and the model's only options are to conclude that the feature isn't useful and assign it a very small weight, or to try and find a correlation where there isn't one. Either case is bad; if the model concludes the feature isn't useful we are wasting information since the feature really does contain relevant information, and if it tries to find a pattern it is likely to overfit since it is assigning meaning to the numerical values where a meaning doesn't exist. Using a one hot encoding fixes these issues since we don't impose any false ordering on the data but all of the information is still provided to the model.

- In scikit-learn, most algorithms require numeric inputs.
- That means we have to convert categorical features to numeric values.
- We will do this twice: with a toy dataset for illustration, and then with a bigger dataset to compute scores.

- First dataset: we are looking at the `effect` of different treatment for depression.
- We would like to use `age`, `weight`, and `treatment` to predict `effect`.

In [90]:
df = pd.read_csv('data/depression_data.csv')
df.head()

Unnamed: 0,age,weight,treatment,effect
0,21,57.8,A,56
1,23,60.3,B,41
2,30,60.5,B,40
3,19,38.7,C,28
4,28,96.5,A,55


In [91]:
X = df[['age', 'weight', 'treatment']]
y = df['effect'] 

- Let's try to build a `LogisticRegression` classifier on this data?
- This will fail because we have non-numeric data.

In [92]:
lr = LogisticRegression()
lr.fit(X, y)



ValueError: could not convert string to float: 'C'

#### No encoding (not recommended!)

- For starters, let's just delete the categorical feature.

In [93]:
X_drop = X.drop(columns=["treatment"])
X_drop.head()

Unnamed: 0,age,weight
0,21,57.8
1,23,60.3
2,30,60.5
3,19,38.7
4,28,96.5


#### Ordinal encoding (occasionally recommended)

- Here we simply assign an integer to each of our unique categorical labels
- We can use sklearn's [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html).

In [95]:
from sklearn.preprocessing import OrdinalEncoder

In [108]:
oe = OrdinalEncoder(dtype='int')
oe.fit(X[['treatment']])

X_ord = X.copy()
X_ord['treatment_encoded'] = oe.transform(X[['treatment']])
X_ord.head()

Unnamed: 0,age,weight,treatment,treatment_encoded
0,21,57.8,A,0
1,23,60.3,B,1
2,30,60.5,B,1
3,19,38.7,C,2
4,28,96.5,A,0


In [109]:
X_ord = X_ord.drop(columns='treatment')
X_ord.head()

Unnamed: 0,age,weight,treatment_encoded
0,21,57.8,0
1,23,60.3,1
2,30,60.5,1
3,19,38.7,2
4,28,96.5,0


- Can you see a big problem with this approach?
  - We have imposed ordinality on the categorical data.
  - For example, Treatment C (3) is "closer" to Treatment B (2) than Treatment A (1)
  - In general, label encoding is useful if there is ordinality in your data, e.g., `[cold, warm, hot]`.

#### One-hot encoding (OHE)
- Rather than assign integer labels to our data, we use create new binary columns to represent our categories.
- If we have $c$ categories in our column.
- We create $c$ new binary columns to represent those categories.
- We can use sklearn's [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

In [110]:
from sklearn.preprocessing import OneHotEncoder

In [111]:
ohe = OneHotEncoder(sparse=False, dtype='int')

In [112]:
ohe.fit(X[['treatment']])

X_ohe = pd.concat((X, pd.DataFrame(ohe.transform(X[['treatment']]),
                           columns=ohe.categories_[0], index=X.index)), axis=1)
X_ohe.head()

Unnamed: 0,age,weight,treatment,A,B,C
0,21,57.8,A,1,0,0
1,23,60.3,B,0,1,0
2,30,60.5,B,0,1,0
3,19,38.7,C,0,0,1
4,28,96.5,A,1,0,0


- We put a "1" for the actual value and "0" otherwise.
- But in fact, do we need all 3 columns?
  - After all, if `B` is 0 and `C` is 0, then `A` must be 1, right?
  - So, we drop one of the columns (usually the first one by convention).
- We can specify this in `OneHotEncoder()` by using the argument `drop='first'`

In [117]:
ohe = OneHotEncoder(drop='first', sparse=False, dtype='int')

ohe.fit(X[['treatment']])

X_ohe = pd.concat((X, pd.DataFrame(ohe.transform(X[['treatment']]),
                           columns=ohe.categories_[0][1:], index=X.index)), axis=1)
X_ohe.head()

Unnamed: 0,age,weight,treatment,B,C
0,21,57.8,A,0,0
1,23,60.3,B,1,0
2,30,60.5,B,1,0
3,19,38.7,C,0,1
4,28,96.5,A,0,0


In [118]:
X_ohe = X_ohe.drop(columns='treatment')
X_ohe.head()

Unnamed: 0,age,weight,B,C
0,21,57.8,0,0
1,23,60.3,1,0
2,30,60.5,1,0
3,19,38.7,0,1
4,28,96.5,0,0


- Can we do better? What about our discussion of scaling from earlier?
- Our features are not on the same scale and our encodings are getting "drowned out" by `age` and `weight`.
- We should preprocess both numeric features (e.g., scaling) and categorical features (e.g., OHE).
- sklearn's [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) makes this more manageable.
  - A big advantage here is that we build all our transformations together into one object, and that way we're sure we do the same operations to all splits of the data. 
  - Otherwise we might, for example, do the OHE on both train and test but forget to scale the test data.

In [119]:
from sklearn.compose import ColumnTransformer

In [120]:
# Identify the categorical and numeric columns
numeric_features = ['age', 'weight']
categorical_features = ['treatment']

In [122]:
transformers=[
    ('scale', StandardScaler(), numeric_features),
    ('ohe', OneHotEncoder(drop='first'), categorical_features)]

In [123]:
# Create the transformer
preprocessor = ColumnTransformer(transformers=transformers)

When we fit the preprocessor, it fits _all_ the transformers.

In [125]:
preprocessor.fit(X);

We can get the new names of the columns that were generated by the one-hot encoding:

In [126]:
preprocessor.named_transformers_['ohe'].get_feature_names(categorical_features)

array(['treatment_B', 'treatment_C'], dtype=object)

Combining this with the numeric feature names gives us all the column names:

In [127]:
columns = numeric_features + list(preprocessor.named_transformers_['ohe']
                                     .get_feature_names(categorical_features))
columns

['age', 'weight', 'treatment_B', 'treatment_C']

Like fit, when we transform with the preprocessor, it calls `transform` on _all_ the transformers.

In [129]:
# Apply data transformations and convert back to dataframe
X_ohe_scale = pd.DataFrame(preprocessor.transform(X),
                       index=X.index,
                       columns=columns)

In [130]:
X_ohe_scale.head()

Unnamed: 0,age,weight,treatment_B,treatment_C
0,-1.602301,-1.29436,0.0,0.0
1,-1.46364,-1.204307,1.0,0.0
2,-0.978328,-1.197103,1.0,0.0
3,-1.740961,-1.982364,0.0,1.0
4,-1.116989,0.099659,0.0,0.0


- Side note: the `ColumnTransformer` will automatically remove columns that are not being transformed:

In [132]:
preprocessor = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(drop='first'), categorical_features)])

preprocessor.fit_transform(X)

array([[0., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 0.],
       [0., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 0.],
       [0., 0.],
       [1., 0.],
       [0., 1.],
       [0., 0.],
       [1., 0.],
       [0., 0.],
       [1., 0.],
       [1., 0.],
       [0., 0.],
       [0., 1.],
       [0., 1.],
       [0., 0.],
       [0., 1.]])

Using `remainder='passthrough'` keeps the other columns in tact:

In [133]:
preprocessor = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(drop='first'), categorical_features)], 
                                 remainder='passthrough')

preprocessor.fit_transform(X)

array([[  0. ,   0. ,  21. ,  57.8],
       [  1. ,   0. ,  23. ,  60.3],
       [  1. ,   0. ,  30. ,  60.5],
       [  0. ,   1. ,  19. ,  38.7],
       [  0. ,   0. ,  28. ,  96.5],
       [  0. ,   1. ,  23. ,  41.8],
       [  1. ,   0. ,  33. , 111. ],
       [  0. ,   1. ,  67. , 158. ],
       [  1. ,   0. ,  42. , 109.4],
       [  0. ,   0. ,  33. ,  66.5],
       [  0. ,   0. ,  33. , 111.4],
       [  0. ,   1. ,  56. , 114.7],
       [  0. ,   1. ,  45. ,  91.6],
       [  1. ,   0. ,  43. ,  89.8],
       [  0. ,   0. ,  38. , 102.4],
       [  0. ,   1. ,  37. ,  75.7],
       [  1. ,   0. ,  43. ,  87.4],
       [  0. ,   1. ,  27. ,  49.9],
       [  0. ,   0. ,  43. ,  94.2],
       [  1. ,   0. ,  45. , 132. ],
       [  1. ,   0. ,  48. , 113.5],
       [  0. ,   1. ,  47. , 136. ],
       [  0. ,   0. ,  48. ,  95.3],
       [  0. ,   0. ,  53. , 121.6],
       [  1. ,   0. ,  58. , 121.1],
       [  0. ,   1. ,  29. ,  89.2],
       [  0. ,   0. ,  53. , 107.3],
 

#### Categorical labels

- For now sklearn is fine with categorical labels ($y$-values), so we're not preprocessing them.


#### Side note: 

- Theoretically, decision trees should have no problem with categorical inputs
  - e.g., "is the treatment equal to C?"
  - However, the sklearn implementation does not support this, so we need to convert the features to numerical.
- On the other hand, logistic regression fundamentally operates on numerical data
  - So any implementation would require these transformations.

## Break (5 min)

## Encoding categorical variables, revisited (15 min)

In [92]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


Note: we'll ignore missing values for now, and return to them shortly.

In [93]:
# Let's identify numeric and categorical features

numeric_features = ['age', 'fnlwgt', 'education.num', 
                    'capital.gain', 'capital.loss', 
                    'hours.per.week']

categorical_features = ['workclass', 'education', 'marital.status', 
                        'occupation', 'relationship', 
                        'race', 'sex', 'native.country']

In [94]:
df.shape

(32561, 15)

Create `X` and `y` from the DataFrame:

In [95]:
X_train = df_train.drop(['income'], axis=1)
y_train = df_train['income']

X_test = df_test.drop(['income'], axis=1)
y_test = df_test['income']

- We will skip `ColumnTransformer` for now and just scale/encode separately for simplicity.
- Let's first scale the numeric features.

In [96]:
sc = StandardScaler()
sc.fit(X_train[numeric_features]);

In [97]:
X_train_scale_numeric = pd.DataFrame(sc.transform(X_train[numeric_features]),
                           columns=numeric_features, index=X_train.index)
X_test_scale_numeric = pd.DataFrame(sc.transform(X_test[numeric_features]),
                           columns=numeric_features, index=X_test.index)

X_train_scale_numeric.head()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
17064,-1.364769,-0.745073,-0.027403,-0.146669,-0.216807,-0.845065
18434,-1.217915,0.708967,-0.416178,-0.146669,-0.216807,3.204805
3294,0.911476,1.397806,-0.416178,0.451354,-0.216807,1.17987
31317,1.05833,-0.736111,1.138922,-0.146669,-0.216807,-0.440078
4770,0.984903,-1.316034,1.138922,-0.146669,-0.216807,1.17987


Let's re-combine this with the categorical features:

In [98]:
X_train_scale = X_train.copy()
X_train_scale.update(X_train_scale_numeric)

X_test_scale = X_test.copy()
X_test_scale.update(X_test_scale_numeric)

X_train_scale.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country
17064,-1.364769,Private,-0.745073,Some-college,-0.027403,Never-married,Adm-clerical,Own-child,Asian-Pac-Islander,Female,-0.146669,-0.216807,-0.845065,United-States
18434,-1.217915,Private,0.708967,HS-grad,-0.416178,Never-married,Other-service,Own-child,Black,Male,-0.146669,-0.216807,3.204805,United-States
3294,0.911476,Private,1.397806,HS-grad,-0.416178,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.451354,-0.216807,1.17987,United-States
31317,1.05833,Private,-0.736111,Bachelors,1.138922,Married-civ-spouse,Other-service,Husband,White,Male,-0.146669,-0.216807,-0.440078,United-States
4770,0.984903,Self-emp-inc,-1.316034,Bachelors,1.138922,Married-civ-spouse,Sales,Husband,White,Male,-0.146669,-0.216807,1.17987,United-States


#### Drop categorical features

In [99]:
X_train_scale_drop = X_train_scale.drop(columns=categorical_features)
X_test_scale_drop = X_test_scale.drop(columns=categorical_features)

X_train_scale_drop.head()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
17064,-1.364769,-0.745073,-0.027403,-0.146669,-0.216807,-0.845065
18434,-1.217915,0.708967,-0.416178,-0.146669,-0.216807,3.204805
3294,0.911476,1.397806,-0.416178,0.451354,-0.216807,1.17987
31317,1.05833,-0.736111,1.138922,-0.146669,-0.216807,-0.440078
4770,0.984903,-1.316034,1.138922,-0.146669,-0.216807,1.17987


In [100]:
lr_drop = LogisticRegression()
lr_drop.fit(X_train_scale_drop, y_train);

In [101]:
show_scores(lr_drop, X_train_scale_drop, y_train, X_test_scale_drop, y_test)

Training error: 0.185
Test     error: 0.185


#### Ordinal encoding

In [102]:
oe = OrdinalEncoder()
oe.fit(X_train_scale[categorical_features]);

In [103]:
X_train_ord_cat = pd.DataFrame(oe.transform(X_train_scale[categorical_features]),
                           columns=categorical_features, index=X_train_scale.index)
X_test_ord_cat = pd.DataFrame(oe.transform(X_test_scale[categorical_features]),
                           columns=categorical_features, index=X_test_scale.index)
X_train_ord_cat.head()

Unnamed: 0,workclass,education,marital.status,occupation,relationship,race,sex,native.country
17064,4.0,15.0,4.0,1.0,3.0,1.0,0.0,39.0
18434,4.0,11.0,4.0,8.0,3.0,2.0,1.0,39.0
3294,4.0,11.0,2.0,4.0,0.0,4.0,1.0,39.0
31317,4.0,9.0,2.0,8.0,0.0,4.0,1.0,39.0
4770,5.0,9.0,2.0,12.0,0.0,4.0,1.0,39.0


In [104]:
X_train_scale_ord = X_train_scale.copy()
X_train_scale_ord.update(X_train_ord_cat)
X_test_scale_ord = X_test_scale.copy()
X_test_scale_ord.update(X_test_ord_cat)

X_train_scale_ord.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country
17064,-1.364769,4,-0.745073,15,-0.027403,4,1,3,1,0,-0.146669,-0.216807,-0.845065,39
18434,-1.217915,4,0.708967,11,-0.416178,4,8,3,2,1,-0.146669,-0.216807,3.204805,39
3294,0.911476,4,1.397806,11,-0.416178,2,4,0,4,1,0.451354,-0.216807,1.17987,39
31317,1.05833,4,-0.736111,9,1.138922,2,8,0,4,1,-0.146669,-0.216807,-0.440078,39
4770,0.984903,5,-1.316034,9,1.138922,2,12,0,4,1,-0.146669,-0.216807,1.17987,39


In [105]:
lr_ord = LogisticRegression()
lr_ord.fit(X_train_scale_ord, y_train);

In [106]:
show_scores(lr_ord, X_train_scale_ord, y_train, X_test_scale_ord, y_test)

Training error: 0.176
Test     error: 0.174


Let's compare with the case where we dropped categorical variables:

In [107]:
show_scores(lr_drop, X_train_scale_drop, y_train, X_test_scale_drop, y_test)

Training error: 0.185
Test     error: 0.185


Ok, we're doing a bit better now.

- Returning to the problem with this approach:
  - We have imposed ordinality on the categorical data.
  - For example:

In [108]:
c = oe.categories_[3]
{i : c_i for i, c_i in enumerate(c)}

{0: '?',
 1: 'Adm-clerical',
 2: 'Armed-Forces',
 3: 'Craft-repair',
 4: 'Exec-managerial',
 5: 'Farming-fishing',
 6: 'Handlers-cleaners',
 7: 'Machine-op-inspct',
 8: 'Other-service',
 9: 'Priv-house-serv',
 10: 'Prof-specialty',
 11: 'Protective-serv',
 12: 'Sales',
 13: 'Tech-support',
 14: 'Transport-moving'}

- Is it reasonable to say "occupation > 3" in the above? There is no pattern to these occupations. 
- In this case, it may be reasonable for the education feature:

In [109]:
X_train["education"].unique()

array(['Some-college', 'HS-grad', 'Bachelors', 'Doctorate', '12th',
       '10th', 'Masters', '11th', 'Prof-school', 'Assoc-voc',
       'Assoc-acdm', '7th-8th', '9th', '1st-4th', 'Preschool', '5th-6th'],
      dtype=object)

- But we would want to set the order manually, not allow Python to do it arbitrarily (probably alphabetical)!
- Interestingly, this has already been done for us in this dataset:

In [110]:
X_train[['education', 'education.num']]

Unnamed: 0,education,education.num
17064,Some-college,10
18434,HS-grad,9
3294,HS-grad,9
31317,Bachelors,13
4770,Bachelors,13
...,...,...
28636,HS-grad,9
17730,10th,6
28030,Some-college,10
15725,5th-6th,3


So, in reality, we might just drop the `education` column and only use `education.num`. Here we'll just ignore this issue and keep both.

#### OHE

In [111]:
ohe = OneHotEncoder(drop='first', sparse=False, dtype=int)
ohe.fit(X_train_scale[categorical_features]);

In [112]:
X_train_scale_ohe = pd.concat((X_train_scale, pd.DataFrame(ohe.transform(X_train_scale[categorical_features]),
                           columns=ohe.get_feature_names(categorical_features), index=X_train_scale.index)), axis=1).drop(columns=categorical_features)
X_test_scale_ohe = pd.concat((X_test_scale, pd.DataFrame(ohe.transform(X_test_scale[categorical_features]),
                           columns=ohe.get_feature_names(categorical_features), index=X_test_scale.index)), axis=1).drop(columns=categorical_features)

X_train_scale_ohe.head()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native.country_Portugal,native.country_Puerto-Rico,native.country_Scotland,native.country_South,native.country_Taiwan,native.country_Thailand,native.country_Trinadad&Tobago,native.country_United-States,native.country_Vietnam,native.country_Yugoslavia
17064,-1.364769,-0.745073,-0.027403,-0.146669,-0.216807,-0.845065,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
18434,-1.217915,0.708967,-0.416178,-0.146669,-0.216807,3.204805,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
3294,0.911476,1.397806,-0.416178,0.451354,-0.216807,1.17987,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
31317,1.05833,-0.736111,1.138922,-0.146669,-0.216807,-0.440078,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4770,0.984903,-1.316034,1.138922,-0.146669,-0.216807,1.17987,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [113]:
X_train_scale_ohe.shape

(26048, 100)

In [114]:
lr_ohe = LogisticRegression()
lr_ohe.fit(X_train_scale_ohe, y_train);

In [115]:
show_scores(lr_ohe, X_train_scale_ohe, y_train, X_test_scale_ohe, y_test)

Training error: 0.147
Test     error: 0.149


Again, going back to our previous scores:

In [116]:
show_scores(lr_ord, X_train_scale_ord, y_train, X_test_scale_ord, y_test)

Training error: 0.176
Test     error: 0.174


In [117]:
show_scores(lr_drop, X_train_scale_drop, y_train, X_test_scale_drop, y_test)

Training error: 0.185
Test     error: 0.185


We see that we've indeed done better!

Hyperopt code from lecture 5:

In [209]:
param_grid = {
              "n_estimators"     : [10,100],
              "max_depth"        : [3, None],
              "max_features"     : [3, None]
             }
param_grid

{'n_estimators': [10, 100], 'max_depth': [3, None], 'max_features': [3, None]}

- How many combinations in total? 
- $2\times 2\times 2=8$

In [210]:
np.prod(list(map(len, param_grid.values())))

8

In [211]:
rf = RandomForestClassifier(random_state=321)
grid_search = GridSearchCV(rf, param_grid, cv=3, verbose=1)

In [212]:
grid_search.fit(X_train_transformed, y_train);

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  1.3min finished


In [213]:
grid_search.best_params_

{'max_depth': None, 'max_features': None, 'n_estimators': 100}

- lol... these are the default values.
- I guess they picked good defaults!

In [214]:
grid_search.best_score_

0.8549216369281329

In [215]:
pd.DataFrame(grid_search.cv_results_)[['mean_test_score', 'param_max_depth', 'param_max_features', 'param_n_estimators', 'mean_fit_time', 'rank_test_score']].set_index("rank_test_score").sort_index()

Unnamed: 0_level_0,mean_test_score,param_max_depth,param_max_features,param_n_estimators,mean_fit_time
rank_test_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.854922,,,100,14.519656
2,0.848587,,,10,1.399292
3,0.848127,,3.0,100,1.986127
4,0.844748,3.0,,10,0.51757
4,0.844748,3.0,,100,5.280527
6,0.839527,,3.0,10,0.226066
7,0.761057,3.0,3.0,10,0.18932
8,0.760519,3.0,3.0,100,0.907236


- Note that the grid search object acts like a scikit-learn model.
- It was actually refit on the _whole_ training set, as discussed earlier in the course!
- I believe it is the same as `grid_search.best_estimator_`.

In [216]:
grid_search.predict(X_test_transformed)

array(['<=50K', '>50K', '<=50K', ..., '>50K', '<=50K', '<=50K'],
      dtype=object)

In [217]:
param_choices = {
              "n_estimators"     : [10, 30, 100, 300],
              "max_depth"        : [3, 10, None],
              "max_features"     : [3, 10, None]
             }

You can also give it distributions, instead of lists.

In [218]:
import scipy.stats

In [219]:
param_dist = {
              "n_estimators"     : scipy.stats.randint(low=10, high=300),
              "max_depth"        : scipy.stats.randint(low=10, high=30),
              "max_features"     : scipy.stats.randint(low=10, high=30)
             }

In [220]:
rf = RandomForestClassifier(random_state=321) # Note: you can set other hyperparameters here

In [None]:
random_search = RandomizedSearchCV(rf, param_distributions = param_dist, 
                                   n_iter = 10, 
                                   cv=3,
                                   verbose=1, random_state=123)

In [None]:
random_search.fit(X_train_transformed, y_train);

- Note: some hyperparameters significantly affect the training time!
- For example, setting `n_estimators=1000` is going to be very slow.

In [221]:
random_search.best_params_

{'max_depth': 16, 'max_features': 27, 'n_estimators': 93}

- Now we get something different! 
- What's the score?

In [222]:
random_search.best_score_

0.863175750441226

- So, we had 85.4% and now we have 86.1%.
- Is that difference important?
- Do we BELIEVE that difference?
  - We can try it out on the test set.
- But first:  

In [225]:
pd.DataFrame(random_search.cv_results_)[['mean_test_score', 'param_max_depth', 'param_max_features', 'param_n_estimators', 'mean_fit_time', 'rank_test_score']].set_index("rank_test_score").sort_index()

Unnamed: 0_level_0,mean_test_score,param_max_depth,param_max_features,param_n_estimators,mean_fit_time
rank_test_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.863176,16,27,93,4.662635
2,0.862561,20,11,106,2.776636
3,0.86164,23,12,108,3.506636
4,0.861602,17,12,94,2.629149
5,0.861333,14,10,218,5.281985
6,0.860527,27,25,83,3.811981
7,0.86045,25,29,263,15.884677
8,0.860412,10,24,234,8.109985
9,0.859951,25,26,145,7.407122
10,0.859375,14,27,12,0.542452


- Look at the timings, they are quite interesting.
- And now, the test set:

In [223]:
grid_search.score(X_test_transformed, y_test)

0.8527560264087211

In [224]:
random_search.score(X_test_transformed, y_test)

0.8622754491017964

## Summary

TODO