# Cross Validation

Cross validation is a way we can further split up our data to find the best model, or the hyper parameters for a model.

!!!danger "Only use test data once."
    A common mistake is to use the test split of your data with multiple models. Doing so can lead to overfitting on your dataset, and the test data set should only be used once, to give an idea of how well your *best* model will generalize.
    
    You should **not** use the test data set to compare different models, or to find the best hyper parameters for a model, instead, use cross validation methods.

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

import env

For this lesson, we'll be using a data set that contains information on used cars and their sale prices.

We'll create a feature that determines whether a car sold for over the average sale price, and try to predict this.

In [2]:
url = f'mysql+pymysql://{env.user}:{env.password}@{env.host}/used_cars'
cars = pd.read_sql('SELECT * FROM cars', url)
cars.columns = [c.lower() for c in cars]
cars.set_index('id', inplace=True)

print('{} rows x {} cols'.format(*cars.shape))
cars.head()

297899 rows x 8 cols


Unnamed: 0_level_0,price,year,mileage,city,state,vin,make,model
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,16472,2015,18681,Jefferson City,MO,KL4CJBSBXFB267643,Buick,EncoreConvenience
2,15749,2015,27592,Highland,IN,KL4CJASB5FB245057,Buick,EncoreFWD
3,16998,2015,13650,Boone,NC,KL4CJCSB0FB264921,Buick,EncoreLeather
4,15777,2015,25195,New Orleans,LA,KL4CJASB4FB217542,Buick,EncoreFWD
5,16784,2015,22800,Las Vegas,NV,KL4CJBSB3FB166881,Buick,EncoreConvenience


## Data Prep

We'll construct a feature that says whether the car sold for over the average price for the car's make, model, and year:

In [3]:
cars['avg_saleprice'] = cars.groupby(['year', 'make', 'model']).price.transform('mean')
cars['gt_avg'] = (cars.price > cars.avg_saleprice).astype(int)

In [4]:
cars.head()

Unnamed: 0_level_0,price,year,mileage,city,state,vin,make,model,avg_saleprice,gt_avg
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,16472,2015,18681,Jefferson City,MO,KL4CJBSBXFB267643,Buick,EncoreConvenience,17291.768786,0
2,15749,2015,27592,Highland,IN,KL4CJASB5FB245057,Buick,EncoreFWD,16721.350598,0
3,16998,2015,13650,Boone,NC,KL4CJCSB0FB264921,Buick,EncoreLeather,19080.632911,0
4,15777,2015,25195,New Orleans,LA,KL4CJASB4FB217542,Buick,EncoreFWD,16721.350598,0
5,16784,2015,22800,Las Vegas,NV,KL4CJBSB3FB166881,Buick,EncoreConvenience,17291.768786,0


We'll remove the features we aren't going to use:

In [5]:
cars.drop(columns=['price', 'city', 'vin', 'avg_saleprice'], inplace=True)

Let's encode the categorical columns:

In [6]:
from sklearn.preprocessing import LabelEncoder

for col in ['state', 'make', 'model', 'year']:
    le = LabelEncoder().fit(cars[col])
    cars[col] = le.transform(cars[col])

In [7]:
cars.head()

Unnamed: 0_level_0,year,mileage,state,make,model,gt_avg
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,18,18681,28,7,523,0
2,18,27592,19,7,525,0
3,18,13650,32,7,526,0
4,18,25195,22,7,525,0
5,18,22800,38,7,523,0


Now that our data is prepped, we can split it into training and test.

In [8]:
X, y = cars.drop(columns='gt_avg'), cars.gt_avg

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

## By Hand

Now we can further split our data into training and validate data sets:

In [9]:
X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train, test_size=.3333)

And we could now use the validation data set to compare model performance, or select hyper parameters.

## Basic Cross Validation

The `cross_val_score` function can be used to automate the splitting process. We can specify a number of splits we want from the test data, and 

In [10]:
import sklearn.metrics as m
from sklearn.model_selection import cross_val_score

tree = DecisionTreeClassifier(max_depth=2)

cross_val_score(tree, X_train, y_train, cv=3)

array([0.59156392, 0.59231147, 0.59353876])

By default for classifiers, accuracy will be used as the metric, but we can look at others as well:

In [11]:
cross_val_score(tree, X_train, y_train, cv=3, scoring='precision')

array([0.58878505, 0.59654678, 0.59260508])

## Grid Search CV

Sklearn's grid search cross validation (`GridSearchCV`) class lets us quickly try out many different combinations of hyper parameters.

For our example, we'll try out different for `max_depth` and `max_features` with a decision tree classifier.

We'll specify the parameters we wish to use as a dictionary, then use that dictionary when we create the class.

In [12]:
from sklearn.model_selection import GridSearchCV

params = {'max_depth': [2, 3, 4],
          'max_features': [None, 1, 3]}

tree = DecisionTreeClassifier()

grid = GridSearchCV(tree, params, cv=3)

grid.fit(X_train, y_train)

GridSearchCV(cv=3, estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': [2, 3, 4], 'max_features': [None, 1, 3]})

We can now see the cross validation results in the `cv_results_` property of the object we created.

In [13]:
results = grid.cv_results_
results.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_max_depth', 'param_max_features', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])

There are a lot of properties here, but we will focus on two:

- `mean_test_score`: the average test scores for each model
- `params`: a dictionary containing the parameters used to train each model

In [14]:
test_scores = results['mean_test_score']
test_scores

array([0.59247138, 0.5336245 , 0.57126759, 0.62837741, 0.55348778,
       0.60133915, 0.63756002, 0.55784278, 0.61290109])

In [15]:
params = results['params']
params

[{'max_depth': 2, 'max_features': None},
 {'max_depth': 2, 'max_features': 1},
 {'max_depth': 2, 'max_features': 3},
 {'max_depth': 3, 'max_features': None},
 {'max_depth': 3, 'max_features': 1},
 {'max_depth': 3, 'max_features': 3},
 {'max_depth': 4, 'max_features': None},
 {'max_depth': 4, 'max_features': 1},
 {'max_depth': 4, 'max_features': 3}]

We can combine these features together into a data frame to see how our different models perform:

In [16]:
for p, s in zip(params, test_scores):
    p['score'] = s
    
pd.DataFrame(params).sort_values(by='score')

Unnamed: 0,max_depth,max_features,score
1,2,1.0,0.533624
4,3,1.0,0.553488
7,4,1.0,0.557843
2,2,3.0,0.571268
0,2,,0.592471
5,3,3.0,0.601339
8,4,3.0,0.612901
3,3,,0.628377
6,4,,0.63756


## Exercises

Within your `codeup-data-science` directory, create a new repo named `advanced-topics`. This will be where you do your work for this module. Create a repository on GitHub with the same name, and link your local repository to GitHub.

Save this work in your `advanced-topics` repo. Then add, commit, and push your changes.

Do your work for this exercise in a jupyter notebook or python script named `cross_validation`.

Use the cross validation techniques discussed in the lesson to figure out what kind of model works best with the cars dataset used in the lesson.