# Topic: Cross Validation in predictive modeling


## Problem

When you do `train-test-split` you are sampling randomly to identify the train and test datasets.

![img](train-test-examp.png)

But what about **sampling error**? What if you get a **BAD SAMPLE**?!

![bad](https://media.giphy.com/media/cJjQJWU70DSuHzx4oR/giphy.gif)

## Our Task
![map](map.png) 

Build a multivariate Ordinary Least Squares regression model to predict "TARGET_deathRate"with a more robust model validation method, cross-validation.<br>
We have data aggregated from a number of sources including the American Community Survey (census.gov), clinicaltrials.gov, and cancer.gov. Most of the data preparation process can be viewed here.


## Learning Goals:

- Describe the elements of  K-fold Cross validation 
- Recognize how K-fold cross validation is superior to normal validation testing
- Apply K-fold cross validation to a dataset
- Apply K-fold cross validation to Module 1 project 

## Activation
Let us talk about `training` and `testing.`

![train-test](why-train-test.png)

We split to prevent:

![fit-pit](overfit_underfit.png)
(found on [this blog](https://rmartinshort.jimdo.com/2019/02/17/overfitting-bias-variance-and-leaning-curves/)). 

But what if by random chance of your training dataset split - your training data isn't representative? what if it includes some wacky data?

![but what if](bad-split.png)

k-fold averages that out, and also keeps from “overfitting” and “underfitting.”


## Learning Goal 1: Describe the elements of K-fold Cross validation 

In the context of modeling, K-fold cross validation sits under the Stage 6- Predictive Modeling, in the 7 stage Data Science Lifecycle.

![chart](chart.png)

K-fold cross validation essentially helps us increase the accuracy of any Machine learning model. It does this by taking the average of the results of training and testing data from given dataset. This in turn is by dividing the dataset into several (“k”) folds. Then, Training data on “k-1” folds and testing on “kth” fold. Repeat this “k” times and average the result.

![cross-val](cross-val-graphic.png)
(graphic from [here](https://towardsdatascience.com/cross-validation-70289113a072) )

We can compare the resultant accuracy by taking the average of accuracy calculated during each of the folds. This tends to give a more real picture of the machine learning model performance. 

The cross validation technique can be used to compare the performance of different machine learning models on the same data set. To understand this point better, let us consider the following example.

Go through [this blog](https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79) to hit the topic home.

## Learning Goal 2:  Explain to Greg
![img2](thinking.jpeg)

You've hired Greg to build models for you. He's stressed and trying to tell you there isn't enough time to do a cross validation and that one train-test split should be enough.

Write down what you would say to Greg and then tell it to your neighbor.

## Learning Goal 3: Applying k-fold cross validation

### Try the code in each of these articles:

### One half of room:
This is a good tech blog:

[this blog is a good one](https://machinelearningmastery.com/k-fold-cross-validation/)

### Other half of room:
[another good example](https://medium.com/datadriveninvestor/k-fold-cross-validation-6b8518070833)


### Task: 
Write the most important  parts of code from each post on the board & then discuss

- What did you need to specify?
- What new libraries did you use?

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)

In [None]:
    best_svr.fit(X_train, y_train)
#     scores.append(best_svr.score(X_test, y_test))

## Build model to predict cancer


```
cancer_rates = pd.read_csv('https://query.data.world/s/5ylxfjp6oymzhuhhzwmlbqxzcw6etz')

households = pd.read_csv('https://download.data.world/s/3nopgtdm2fwjgidovkostutkfitlps')
```

[Here is the documentatiopn](https://data.world/exercises/linear-regression-exercise-1/workspace/data-dictionary) for this data.

Integrate this new knowledge of k-fold cross validation to build a model and calculate the average performance. 

In [2]:
import pandas as pd
import numpy as np


In [4]:
cancer_rates = pd.read_csv('https://query.data.world/s/5ylxfjp6oymzhuhhzwmlbqxzcw6etz')
households = pd.read_csv('https://download.data.world/s/3nopgtdm2fwjgidovkostutkfitlps')
print(cancer_rates.head())
print(households.head())


   avganncount  avgdeathsperyear  target_deathrate  incidencerate  medincome  \
0       1397.0               469             164.9          489.8      61898   
1        173.0                70             161.3          411.6      48127   
2        102.0                50             174.7          349.7      49348   
3        427.0               202             194.8          430.4      44243   
4         57.0                26             144.4          350.1      49955   

   popest2015  povertypercent  studypercap           binnedinc  medianage  \
0      260131            11.2   499.748204   (61494.5, 125635]       39.3   
1       43269            18.6    23.111234  (48021.6, 51046.4]       33.0   
2       21026            14.6    47.560164  (48021.6, 51046.4]       45.0   
3       75882            17.1   342.637253    (42724.4, 45201]       42.8   
4       10321            12.5     0.000000  (48021.6, 51046.4]       48.3   

     ...      pctprivatecoveragealone  pctempprivcoverag

In [5]:
cancer_rates.isnull().sum()

avganncount                   0
avgdeathsperyear              0
target_deathrate              0
incidencerate                 0
medincome                     0
popest2015                    0
povertypercent                0
studypercap                   0
binnedinc                     0
medianage                     0
medianagemale                 0
medianagefemale               0
geography                     0
percentmarried                0
pctnohs18_24                  0
pcths18_24                    0
pctsomecol18_24            2285
pctbachdeg18_24               0
pcths25_over                  0
pctbachdeg25_over             0
pctemployed16_over          152
pctunemployed16_over          0
pctprivatecoverage            0
pctprivatecoveragealone     609
pctempprivcoverage            0
pctpubliccoverage             0
pctpubliccoveragealone        0
pctwhite                      0
pctblack                      0
pctasian                      0
pctotherrace                  0
pctmarri

In [8]:
cancer_rates.dropna(axis=1, how='all')

Unnamed: 0,avganncount,avgdeathsperyear,target_deathrate,incidencerate,medincome,popest2015,povertypercent,studypercap,binnedinc,medianage,...,pctprivatecoveragealone,pctempprivcoverage,pctpubliccoverage,pctpubliccoveragealone,pctwhite,pctblack,pctasian,pctotherrace,pctmarriedhouseholds,birthrate
0,1397.000000,469,164.9,489.800000,61898,260131,11.2,499.748204,"(61494.5, 125635]",39.3,...,,41.6,32.9,14.0,81.780529,2.594728,4.821857,1.843479,52.856076,6.118831
1,173.000000,70,161.3,411.600000,48127,43269,18.6,23.111234,"(48021.6, 51046.4]",33.0,...,53.8,43.6,31.1,15.3,89.228509,0.969102,2.246233,3.741352,45.372500,4.333096
2,102.000000,50,174.7,349.700000,49348,21026,14.6,47.560164,"(48021.6, 51046.4]",45.0,...,43.5,34.9,42.1,21.1,90.922190,0.739673,0.465898,2.747358,54.444868,3.729488
3,427.000000,202,194.8,430.400000,44243,75882,17.1,342.637253,"(42724.4, 45201]",42.8,...,40.3,35.0,45.3,25.0,91.744686,0.782626,1.161359,1.362643,51.021514,4.603841
4,57.000000,26,144.4,350.100000,49955,10321,12.5,0.000000,"(48021.6, 51046.4]",48.3,...,43.9,35.1,44.0,22.7,94.104024,0.270192,0.665830,0.492135,54.027460,6.796657
5,428.000000,152,176.0,505.400000,52313,61023,15.6,180.259902,"(51046.4, 54545.6]",45.4,...,38.8,32.6,43.2,20.2,84.882631,1.653205,1.538057,3.314635,51.220360,4.964476
6,250.000000,97,175.9,461.800000,37782,41516,23.2,0.000000,"(37413.8, 40362.7]",42.6,...,35.0,28.3,46.4,28.7,75.106455,0.616955,0.866157,8.356721,51.013900,4.204317
7,146.000000,71,183.6,404.000000,40189,20848,17.8,0.000000,"(37413.8, 40362.7]",51.7,...,33.1,25.9,50.9,24.1,89.406636,0.305159,1.889077,2.286268,48.967033,5.889179
8,88.000000,36,190.5,459.400000,42579,13088,22.3,0.000000,"(40362.7, 42724.4]",49.3,...,37.8,29.9,48.1,26.6,91.787477,0.185071,0.208205,0.616903,53.446998,5.587583
9,4025.000000,1380,177.8,510.900000,60397,843954,13.1,427.748432,"(54545.6, 61494.5]",35.8,...,,44.4,31.4,16.5,74.729668,6.710854,6.041472,2.699184,50.063573,5.533430


In [17]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

linreg =LinearRegression()

X = cancer_rates[["povertypercent", "birthrate", "pctpubliccoveragealone"]]
y = cancer_rates["target_deathrate"]

cv_5_results = np.mean(cross_val_score(linreg, X, y, cv=5))
cv_10_results = np.mean(cross_val_score(linreg, X, y, cv=10))
cv_20_results = np.mean(cross_val_score(linreg, X, y, cv=20))

In [18]:
print(cv_5_results, cv_10_results, cv_20_results)

0.20081555565836567 0.1705449726746069 0.13379834338458024


### Assessment

Did they achieve all the learning goals from the start? How do you confirm? You can use many different methods:

Review questions? (make into quiz)

- What is “training” and “testing” 
- What is underfitting?
- What is overfitting?
- What is the data science lifecycle? (students should be able to articulate the 7 steps in the - pie chart above)
- What is k-fold cross validation?


Why is it useful?

### Reflection/Key Takeaways


In machine learning, it is always a good idea to play around with different predictive models and their parameters to arrive at the best choice. Fine-tuning your machine learning model is helpful in achieving good results, and of course, cross validation helps you know if you are on the right track to get a good predictive model.


_Limitations of Cross Validation_ <br>
For cross validation to give some meaningful results, the training set and the validation set are required to be drawn from the same population. Also, human biases need to be controlled, or else cross validation will not be fruitful.

_**Other Applications**_

_Compare Performance_<br>
Suppose you want to make a classifier for the MNIST data set, which consists of hand-written numerals from 0 to 9. You are considering using either K Nearest Neighbours (KNN) or Support Vector Machine (SVM). To compare the performance of the two machine learning models on the given data set, you can use cross validation. This will help you determine which predictive model you should choose working with for the MNIST data set.
Cross validation can also be used for selecting suitable parameters. The example mentioned below will illustrate this point well.

_Fine-tune Parameters_<br>
Suppose you have to build a K Nearest Neighbours (KNN) classifier for the MNIST data set. To use this classifier, you should provide an appropriate value of the parameter k to the classifier. Choosing the value of k intuitively is not a good idea (beware of overfitting!). You can play around with different values of the parameter k and use cross validation to estimate the performance of the predictive model corresponding to each k. You should finally go ahead with the value of k that gives the best performance of the predictive model on the given data set.
For the K Nearest Neighbours (KNN) classifier, you can even choose different metrics (default is ‘minkowski’ if you use ‘KNeighborsClassifier’ of sklearn). So you can use cross validation to determine which metric is the best for the data set you have.

_References_
- https://machinelearningmastery.com/k-fold-cross-validation/
- https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79
- https://www.researchgate.net/post/What_is_the_purpose_of_performing_cross-validation
- https://medium.com/datadriveninvestor/k-fold-cross-validation-6b8518070833
- https://www.cs.tau.ac.il/~nin/Courses/NC05/pr_l13.pdf
- https://magoosh.com/data-science/k-fold-cross-validation/