## Mud card
- **Why is cv score higer than train score in some cases?**
   - we use MSE as our evaluation metric, so the cv score should normally be higher than the train score
   - the unusual thing is when the cv score is smaller than the train score
      - we work with small datasets due to the hub's limited computational resources
      - unlucky splits happen

- **For k-fold, when shuffle is true, when is the data shuffled? is the data shuffled before splitting into k folds? but isn't data shuffled before?**
   - the data is shuffled first, then split into folds
   - train_test_split by default shuffles the data so if you use train_test_split first, it's OK to not shuffle in KFold
   - ALWAYS CHECK THAT THE CODE DOES WHAT YOU INTEND IT TO DO!

- **how is the deviation of k-fold test mse smaller than the basic pipeline?**
   - KFold CV was run only once, so we only had one set of points in test
      - the only source of uncertainty came from changing the CV fold
   - we ran the basic ML pipeline 10 times, so we had 10 different train/CV/test sets

- **What is the meaning of "rank_test_score"?**
   - rank_test_score column in the results tells you what GridSearchCV believes to be the best hyperparameter combination

In [1]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.metrics import make_scorer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

np.random.seed(10)
def true_fun(X):
    return np.cos(1.5 * np.pi * X)

n_samples = 100

X = np.random.rand(n_samples)
y = true_fun(X) + np.random.randn(n_samples) * 0.1

In [2]:
def ML_pipeline_kfold(X,y,random_state,n_folds):
    # split the data
    X_other, X_test, y_other, y_test = train_test_split(X, y, test_size=0.2, random_state = random_state)
    CV_scores = []
    test_scores = []
    # k folds - each fold will give us a CV and a test score
    kf = KFold(n_splits=n_folds,shuffle=True,random_state=random_state)
    for train_index, CV_index in kf.split(X_other,y_other):
        X_train, X_CV = X_other[train_index], X_other[CV_index]
        y_train, y_CV = y_other[train_index], y_other[CV_index]
        # simple preprocessing
        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_train)
        X_c = scaler.transform(X_CV)
        X_t = scaler.transform(X_test)
        # tune ridge hyper-parameter, alpha
        alpha = np.logspace(-5,2,num=8)
        train_score = []
        CV_score = []
        regs = []
        for a in alpha:
            reg = Ridge(alpha = a)
            reg.fit(X_train,y_train)
            train_score.append(mean_squared_error(y_train,reg.predict(X_train)))
            CV_score.append(mean_squared_error(y_CV,reg.predict(X_c)))
            regs.append(reg)
        # find the best alpha in this fold
        best_alpha = alpha[np.argmin(CV_score)]
        # grab the best model
        reg = regs[np.argmin(CV_score)]
        CV_scores.append(np.min(CV_score))
        # calculate test score using thee best model
        test_scores.append(mean_squared_error(y_test,reg.predict(X_t)))
    return CV_scores,test_scores

In [3]:
def ML_pipeline_kfold_GridSearchCV(X,y,random_state,n_folds):
    # create a test set
    X_other, X_test, y_other, y_test = train_test_split(X, y, test_size=0.2, random_state = random_state)
    # splitter for _other
    kf = KFold(n_splits=n_folds,shuffle=True,random_state=random_state)
    # create the pipeline: preprocessor + supervised ML method
    scaler = StandardScaler()
    pipe = make_pipeline(scaler,Ridge())
    # the parameter(s) we want to tune
    param_grid = {'ridge__alpha': np.logspace(-3,4,num=8)}
    # prepare gridsearch
    grid = GridSearchCV(pipe, param_grid=param_grid,scoring = make_scorer(mean_squared_error,greater_is_better=False),
                        cv=kf, return_train_score = True)
    # do kfold CV on _other
    grid.fit(X_other, y_other)
    return grid, grid.score(X_test, y_test)

In [4]:
grid, test_score = ML_pipeline_kfold_GridSearchCV(X[:,np.newaxis],y,42,5)
results = pd.DataFrame(grid.cv_results_)
print('CV MSE:',-np.around(results[results['rank_test_score'] == 1]['mean_test_score'].values[0],2),\
      '+/-',np.around(results[results['rank_test_score'] == 1]['std_test_score'].values[0],2))
print('test MSE:',-np.around(test_score,2))
print(grid.best_estimator_)
print(grid.best_score_)
print(grid.best_index_)
results

CV MSE: 0.19 +/- 0.03
test MSE: 0.16
Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('ridge',
                 Ridge(alpha=0.1, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)
-0.18765006501383993
2


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ridge__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.001283,0.000445,0.000439,0.000177,0.001,{'ridge__alpha': 0.001},-0.2023,-0.157711,-0.168699,-0.160049,...,-0.187654,0.03481,3,-0.180344,-0.191596,-0.188761,-0.192012,-0.168994,-0.184341,0.008747
1,0.000757,8.6e-05,0.00025,8e-06,0.01,{'ridge__alpha': 0.01},-0.202294,-0.157719,-0.168705,-0.160029,...,-0.187654,0.034814,2,-0.180344,-0.191596,-0.188761,-0.192012,-0.168994,-0.184341,0.008747
2,0.000871,0.000126,0.000379,6.2e-05,0.1,{'ridge__alpha': 0.1},-0.202237,-0.157799,-0.168761,-0.159828,...,-0.18765,0.034859,1,-0.180344,-0.191597,-0.188762,-0.192013,-0.168995,-0.184342,0.008747
3,0.000792,0.000164,0.000258,3.3e-05,1.0,{'ridge__alpha': 1.0},-0.20174,-0.15865,-0.169361,-0.157909,...,-0.187673,0.035312,4,-0.180407,-0.191656,-0.188824,-0.192083,-0.169056,-0.184405,0.008748
4,0.000729,7.5e-05,0.000252,1.3e-05,10.0,{'ridge__alpha': 10.0},-0.202714,-0.170812,-0.17843,-0.145902,...,-0.192327,0.04006,5,-0.185235,-0.196205,-0.193658,-0.197511,-0.173772,-0.189276,0.008851
5,0.000684,3.8e-05,0.000283,7e-05,100.0,{'ridge__alpha': 100.0},-0.299603,-0.303486,-0.282612,-0.179732,...,-0.289302,0.064467,6,-0.279937,-0.285426,-0.288467,-0.303964,-0.26627,-0.284813,0.012232
6,0.000764,8.7e-05,0.00025,1.4e-05,1000.0,{'ridge__alpha': 1000.0},-0.454587,-0.4753,-0.419325,-0.279924,...,-0.430692,0.08269,7,-0.416955,-0.414514,-0.425639,-0.457985,-0.400098,-0.423038,0.019308
7,0.000934,0.000243,0.000329,0.000114,10000.0,{'ridge__alpha': 10000.0},-0.486653,-0.509482,-0.446598,-0.30221,...,-0.459481,0.085772,8,-0.444815,-0.440761,-0.45353,-0.489302,-0.427309,-0.451143,0.020869


## Cross Validation with iid and non-iid data
By the end of this lecture, you will be able to
- use GridSearchCV with pipelines
- apply stratified splits to imbalanced data
- split based on group ID and time

## <font color='lightgray'>Cross Validation with iid and non-iid data</font>
<font color='lightgray'>By the end of this lecture, you will be able to</font>
- **use GridSearchCV with pipelines**
- <font color='lightgray'>apply stratified splits to imbalanced data</font>
- <font color='lightgray'>split based on group ID and time</font>

### Some notable differences between my KFold and KFold with GridSearchCV
- if multiple parameters give an equally good CV score, GridSearchCV returns the largest
   - my function returns the smallest
- GridSearchCV calculates only one test score
   - my function returns n_folds test scores
   - the GridSearchCV approach refits the best model to X_other and y_other and that model is used to calculate the test score
   - it's unclear which one is better
      - my approach allows to calculate some uncertainty due to splitting (not on test)
      - the GridSearchCV approach returns one test score but it is based on more data (likely more accurate)
- 7 lines of code in GridSearchCV
   - 28 lines of code in my function

## Estimate the uncertainty from random test sets in KFold CV
### Exercise 1 
Calculate the test score for 10 different random splits. What's the mean and std test score?

## <font color='lightgray'>Cross Validation with iid and non-iid data</font>
<font color='lightgray'>By the end of this lecture, you will be able to</font>
- <font color='lightgray'>use GridSearchCV with pipelines</font>
- **apply stratified splits to imbalanced data**
- <font color='lightgray'>split based on group ID and time</font>

## Imbalanced data: use stratified folds
<center><img src="figures/stratified_kfold.png" width="600"></center>


In [6]:
from sklearn.model_selection import StratifiedKFold
help(StratifiedKFold)

Help on class StratifiedKFold in module sklearn.model_selection._split:

class StratifiedKFold(_BaseKFold)
 |  Stratified K-Folds cross-validator
 |  
 |  Provides train/test indices to split data in train/test sets.
 |  
 |  This cross-validation object is a variation of KFold that returns
 |  stratified folds. The folds are made by preserving the percentage of
 |  samples for each class.
 |  
 |  Read more in the :ref:`User Guide <cross_validation>`.
 |  
 |  Parameters
 |  ----------
 |  n_splits : int, default=3
 |      Number of folds. Must be at least 2.
 |  
 |      .. versionchanged:: 0.20
 |          ``n_splits`` default value will change from 3 to 5 in v0.22.
 |  
 |  shuffle : boolean, optional
 |      Whether to shuffle each class's samples before splitting into batches.
 |  
 |  random_state : int, RandomState instance or None, optional, default=None
 |      If int, random_state is the seed used by the random number generator;
 |      If RandomState instance, random_state 

## Stratified train_test_split

In [7]:
help(train_test_split) # give the class labels to the stratify parameter

Help on function train_test_split in module sklearn.model_selection._split:

train_test_split(*arrays, **options)
    Split arrays or matrices into random train and test subsets
    
    Quick utility that wraps input validation and
    ``next(ShuffleSplit().split(X, y))`` and application to input data
    into a single call for splitting (and optionally subsampling) data in a
    oneliner.
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.
    
    test_size : float, int or None, optional (default=None)
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to the
        complement of the train size. If ``train_s

## <font color='lightgray'>Cross Validation with iid and non-iid data</font>
<font color='lightgray'>By the end of this lecture, you will be able to</font>
- <font color='lightgray'>use GridSearchCV with pipelines</font>
- <font color='lightgray'>apply stratified splits to imbalanced data</font>
- **split based on group ID and time**

## When the iid assumption breaks down
- What is the intended use of the model? What is it supposed to do/predict?
- What data do you have available at that time?
- Your cross validation must simulate the intended use of the model!

## An example: seizure project
- you can read the publication [here](https://ieeexplore.ieee.org/document/8857552)
- classification problem:
   - epileptic seizures vs. non-epileptic psychogenic seizures
- data from empatica wrist sensor
   - heart rate, skin temperature, EDA, blood volume pressure, acceleration
- data collection:
   - patients come to the hospital for a few days
   - eeg and video recording to determine seizure type
   - wrist sensor data is collected
- question:
   - Can we use the wrist sensor data to differentiate the two seizure types on new patients?

In [16]:
df = pd.read_csv('data/seizure_data.csv')
print(df[df['patient ID'] == 32])

    patient ID            seizure_ID  ACC_mean  BVP_mean  EDA_mean    HR_mean  \
5           32  ID32__day3_arm_1_sz1  1.028539 -0.092102  0.112795  64.748167   
6           32  ID32__day3_arm_1_sz1  1.027986  0.745437  0.130486  63.715667   
7           32  ID32__day2_arm_1_sz0  1.002146  0.150810  0.189272  61.838500   
8           32  ID32__day2_arm_1_sz0  1.005410  0.482859  1.226038  66.240833   
9           32  ID32__day1_arm_1_sz0  0.997017 -0.925122  0.200990  56.103667   
10          32  ID32__day1_arm_1_sz0  1.009207  1.618456  1.679754  64.668167   
27          32  ID32__day1_arm_1_sz0  1.000290  0.046690  0.123165  54.289500   
28          32  ID32__day1_arm_1_sz0  1.010351  0.125039  0.471180  65.060667   
29          32  ID32__day2_arm_1_sz0  1.018163  0.254302  0.206010  61.875833   
30          32  ID32__day2_arm_1_sz0  1.016785  1.242893  0.954649  66.216167   
34          32  ID32__day3_arm_1_sz1  1.008867  0.070180  0.195966  65.995667   
35          32  ID32__day3_a

In [17]:
y = df['label']
patient_ID = df['patient ID']
seizure_ID = df['seizure_ID']
X = df.drop(columns=['patient ID','seizure_ID','label'])
classes, counts = np.unique(y,return_counts=True)
print('balance:',np.max(counts/len(y)))

balance: 0.6884057971014492


In [10]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
def ML_pipeline_kfold_GridSearchCV(X,y,random_state,n_folds):
    # create a test set
    X_other, X_test, y_other, y_test = train_test_split(X, y, test_size=0.2, random_state = random_state,stratify=y)
    # splitter for _other
    kf = StratifiedKFold(n_splits=n_folds,shuffle=True,random_state=random_state)
    # create the pipeline: preprocessor + supervised ML method
    scaler = StandardScaler()
    pipe = make_pipeline(scaler,SVC())
    # the parameter(s) we want to tune
    param_grid = {'svc__C': np.logspace(-3,4,num=8),'svc__gamma': np.logspace(-3,4,num=8)}
    # prepare gridsearch
    grid = GridSearchCV(pipe, param_grid=param_grid,scoring = make_scorer(accuracy_score),
                        cv=kf, return_train_score = True,iid=True)
    # do kfold CV on _other
    grid.fit(X_other, y_other)
    return grid, grid.score(X_test, y_test)

In [11]:
test_scores = []
for i in range(5):
    grid, test_score = ML_pipeline_kfold_GridSearchCV(X,y,i*42,5)
    print(grid.best_params_)
    print('best CV score:',grid.best_score_)
    print('test score:',test_score)
    test_scores.append(test_score)
print('test accuracy:',np.around(np.mean(test_scores),2),'+/-',np.around(np.std(test_scores),2))

{'svc__C': 100.0, 'svc__gamma': 0.001}
best CV score: 0.9136363636363637
test score: 0.9285714285714286
{'svc__C': 10.0, 'svc__gamma': 0.01}
best CV score: 0.9454545454545454
test score: 0.9285714285714286
{'svc__C': 10.0, 'svc__gamma': 0.01}
best CV score: 0.9227272727272727
test score: 0.9464285714285714
{'svc__C': 10.0, 'svc__gamma': 0.01}
best CV score: 0.9363636363636364
test score: 0.9285714285714286
{'svc__C': 10.0, 'svc__gamma': 0.01}
best CV score: 0.9454545454545454
test score: 0.9107142857142857
test accuracy: 0.93 +/- 0.01


## This is wrong! A very bad case of data leakage!
- the textbook case of information leakage!
- if we just do KFold CV blindly, the points from the same patient end up in different sets
   - when you deploy the model and apply it to data from new patients, that patient's data will be seen for the first time
- the ML pipeline needs to mimic the intended use of the model!
   - we want to split the points based on the patient ID!
   - we want all points from the same patient to be in either train/CV/test

## Group-based split: GroupKFold
<center><img src="figures/groupkfold.png" width="600"></center>


In [12]:
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import GroupShuffleSplit
def ML_pipeline_groups_GridSearchCV(X,y,groups,random_state,n_folds):
    # create a test set based on groups
    splitter = GroupShuffleSplit(n_splits=1,test_size=0.2,random_state=random_state)
    for i_other,i_test in splitter.split(X, y, groups):
        X_other, y_other, groups_other = X.iloc[i_other], y.iloc[i_other], groups.iloc[i_other]
        X_test, y_test, groups_test = X.iloc[i_test], y.iloc[i_test], groups.iloc[i_test]
    # check the split
#     print(pd.unique(groups))
#     print(pd.unique(groups_other))
#     print(pd.unique(groups_test))
    # splitter for _other
    kf = GroupKFold(n_splits=n_folds)
    # create the pipeline: preprocessor + supervised ML method
    scaler = StandardScaler()
    pipe = make_pipeline(scaler,SVC())
    # the parameter(s) we want to tune
    param_grid = {'svc__C': np.logspace(-3,4,num=8),'svc__gamma': np.logspace(-3,4,num=8)}
    # prepare gridsearch
    grid = GridSearchCV(pipe, param_grid=param_grid,scoring = make_scorer(accuracy_score),
                        cv=kf, return_train_score = True,iid=True)
    # do kfold CV on _other
    grid.fit(X_other, y_other, groups_other)
    return grid, grid.score(X_test, y_test)

In [13]:
test_scores = []
for i in range(5):
    grid, test_score = ML_pipeline_groups_GridSearchCV(X,y,patient_ID,i*42,5)
    print(grid.best_params_)
    print('best CV score:',grid.best_score_)
    print('test score:',test_score)
    test_scores.append(test_score)
print('test accuracy:',np.around(np.mean(test_scores),2),'+/-',np.around(np.std(test_scores),2))

{'svc__C': 10.0, 'svc__gamma': 0.001}
best CV score: 0.8143459915611815
test score: 0.6410256410256411
{'svc__C': 10000.0, 'svc__gamma': 0.001}
best CV score: 0.6455696202531646
test score: 0.5847457627118644
{'svc__C': 10.0, 'svc__gamma': 0.001}
best CV score: 0.6494845360824743
test score: 0.9390243902439024
{'svc__C': 10.0, 'svc__gamma': 0.001}
best CV score: 0.7907949790794979
test score: 0.43243243243243246
{'svc__C': 10000.0, 'svc__gamma': 0.001}
best CV score: 0.6756756756756757
test score: 0.8901098901098901
test accuracy: 0.7 +/- 0.19


## The takeaway
- an incorrect cross validation pipeline gives misleading results
   - usually the model appears to be pretty accurate
   - but the performance is poor when the model is deployed
- this can be avoided by a careful cross validation pipeline
   - think about how your model will be used
   - mimic that future use in CV

## Data leakage in time series data is similar!
- do NOT use information in CV which will not be available once your model is deployed
   - don't use future information!
   
<center><img src="figures/timeseriessplit.png" width="600"></center>


Now you can
- use GridSearchCV with pipelines
- apply stratified splits to imbalanced data
- split based on group ID and time**