## In this notebook we are going to perform and test different types of cross validation

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Here we will use the advertising dataset in order to perform the classic K-Fold cross-validation and test how well this function splits the data.
<br></br>
The first thing we have to do is to create the KFold object and split the data into two subsets: train and test sets with the purpose of performing cross validation only in the train set and validate just once in the test set and thus avoid overfitting

In [2]:
data = pd.read_csv('../data/Advertising.csv')

In [3]:
kf = KFold(n_splits = 5)

X_train,X_test,y_train,y_test = train_test_split(data.drop("sales",axis=1),data['sales'],
                                                 test_size = 0.2, random_state=97)
#We zip this preprocessing and modelling steps in a pipeline
pipe = Pipeline([("scaler",StandardScaler()),("model",LinearRegression())])

In [4]:
#Check the index of the train set rows
np.sort(X_train.index)

array([  1,   3,   4,   6,   7,   8,   9,  10,  11,  13,  14,  15,  17,
        18,  19,  21,  23,  24,  25,  26,  27,  28,  29,  31,  32,  33,
        34,  35,  36,  39,  40,  41,  42,  43,  44,  46,  48,  50,  51,
        52,  53,  54,  57,  59,  60,  61,  62,  64,  66,  67,  69,  70,
        71,  72,  73,  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,
        86,  87,  88,  89,  90,  91,  92,  93,  95,  96,  98, 100, 101,
       102, 103, 104, 105, 107, 108, 109, 111, 112, 113, 114, 115, 116,
       117, 118, 120, 123, 124, 125, 126, 127, 129, 130, 131, 132, 133,
       134, 135, 136, 137, 139, 142, 143, 144, 147, 148, 149, 150, 151,
       152, 153, 154, 155, 156, 157, 158, 159, 161, 162, 163, 164, 165,
       166, 169, 170, 171, 172, 173, 174, 175, 177, 178, 179, 180, 181,
       182, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195,
       196, 197, 198, 199], dtype=int64)

In [5]:
#Perform k-fold cross validation for each fold of the train set
scores = []
mean_sq = []
for train, test in kf.split(X = X_train, y = y_train):
    pipe.fit(X_train.iloc[train],y_train.iloc[train])
    #get scores and save into an array
    preds = pipe.predict(X_train.iloc[test])
    errors = mean_squared_error(y_train.iloc[test], preds)
    score = pipe.score(X_train.iloc[test], y_train.iloc[test])
    scores.append(score)
    mean_sq.append(errors)

(scores,mean_sq)

([0.8368030348383042,
  0.825623080292534,
  0.8967243723102103,
  0.8933323393255432,
  0.9161489573583848],
 [2.771159373302651,
  4.689293378364903,
  2.7620905257466006,
  3.208287188848609,
  2.0347036886053127])

Let's stop for a moment and check what is KFold object returning:
<br></br>
First we run this loop over the KFold object enumerated (using the enumerate function).
Then we call the split function of the KFold object in order to get the splits (on the entire dataset, which is not a good practice, we do this here just for practical purpose) for each iteration.
Finally, we will show the splits which consist in two arrays (train fold and test fold).
<br></br>
Something to keep in mind is that the indexes returned are the "intrinsic" location of the rows and not their indexes in the dataframe.


In [6]:
for index, (train, test) in enumerate(kf.split(X = data[['TV','radio','newspaper']], y = data['sales'])):
    print('\niteration: {0}\ntrain set:\n{1}\n\ntest set:\n{2}'.format(index, train, test))
    print('train set: {0}\ntest set: {1}'.format(len(train), len(test)))


iteration: 0
train set:
[ 40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57
  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75
  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93
  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111
 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129
 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199]

test set:
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]
train set: 160
test set: 40

iteration: 1
train set:
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33 

In [7]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression

Let's do the same as before by using the function cross_validate, that is, in a simpliest way

In [8]:
lin_reg = LinearRegression()
scores = cross_validate(lin_reg, X_train, y_train, scoring = ['neg_mean_squared_error','r2'],cv = kf)
scores

{'fit_time': array([0.00699449, 0.00858402, 0.00525188, 0.00499845, 0.00598884]),
 'score_time': array([0.00528383, 0.0049994 , 0.00400114, 0.0030005 , 0.0059979 ]),
 'test_neg_mean_squared_error': array([-2.77115937, -4.68929338, -2.76209053, -3.20828719, -2.03470369]),
 'test_r2': array([0.83680303, 0.82562308, 0.89672437, 0.89333234, 0.91614896])}

By comparing to the previous results when we did this same process more mannually, we can tell that they are the same. Now let's perform stratified cross validation. To make this possible we'll use other dataset in which the target is binary (1/0, yes/no, this/that)

In [9]:
from sklearn.model_selection import StratifiedKFold

In [10]:
st_k_fold = StratifiedKFold(n_splits = 5)

In [11]:
data_2 = pd.read_csv('../data/census_income_data.csv')

data_2['income'].value_counts()

<=50K    24720
>50K      7841
Name: income, dtype: int64

At first look we can notice that the label/target is umbalanced. So, in order to keep the proportion equal for every fold, we need to perform stratified k-fold cross-validation

In [12]:
for index, (train, test) in enumerate(st_k_fold.split(X = data_2.drop('income',axis=1), y = data_2['income'])):
    print("\niteration {0} (train)\n{1}".format(index,data_2.loc[train,'income'].value_counts()))
    print("\niteration {0} (test)\n{1}".format(index,data_2.loc[test,'income'].value_counts()))


iteration 0 (train)
<=50K    19776
>50K      6272
Name: income, dtype: int64

iteration 0 (test)
<=50K    4944
>50K     1569
Name: income, dtype: int64

iteration 1 (train)
<=50K    19776
>50K      6273
Name: income, dtype: int64

iteration 1 (test)
<=50K    4944
>50K     1568
Name: income, dtype: int64

iteration 2 (train)
<=50K    19776
>50K      6273
Name: income, dtype: int64

iteration 2 (test)
<=50K    4944
>50K     1568
Name: income, dtype: int64

iteration 3 (train)
<=50K    19776
>50K      6273
Name: income, dtype: int64

iteration 3 (test)
<=50K    4944
>50K     1568
Name: income, dtype: int64

iteration 4 (train)
<=50K    19776
>50K      6273
Name: income, dtype: int64

iteration 4 (test)
<=50K    4944
>50K     1568
Name: income, dtype: int64


Now we can notice that the proportion of the label is preserved across all folds. 
<br></br>
Finally, we're going to perform stratified group k-fold cross validation with the purpose of preserving the same proportion of the target variable and avoid biases caused by any category or group. In this case it is possible that the model could be biased by the person's education level.
<br></br>
By doing this, the model will be able to predict the outcome of people that belong to a brand new group

In [13]:
data_2['education'].value_counts()

HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64

In [14]:
from sklearn.model_selection import StratifiedGroupKFold

In [15]:
st_gr = StratifiedGroupKFold(n_splits = 5)
groups = data_2['education']

In [16]:
for index, (train, test) in enumerate(st_gr.split(X = data_2.drop('income',axis=1), y = data_2['income'], groups = groups)):
    print("\n train:\n{0}".format(data_2.loc[train,'income'].value_counts()))
    print("test:\n{0}".format(data_2.loc[test,'income'].value_counts()))
    print("groups:\n{0}".format(data_2.loc[test]['education'].value_counts()))


 train:
<=50K    15894
>50K      6166
Name: income, dtype: int64
test:
<=50K    8826
>50K     1675
Name: income, dtype: int64
groups:
HS-grad    10501
Name: education, dtype: int64

 train:
<=50K    18816
>50K      6454
Name: income, dtype: int64
test:
<=50K    5904
>50K     1387
Name: income, dtype: int64
groups:
Some-college    7291
Name: education, dtype: int64

 train:
<=50K    21565
>50K      7031
Name: income, dtype: int64
test:
<=50K    3155
>50K      810
Name: income, dtype: int64
groups:
11th           1175
Assoc-acdm     1067
7th-8th         646
Prof-school     576
5th-6th         333
1st-4th         168
Name: education, dtype: int64

 train:
<=50K    21535
>50K      5620
Name: income, dtype: int64
test:
<=50K    3185
>50K     2221
Name: income, dtype: int64
groups:
Bachelors    5355
Preschool      51
Name: education, dtype: int64

 train:
<=50K    21070
>50K      6093
Name: income, dtype: int64
test:
<=50K    3650
>50K     1748
Name: income, dtype: int64
groups:
Masters    

Even though the amount of the target variable in both sets across the folds is not the same, as it was in previous cross validation types in most of the folds, the quantity in most folds is trying to be fair distributed (except for the first and third fold that the distribution in the test set is a little bit skewed compared to the other)