## k fold cross validation - Classification

This is when, we have balanced datasets. means 50% Yes and 50% No. Binary Classification

In [2]:
import pandas as pd
from sklearn import model_selection

if __name__ == '__main__':
    df = pd.read_csv('advertising.csv')
    df['kfold'] = -1
    df = df.sample(frac = 1).reset_index(drop = True)
    kf = model_selection.KFold(n_splits = 5)
    for fold,(t_,v_) in enumerate(kf.split(X = df)):
        df.loc[v_,'kfold'] = fold
    df.to_csv('advertising_fold.csv', index=False)

## Stratified K fold - Classification

When we have skewed datasets, when we have binary classifications, 
*  90% positive sample  
*  10 % negative sample




Some classes have a lot of samples, and some don’t have that many. If we do a simple k-fold, we
won’t have an equal distribution of targets in every fold. Thus, we choose stratified k-fold in this case.

* The rule is simple. If it’s a standard classification problem, choose stratified k-fold blindly.

In [None]:
import pandas as pd
from sklearn import model_selection

if __name__ == '__main__':
    df = pd.read_csv(#'advertising.csv')
    df['kfold'] = -1
    df = df.sample(frac = 1).reset_index(drop=True)
    y = df.target.values
    kf = model_selection.StratifiedKFold(n_splits=5)
    for f,(t_,v_) in enumerate(kf.split(X =df, y = y)):
        df.loc[v_, 'kfold'] = f
    df.to_csv('#advertise_folds.csv')

# Stratified Kfold - Regression - timeseries data

We cannot use stratified k-fold directly, but there are ways to change the problem a bit so that we can use stratified k-fold for regression problems. Mostly, simple k-fold cross-validation works for any regression problem. However, if you see that the distribution of targets is not consistent, you can use stratified k-fold.

* To use stratified k-fold for a regression problem, we have first to divide the target into bins, and then we can use stratified k-fold in the same way as for classification problems.

* If you have a lot of samples( > 10k, > 100k), then you don’t need to care about the number of bins. Just divide the data into 10 or 20 bins.

* If you do not have a lot of samples, you can use a simple rule like Sturge’s Rule to calculate the appropriate number of bins.

bins = 1 + log2(N)

N = len of dataset

In [3]:
import pandas as pd
import numpy as np

from sklearn import datasets
from sklearn import model_selection

def create_folds(data):
    data['kfold'] = -1
    data = data.sample(frac=1).reset_index(drop = True)
    num_bins = np.floor(1+np.log2(len(data)))
    data.loc[:,'bins'] = pd.cut(
        data['target'],bins = num_bins, labels= False)
    kf = model_selection.StratifiedKFold(n_splits=5)
    for f,(t_,v_) in enumerate(kf.split(X = data, y = data.bins.values)):
        data.loc[v_,'kfold'] = f
    data = data.drop('bins', axis = 1)
    return data


if __name__ == '__main__':
    #creating a dataset, with 1500 samples and 100 features and 1 target
    X, y = datasets.make_regression(n_samples = 15000, n_features = 100, n_targets = 1)
    df = pd.DataFrame(X, columns= [f'f_{i}' for i in range(X.shape[1])])
    df.loc[:,'target'] = y
    df = create_folds(df)

  return f(*args, **kwds)


In [5]:
df.head()

Unnamed: 0,f_0,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,...,f_92,f_93,f_94,f_95,f_96,f_97,f_98,f_99,target,kfold
0,0.344447,-1.995182,-1.514464,0.245371,-1.265125,0.029895,-0.581069,-1.080737,-0.031236,-1.133373,...,0.144927,-0.768895,-0.731983,-0.925336,-0.203425,0.897261,-1.263357,-0.564291,20.425703,0
1,-0.48044,-0.42891,0.34524,-1.039374,0.363398,-1.942665,0.169723,-0.929829,-0.564044,-0.824234,...,-0.328204,-0.928913,-0.759538,0.132935,-0.288166,-2.238947,1.145854,0.121889,54.909779,0
2,-0.51008,-1.30815,-0.824694,2.449228,0.542844,-1.945811,-0.458384,-1.496419,0.803799,-0.592972,...,0.618356,-0.498378,0.008155,0.074344,-0.252396,-0.890504,-1.085554,0.158722,126.928854,0
3,-0.030511,-0.463971,-1.067302,2.280652,-0.176592,0.947312,-0.310826,1.960855,0.178239,1.12596,...,0.91804,-1.033999,0.731701,1.401393,0.101335,1.763901,0.799613,0.039592,408.295951,0
4,1.749481,-0.294192,0.194189,0.100953,1.148083,0.030195,2.620489,-0.149118,0.161522,0.536576,...,0.024839,-0.500578,1.338352,0.424361,0.738802,0.557522,-0.616515,0.088568,603.676395,0


# Train test split of sklearn

Go for it when you don't have much time

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)