### K-Fold Cross Validation
- KFold divides the samples into k groups (folds) of approximately equal sizes. Out of these k groups, k-1 folds are used for training and the remaning one is used for testing. This process is repeated k times
- KFold(n_splits=5, *, shuffle=False, random_state=None)
 - n_splits --> number of folds, default=5 
 - shuffle: bool, default=False Shuffle is used to shuffle the data before splitting it into batches. Samples within each split will not be shuffled.
 - random_state --> int, default=None This is used to control the randomness of each fold and it affects the ordering of indices only when shuffle=True, else it doesn't have any effect

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier

In [2]:
X = ["a",'b','c','d','e','f']
kf = KFold(n_splits=3,shuffle=False,random_state=None)

In [3]:
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [4]:
#i=0
for train, test in kf.split(X):
    #print("Iteration:",i)
    print("Train:",train,"Test:",test)

Train: [2 3 4 5] Test: [0 1]
Train: [0 1 4 5] Test: [2 3]
Train: [0 1 2 3] Test: [4 5]


### Stratified KFold
- This technique is a variation of K-Fold, and it divides the data into k-stratified folds. This way it preserves the percentage of samples of each class present in the data.
- It generates test sets such that all sets contain the same distribution of classes, or as close as possibl

In [5]:
from sklearn.model_selection import StratifiedKFold

In [6]:
X = np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y= np.array([0,0,1,0,1,1])
skf = StratifiedKFold(n_splits=3,random_state=None,shuffle=False)

In [7]:
for train_index,test_index in skf.split(X,y):
    print("Train:",train_index,'Test:',test_index)
    X_train,X_test = X[train_index], X[test_index]
    y_train,y_test = y[train_index], y[test_index]

Train: [1 3 4 5] Test: [0 2]
Train: [0 2 3 5] Test: [1 4]
Train: [0 1 2 4] Test: [3 5]


### LeaveOneOut CrossValidation
- This is a simple technique in which training data inlcudes all observations in the data except one observation which will be used to test.
- For n samples, we have n different training sets.
- Although this model is trained on almost all of the data, the number of iterations and n different training sets, makes it computationally very expensive.
- Almost all of the data (n-1 of the n samples) is used to build each model, all of the models are identical to each other and this results in high variance compared to kfold

In [8]:
from sklearn.model_selection import LeaveOneOut

In [9]:
X = [10,20,30,40,50,60,70,80,90,100]
l = LeaveOneOut()

In [10]:
for train, test in l.split(X):
    print("%s %s"% (train,test))

[1 2 3 4 5 6 7 8 9] [0]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4]
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8]
[0 1 2 3 4 5 6 7 8] [9]


- It is suggested that 5 or 10-Fold Cross Validation should be preferred over LOOCV

### Hold Out Cross Validation
- Here we split the data into 2 sets - train and test set. 
- The split (either 70:30 or 80:20 or even 60:40) is totally dependent on the use case we are working on.

In [11]:
from sklearn.model_selection import train_test_split
X = [10,20,30,40,50,60,70,80,90,100]

In [12]:
train, test= train_test_split(X,test_size=0.3, random_state=1)
print("Train:\n",X_train,"\tTest:" ,X_test)

Train:
 [[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 9 10]] 	Test: [[ 7  8]
 [11 12]]


In [13]:
df=pd.read_csv("data.csv")

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

In [14]:
df.head(10)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,
5,843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,
6,844359,M,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,
7,84458202,M,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,...,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,
8,844981,M,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,...,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,
9,84501001,M,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,...,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,


In [15]:
df.shape

(569, 33)

In [16]:
df.isnull().sum()

id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed:

In [17]:
df=df.drop(["Unnamed: 32"], axis=1)

In [18]:
df.shape

(569, 32)

In [19]:
df.diagnosis.value_counts()

B    357
M    212
Name: diagnosis, dtype: int64

In [20]:
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [21]:
x=df.drop(["id","diagnosis"],axis=1)

In [22]:
y=df.diagnosis

### HoldOneOut Method

In [23]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [24]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=5)

In [25]:
dt=DecisionTreeClassifier()

In [26]:
dt.fit(x_train,y_train)

DecisionTreeClassifier()

In [27]:
dt.score(x_train,y_train)

1.0

In [28]:
result=dt.score(x_test,y_test)

In [29]:
print("The accuracy score for the HoldOneOut method : ",result)

The accuracy score for the HoldOneOut method :  0.9298245614035088


### K-Fold

In [30]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import accuracy_score

In [31]:
kf=KFold(n_splits=5)

In [32]:
kf_score=cross_val_score(dt,x,y,cv=kf)

In [33]:
print("The cross validation score for the Kfold method with 5 fold is : ",kf_score)

The cross validation score for the Kfold method with 5 fold is :  [0.87719298 0.90350877 0.94736842 0.93859649 0.89380531]


In [34]:
kf_score_mean=kf_score.mean()

In [35]:
kf_score_mean

0.9120943952802361

### Stratified KFold

In [36]:
from sklearn.model_selection import StratifiedKFold

In [37]:
skf=StratifiedKFold(n_splits=10)

In [38]:
skf_score=cross_val_score(dt,x,y,cv=skf)

In [39]:
print("The cross validation score for the StratifiedKfold method with 10 fold is : ",skf_score)

The cross validation score for the StratifiedKfold method with 10 fold is :  [0.89473684 0.85964912 0.92982456 0.87719298 0.94736842 0.89473684
 0.89473684 0.94736842 0.9122807  0.92857143]


In [40]:
skf_score_mean=skf_score.mean()

In [41]:
skf_score_mean

0.9086466165413534

### LeaveOneOut Method

In [42]:
from sklearn.model_selection import LeaveOneOut

In [43]:
lv=LeaveOneOut()

In [44]:
lv_score=cross_val_score(dt,x,y,cv=lv)

In [45]:
print("The cross validation score for the LeaveOneOut method is : ",lv_score)

The cross validation score for the LeaveOneOut method is :  [1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1.
 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1.
 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 

In [46]:
lv_score.mean()

0.9209138840070299