# Cross-validation
Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect overfitting, ie, failing to generalize a pattern.

The purpose of cross–validation is to test the ability of a machine learning model to predict new data. It is also used to flag problems like overfitting or selection bias and gives insights on how the model will generalize to an independent dataset


# Types of Cross Validation 
The 4 Types of Cross Validation in Machine Learning are:

1] Holdout Method.

2] K-Fold Cross-Validation.

3] Stratified K-Fold Cross-Validation.

4] Leave-P-Out Cross-Validation.

# Holdout Cross-Validation
 

The Holdout method is quite easy to understand and work upon. To get started, the data sample is divided into two parts - Training Data Set and Testing Data Set. 

 

Before the division takes place, the data sample is shuffled so that samples get mixed and lead to an accurate training data set. As the training data set is twice the size of the test set in machine learning, the model is trained with a large number of samples as compared to the samples available in the testing data set. 

 

Usually, the ratio of training data set to testing data set is 70:30 or 80:20. The next step is to train the model with the training data set and once it is trained, the model is tested with the testing data set. 

 

Although this method might seem to be easy and efficient, it has its own drawbacks. 

 

While the training data set is kept to be more than the testing data set in terms of size, it could be possible that the training data set is not representative of the whole data sample. 

 

One of the disadvantages of the holdout method is that the testing data set could contain essential characteristics of the whole data that can get missed out. This method is also known as the train/test split approach. 

 ![image.png](attachment:image.png)
 

In [12]:


import numpy as np
from sklearn.model_selection import train_test_split
 
d = np.array([1, 2,3, 4, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20])
 
x_train, x_test = train_test_split(d, test_size=0.2, random_state=3)
 
print('Train:', x_train, 'Test:',x_test)

Train: [ 7 16  9 10 15 12 20 19  8  1 18 11  4 13] Test: [14  3  2 17]


#  K-Fold Cross-Validation
In a Data-Driven World, there is never enough data to train your model, on top of that removing a part of it for validation poses a greater problem of Underfitting and we risk losing important patterns and trends in our data set, which in turn increases Bias. So ideally, we require a method that provides ample amounts of data for training the model and leaves ample amounts of data for validation sets.

In K-Fold cross-validation, the data is divided into k subsets or we can take it as a holdout method repeated k times, such that each time, one of the k subsets is used as the validation set and the other k-1 subsets as the training set. The error is averaged over all k trials to get the total efficiency of our model.

We can see that each data point will be in a validation set exactly once and will be in a training set k-1 time. This helps us reduce bias as we are using most of the data for fitting and reduces variance as most of the data is also being used in the validation set.

<b>Pros:-</b>

This will help to overcome the problem of computational power.

Models may not be affected much if an outlier is present in data.

It helps us overcome the problem of variability.


<b>Cons:-</b>

Imbalanced data sets will impact our model.

![image-2.png](attachment:image-2.png)


In [67]:

from numpy import array
from sklearn.model_selection import KFold

data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])

kfold = KFold(3, True, 1)

for train, test in kfold.split(data):
	print('train: %s, test: %s' % (data[train], data[test]))

train: [0.1 0.4 0.5 0.6], test: [0.2 0.3]
train: [0.2 0.3 0.4 0.6], test: [0.1 0.5]
train: [0.1 0.2 0.3 0.5], test: [0.4 0.6]




# Stratified K-Fold Cross-Validation.
Stratified K-Folds cross-validator. Provides train/test indices to split data in train/test sets. This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class

![image-3.png](attachment:image-3.png)

In [20]:

import numpy as np 
from sklearn.model_selection import StratifiedKFold
 
data = np.array([1, 2, 3, 4,5, 6,7,8,9,10])
y = np.array([1,1,1,1,1,0,1,1,1,1])
 
skf = StratifiedKFold(n_splits = 5, shuffle=True)
 
for train, validate in skf.split(data, y):
    print('Tran:', data[train], '      Test:', data[validate])

Tran: [ 2  3  5  6  7  8  9 10]       Test: [1 4]
Tran: [1 2 4 5 6 7 8 9]       Test: [ 3 10]
Tran: [ 1  3  4  5  6  7  9 10]       Test: [2 8]
Tran: [ 1  2  3  4  5  6  8 10]       Test: [7 9]
Tran: [ 1  2  3  4  7  8  9 10]       Test: [5 6]




# Leave-P-Out Cross-Validation:
In this strategy, p observations are used for validation, and the remaining is used for training.

For a data set with n observations, n-p observations will be used for training, and p will be used for validation.

Since this method is exhaustive, it trains and tests on all possible combinations, and it can become computationally expensive for large values of p.

![image-4.png](attachment:image-4.png)



In [4]:

import numpy as np
from sklearn.model_selection import LeavePOut
 
d = np.array([1, 2, 3, 4, 5, 6,7,8,9,10,11,12])
 
lpo = LeavePOut(p=2)
 
for train, validate in lpo.split(data):
    print("Train set:{}".format(d[train]), "Test set:{}".format(d[validate]))

Train set:[3 4 5 6 7 8] Test set:[1 2]
Train set:[2 4 5 6 7 8] Test set:[1 3]
Train set:[2 3 5 6 7 8] Test set:[1 4]
Train set:[2 3 4 6 7 8] Test set:[1 5]
Train set:[2 3 4 5 7 8] Test set:[1 6]
Train set:[2 3 4 5 6 8] Test set:[1 7]
Train set:[2 3 4 5 6 7] Test set:[1 8]
Train set:[1 4 5 6 7 8] Test set:[2 3]
Train set:[1 3 5 6 7 8] Test set:[2 4]
Train set:[1 3 4 6 7 8] Test set:[2 5]
Train set:[1 3 4 5 7 8] Test set:[2 6]
Train set:[1 3 4 5 6 8] Test set:[2 7]
Train set:[1 3 4 5 6 7] Test set:[2 8]
Train set:[1 2 5 6 7 8] Test set:[3 4]
Train set:[1 2 4 6 7 8] Test set:[3 5]
Train set:[1 2 4 5 7 8] Test set:[3 6]
Train set:[1 2 4 5 6 8] Test set:[3 7]
Train set:[1 2 4 5 6 7] Test set:[3 8]
Train set:[1 2 3 6 7 8] Test set:[4 5]
Train set:[1 2 3 5 7 8] Test set:[4 6]
Train set:[1 2 3 5 6 8] Test set:[4 7]
Train set:[1 2 3 5 6 7] Test set:[4 8]
Train set:[1 2 3 4 7 8] Test set:[5 6]
Train set:[1 2 3 4 6 8] Test set:[5 7]
Train set:[1 2 3 4 6 7] Test set:[5 8]
Train set:[1 2 3 4 5 8] T

#  Implementation on dataset

In [56]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [57]:
df = pd.read_csv('C:/Users/deshm/Desktop/dataml/heart.csv')
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [68]:
# Kflod 

In [58]:
kf = KFold(n_splits=5)

In [61]:
X.shape

(303, 13)

In [62]:
303/5

60.6

In [69]:
60*4

240

In [60]:
i = 1
for train_set, test_set in kf.split(X=X):
    print("iteration ", i)
    print(train_set, " having :" , len(train_set))
    print(test_set, " having :" , len(test_set))
    print("-------------------------")
    i += 1

iteration  1
[ 61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78
  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96
  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168
 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186
 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204
 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222
 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258
 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276
 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294
 295 296 297 298 299 300 301 302]  hav