<a href="https://colab.research.google.com/github/Rachita-G/Python_Practice/blob/main/Model_Concepts/Train_Test_Split_and_Cross_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TRAIN TEST SPLIT

Evaluating a Machine Learning model can be quite tricky. Usually, we split the data set into training and testing sets and use the training set to train the model and testing set to test the model. We then evaluate the model performance based on an error metric to determine the accuracy of the model. 

In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

In [None]:
# In this lesson we will explore the train_test_split module
# Therefore we need no more than the module itself and NumPy
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
# Let's generate a new data frame 'a' which will contain all integers from 1 to 100
# The method np.arange works like the built-in method 'range' with the difference it creates an array
a=np.arange(1,101) 

In [None]:
a

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100])

In [None]:
# Similarly, let's create another ndarray 'b', which will contain integers from 501 to 600
# We have intentionally picked these numbers so we can easily compare the two
# Obviously, the difference between the elements of the two arrays is 500 for any two corresponding elements
b = np.arange(501,601)
b

array([501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513,
       514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526,
       527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539,
       540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552,
       553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565,
       566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578,
       579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591,
       592, 593, 594, 595, 596, 597, 598, 599, 600])

In [None]:
train_test_split(a) # splitting a in 2 arrays randomly

[array([ 14,  63,  28,  65,  10,  47,  33,  41,  38,  26,  97,  39,  32,
         25,  21,  18,  87,  86,  40,  78,  55,  70,  45, 100,  81,  84,
         52,   4,  99,  98,  88,  53,  46,  29,  27,  66,   3,  60,  72,
         15,  69,  37,  24,  48,  96,   5,  56,  30,  22,   1,  36,  71,
         76,  67,  80,  59,   2,  42,  95,  74,  19,  94,  89,  85,  11,
         73,  16,  75,   7,  23,  31,   8,  79,  17,  12]),
 array([57, 54, 50, 43, 82, 35, 83, 13, 61,  9, 93, 92, 44,  6, 90, 64, 34,
        58, 49, 68, 77, 62, 20, 91, 51])]

In [None]:
# There are several different arguments we can set when we employ this method
# Most often, we have inputs and targets, so we have to split 2 different arrays
# we are simulating this situation by splitting 'a' and 'b'

# You can specify the 'test_size' or the 'train_size' (but the latter is deprecated and will be removed)
# essentially the two have the same meaning 
# Common splits are 75-25, 80-20, 85-15, 90-10

# Finally, you should always employ a 'random_state'
# In this way you ensure that when you are splitting the data you will always get the SAME random shuffle

# Note 2 arrays (a and b) will be split into 4
# The order is train1, test1, train2, test2 
# It is very useful to store them in 4 variables, so we can later use them
a_train, a_test, b_train, b_test = train_test_split(a, b, test_size=0.2, random_state=365)

In [None]:
a_train

array([ 25,  32,  99,  73,  91,  66,   3,  59,  94,   1,   8,  15,  90,
        54,  31,  20,  77,  82,  30,  35,  95,  42,  38,   7,  11,  50,
        21,  48,   2,  17,  10,  58,  68,  43,  41,  16,  88,  72,  79,
       100,  80,  39,  24,  86,  22,  23,  62,  76,  18,  47,  55,  26,
        60,  19,  71,  64,  51,  63,  65,  28,  12,  78,  13,  44,  75,
        87,  40,   4,  29,  49,  37,  57,  27,  74,   6,  45,  92,  34,
        53,  83])

In [None]:
a_test

array([ 9, 69, 81, 56, 33, 93, 84, 61, 46, 89, 85, 67, 97,  5, 70, 36, 98,
       96, 14, 52])

In [None]:
b_train # notice-- does the same 25 ka 525 n ol.

array([525, 532, 599, 573, 591, 566, 503, 559, 594, 501, 508, 515, 590,
       554, 531, 520, 577, 582, 530, 535, 595, 542, 538, 507, 511, 550,
       521, 548, 502, 517, 510, 558, 568, 543, 541, 516, 588, 572, 579,
       600, 580, 539, 524, 586, 522, 523, 562, 576, 518, 547, 555, 526,
       560, 519, 571, 564, 551, 563, 565, 528, 512, 578, 513, 544, 575,
       587, 540, 504, 529, 549, 537, 557, 527, 574, 506, 545, 592, 534,
       553, 583])

In [None]:
b_test

array([509, 569, 581, 556, 533, 593, 584, 561, 546, 589, 585, 567, 597,
       505, 570, 536, 598, 596, 514, 552])

In [None]:
# to split it order wise only . no shuffling then.
a_train, a_test, b_train, b_test = train_test_split(a, b, test_size=0.2,shuffle=False)

In [None]:
a_train

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80])

In [None]:
a_test

array([ 81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,
        94,  95,  96,  97,  98,  99, 100])

In [None]:
b_test.shape,b_train.shape

((20,), (80,))

In [None]:
b_test

array([581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593,
       594, 595, 596, 597, 598, 599, 600])

In [None]:
b_train

array([501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513,
       514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526,
       527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539,
       540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552,
       553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565,
       566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578,
       579, 580])

In [None]:
# using iris data
from sklearn import datasets
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape

((150, 4), (150,))

In [None]:
X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.4)
X_train.shape,y_train.shape,X_test.shape,y_test.shape

((90, 4), (90,), (60, 4), (60,))

In [None]:
from sklearn import svm
clf=svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

0.9666666666666667

# Problem in Train-Test Split

When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called cross-validation (CV for short). 

# Cross Validation

Under CV, A test set is to be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called 
K-fold Cross Validation(CV)  training data split into k folds and ensuring that each fold is used as a testing set at some point. 
K-Fold CV is where a given data set is split into a K number of sections/folds where each fold is used as a testing set at some point. Lets take the scenario of 5-Fold cross validation(K=5). 


![image.png](attachment:image.png)

The general procedure is as follows:

* Shuffle the dataset randomly.
* Split the dataset into k groups
* For each unique group:
       a. Take one group as a hold out or test data set
       b. Take the remaining groups as a training data set (consisting of k-1 folds)
* Fit a model on the training set and evaluate it on the test set
* Retain the evaluation score and discard the model
* Summarize the skill of the model using the sample of model evaluation scores
 This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.


### Note
Cross-Validation is primarily used in scenarios where prediction is the main aim, and the user wants to estimate how well and accurately a predictive model will perform in real-world situations.

Cross-Validation seeks to define a dataset by testing the model in the training phase to help minimize problems like overfitting and underfitting. However, you must remember that both the validation and the training set must be extracted from the same distribution, or else it would lead to problems in the validation phase.




### Benefits of Cross-Validation
* It helps evaluate the quality of your model.
* It helps to reduce/avoid problems of overfitting and underfitting.
* It lets you select the model that will deliver the best performance on unseen data.

### Trade-offs Between Cross-Validation and Train-Test Split

* Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions. However, it can take more time to run, because it estimates models once for each fold. So it is doing more total work.

* Given these tradeoffs, when should you use each approach? On small datasets, the extra computational burden of running cross-validation isn't a big deal. These are also the problems where model quality scores would be least reliable with train-test split. So, if your dataset is smaller, you should run cross-validation.For the same reasons, a simple train-test split is sufficient for larger datasets. It will run faster, and you may have enough data that there's little need to re-use some of it for holdout.

* There's no simple threshold for what constitutes a large vs small dataset. If your model takes a couple minute or less to run, it's probably worth switching to cross-validation. If your model takes much longer to run, cross-validation may slow down your workflow more than it's worth.

* Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If each experiment gives the same results, train-test split is probably sufficient.

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset.

The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 10 consecutive times (with different splits each time):

In [None]:
from sklearn.model_selection import cross_val_score
clf=svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=10)
scores

array([1.        , 0.93333333, 1.        , 1.        , 0.86666667,
       1.        , 0.93333333, 1.        , 1.        , 1.        ])

The mean score and the 95% confidence interval of the score estimate are hence given by:

In [None]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.97 (+/- 0.09)


In [None]:
from sklearn import metrics
scores=cross_val_score(clf,X,y,cv=10,scoring='f1_macro')
scores

array([1.        , 0.93265993, 1.        , 1.        , 0.86111111,
       1.        , 0.93265993, 1.        , 1.        , 1.        ])

### Configuration of k
The k value must be chosen carefully for your data sample.

A poorly chosen value for k may result in a mis-representative idea of the skill of the model, such as a score with a high variance (that may change a lot based on the data used to fit the model), or a high bias, (such as an overestimate of the skill of the model).

### Variations on Cross-Validation
There are a number of variations on the k-fold cross validation procedure.

Three commonly used variations are as follows:

* Train/Test Split: Taken to one extreme, k may be set to 2 (not 1) such that a single train/test split is created to evaluate the model.
* Repeated: This is where the k-fold cross-validation procedure is repeated n times, where importantly, the data sample is shuffled prior to each repetition, which results in a different split of the sample.
* Stratified: The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation.
* LOOCV: Taken to another extreme, k may be set to the total number of observations in the dataset such that each observation is given a chance to be the held out of the dataset. This is called leave-one-out cross-validation, or LOOCV for short.

### 1. KFold
KFold divides all the samples in  groups of samples, called folds, of equal sizes (if possible). The prediction function is learned using  folds, and the fold left out is used for test.

It takes as arguments the number of splits, whether or not to shuffle the sample, and the seed for the pseudorandom number generator used prior to the shuffle.

In [None]:
import numpy as np
from sklearn.model_selection import KFold
X = np.array(["l","m","o","p"])
rkf = KFold(n_splits=3)
for train_index, test_index in rkf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1 3] TEST: [2]
TRAIN: [0 1 2] TEST: [3]


### 2. RepeatedKFold 
RepeatedKFold repeats K-Fold n times. It can be used when one requires to run KFold n times, producing different splits in each repetition.
Repeats K-Fold n times with different randomization in each repetition.


In [None]:
import numpy as np
from sklearn.model_selection import RepeatedKFold
X = np.array([[18,77], [66,87], [51, 62], [83, 84],[9,8]])
rkf = RepeatedKFold(n_splits=2, n_repeats=2)
for train_index, test_index in rkf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [0 4] TEST: [1 2 3]
TRAIN: [1 2 3] TEST: [0 4]
TRAIN: [0 1] TEST: [2 3 4]
TRAIN: [2 3 4] TEST: [0 1]


### 3. Repeated Stratified KFold 
Repeats Stratified K-Fold n times with different randomization in each repetition.

In [None]:
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
X = np.array([[18,77], [66,87], [51, 62], [83, 84],[9,8]])
rkf = RepeatedKFold(n_splits=3, n_repeats=2)
for train_index, test_index in rkf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [0 1 3] TEST: [2 4]
TRAIN: [2 3 4] TEST: [0 1]
TRAIN: [0 1 2 4] TEST: [3]
TRAIN: [1 3 4] TEST: [0 2]
TRAIN: [0 2 3] TEST: [1 4]
TRAIN: [0 1 2 4] TEST: [3]


### 4. LeaveOneOut (or LOO) 
LOO is a simple cross-validation. Each learning set is created by taking all the samples except one, the test set being the sample left out. Thus, for  samples, we have n different training sets and n different tests set. This cross-validation procedure does not waste much data as only one sample is removed from the training set:

In [None]:
from sklearn.model_selection import LeaveOneOut
X=[1,2,3,4]
loo=LeaveOneOut()
for train_index,test_index in loo.split(X):
    print(train_index,test_index)

[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]


LOO is more computationally expensive than k-fold cross validation.

In terms of accuracy, LOO often results in high variance as an estimator for the test error. Intuitively, since  of the  samples are used to build each model, models constructed from folds are virtually identical to each other and to the model built from the entire training set.

However, if the learning curve is steep for the training size in question, then 5- or 10- fold cross validation can overestimate the generalization error.

### Conclusion
As a general rule, most authors, and empirical evidence, suggest that 5- or 10- fold cross validation should be preferred to LOO.