# Methods of training and testing a model:  

## 1. Feeding all data to model:  

In this approach we just feed 100% of the data to the model and then test the model on that same data. It's like preparing a kid for school test by giving him some questions and then asking him those same questions in the test. The Model Performance score will be 100% in this approach which is not good.
  
## 2. Splitting the data:  

In this approach we split the dataset into training and testing datasets with some percentage. Usually it's 70% for training and 30% for testing. The model performance in this approach may vary based on the model selected and the dataset. This approach is called **Train Test Split method**. But the problem with that approach is let's say we prepare the kid for school test by giving him 70 questions out of 100 and they were all algebra and the other 30 which we're gonna ask in the test are all calculus, then the kid will end up worst in the test.  
  
## 3. K Fold Cross Validation:  

To avoid all the problem with above approached, we've K Fold Cross Validation. In this approach we divide sample in K number of Folds. Folds means parts. Let's say we have 100 elements and we divided it into 5 Folds each Fold containing 20 elements. Then we'll perform iterations. First we'll take first fold as test dataset and other 4 as training, then 2nd dataset as test and other 4 as training, and repeat this process until the last remaining fold is selected as test. In this case, Until 5th fold as test and other 4 are training. Once we're done with these iterations, we take average of all the scores as final score. Such as shown below:  

![K-Fold](resources/K-Fold.png)

#### We'll use SKLearn build in image dataset and perform K-Fold on different models to eveluate their performance.

In [10]:
import warnings
warnings.filterwarnings('ignore')

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits

In [2]:
digits = load_digits()

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(digits.data,digits.target, test_size=0.2)

In [6]:
digits.data

array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]], shape=(1797, 64))

In [7]:
digits.target

array([0, 1, 2, ..., 8, 9, 8], shape=(1797,))

In [11]:
lr = LogisticRegression()
lr.fit(X_train, Y_train)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [12]:
lr.score(X_test, Y_test)

0.9666666666666667

In [13]:
svm = SVC()
svm.fit(X_train, Y_train)

0,1,2
,C,1.0
,kernel,'rbf'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,


In [14]:
svm.score(X_test, Y_test)

0.9888888888888889

In [15]:
rf = RandomForestClassifier()
rf.fit(X_train, Y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [16]:
rf.score(X_test, Y_test)

0.9833333333333333

#### Here we evaluated the performance of different models one by one using train test split method. Now the problem is that the train and test split is not unifrom which means everytime we execute train test split code the distribution of dataset got changed and the performance of models will also change. Because of this we can't just run the code one or two time and decide which model performs best. In order to truly find out best performing model we've to run the code multiple times by ourselves manually.

#### Now let's try **K_Fold method**

In [17]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=4)
kf

KFold(n_splits=4, random_state=None, shuffle=False)

In [18]:
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 5 6 7 8] [3 4]
[0 1 2 3 4 7 8] [5 6]
[0 1 2 3 4 5 6] [7 8]


In [19]:
def get_score(model, X_train, X_test, Y_train, Y_test):
    model.fit(X_train, Y_train)
    return model.score(X_test, Y_test)

In [20]:
from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits = 3)

In [28]:
scores_l = []
scores_svm = []
scores_rf = []

for train_index, test_index in folds.split(digits.data, digits.target):
    X_train, X_test, Y_train, Y_test = digits.data[train_index], digits.data[test_index], digits.target[train_index], digits.target[test_index]

    scores_l.append(get_score(LogisticRegression(), X_train, X_test, Y_train, Y_test))
    scores_svm.append(get_score(SVC(), X_train, X_test, Y_train, Y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators = 40), X_train, X_test, Y_train, Y_test))

In [29]:
scores_l

[0.9215358931552587, 0.9415692821368948, 0.9165275459098498]

In [30]:
scores_svm

[0.9649415692821369, 0.9799666110183639, 0.9649415692821369]

In [31]:
scores_rf

[0.9232053422370617, 0.9532554257095158, 0.9165275459098498]

In [32]:
from sklearn.model_selection import cross_val_score

In [38]:
cross_val_score(LogisticRegression(), digits.data, digits.target)

array([0.92222222, 0.86944444, 0.94150418, 0.93871866, 0.89693593])

In [35]:
cross_val_score(SVC(), digits.data, digits.target)

array([0.96111111, 0.94444444, 0.98328691, 0.98885794, 0.93871866])

In [36]:
cross_val_score(RandomForestClassifier(), digits.data, digits.target)

array([0.93611111, 0.90277778, 0.95821727, 0.95821727, 0.92200557])