# Cross Validation

We have a dataset, the first thing we need to decide on is, what model to use. Example if you were to have a classification problem, you may decide to use a SVM, KNN, logistic regression, Naive bayes and so on. But whiich model is the best for that dataset. The only way we can answer that question is by evaluating the performance of each mode. We do this using the cross validation procedure.

Traditionally when we want to create a model, we first train the model will data and then evaluete the model by comparing the truth against the predictions made by the model. There are several ways we could do this.

### 1. Train the model with all dataset and test it against the same dataset

In this technique we train the model with the whole dataset and test it against the whole dataset. The problem with this is that the model has already seen those datasets we'll train against hence there is no benefit in doing that if we want to evaluate the model. That will be like a teacher testing the student with same exercise questions he gave as home work, the students will get 100%, this does not really test the understanding of the students.

### 2. Splitting the dataset into training and testing samples

This technique involves splitting the original dataset into training and testing datasets, then we train the model with the training dataset and evaluate it with the testing dataset to test its understanding of the dataset when it comes to unseen data points. Usually we use a larger percent for training the model and a smaller percent to evaulate the model. The issue with this method is that there maybe baisness in the besting and training datasets, where the training or testing sets may have some features that does not exist in the other set, hence the model will not be able to make predictions well in case unseen data points need to be predicted. Randomization of the original dataset before splitting is one solution to this problem but, there is a better way around things.

### 3. Using K fold cross validation

This is the best way to evaluate a model. How this works is that, you split the origianl dataset into some folds or samples, lets say you split the original dataset into 7 folds. After that you need to run some iterations 7 times, in each iteration you change which fold or subset is used as the testing set while all the other 6 sets will be used as the training set. In each iteration you note the score of the model and late you find the average of all the 7 scores from each iteration, the average score is the general score of that specific model.

#### Lets see this in action, first we'll use the split technique on several models and evaluate them

In [180]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
import numpy as np

In [181]:
from sklearn.datasets import load_iris

In [182]:
data = load_iris()

In [183]:
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, train_size=0.8)

In [184]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9666666666666667

In [185]:
svm = SVC()
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

1.0

In [186]:
nb = GaussianNB()
nb.fit(X_train, y_train)
nb.score(X_test, y_test)

0.9333333333333333

## Running the note book a couple of times

Something you might have noticed if you ran the note book a couple of times is that the score of each model differs from the last time you ran it, hence running the models once, we can't conclude which is the best model to use. The solution is to run this a couple of times and find the average score of each model. That is still not the best practice and its also time consuming. Lets use the k fold cross validation to see how it can solve this issues.

## Using the K fold cross validation

In [187]:
from sklearn.model_selection import KFold

In [188]:
# help(KFold)

In [189]:
kf = KFold(n_splits=5)

In [190]:
for train, test in kf.split([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]):
    print(f'train set:{train}  test: {test}')

train set:[ 3  4  5  6  7  8  9 10 11 12 13 14]  test: [0 1 2]
train set:[ 0  1  2  6  7  8  9 10 11 12 13 14]  test: [3 4 5]
train set:[ 0  1  2  3  4  5  9 10 11 12 13 14]  test: [6 7 8]
train set:[ 0  1  2  3  4  5  6  7  8 12 13 14]  test: [ 9 10 11]
train set:[ 0  1  2  3  4  5  6  7  8  9 10 11]  test: [12 13 14]


From the above output, the kfold split the dataset into 5 main subsets, in each subset it again splits the data into equal portions and uses one part of the portion as testing set. Example in the first row it used 0 1 and 2 as testing sets indices while in the second iteration it used 3, 4 and 5 as testing sets indices.**Note the variables train and test are actually indices** by convention the variable names are train_index and test_index. Check the official documentation for more on KFold.


## Other types of K fold cross validation

1. **StratifiedKFold**

Takes group information into account to avoid building folds with imbalanced class distributions (for binary or multiclass classification tasks).

2. **GroupKFold**

K-fold iterator variant with non-overlapping groups.

3. **RepeatedKFold**

Repeats K-Fold n times.

### Lets use stratifiedKFolds

In [191]:
from sklearn.model_selection import StratifiedKFold

In [192]:
sf = StratifiedKFold(n_splits=5)

## Keeping the code try

You might have noticed how much we repeat our code when creating each classifier above, lets use an inbuilt class to reduce the number of code we write :) pheww!

In [193]:
from sklearn.model_selection import cross_val_score

In [194]:
# help(cross_val_score)

In [195]:
cross_val_score(SVC(), data.data, data.target)

array([0.96666667, 0.96666667, 0.96666667, 0.93333333, 1.        ])

In [196]:
cross_val_score(RandomForestClassifier(), data.data, data.target)

array([0.96666667, 0.96666667, 0.93333333, 0.9       , 1.        ])

In [197]:
cross_val_score(GaussianNB(), data.data, data.target)

array([0.93333333, 0.96666667, 0.93333333, 0.93333333, 1.        ])

## How the cross_val_score works

Its magical that we only provide the data and the model to use then we get make the scores. How does this work under the hood? Lets build our own version of this function so you get an idea of how it works, actions are always better then words.

#### Steps

1. Create a function that takes in the model, the train and the test datasets and returns the scores of it

2. Iterate throught the indices returned from the kfold and call the function we created in each iteration.

In [198]:
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [199]:
for train_index, test_index in kf.split(data.data):
    X_train, X_test, y_train, y_test = (data.data[train_index], data.data[test_index], 
                                                                                    data.target[train_index],
                                                                                    data.target[test_index])
    print(get_score(SVC(), X_train, X_test, y_train, y_test))
    

1.0
1.0
0.8333333333333334
0.9333333333333333
0.7


## Putting it together

In [200]:
def get_score_major(model,x_value, y_value):
    outputlist = []
    
    def get_score(X_train, y_train):
        return model.fit(X_train, y_train)
 
    for train_index, test_index in kf.split(x_value):
        X_train, X_test, y_train, y_test = (x_value[train_index], x_value[test_index], 
                                                                                        y_value[train_index],
                                                                                        y_value[test_index])

        outputlist.append(get_score(X_train, y_train).score(X_test, y_test))
    
    return np.array(outputlist)

In [201]:
get_score_major(SVC(), data.data, data.target)

array([1.        , 1.        , 0.83333333, 0.93333333, 0.7       ])

Since am returning a numpy array i can use all the aggregate numpy functions such as mean

In [202]:
get_score_major(SVC(), data.data, data.target).mean()

0.8933333333333333

In [203]:
get_score_major(RandomForestClassifier(), data.data, data.target).mean()

0.9

In [204]:
get_score_major(GaussianNB(), data.data, data.target).mean()

0.9466666666666667

Clearly we can see how the kfold and be used to evaluate models and select the best one. In this case the Gaussian naive bayes algorith looks best suited for this dataset. We can also use this technique for testing different parameters for different model, a process called **parameter tunning**.

Well take a closer look at parameter tunning in another article. If you find this helpful consider subscribing to [my youtube channel](https://www.youtube.com/c/CodeWithPrince)

By Prince Krampah