# Cros Validation

It's been almost 2 years since I've started learning about machine learning and it's internal structures and processes. One thing that still concerns me that no matter how careful you are with your data you can't always avoid the `bias and variance problem` and there aren't many creative solutions to it. But there is one thing that you can do and that is `Cross Validation`. Cross validation might not make bias-variance problem obsolete but it will definitely help you reduce bias-variance problem of a model. So, this article is about Cross Validation.

As always, I'm `Md. Rishat Talukder` AKA `Itvaya`(Nobody knows me by that name), Let's get started.

- [LinkedIn](https://www.linkedin.com/in/pro-programmer/)
- [YouTube](http://www.youtube.com/@itvaya)
- [gtihub](https://github.com/RishatTalukder/Machine-Learning-Zero-to-Hero)
- [Gmail](talukderrishat2@gmail.com)
- [discord](https://discord.gg/ZB495XggcF)

# The Intuition of Bias-Variance Problem

## The Intuition

Let's say you are a student. You have an `exam coming` 1 week later. But you are `under-prepared` for the exam. 

So, you `study hard`. So, hard that by the exam time you cleared 80% of the syllabus. You are confident that you should at least get a moderate grade.

And You did! You got 90%. Congrats!

So, when you took the test you realised the questions from the 20% of the syllabus were not covered that much in the test and the teacher actually wrote almost all the questions from that 80% that you studied hard for.

So, here's the question for you, _**Does this mean that you are secretly brilliant or were you just lucky? If the question from the rest of the 20% of the syllabus were covered in the test, do you thing you would have got a 100%?**_

The answer is clearly, `No`. 

Now, imagine you took 5 more exams on the same syllabus. The questions are now more `scrambled` in the test. You average around 78%. 

This time the results looks trustworthy.

Here's some crucial points:

- After taking 5 exams you get a proper view of your performance, 1 exam is not enough to get a proper view of your performance.

- You got lucky on the first test because the questions were not scrambled, on the later tests the questions were scrambled and you performed worse in some tests.

Now, keep this intuition in mind because we are going to apply this in machine learning terms.

## The Bias-Variance Problem

Now, in the above example, the question was, _**Does this mean that you are secretly brilliant or were you just lucky?**_

Let's get the point clear for the first exam:

- You covered `80%` of the syllabus. Syllabus is the `dataset` and 80% is the `training set`.
- You took the first test where almost all the questions were from that `80%` of the syllabus. The first test is the `training set`.
- You performed `90%` in the first test. YOU are the `MODEL`.

Now, let's rephrase the question:

- You made a model that in trained on `80%` of the dataset and then tested it on `20%` of the dataset where most of the inputs are similar to the training set. 

- It leanred the 80% of the data patterns very well so, when faced with similar data it performed well. 90%.

Which is amazing but this is a classic example of a `bias` in the model.

That's is what we actually will see if we run the model on more diverse data multiple times.

And if you have a dataset that has all the other 20% of the data patterns then the model will not perform as well as it did in the first test.

So, one thing is clear, building and testing a model on a single configuration of data is not enough to get a proper view of the model performance.

So, this is where `cross-validation` comes into play.

# Cross Validation

In general splitting the data into `training` and `testing` is not enough to get a proper view of the model performance.

We do that using the `train_test_split()` method.

Let's talk about this method first.

## train_test_split()

Let's say you have a simple target variable `y` and a set of features `X`.

| X | y |
| --- | --- |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
| 5 | 2 |
| 6 | 2 |
| 7 | 2 |
| 8 | 3 |
| 9 | 3 |
| 10 | 3 |

When we apply `train_test_split()` method it will suffle this. Because, as you can see the data is in order and if we take 70% as `training` set then the first 7 rows will be in the `training` set and the last 3 rows will be in the `testing` set and it will be blunder because the last 3 rows have the class of `3` and the rest of the data does not have the class of `3`.

So, no model can ever learn the patterns of `class 3` because it does not exist in the `training` set and the model will have `high variance`. I cannot even say this is high variance because if the model never learns the patterns of `class 3` then whats the point in testing it in the `testing` set?

So, to avoid this the data has to be suffled.

SO, let's say after suffling the data we have:

| X | y |
| --- | --- |
| 8 | 3 |
| 9 | 3 |
| 10 | 3 |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
| 5 | 2 |
| 6 | 2 |
| 7 | 2 |

Now, if we take 70% of the data as the `training` set then the `training` set will have all the `3` classes and the `testing` set will have only the `2` class.

So, now the model will be able to learn the patterns of the `3` class and should also be able to perform well right?

Even though suffling the data can solve the probelm of a bad split still if you take a look at dataset after suffling then you will still see that the `training` set will have all the `3` classes but class `2` will have only 1 example in the `training` set and when we are testing the model we are `only` tesing the model on the `2` class. So, no matter how hard we try to measure the performance of the model we cannot get a proper view of the model performance.

`train_test_split()` ensures that training and testing data come from the same underlying distribution. While shuffling helps prevent catastrophic splits, a single split is still sensitive to randomness, which is why cross-validation is often preferred for reliable model evaluation.

> **Note**: `train_test_split()` has an argument called `stratify` that can be used to ensure that the `training` and `testing` sets have the same distribution of the target variable.

So, let's see the `train_test_split()` function in action.

I'll recreate the same dataset and split it into `training` and `testing` sets.

In [2]:
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'Feature' : np.arange(1,11),
    'Target' : np.array([1,1,1,2,2,2,2,3,3,3])
})
df

Unnamed: 0,Feature,Target
0,1,1
1,2,1
2,3,1
3,4,2
4,5,2
5,6,2
6,7,2
7,8,3
8,9,3
9,10,3


Now, let's split the data into `training` and `testing` sets.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[['Feature']],
    df['Target'],
    test_size=0.30,
    random_state=43
)

The data is split into `training` and `testing` sets. Now, let's see how it looks like.

In [19]:
train_set = pd.concat([X_train, y_train], axis=1)
train_set

Unnamed: 0,Feature,Target
8,9,3
1,2,1
3,4,2
4,5,2
7,8,3
2,3,1
5,6,2


In [20]:
testing_set = pd.concat([X_test, y_test], axis=1)
testing_set

Unnamed: 0,Feature,Target
0,1,1
6,7,2
9,10,3


Here you can see that the `testing` set does not have the `class 1`. SO, there will be a unbalanced evaluation of the model.

And this is when you can use a argument called `stratify` and it will ensure that the `testing` and `training` sets have the same distribution of the `target variable`.

> You can pass a column name to the `stratify` argument for it to work.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df[['Feature']],
    df['Target'],
    test_size=0.30,
    random_state=43,
    stratify=df['Target']
)

Now, let's see how it looks like.

In [21]:
train_set = pd.concat([X_train, y_train], axis=1)
train_set

Unnamed: 0,Feature,Target
8,9,3
1,2,1
3,4,2
4,5,2
7,8,3
2,3,1
5,6,2


In [22]:
testing_set = pd.concat([X_test, y_test], axis=1)
testing_set

Unnamed: 0,Feature,Target
0,1,1
6,7,2
9,10,3


And, we see that now the `testing` set has all the `3` classes and also the training set has all the `3` classes as well.

So, this solves the problem right?

YEAH! BUUUUUUUUT.(There's always a but)

Remember the exam intuition? 

In this case,

**One stratified split** = **One fair exam**

Because, the exam now covers all topics, has balanced questions and it is well designed. But it is still just `one exam` right?

Even though everything looks good, there might be a better way to design an exam where you can learn well and also get a good score, so one split of a stratified suffled data set is not enough.
