# **Cross Validation**

Cross Validation is a technique used to assess how the results of a statistical analysis will generalize to an independent data set. 

It is mainly used in settings where the goal is prediction and one wants to estimate how accurately a predictive model will perform in practice.
It is also used to compare the performances of different predictive modeling procedures and to select the best one.

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. 
The procedure has a single parameter called k, which refers to the number of groups that a given data sample is to be split into.

The procedure involves,
 - Partitioning the data into k subsets, called folds. 
 - For each fold, the model is trained using k-1 folds and tested on the remaining fold. 
 - This process is repeated k times, with each fold being used as the test set once. 
 - The results from each iteration are then averaged to produce a single estimation.

Cross-validation is a powerful technique for assessing the performance of machine learning models, as it provides a more reliable estimate of model performance than a single train-test split.

It helps to mitigate issues such as overfitting and provides a better understanding of how the model will perform on unseen data.

Cross-validation is particularly useful when the dataset is small, as it allows for more efficient use of the available data by ensuring that every observation is used for both training and testing.

It is widely used in machine learning and statistics to ensure that models are robust and generalizable to new data.
Cross-validation is a crucial step in the machine learning pipeline, as it helps to ensure that the model is not only accurate but also reliable and applicable to real-world scenarios.

In [558]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [559]:
np.random.seed(42)
fastball_speed = np.random.randint(88, 108, size=1000)
tommy_john = np.where(fastball_speed > 95, np.random.choice([0, 1], size=1000, p=[0.3, 0.7]), 0)

d = {
    'fastball_speed': fastball_speed,
    'tommy_john': tommy_john,
}

df = pd.DataFrame(d)
df

Unnamed: 0,fastball_speed,tommy_john
0,94,0
1,107,1
2,102,1
3,98,1
4,95,0
...,...,...
995,90,0
996,107,0
997,98,0
998,93,0


In [560]:
X = df[['fastball_speed']]
y= df['tommy_john']

In [561]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [562]:
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(800, 1) (800,) (200, 1) (200,)


In [563]:
# values.ravel() is used to convert the DataFrame to a 1D array
# values.ravel() makes sure that the target variable is in the correct shape for fitting the model.
# This is particularly important when the target variable is a single column DataFrame.
# It is necessary for the LogisticRegression model to accept the target variable in the correct shape.
logr = LogisticRegression()
logr.fit(X_train, y_train.values.ravel())

In [564]:
y_pred = logr.predict(X_test)
display(y_pred)

array([1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0])

In [565]:
logr.score(X_test, y_test)

0.795

In [566]:
print("Cross-validation scores:", cross_val_score(logr, X, y, cv=5))
print("Mean cross-validation score:", np.mean(cross_val_score(logr, X, y, cv=5)))
print("Accuracy Score:", accuracy_score(y_test, y_pred))

Cross-validation scores: [0.775 0.77  0.755 0.8   0.755]
Mean cross-validation score: 0.7709999999999999
Accuracy Score: 0.795


In [567]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X, y, test_size=0.3, random_state=50)

In [568]:
logr.fit(X2_train, y2_train.values.ravel())

In [569]:
logr.score(X2_test, y2_test)

0.78

In [570]:
# Perform cross-validation
# cv is the number of folds in cross-validation
# cv=5 means the dataset will be split into 5 parts
cvs = cross_val_score(logr, X, y.values.ravel(), cv=5)

In [571]:
cvs

array([0.775, 0.77 , 0.755, 0.8  , 0.755])

In [572]:
np.average(cvs)

0.7709999999999999

**K-Fold Cross Validation**

KFold Cross-Validation is a technique used to evaluate the performance of a model by splitting the dataset into K subsets (folds).

The model is trained on K-1 folds and tested on the remaining fold. This process is repeated K times, with each fold being used as the test set once.

The results are then averaged to provide a more reliable estimate of the model's performance.

KFold is used for splitting the dataset into K consecutive folds (without shuffling by default).


In [573]:
from sklearn.model_selection import KFold

# KFold cross-validation
# Create a KFold object with 5 splits, shuffling the data and setting a random state for reproducibility
# n_splits=5 means the dataset will be split into 5 parts
# shuffle=True means the data will be shuffled before splitting
kf = KFold(n_splits=5, shuffle=True, random_state=None)

kf_score = cross_val_score(logr, X, y.values.ravel(), cv=kf, scoring='accuracy')
kf_score_2 = cross_val_score(logr, X, y.values.ravel(), cv=kf, scoring='f1')

In [574]:
kf_score, kf_score_2

(array([0.775, 0.775, 0.765, 0.75 , 0.79 ]),
 array([0.69512195, 0.72727273, 0.62745098, 0.77464789, 0.72131148]))

In [575]:
np.average(kf_score), np.average(kf_score_2)

(0.771, 0.7091610043236352)

**Stratified K-Fold**

StratifiedKFold is used for splitting the dataset into K consecutive folds (without shuffling by default).

It is particularly useful for classification tasks where the class distribution is imbalanced.

In [576]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=8, shuffle=False, random_state=None)

skf_score = cross_val_score(logr, X, y.values.ravel(), cv=skf, scoring='accuracy')
skf_score_2 = cross_val_score(logr, X, y.values.ravel(), cv=skf, scoring='f1')


In [577]:
skf_score, skf_score_2

(array([0.76 , 0.784, 0.784, 0.768, 0.736, 0.784, 0.816, 0.736]),
 array([0.67391304, 0.71578947, 0.73786408, 0.70707071, 0.67326733,
        0.73267327, 0.76767677, 0.65263158]))

In [578]:
np.average(skf_score), np.average(skf_score_2)

(0.771, 0.7076107803233278)

In [579]:
scaler = StandardScaler()

pipe1 = make_pipeline(scaler, logr)
pipe1.fit(X_train, y_train.values.ravel())

In [580]:
pipe1_score = pipe1.score(X_test, y_test)
pipe1_score

0.795

In [581]:
pipe1_cross_val = cross_val_score(pipe1, X, y.values.ravel(), cv=20)
pipe1_cross_val

array([0.72, 0.82, 0.78, 0.78, 0.76, 0.86, 0.74, 0.72, 0.82, 0.74, 0.74,
       0.72, 0.82, 0.8 , 0.72, 0.86, 0.8 , 0.68, 0.8 , 0.74])

In [582]:
np.average(pipe1_cross_val)

0.7710000000000001

In [583]:
pipe1_cross_val_2 = cross_val_score(pipe1, X, y.values.ravel(), cv=8, scoring='accuracy')
pipe1_cross_val_2

array([0.76 , 0.784, 0.784, 0.768, 0.736, 0.784, 0.816, 0.736])

In [584]:
np.average(pipe1_cross_val_2)

0.771