## Objective: Implement and explain cross validation strategies
- why cross validation?
    - statistical method to evaluate model performance by partitioning data into train and val subsets multiple times
    - helps to ensure model generalise well to unseen data
    - prevents overfitting
    - optimise model selection
    - reduce model variance
- Tips
    - K = 5 or 10 used for large datasets
    - Use LOOCV for small datasets

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split , cross_val_score , KFold , StratifiedKFold , cross_validate
from sklearn.ensemble import RandomForestClassifier

In [2]:
url = "https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv"
df = pd.read_csv(url)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [4]:
df['Class'].value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

In [5]:
x , y = df.drop(columns='Class') , df['Class']
x_train , x_test , y_train , y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [6]:
rf = RandomForestClassifier(random_state=42)

## K-fold cross validation
- split data into k equal folds
- train model on k-1 folds and validate on the remaining gold
- repeat k times , allowing each fold to be validation once
- best for general purpose datasets

In [None]:
kf = KFold(n_splits=5,shuffle=True,random_state=42)
scoring = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

cv_results = cross_validate(rf, x_train, y_train, cv=kf, scoring=scoring)

print('Kfold CV results: ')
for metric in scoring:
    scores = cv_results[f'test_{metric}']
    print(f"{metric}: {scores.mean():.3f} ± {scores.std():.3f}")

## Stratified K-Fold Cross-validation
- ensure each fold maintains the same class distribution as original dataset
- useful for imbalanced datasets
- best for classificaiton task with imbalance data

In [None]:
skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=42)
cv_results = cross_validate(rf, x_train, y_train, cv=skf, scoring=scoring)

print('Stratified Kfold CV results: ')
for metric in scoring:
    scores = cv_results[f'test_{metric}']
    print(f"{metric}: {scores.mean():.3f} ± {scores.std():.3f}")

## Leave-One-Out Cross Validation (LOCCV)
- use a single data point as validation set and rest as training
- repeats process for each data points
- maximising training data for each fold
- computation expensive
- best for small datasets