# Cross-Validation

In this section, we'll be learning about different cross-validation techniques you can use for ML models. We'll go over these cross validation techniques by analyzing a fraud detection dataset. At large tech companies, fraud becomes an important problem that directly affects the company's bottom line. For example, Uber had a HUGE fraud problem especially when they expanded into international markets. 

In this notebook, we'll be covering:

- Train-Test-Split

- Leave-One-Out Cross Validation

- K-Fold Cross Validation

- Date Split

- Time Series Split

- Expanding Window

- Monte Carlo Cross Validation

Let's get started!

## Import Libraries

First, we'll import the standard python libraries we commonly use for data analysis.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
# Load data
df = pd.read_csv("../data/Fraud_data.csv")
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,date
0,2,2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,2021-11-27
1,3,3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,2021-01-01
2,251,251,1,TRANSFER,2806.0,C1420196421,2806.0,0.0,C972765878,0.0,0.0,1,0,2021-03-28
3,252,252,1,CASH_OUT,2806.0,C2101527076,2806.0,0.0,C1007251739,26202.0,0.0,1,0,2022-05-20
4,680,680,1,TRANSFER,20128.0,C137533655,20128.0,0.0,C1848415041,0.0,0.0,1,0,2021-09-28


In [13]:
# # Sample down to improve speed
# pos = df[df['isFraud'] == 1].copy()
# neg = df[df['isFraud'] == 0].sample(100000)

# df = pos.append(neg)

## Train-Test-Split

Train-test-split is the simplest form of cross-validation. We simply randomly slice our dataset into a training set and testing set. Typically, the most important parameters are: 

`X`: The feature set you're looking to split. 

`y`: The target variable you're looking to split.

`test_size`: The size of your testing set. Typically, this is denoted as a fraction such as `0.33`. 

`random_state`: This is the seed of the random shuffle. I recommend setting a seed so everytime you rerun your notebook, your results stay consistent. 

`stratify`: This is an optional argument. But stratifying will reduce the variance in the random shuffle to ensure that your training and testing sets are more similar than not.

In [16]:
df.columns

Index(['Unnamed: 0.1', 'Unnamed: 0', 'step', 'type', 'amount', 'nameOrig',
       'oldbalanceOrg', 'newbalanceOrig', 'nameDest', 'oldbalanceDest',
       'newbalanceDest', 'isFraud', 'isFlaggedFraud', 'date'],
      dtype='object')

In [30]:
# let's import libraries form sklearn

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import average_precision_score
from sklearn.ensemble import RandomForestClassifier


In [18]:
# We'll only work on these columns 
featuers = [
    'amount', 
       'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',
       'newbalanceDest'
]


# split dataset
X=df[featuers]
y = df['isFraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)



In [22]:
print(f"X_train: ", X_train.shape)
print(f"X_test: ", X_test.shape)
print(f"y_train: ", y_train.shape)
print(f"y_test: ", y_test.shape)


X_train:  (72502, 5)
X_test:  (35711, 5)
y_train:  (72502,)
y_test:  (35711,)


In [19]:
# Build a model
model = RandomForestClassifier()

In [20]:
model.fit(X_train, y_train)

In [26]:
y_pred = model.predict(X_test)
y_pred[-5:]

array([0, 0, 0, 0, 1])

In [25]:
print(average_precision_score(y_pred, y_test))

0.9373928609092386


## K-Fold Cross Validation

K-Fold is we’re essentially dividing our dataset into multiple datasets, then running train-test-split multiple times, across these subsets.

Import parameters we should keep in mind: 

`n_splits`: This is the number of splits we want to make within our dataset. 

`shuffle`: This tells us whether we should shuffle our data before splitting into folds. 

`random_state`: This is the random seed we're setting, similar to train-test-split.


The general procedure is as follows:

    1. Shuffle the dataset randomly.
    2. Split the dataset into k groups
    3. For each unique group:
        3.1 Take the group as a hold out or test data set
        3.2 Take the remaining groups as a training data set
        3.3 Fit a model on the training set and evaluate it on the test set
        3.4 Retain the evaluation score and discard the model
    4. Summarize the skill of the model using the sample of model evaluation scores


In [28]:
# let's implement kfold

kf = KFold(n_splits=2, shuffle=True, random_state=42)
kf.get_n_splits(X)


2

In [29]:
folds = {}
for train, test, in kf.split(X):
    # Fold
    fold_number = 1

    # Store fold number
    folds[fold_number] = (df.iloc[train], df.iloc[test])
    print(f"Train: {df.iloc[train]}, Test: {df.iloc[test]}")

    fold_number += 1

Train:         Unnamed: 0.1  Unnamed: 0  step      type     amount     nameOrig  \
0                  2           2     1  TRANSFER     181.00  C1305486145   
2                251         251     1  TRANSFER    2806.00  C1420196421   
5                681         681     1  CASH_OUT   20128.00  C1118430673   
9               1115        1115     1  TRANSFER   35063.63  C1364127192   
10              1116        1116     1  CASH_OUT   35063.63  C1635772897   
...              ...         ...   ...       ...        ...          ...   
108206       2657203     2657203   210  CASH_OUT  287206.37   C758909241   
108207       4179135     4179135   304   PAYMENT   32388.33  C1059105200   
108209       3776228     3776228   280  CASH_OUT  484590.48  C1291622491   
108210        584905      584905    33  CASH_OUT  521837.86   C966210331   
108211       4760428     4760428   334  CASH_OUT   36100.43   C792702729   

        oldbalanceOrg  newbalanceOrig     nameDest  oldbalanceDest  \
0         

Typically, after completing K-Fold Cross-Validation we'll want to calculate a cross-validation score. Typically, we'll get the scores for each fold, then take an average:

In [31]:
model = RandomForestClassifier()

scores = cross_val_score(model, X, y, scoring='accuracy', cv=kf, n_jobs=-1)

scores



array([0.99362375, 0.99334639])

In [32]:
print(np.mean(scores))

0.9934850698282488


## Leave One Out Cross Validation

Another type of cross-validation is Leave One Out Cross-Validation. The idea here is that we’re training our model on all the data, then leaving one data point out, and evaluating our model on a single data point. We do this for every datapoint in the entire dataset. For the sake of time, we're goint to limit our dataset to 100 data points here:

In [33]:
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import accuracy_score

loo = LeaveOneOut()
loo.get_n_splits(X)


all_preds = []

for train_index, test_index in loo.split(X[:100]):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    model = RandomForestClassifier()

    model.fit(X_train, y_train)
    y_preds = model.predict(X_test)

    correct = y_preds[0] == y_test.values[0]
    
    all_preds.append(correct)

TRAIN: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
 97 98 99] TEST: [0]
TRAIN: [ 0  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
 97 98 99] TEST: [1]
TRAIN: [ 0  1  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
 97 98 99] TEST: [2]
TRAIN: [ 0  1  2  4  5  6  7  8  9 10 11

In [34]:
sum(all_preds)/len(all_preds)

np.float64(1.0)