# Cross-Validation

In this section, we'll be learning about different cross-validation techniques you can use for ML models. We'll go over these cross validation techniques by analyzing a fraud detection dataset. At large tech companies, fraud becomes an important problem that directly affects the company's bottom line. For example, Uber had a HUGE fraud problem especially when they expanded into international markets. 

In this notebook, we'll be covering:

- Train-Test-Split

- Leave-One-Out Cross Validation

- K-Fold Cross Validation

- Date Split

- Time Series Split

- Expanding Window

- Monte Carlo Cross Validation

Let's get started!

## Import Libraries

First, we'll import the standard python libraries we commonly use for data analysis.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Load Data

Next, we'll load our fraud dataset.

In [None]:
df = pd.read_csv("fraud_data.csv")

## Sample down to improve speed
pos = df[df['isFraud'] == 1].copy()
neg = df[df['isFraud'] == 0].sample(100000)

df = pos.append(neg)

## Train-Test-Split

First method we'll go over is train-test-split. Train-test-split is the simplest form of cross-validation. We simply randomly slice our dataset into a training set and testing set. Typically, the most important parameters are: 

`X`: The feature set you're looking to split. 

`y`: The target variable you're looking to split.

`test_size`: The size of your testing set. Typically, this is denoted as a fraction such as `0.33`. 

`random_state`: This is the seed of the random shuffle. I recommend setting a seed so everytime you rerun your notebook, your results stay consistent. 

`stratify`: This is an optional argument. But stratifying will reduce the variance in the random shuffle to ensure that your training and testing sets are more similar than not.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import average_precision_score

features = [
    'amount',
    'oldbalanceOrg',
    'newbalanceOrig',
    'oldbalanceDest',
    'newbalanceDest'
]

X = df[features]
y = df['isFraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = RandomForestClassifier()

model.fit(X_train, y_train)
y_preds = model.predict(X_test)

print(average_precision_score(y_preds, y_test))

## K-Fold Cross Validation

The next common method of cross validation is K-Fold. To review, K-Fold is we’re essentially dividing our dataset into multiple datasets, then running train-test-split multiple times, across these subsets.

Import parameters we should keep in mind: 

`n_splits`: This is the number of splits we want to make within our dataset. 

`shuffle`: This tells us whether we should shuffle our data before splitting into folds. 

`random_state`: This is the random seed we're setting, similar to train-test-split.

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=2, shuffle = True, random_state = 42)
kf.get_n_splits(X)

folds = {}

for train, test in kf.split(X):
    # Fold
    fold_number = 1
    # Store fold number
    folds[fold_number] = (df.iloc[train], df.iloc[test])
    print('train: %s, test: %s' % (df.iloc[train], df.iloc[test]))
    fold_number += 1

Typically, after completing K-Fold Cross-Validation we'll want to calculate a cross-validation score. Typically, we'll get the scores for each fold, then take an average:

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

scores = cross_val_score(model, X, y, scoring='accuracy', cv=kf, n_jobs=-1)

print(np.mean(scores))

## Leave One Out Cross Validation

Another type of cross-validation is Leave One Out Cross-Validation. The idea here is that we’re training our model on all the data, then leaving one data point out, and evaluating our model on a single data point. We do this for every datapoint in the entire dataset. For the sake of time, we're goint to limit our dataset to 100 data points here:

In [None]:
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import accuracy_score

loo = LeaveOneOut()
loo.get_n_splits(X)


all_preds = []

for train_index, test_index in loo.split(X[:100]):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    model = RandomForestClassifier()

    model.fit(X_train, y_train)
    y_preds = model.predict(X_test)

    correct = y_preds[0] == y_test.values[0]
    
    all_preds.append(correct)

In [88]:
sum(all_preds)/len(all_preds)

1.0

## Train-Test-Split Date Split

In many instances, you don't want to randomly slice your data into training and testing sets, but instead, you want to split it by time. In this case, you'll want to split by date: 

In [89]:
DATE = '2021-12-31'

train_df = df[df['date'] < DATE].copy()
test_df = df[df['date'] >= DATE].copy()

X_train = train_df[features]
X_test = test_df[features]

y_train = train_df['isFraud']
y_test = test_df['isFraud']


model = RandomForestClassifier()

model.fit(X_train, y_train)
y_preds = model.predict(X_test)

print(average_precision_score(y_preds, y_test))

0.923704488908898


## Sliding Window/Time Series KFold

The problem with splitting by date is that the resulting training and testing sets can vary, depending on the date you select. The solution to this is to use a Time Series K-Fold Split. In time series Kfold, we're combining the elements of KFold and train-test date split. In Time Series Kfold, we'll be splitting our dataset multiple times, using differing dates. The size of the training sets will be the same for each fold: 

In [101]:
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit()

all_scores = []

for train_index, test_index in tscv.split(X):
#     print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    model = RandomForestClassifier()

    model.fit(X_train, y_train)
    y_preds = model.predict(X_test)

    pr_auc = average_precision_score(y_preds, y_test)
    
    all_scores.append(pr_auc)
    
    
print(all_scores)

[0.012475741613529248, 0.0068755198225672306, 0.005212087607429998, 0.004214028278347657, 0.0032159689492653174]


## Expanding Window

One of the problems with standard time series split, is the size of the training sets stay the same. The solution to this is to use an expanding window split. The idea here, is with each incremental date split, we're using **all** the data rather than sliding the window. You can adapt the sci-kit learn version of TimeSeriesSplit. We also wrote a simple implementation as well: 

In [None]:
class ExpandingWindowCV:
    def fit(self, date_col, date_range = None, custom_range = None):
        self.date_col = date_col
        self.date_range = date_range
        self.custom_range = custom_range
        
        if date_range is not None and custom_range is not None:
            raise ValueError("Date Range and Custom Range both cannot be None.")
    
    def split(self, df):
        if self.date_range is None:         
            dates = list(set(df[self.date_col].astype(str).values))
        
        if self.date_range is not None:
            dates = pd.date_range(start=self.date_range[0], end=self.date_range[1])
            dates = [str(d.date()) for d in dates]
        
        if self.custom_range is not None:
            dates = self.custom_range
            
        for d in dates:
            df_train = df[df[self.date_col].astype(str) <= d].copy()
            df_test = df[df[self.date_col].astype(str) > d].copy()
            yield df_train, df_test
            
ew = ExpandingWindowCV()
ew.fit(date_col = 'date', date_range = ['2022-01-02','2022-01-08'])
ew.split(df)

In [93]:
all_scores = []

for train_df, test_df in ew.split(df):
    X_train = train_df[features]
    X_test = test_df[features]

    y_train = train_df['isFraud']
    y_test = test_df['isFraud']


    model = RandomForestClassifier()

    model.fit(X_train, y_train)
    y_preds = model.predict(X_test)

    pr_auc = average_precision_score(y_preds, y_test)
    
    all_scores.append(pr_auc)
    
all_scores

[0.9220560208631237,
 0.9226582019111925,
 0.9243608853720093,
 0.9229844807192916,
 0.9278032540111424,
 0.921955918184176,
 0.9253343888992496]

## Monte Carlo Cross Validation

The last method we'll go over is Monte Carlo Cross Validation. Monte Carlos Cross Validation is where we randomly select a sub-sample (with replacement) from our dataset for the training set, use the rest for the testing set. Repeat this (with replacement) N number of times, to create a distribution of evaluation scores:

In [94]:
from sklearn.model_selection import ShuffleSplit

rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)
rs.get_n_splits(df)

all_scores = []
for train_index, test_index in rs.split(df):
#     print("TRAIN:", train_index, "TEST:", test_index)

    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    model = RandomForestClassifier()

    model.fit(X_train, y_train)
    y_preds = model.predict(X_test)

    pr_auc = average_precision_score(y_preds, y_test)
    
    all_scores.append(pr_auc)

In [95]:
all_scores

[0.9340994244507013,
 0.9356447139288409,
 0.9230229692911995,
 0.9411872091474103,
 0.9278329879205108]

## Summary

In this section, you learned about a variety of cross validation techniques: 

- Train-Test-Split

- Leave-One-Out Cross Validation

- K-Fold Cross Validation

- Date Split

- Time Series Split

- Expanding Window

- Monte Carlo Cross Validation