# Thank you to Kishal Mandal

I got the idea for this step/notebook from Kishal Mandal's [notebook](https://www.kaggle.com/kishalmandal/fold-is-power/notebook). Please have a look at it.

# Problem Statement

In this competition, we predict whether or not an email is spam.

It is a binary (2-class) classification problem. The number of observations for each class is balanced. There are 600,000 observations in the training dataset with 101 input variables (id and f0 to f99) and 1 output variable (target). We do not have any missing values.

<u>Goal of the Competition</u>: The Goal of the Tabular Playground Series November 2021 is to predict whether or not an email is spam.

<u>Goal of this Notebook</u>: In this notebook, we are going to prepare training dataset using the various resampling techniques such as KFold, StratifiedKFold and Repeated Random Splits.

We are going to cover the following steps:
1. Load Data
2. Resampling using KFold Cross Validation
3. Resampling using StratifiedKFold
4. Resampling using Repeated Random Splits
5. Resampling using Leave One Out Cross Validation
6. References

Let's get started

# Load Data

In [None]:
!pip install datatable

In [None]:
# Load libraries
import datatable as dt
print(dt.__version__)
import time
from pathlib import Path
import numpy as np
import pandas as pd

# to print all outputs of a cell
from IPython.core.interactiveshell import InteractiveShell  
InteractiveShell.ast_node_interactivity = "all"

import warnings
warnings.filterwarnings('ignore')

In [None]:
## Data Table Reading
start = time.time()
data_dir = Path('../input/tabular-playground-series-nov-2021/')
dt_train = dt.fread(data_dir / "train.csv")
end = time.time()
print(end - start)

In [None]:
# peek at the data
dt_train.head(5)

# number of rows and columns in training dataset
dt_train.shape

In [None]:
# Convert datatable Frame to pandas DataFrame.
df_train = dt_train.to_pandas()

# Resampling using KFold Cross Validation

Cross validation is an approach that we can use to estimate the performance of an algorithm with less variance than a single train-test set split. It works by splitting the dataset into k-parts (e.g. k = 5 or k = 10). Each split of the data is called a fold. The algorithm is trained on k−1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set. After running cross validation we end up with k different performance scores that we can summarize using a mean and a standard deviation.

The result is a more reliable estimate of the performance of the algorithm on new data. It is more accurate because the algorithm is trained and evaluated multiple times on different data. The choice of k must allow the size of each test partition to be large enough to be a reasonable sample of the problem, whilst allowing enough repetitions of the train-test evaluation of the algorithm to provide a fair estimate of the algorithms performance on unseen data.

In [None]:
from sklearn.model_selection import KFold

def create_kfolds(data, num_splits):
    data["kfold"] = -1
    kf = KFold(n_splits=num_splits, shuffle=True, random_state=11)
    for f, (t_, v_) in enumerate(kf.split(X=data, y=data['target'])):
        data.loc[v_, 'kfold'] = f
    return data

In [None]:
df_train_10folds = create_kfolds(df_train.copy(), num_splits=10)
df_train_20folds = create_kfolds(df_train.copy(), num_splits=20)
df_train_30folds = create_kfolds(df_train.copy(), num_splits=30)
df_train_40folds = create_kfolds(df_train.copy(), num_splits=40)
df_train_50folds = create_kfolds(df_train.copy(), num_splits=50)

In [None]:
df_train_10folds['kfold'].value_counts()
df_train_20folds['kfold'].value_counts()
df_train_30folds['kfold'].value_counts()
df_train_40folds['kfold'].value_counts()
df_train_50folds['kfold'].value_counts()

In [None]:
df_train_10folds.to_csv('train_10folds.csv', index=False)
df_train_20folds.to_csv('train_20folds.csv', index=False)
df_train_30folds.to_csv('train_30folds.csv', index=False)
df_train_40folds.to_csv('train_40folds.csv', index=False)
df_train_50folds.to_csv('train_50folds.csv', index=False)

# Resampling using StratifiedKFold Cross Validation

The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value.

In [None]:
from sklearn.model_selection import StratifiedKFold

# function copied from https://www.kaggle.com/kishalmandal/fold-is-power/notebook?scriptVersionId=78574712&cellId=5
def create_stratified_kfolds(data, num_splits):
    data["stratifiedkfold"] = -1
    kf = StratifiedKFold(n_splits=num_splits, shuffle=True, random_state=11)
    for f, (t_, v_) in enumerate(kf.split(X=data, y=data['target'])):
        data.loc[v_, 'stratifiedkfold'] = f
    return data

In [None]:
df_train_stratified_10folds = create_stratified_kfolds(df_train.copy(), num_splits=10)
df_train_stratified_20folds = create_stratified_kfolds(df_train.copy(), num_splits=20)
df_train_stratified_30folds = create_stratified_kfolds(df_train.copy(), num_splits=30)
df_train_stratified_40folds = create_stratified_kfolds(df_train.copy(), num_splits=40)
df_train_stratified_50folds = create_stratified_kfolds(df_train.copy(), num_splits=50)

In [None]:
df_train_stratified_10folds['stratifiedkfold'].value_counts()
df_train_stratified_20folds['stratifiedkfold'].value_counts()
df_train_stratified_30folds['stratifiedkfold'].value_counts()
df_train_stratified_40folds['stratifiedkfold'].value_counts()
df_train_stratified_50folds['stratifiedkfold'].value_counts()

In [None]:
df_train_stratified_10folds.to_csv('train_stratified_10folds.csv', index=False)
df_train_stratified_20folds.to_csv('train_stratified_20folds.csv', index=False)
df_train_stratified_30folds.to_csv('train_stratified_30folds.csv', index=False)
df_train_stratified_40folds.to_csv('train_stratified_40folds.csv', index=False)
df_train_stratified_50folds.to_csv('train_stratified_50folds.csv', index=False)

# Resampling using Repeated Random Splits

Another variation on k-fold cross validation is to create a random split of the data like the train/test split, but repeat the process of splitting and evaluation of the algorithm multiple times, like cross validation. This has the speed of using a train/test split and the reduction in variance in the estimated performance of k-fold cross validation. We can also repeat the process many more times as needed to improve the accuracy. A down side is that repetitions may include much of the same data in the train or the test split from run to run, introducing redundancy into the evaluation.

In [None]:
from sklearn.model_selection import ShuffleSplit

def create_folds_shuffle(data, num_splits):
    data["shufflesplit"] = -1
    kf = ShuffleSplit(n_splits=num_splits, random_state=11)
    for f, (t_, v_) in enumerate(kf.split(X=data, y=data['target'])):
        data.loc[v_, 'shufflesplit'] = f
    return data

In [None]:
df_train_shuffle_10folds = create_folds_shuffle(df_train.copy(), num_splits=10)
df_train_shuffle_20folds = create_folds_shuffle(df_train.copy(), num_splits=20)
df_train_shuffle_30folds = create_folds_shuffle(df_train.copy(), num_splits=30)
df_train_shuffle_40folds = create_folds_shuffle(df_train.copy(), num_splits=40)
df_train_shuffle_50folds = create_folds_shuffle(df_train.copy(), num_splits=50)

In [None]:
df_train_shuffle_10folds['shufflesplit'].value_counts()
df_train_shuffle_20folds['shufflesplit'].value_counts()
df_train_shuffle_30folds['shufflesplit'].value_counts()
df_train_shuffle_40folds['shufflesplit'].value_counts()
df_train_shuffle_50folds['shufflesplit'].value_counts()

In [None]:
df_train_shuffle_10folds.to_csv('train_shuffle_10folds.csv', index=False)
df_train_shuffle_20folds.to_csv('train_shuffle_20folds.csv', index=False)
df_train_shuffle_30folds.to_csv('train_shuffle_30folds.csv', index=False)
df_train_shuffle_40folds.to_csv('train_shuffle_40folds.csv', index=False)
df_train_shuffle_50folds.to_csv('train_shuffle_50folds.csv', index=False)

# Resampling using Leave One Out Cross Validation

In Leave One Out Cross Validation, we configure cross validation so that the size of the fold is 1 (k is set to the number of observations in our dataset). The result is a large number of performance measures that can be summarized in an effort to give a more reasonable estimate of the accuracy of our model on unseen data. A downside is that it can be a computationally more expensive procedure than k-fold cross validation.

We have not used Leave One Out Cross Validation because it would have been computationally very expensive.

# References

1. Thank you to Jason Brownlee for [Machine Learning Mastery](https://machinelearningmastery.com/).
2. Thank you to Kishal Mandal for his [notebook](https://www.kaggle.com/kishalmandal/fold-is-power/notebook).