# DATA-611 Final Project - Eland - Train Test Split

This notebook handles upsampling our dataset to avoid class imbalance and then splitting it into train and test files.

It starts with `Cleaned.csv` and outputs `train.csv` and `test.csv`

This notebook was originally developed in Azure Machine Learning Studio against the Python 3.8 - AzureML kernel on a STANDARD_E4DS_V4 compute instance

## Dependencies

In [38]:
%pip install pandas
%pip install imblearn
%pip install sklearn

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [39]:
# Set up Plotly express for visualization
%pip install plotly

import plotly.express as px

px.defaults.template = 'plotly_white'
px.defaults.color_continuous_scale = px.colors.sequential.Plasma
px.defaults.color_discrete_sequence = px.colors.qualitative.Vivid

Note: you may need to restart the kernel to use updated packages.


## Load Data

In [40]:
import pandas as pd

df = pd.read_csv('cleaned.csv', index_col=0)
df.head()

Unnamed: 0_level_0,Credit Amount,Is Male,Age in Years,Repay Delay Sep,Repay Delay Aug,Repay Delay Jul,Prior Pay Sep,Prior Pay Aug,Prior Pay Jul,Defaulted,Graduate School,Is Married,Prior Pay Total,Repay Delay Total
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,20000,False,24,2,2,0,0,689,0,True,0,True,689,4
2,120000,False,26,0,2,0,0,1000,1000,True,0,False,5000,4
3,90000,False,34,0,0,0,1518,1500,1000,False,0,False,11018,0
4,50000,False,37,0,0,0,2000,2019,1200,False,0,True,8388,0
5,50000,True,57,0,0,0,2000,36681,10000,False,0,True,59049,0


## Class Imbalance Detection
Let's detect class imabalances in our target label of Defaulted

In [41]:
df['Defaulted'].value_counts(normalize=True)

False    0.823066
True     0.176934
Name: Defaulted, dtype: float64

In [42]:
# Plot the class distribution
fig = px.histogram(df, x='Defaulted', color='Defaulted', title='Class Balance')
fig.update_layout(xaxis_title='Defaulted', legend_title='Defaulted')
fig.show()

Congratulations, that's a class imbalance, and 17.6% minority class size is below the 20% I've learned as the minimum recommended. Let's upsample the minority class with SMOTE

## Synthetic Minority Upsampling Technique (SMOTE)

In [43]:
# Use SMOTE to upsample our minority class (defaulted)
from imblearn.over_sampling import SMOTE

# Create our X and y
X = df.drop(columns=['Defaulted'])
y = df['Defaulted']

# Create our SMOTE object
smote = SMOTE()

# Resample our data
X_smote, y_smote = smote.fit_resample(X, y)

# Plot the new class distribution
fig = px.histogram(y_smote, title='Class Balance after SMOTE', color=y_smote)
fig.update_layout(xaxis_title='Defaulted', legend_title='Defaulted')
fig.show()

In [44]:
# Merge these back into a single dataframe
df = pd.concat([X_smote, y_smote], axis=1)
df.head()

Unnamed: 0,Credit Amount,Is Male,Age in Years,Repay Delay Sep,Repay Delay Aug,Repay Delay Jul,Prior Pay Sep,Prior Pay Aug,Prior Pay Jul,Graduate School,Is Married,Prior Pay Total,Repay Delay Total,Defaulted
0,20000,False,24,2,2,0,0,689,0,0,True,689,4,True
1,120000,False,26,0,2,0,0,1000,1000,0,False,5000,4,True
2,90000,False,34,0,0,0,1518,1500,1000,0,False,11018,0,False
3,50000,False,37,0,0,0,2000,2019,1200,0,True,8388,0,False
4,50000,True,57,0,0,0,2000,36681,10000,0,True,59049,0,False


## Train / Test Split

Let's do a stratified split of our train / test data using the prescribed 20 % in the test set and random state of 123

In [45]:
# To stratify on multiple columns, lets create a new column that combines the features. These are things we want to make sure are evenly distributed across test / train split
df['Stratify'] = df['Defaulted'].astype(str) + '_' + df['Is Male'].astype(str) + '_' + df['Graduate School'].astype(str)

df['Stratify'].unique()

array(['True_False_0', 'False_False_0', 'False_True_0', 'False_True_1',
       'False_False_1', 'True_False_1', 'True_True_1', 'True_True_0'],
      dtype=object)

In [46]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['Defaulted', 'Stratify', 'Is Male', 'Age in Years']) # Drop the label, stratification column, and our potential fairness concern areas
y = df['Defaulted']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=df['Stratify'], test_size=0.2, random_state=123)

## Data Checkpointing

Before we move on to training, let's save our test and train datasets

In [47]:
# Save the test set to test.csv
df_test = pd.concat([X_test, y_test], axis=1)
df_test.to_csv('test.csv')

In [48]:
# Save the train set to train.csv
df_train = pd.concat([X_train, y_train], axis=1)
df_train.to_csv('train.csv')