___
# 10 - Resampling

### Time to look into oversampling / undersampling techniques.

Resampling changes the distribution of the predictor class in our dataset. There are two main techniques: over- or under-sampling. Oversampling will randomly duplicate samples (increasing the number of observations of the underbalanced class), whilst undersampling will randomly delete samples (decreasing the number of observations in the overbalanced class).

In [29]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score, classification_report

from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks

from imblearn.pipeline import Pipeline, make_pipeline

In [30]:
train = pd.read_csv('encoded_train.csv')
test = pd.read_csv('encoded_test.csv')

In [31]:
y_train = train['casualty_severity']
X_train = train.drop('casualty_severity', axis=1)

y_test = test['casualty_severity']
X_test = test.drop('casualty_severity', axis=1)

In [32]:
def resample_dist(sampler_class, X_train=X_train, y_train=y_train):
  X_trans, y_trans = sampler_class.fit_resample(X_train, y_train)
  y_0 = y_trans.value_counts()[0] / len(y_trans)
  y_1 = y_trans.value_counts()[1] / len(y_trans)

  resampler_dict = {
    'Sampler type': sampler_class,
    'y=0': y_0,
    'y=1': y_1,
    'new_training_size': len(y_trans),
    'training_change': (len(y_trans) - len(y_train)) / len(y_train)
  }

  return resampler_dict

In [33]:
def evaluate_model(model_class, X_train=X_train, y_train=y_train):
    kf = StratifiedKFold(n_splits=5, shuffle=True)
    # Stratified K fold cross validation and predict on training data
    accuracy_scores = cross_val_score(model_class, X_train, y_train.to_numpy().ravel(), cv=kf, scoring='accuracy') * 100
    precision_scores = cross_val_score(model_class, X_train, y_train.to_numpy().ravel(), cv=kf, scoring='precision') * 100
    recall_scores = cross_val_score(model_class, X_train, y_train.to_numpy().ravel(), cv=kf, scoring='recall') * 100
    f1_scores = cross_val_score(model_class, X_train, y_train.to_numpy().ravel(), cv=kf, scoring='f1') * 100
    

    metrics_dict = {
        'Model Type': model_class,
        'CV_mean_accuracy': np.round(accuracy_scores.mean(), 1), 
        'CV_mean_precision':np.round(precision_scores.mean(), 1), 
        'CV_mean_recall': np.round(recall_scores.mean(), 1), 
        'CV_mean_F1': np.round(f1_scores.mean(), 1)
        
       
    }   

    return metrics_dict

### Let's try random resampling first, starting with oversampling

In [34]:
random_overs_RF_pipeline = make_pipeline(RandomOverSampler(random_state=42), 
                              RandomForestClassifier())

random_overs_XGB_pipeline = make_pipeline(RandomOverSampler(random_state=42), 
                              XGBClassifier())

In [35]:
models = {'XGB_oversampled': random_overs_XGB_pipeline, 'RF_oversampled': random_overs_RF_pipeline}  
model_metric_dict = {}

for key, values in models.items():
  metrics_dict = evaluate_model(values)
  model_metric_dict.update({key: metrics_dict})

results = pd.DataFrame.from_dict(model_metric_dict).T.round(2)
results.sort_values(by='CV_mean_F1', ascending=False)

Unnamed: 0,Model Type,CV_mean_accuracy,CV_mean_precision,CV_mean_recall,CV_mean_F1
RF_oversampled,"(RandomOverSampler(random_state=42), RandomFor...",77.3,59.6,55.8,57.5
XGB_oversampled,"(RandomOverSampler(random_state=42), XGBClassi...",72.1,49.8,63.1,56.0


### Now, random undersampling

In [36]:
random_unders_RF_pipeline = make_pipeline(RandomUnderSampler(random_state=42), 
                              RandomForestClassifier())

random_unders_XGB_pipeline = make_pipeline(RandomUnderSampler(random_state=42), 
                              XGBClassifier())

models_under = {'XGB_undersampled': random_unders_XGB_pipeline, 'RF_undersampled': random_unders_RF_pipeline}
for key, values in models_under.items():
  metrics_dict = evaluate_model(values)
  model_metric_dict.update({key: metrics_dict})


In [37]:
results = pd.DataFrame.from_dict(model_metric_dict).T.round(2)
results.sort_values(by='CV_mean_F1', ascending=False)

Unnamed: 0,Model Type,CV_mean_accuracy,CV_mean_precision,CV_mean_recall,CV_mean_F1
RF_oversampled,"(RandomOverSampler(random_state=42), RandomFor...",77.3,59.6,55.8,57.5
RF_undersampled,"(RandomUnderSampler(random_state=42), RandomFo...",70.2,47.5,73.7,57.5
XGB_oversampled,"(RandomOverSampler(random_state=42), XGBClassi...",72.1,49.8,63.1,56.0
XGB_undersampled,"(RandomUnderSampler(random_state=42), XGBClass...",68.2,45.3,70.1,54.3


### Let's try something a little spicier: SMOTE (synthetic minority oversampling)

In [38]:
smote_XGB_pipeline = make_pipeline(SMOTE(random_state=42), XGBClassifier())

smote_RF_pipeline = make_pipeline(SMOTE(random_state=42), RandomForestClassifier())

models_smote = {'XGB_SMOTE': smote_XGB_pipeline, 'RF_SMOTE': smote_RF_pipeline}

for key, values in models_smote.items():
  metrics_dict = evaluate_model(values)
  model_metric_dict.update({key: metrics_dict})

results = pd.DataFrame.from_dict(model_metric_dict).T.round(2)
results.sort_values(by='CV_mean_F1', ascending=False)

Unnamed: 0,Model Type,CV_mean_accuracy,CV_mean_precision,CV_mean_recall,CV_mean_F1
RF_SMOTE,"(SMOTE(random_state=42), RandomForestClassifie...",78.3,64.3,51.7,58.2
RF_oversampled,"(RandomOverSampler(random_state=42), RandomFor...",77.3,59.6,55.8,57.5
RF_undersampled,"(RandomUnderSampler(random_state=42), RandomFo...",70.2,47.5,73.7,57.5
XGB_oversampled,"(RandomOverSampler(random_state=42), XGBClassi...",72.1,49.8,63.1,56.0
XGB_undersampled,"(RandomUnderSampler(random_state=42), XGBClass...",68.2,45.3,70.1,54.3
XGB_SMOTE,"(SMOTE(random_state=42), XGBClassifier(base_sc...",75.1,58.3,39.4,46.8


### Undersampling using Tomek links

In [39]:
tomek_XGB_pipeline = make_pipeline(TomekLinks(), XGBClassifier())

tomek_RF_pipeline = make_pipeline(TomekLinks(), RandomForestClassifier())

models_tomek = {'XGB_tomek': tomek_XGB_pipeline, 'RF_tomek': tomek_RF_pipeline}

for key, values in models_tomek.items():
  metrics_dict = evaluate_model(values)
  model_metric_dict.update({key: metrics_dict})

results = pd.DataFrame.from_dict(model_metric_dict).T.round(2)
results.sort_values(by='CV_mean_F1', ascending=False)

Unnamed: 0,Model Type,CV_mean_accuracy,CV_mean_precision,CV_mean_recall,CV_mean_F1
RF_SMOTE,"(SMOTE(random_state=42), RandomForestClassifie...",78.3,64.3,51.7,58.2
RF_oversampled,"(RandomOverSampler(random_state=42), RandomFor...",77.3,59.6,55.8,57.5
RF_undersampled,"(RandomUnderSampler(random_state=42), RandomFo...",70.2,47.5,73.7,57.5
RF_tomek,"(TomekLinks(), RandomForestClassifier())",77.9,61.9,55.5,57.1
XGB_oversampled,"(RandomOverSampler(random_state=42), XGBClassi...",72.1,49.8,63.1,56.0
XGB_undersampled,"(RandomUnderSampler(random_state=42), XGBClass...",68.2,45.3,70.1,54.3
XGB_tomek,"(TomekLinks(), XGBClassifier(base_score=None, ...",75.0,59.2,43.2,49.6
XGB_SMOTE,"(SMOTE(random_state=42), XGBClassifier(base_sc...",75.1,58.3,39.4,46.8


In [40]:
resamplers = {'Random Oversampler': RandomOverSampler(random_state=42), 'Random undersampler': RandomUnderSampler(random_state=42), 'SMOTE': SMOTE(random_state=42), 'Tomkek links': TomekLinks()}

resampler_metric_dict = {}

for key, values in resamplers.items():
  resampler_dict = resample_dist(values)
  resampler_metric_dict.update({key: resampler_dict})

results = pd.DataFrame.from_dict(resampler_metric_dict).T.round(2)
results

Unnamed: 0,Sampler type,y=0,y=1,new_training_size,training_change
Random Oversampler,RandomOverSampler(random_state=42),0.5,0.5,16312,0.438829
Random undersampler,RandomUnderSampler(random_state=42),0.5,0.5,6362,-0.438829
SMOTE,SMOTE(random_state=42),0.5,0.5,16312,0.438829
Tomkek links,TomekLinks(),0.706766,0.293234,10848,-0.043133


___

### With resampling, we've increased the F1 score of the XGBoost classifier from 45.7% to 55.5% - a 21% relative increase in our F1 accuracy. The cross-validated F1 score of the RF classifier has gone from 56.9 to 59.1 - only ~5% relative increase but nothing to be sniffed at!


Now we can move onto hyperparameter turning our model on the resampled training data. After that we'll finally see how we do on the testing data!