## Preprocessing and Modelling

In this notebook, I preprocess the data (balance the classes, and scale) and proceed to apply the following algorithms : 

- Support Vector Classifier
- Random Forest
- XGBoost

**Contents:**
- [Import libraries and data](#Import-libraries-and-data)
- [Preprocessing](#Preprocessing)
- [Modelling](#Modelling)

### Import libraries and data

In [44]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
import seaborn as sns
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')

# preprocessing
from sklearn.preprocessing import MinMaxScaler
from prettytable import PrettyTable
from imblearn.over_sampling import SMOTE
from collections import Counter

# modelling
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.metrics import cohen_kappa_score, make_scorer
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

print("All imported successfully!")
%matplotlib inline

All imported successfully!


In [29]:
# read data
train = pd.read_csv("../data/train-reviewed.csv")
test = pd.read_csv("../data/test-clean.csv")

### Preprocessing
Here I split the training set to get a validation set, 
balance the classes using `SMOTE` and then scale the data. 

#### Train-val split

In [30]:
# create features and target variable separately
features = [col for col in train.columns if col != 'Response']
X = train[features]
y = train['Response']

In [31]:
# split with test size 20%
X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, random_state=42)

#### Balance classes with SMOTE
By default, `SMOTE` will oversample all classes to have the same number of samples as the class with the most samples.

In [32]:
# apply SMOTE
smote = SMOTE()
counter_before = Counter(y_train)
print("Count before SMOTE: ", counter_before)

#fit and resample with SMOTE
X_train_sm, y_train_sm = smote.fit_resample(X_train,y_train)

counter_after = Counter(y_train_sm)
print("Count after SMOTE: ", counter_after)

Count before SMOTE:  Counter({8: 14587, 6: 8401, 7: 6005, 2: 4898, 1: 4642, 5: 4067, 4: 1070, 3: 757})
Count after SMOTE:  Counter({6: 14587, 8: 14587, 4: 14587, 2: 14587, 7: 14587, 5: 14587, 1: 14587, 3: 14587})


#### Scaling Data
Since some data has already been normalized (ranging between 0 and 1), I will use `MinMaxScaler` to scale other features to have the same bounds. 

In [33]:
#instantiate
mms = MinMaxScaler()

#fit and transform train set
X_train_sm_sc = mms.fit_transform(X_train_sm)

#transform the validation and test sets
X_val_sc = mms.transform(X_val)
test_sc = mms.transform(test)

### Modelling
I apply the following classification algorithms in a pipeline and determine which one achieves the best score.

In [46]:
# instantiate classifiers
rfc = RandomForestClassifier()
svc = SVC()
xgbc = XGBClassifier()

In [47]:
# build pipeline
p1 = Pipeline([('mm', MinMaxScaler()),
              ('rfc', rfc)])
p2 = Pipeline([('mm', MinMaxScaler()),
              ('svc', svc)])
p3 = Pipeline([('mm', MinMaxScaler()),
              ('xgbc', xgbc)])

In [48]:
# params
params1 = [{'rfc__n_estimators': [100, 250, 500], 
           'rfc__max_depth': [10,20],
           'rfc__min_samples_split': [5,10,20]}]
params2 = [{'svc__C': [1], 
           'svc__kernel': ['linear']}] 
params3 = [{'xgbc__n_estimators': [50, 100, 250, 500],
           'xgbc__max_depth': [3,5,10],
           'xgbc__learning_rate': [0.05,0.1,0.3]}]

In [49]:
# set up gridsearch for each algo
gridcvs = {}

# set folds for inner
inner_cv = KFold(n_splits=2, shuffle=True, random_state=42)

# make scorer
kappa_scorer = make_scorer(cohen_kappa_score)

for paramgrid, estimator, name in zip((params1,params2,params3),
                                     (p1,p2,p3),
                                     ('Random Forest Classifier', 'Support Vector Classifier', 'XGBoost Classifier')):
    gcv = GridSearchCV(estimator = estimator,
                      param_grid = paramgrid,
                      scoring = kappa_scorer,
                      n_jobs=-1,
                      cv=inner_cv,
                      verbose=0,
                      refit=True)
    gridcvs[name]=gcv

In [None]:
%%time
# score on algos
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# create table
a = PrettyTable(title="Cross-validated ROC-AUC score", header_style='title', max_table_width=110)
a.field_names =["Algorithms", "ROC-AUC score", "Standard Deviation"]

# get cross val score 
for name, gs_est in sorted(gridcvs.items()):
    nested_score = cross_val_score(gs_est,
                                  X=X_train_sm_sc,
                                  y=y_train_sm,
                                  cv=outer_cv,
                                  scoring=kappa_scorer)
    a.add_row([name, f'{round(nested_score.mean(),3)*100}%', f'+/- {round(nested_score.std(),3)*100}%'])
    print(f'Done with {name}')

#print table
print(a)

### Kaggle Submission