## Preprocessing and Modelling

In this notebook, I preprocess the data (balance the classes, and scale) and proceed to apply the following algorithms : 

- Support Vector Classifier
- Random Forest
- XGBoost

**Contents:**
- [Import libraries and data](#Import-libraries-and-data)
- [Preprocessing](#Preprocessing)
- [Modelling](#Modelling)

### Import libraries and data

In [None]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
import seaborn as sns
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')

# preprocessing
from sklearn.preprocessing import MinMaxScaler
from prettytable import PrettyTable
from imblearn.over_sampling import SMOTE

# modelling
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.metrics import cohen_kappa_score 
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

print("All imported successfully!")
%matplotlib inline

In [None]:
# read data
train = pd.read_csv("../data/train-clean.csv")

### Preprocessing
Here I split the training set to get a validation set, 
balance the classes using `SMOTE` and then scale the data. 

#### Train-val split

In [None]:
# create features and target variable separately
features = [col for col in train.columns if col != 'Response']
X = train[features]
y = train['Response']

In [None]:
# split with test size 20%
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

#### Balance classes with SMOTE
By default, `SMOTE` will oversample all classes to have the same number of samples as the class with the most samples.

In [None]:
# apply SMOTE
smote = SMOTE(sampling_strategy='minority')
counter_before = Counter(y_train)
print("Count before SMOTE: ", counter_before)

#fit and resample with SMOTE
X_train_sm, y_train_sm = smote.fit_resample(X_train,y_train)

counter_after = Counter(y_train_ada)
print("Count after SMOTE: ", counter_after)

#### Scaling Data
Since some data has already been normalized (ranging between 0 and 1), I will use `MinMaxScaler` to scale other features to have the same bounds. 

### Modelling
I apply the following classification algorithms in a pipeline and determine which one achieves the best score.

In [None]:
# instantiate classifiers
rfc = RandomForestClassifier()
svc = SVC()
xgbc = XGBClassifier()

In [None]:
# build pipeline
p1 = Pipeline([('mm', MinMaxScaler()),
              ('rfc', rfc)])
p2 = Pipeline([('mm', MinMaxScaler()),
              ('svc', svc)])
p3 = Pipeline([('mm', MinMaxScaler()),
              ('xgbc', xgbc)])

In [None]:
# params
params1 = [{'rfc__n_estimators': [100, 250, 500], 
           'rfc__max_depth': [10,20],
           'rfc__min_samples_split': [5,10,20]}]
params2 = [{'svc__C': [1], 
           'svc__kernel': ['linear']}] 
params3 = [{'xgbc__n_estimators': [50, 100, 250, 500],
           'xgbc__max_depth': [3,5,10],
           'xgbc__learning_rate': [0.05,0.1,0.3]}]

In [None]:
# set up gridsearch for each algo
gridcvs = {}

inner_cv = KFold(n_splits=2, shuffle=True, random_state=42)

for paramgrid, estimator, name in zip((params1,params2,params3),
                                     (p1,p2,p3),
                                     ('Random Forest Classifier', 'Support Vector Classifier', 'XGBoost Classifier')):
    gcv = GridSearchCV(estimator = estimator,
                      param_grid = paramgrid,
                      scoring = 'cohen_kappa_score',
                      n_jobs=-1,
                      cv=inner_cv,
                      verbose=0,
                      refit=True)
    gridcvs[name]=gcv

In [None]:
%%time
# score on algos
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
  
a = PrettyTable(title="Cross-validated ROC-AUC score", header_style='title', max_table_width=110)
a.field_names =["Algorithms", "ROC-AUC score", "Standard Deviation"]
for name, gs_est in sorted(gridcvs.items()):
    nested_score = cross_val_score(gs_est,
                                  X=X_train_ada,
                                  y=y_train_ada,
                                  cv=outer_cv,
                                  scoring='cohen_kappa_score')
    a.add_row([name, f'{round(nested_score.mean(),3)*100}%', f'+/- {round(nested_score.std(),3)*100}%'])
    print(f'Done with {name}')
#print table
print(a)

### Kaggle Submission