### Machine Learning

Note: The following models were commented out (using """xxx""") as they took a long time (more than 40 mins) to run. Please comment in if you'd like to run them on your machine.
- SVM
- Random Forest
- GradientBoost
- XGBoost classifier 

In [1]:
# wip, refactor as we firm model choices
# Import libraries for pre-processing
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

### Import files

In [2]:
# Import the files
df_test = pd.read_csv("https://raw.githubusercontent.com/AngShengJun/dsi14P4/master/assets/working/df_test_weather_cleaned.csv")
df_train = pd.read_csv("https://raw.githubusercontent.com/AngShengJun/dsi14P4/master/assets/working/df_train_weather_cleaned.csv")

In [3]:
df_test.head(1)

Unnamed: 0.1,Unnamed: 0,id,date,lat,long,wk,yr,tavg,stnpress,dewpt,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,0,1,2008-06-11,41.95469,-87.800991,23,2008,75.0,29.31,55.5,...,0,0,0,0,0,0,0,0,0,0


In [4]:
df_train.head(1)

Unnamed: 0.1,Unnamed: 0,date,lat,long,wnv,num_mos,wk,yr,tavg,stnpress,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,0,2007-05-29,41.95469,-87.800991,0,1,22,2007,75.5,29.415,...,0,0,0,0,0,0,0,0,0,0


### Prep

In [5]:
# Make a copy of test
df_testcopy = df_test.copy()

In [6]:
# Drop unnecessary col
df_testcopy.drop(['id','Unnamed: 0'],axis=1,inplace=True)
# Review
df_testcopy.head(1)

Unnamed: 0,date,lat,long,wk,yr,tavg,stnpress,dewpt,precip,windspeed,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,2008-06-11,41.95469,-87.800991,23,2008,75.0,29.31,55.5,0.0,9.15,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# Make a copy of train
df_traincopy = df_train.copy()

In [8]:
# Drop unnecessary col
df_traincopy.drop(['Unnamed: 0'],axis=1,inplace=True)
# Review
df_traincopy.head(1)

Unnamed: 0,date,lat,long,wnv,num_mos,wk,yr,tavg,stnpress,dewpt,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,2007-05-29,41.95469,-87.800991,0,1,22,2007,75.5,29.415,58.5,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# Sanity check
print(df_traincopy.shape)
print(df_testcopy.shape)

(8610, 156)
(116293, 154)


In [10]:
# Drop date column from test and train
df_traincopy.drop(['date'],axis=1,inplace=True)
df_testcopy.drop(['date'],axis=1,inplace=True)

In [11]:
# trc - TrainComplete set
X_trc = df_traincopy.drop(['num_mos','wnv'],axis=1)
y_trc = df_traincopy['wnv']

In [12]:
df_traincopy.head(1)

Unnamed: 0,lat,long,wnv,num_mos,wk,yr,tavg,stnpress,dewpt,precip,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,41.95469,-87.800991,0,1,22,2007,75.5,29.415,58.5,0.0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
df_testcopy.head(1)

Unnamed: 0,lat,long,wk,yr,tavg,stnpress,dewpt,precip,windspeed,daylight,...,trap_T230,trap_T231,trap_T232,trap_T233,trap_T235,trap_T236,trap_T237,trap_T238,trap_T900,trap_T903
0,41.95469,-87.800991,23,2008,75.0,29.31,55.5,0.0,9.15,15.166667,...,0,0,0,0,0,0,0,0,0,0


### Train-Validate-Split

In [14]:
# Train-validate-split
X_train,X_val,y_train,y_val = train_test_split(X_trc,y_trc,test_size=0.3,random_state=42, stratify=y_trc)

In [15]:
print(y_train.value_counts(normalize=True))

0    0.946906
1    0.053094
Name: wnv, dtype: float64


Positive class is wnv. Negative class is no wnv. Classes are imbalanced. We face a Class Imbalance Problem when the class distributions are highly imbalanced. In this context, many classification learning algorithms will have low predictive accuracy for the infrequent class (i.e. the positive class we are interested in).

We will use SMOTE (Synthetic Minority Oversampling TEchnique) to mitigate class imbalance. SMOTE consists of synthesizing elements for the minority class, based on exisitng observations. A minority class observation is randomly picked and the k-nearest neighbors are computed for this observation. Synthetic points are then added between this observation and its neighbors.

### Resampling

In [16]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='minority')
X_sm, y_sm = smote.fit_sample(X_train, y_train)

In [17]:
# Review class balance after SMOTE application
print(y_sm.value_counts(normalize=True))

1    0.5
0    0.5
Name: wnv, dtype: float64


The baseline accuracy is 0.5. A model needs to perform better than this.

### Classification Models

Model workflow:
- pipeline for standard scaler (transformer) and classifier model (estimator), where relevant.
- gridsearch for best model parameters.
- metrics for evaluation: F1 score and roc_auc since we are dealing with imbalanced class distribution.

### Logistic Regression Model

In [18]:
pipe1 = Pipeline([
    ('ss', StandardScaler()),
    ('lor', LogisticRegression(solver='lbfgs',random_state=42)),
])

In [19]:
pipe1.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('lor',
   LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                      intercept_scaling=1, l1_ratio=None, max_iter=100,
                      multi_class='auto', n_jobs=None, penalty='l2',
                      random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                      warm_start=False))],
 'verbose': False,
 'ss': StandardScaler(copy=True, with_mean=True, with_std=True),
 'lor': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'lor__C': 1.0,
 'lor__class_weight': None,
 'lor__dual': False,
 '

In [20]:
# Define the pipe parameters
pipe1_params = {'ss__with_mean': [True],
                'ss__with_std': [True],
                'lor__max_iter': [100,200,300]}

In [21]:
# Instantiate Gridsearch
gs1 = GridSearchCV(pipe1,\
                  param_grid=pipe1_params,\
                  cv=10)
# Fit GridSearch to the cleaned training data.
gs1.fit(X_sm,y_sm)

GridSearchCV(cv=10, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('ss',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('lor',
                                        LogisticRegression(C=1.0,
                                                           class_weight=None,
                                                           dual=False,
                                                           fit_intercept=True,
                                                           intercept_scaling=1,
                                                           l1_ratio=None,
                                                           max_iter=100,
                                                           multi_class='auto',
                

In [22]:
# Check the results of the grid search
# Google colab raises max iter limit, while jupytr runs fine with provided params
print(f"Best parameters: {gs1.best_params_}")
print(f"Best score: {gs1.best_score_}")

Best parameters: {'lor__max_iter': 100, 'ss__with_mean': True, 'ss__with_std': True}
Best score: 0.9677737597676785


In [23]:
# Save model
model1 = gs1.best_estimator_

In [24]:
# Score model on train set and validate set
print(f"Accuracy on train set: {model1.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model1.score(X_val, y_val)}")

Accuracy on train set: 0.9686350096372875
Accuracy on validate set: 0.9454123112659698


The model accuracy is higher than the baseline accuracy, modelling helps with classification. The model is overfitted with about 2% drop in validate accuracy compared to train accuracy.

In [25]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
preds = gs1.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df

Unnamed: 0,pred no wnv,pred wnv
Actual no wnv,2442,4
Actual wnv,137,0


In [26]:
# return nparray as a 1-D array.
confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()

In [27]:
# Summary of metrics for log reg model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

Sensitivity: 0.0
Precision: 0.0
F1: nan


  after removing the cwd from sys.path.


In [28]:
pred_proba = [i[1] for i in gs1.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

Unnamed: 0,validate_values,pred_probs
5602,0,0.187528
1012,0,0.000989
7561,0,0.034399
7887,0,0.001305
4422,0,0.05358


In [29]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

0.7901146516583011

### SVM

Note: Code for SVM commented out, takes significant time to run (more than 30 mins).

In [30]:
pipe2 = Pipeline([
    ('ss', StandardScaler()),
    ('svm', SVC())
])

In [31]:
pipe2.get_params()

{'memory': None,
 'steps': [('ss', StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('svm',
   SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
       decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
       max_iter=-1, probability=False, random_state=None, shrinking=True,
       tol=0.001, verbose=False))],
 'verbose': False,
 'ss': StandardScaler(copy=True, with_mean=True, with_std=True),
 'svm': SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
     decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
     max_iter=-1, probability=False, random_state=None, shrinking=True,
     tol=0.001, verbose=False),
 'ss__copy': True,
 'ss__with_mean': True,
 'ss__with_std': True,
 'svm__C': 1.0,
 'svm__break_ties': False,
 'svm__cache_size': 200,
 'svm__class_weight': None,
 'svm__coef0': 0.0,
 'svm__decision_function_shape': 'ovr',
 'svm__degree': 3,
 'svm__gamma': 'scale',
 'svm__kernel': 'rbf

In [32]:
# Define the pipe parameters
pipe2_params = {'ss__with_mean': [True],
                'ss__with_std': [True],
                'svm__C': [1,10],
                'svm__gamma': ['scale','auto'],
                'svm__kernel': ['rbf','linear','poly']
               }

In [None]:
# Initiate Gridsearch
gs2 = GridSearchCV(pipe2,
                  param_grid=pipe2_params,
                  cv=10)

# Fit gs2
gs2.fit(X_sm, y_sm)

In [None]:
# Check the results of the grid search

print(f"Best parameters: {gs2.best_params_}")
print(f"Best score: {gs2.best_score_}")

In [None]:
# Save model
model2 = gs2.best_estimator_

In [None]:
# Score model on train set and validate set
print(f"Accuracy on train set: {model2.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model2.score(X_val, y_val)}")

In [None]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
preds = gs2.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df

In [None]:
# return nparray as a 1-D array.
confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel

In [None]:
# Summary of metrics for log reg model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

In [None]:
pred_proba = [i[1] for i in gs2.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

In [None]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

### KNN Classifier

In [None]:
pipe3 = Pipeline([
    ('ss', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

In [None]:
pipe3.get_params()

In [None]:
# Define the pipe parameters
pipe3_params = {'ss__with_mean': [True],
                'ss__with_std': [True],
                'knn__n_neighbors' : [3,5,7],
                'knn__metric': ['euclidean','manhattan']}

In [None]:
# Instantiate Gridsearch
gs3 = GridSearchCV(pipe3,\
                  param_grid=pipe3_params,\
                  cv=10)
# Fit GridSearch to the cleaned training data.
gs3.fit(X_sm,y_sm)

In [None]:
# Check the results of the grid search

print(f"Best parameters: {gs3.best_params_}")
print(f"Best score: {gs3.best_score_}")

In [None]:
# Save model
model3 = gs3.best_estimator_

In [None]:
# Score model on train set and validate set
print(f"Accuracy on train set: {model3.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model3.score(X_val, y_val)}")

The model is overfitted with about 4% drop in validate accuracy compared to train accuracy.

In [None]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
preds = gs3.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df

In [None]:
# return nparray as a 1-D array.
confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()

In [None]:
# Summary of metrics for log reg model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

In [None]:
pred_proba = [i[1] for i in gs3.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

In [None]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

### Decision Tree

In [None]:
gs4 = GridSearchCV(estimator = DecisionTreeClassifier(),\
                   param_grid = {'max_depth': [7, 9, 11],\
                                 'min_samples_split': [10, 15, 20],\
                                 'min_samples_leaf': [2, 3, 4],\
                                 'ccp_alpha': [0, 0.001, 0.01, 0.1, 1, 10]},\
                   cv = 5,\
                   verbose = 2)

In [None]:
# Fit GridSearch to the cleaned training data.
gs4.fit(X_sm,y_sm)

In [None]:
# Check the results of the grid search

print(f"Best parameters: {gs4.best_params_}")
print(f"Best score: {gs4.best_score_}")

In [None]:
# Save model
model4 = gs4.best_estimator_

In [None]:
# Score model on train set and validate set
print(f"Accuracy on train set: {model4.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model4.score(X_val, y_val)}")

The model is overfitted with about 8% drop in validate accuracy compared to train accuracy. The accuracy is lower than the all previous models.

In [None]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
preds = gs4.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df

In [None]:
# return nparray as a 1-D array.
confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()

In [None]:
# Summary of metrics for log reg model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

In [None]:
pred_proba = [i[1] for i in gs4.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

In [None]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

### Random Forest + oversampling & undersampling SMOTE

Note: Code for Rand Forest commented out, takes significant time to run (more than 30 mins).

For Random Forest Classifier, we explore the combined effects of oversampling and undersampling SMOTE. First oversample the minority class with SMOTE to about a 1:10 ratio, then undersample the majority class to achieve about a 1:2 ratio.

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold

# define pipeline

rf_pipeline = Pipeline([
    ('over', SMOTE()),
    ('under', RandomUnderSampler()),
    ('rf', RandomForestClassifier())
])

In [None]:
# Define dictionary of hyperparameters.
pipeline_params = {
    'over__k_neighbors' : [1,2,3,4,5,6,7,8,9,10],
    'over__sampling_strategy' : [0.1],
    'under__sampling_strategy' : [0.5],
    'rf__n_estimators': [50, 100, 200],
    'rf__max_depth': [4, 6, 10, 12],
    'rf__random_state': [13]
}

In [None]:
# Instantiate our GridSearchCV object.
rf_gs = GridSearchCV(rf_pipeline, # What is the model we want to fit?
                                 pipeline_params, # What is the dictionary of hyperparameters?
                                 cv=5, # What number of folds in CV will we use?
                                 verbose=1,
                                 scoring='roc_auc')

In [None]:
# Fit the GridSearchCV object to the data.
rf_gs.fit(X_train, y_train)

In [None]:
# Save the best model
rf_gs_best = rf_gs.best_estimator_

In [None]:
# Checking the scores - Not sure if this is accuracy as I used scoring = 'roc_auc', so might be roc_auc??
print(f"Accuracy on train set: {rf_gs_best.score(X_train, y_train)}")
print(f"Accuracy on validate set: {rf_gs_best.score(X_valid, y_valid)}")

In [None]:
prediction_rf = rf_gs_best.predict(X_valid)

In [None]:
# Confusion matrix
cm = confusion_matrix(y_valid, prediction_rf)
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,prediction_rf).ravel()

In [None]:
# Summary of metrics for random forest model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

In [None]:
pred_proba = [i[1] for i in prediction_rf.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()"

In [None]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

### Gradient Boosting Classifier

Note: Code for GradientBoostingClassifier commented out, takes significant time to run (more than 30 mins).

In [None]:
# Build upon the hyper-parameters used in Decision Tree model
# Learning rate in GB is try to put it in lower range; usually 0.1 to 0.2
gs5 = GridSearchCV(estimator = GradientBoostingClassifier(random_state=42),\
                   param_grid = {'learning_rate' : [0.1,0.2],\
                                 'n_estimators' : [100,125],
                                 'min_samples_split': [10,15],\
                                 'min_samples_leaf': [2,3],\
                                 'max_depth': [9,11],\
                                 'ccp_alpha': [0,0.1]},\
                   cv = 5,\
                   verbose = 2)

In [None]:
# CAUTION: Takes long time to run (more than 50 mins)
# Fit GridSearch to the cleaned training data.
gs5.fit(X_sm,y_sm)

In [None]:
# Check the results of the grid search

print(f"Best parameters: {gs5.best_params_}")
print(f"Best score: {gs5.best_score_}")

In [None]:
# Save model
model5 = gs5.best_estimator_

In [None]:
# Score model on train set and validate set
print(f"Accuracy on train set: {model5.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {model5.score(X_val, y_val)}")

The model is overfitted with about 9% drop in validate accuracy compared to train accuracy. The accuracy is lower than the all previous models.

In [None]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
preds = gs5.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df

In [None]:
# return nparray as a 1-D array.
confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()

In [None]:
# Summary of metrics for log reg model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

In [None]:
pred_proba = [i[1] for i in gs5.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

In [None]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

### XGBoost Classifier

Note: Code for XGBoost Classifier commented out, takes significant time to run (more than 30 mins).

XGBoost classifier is not explicitly covered in class; we explore the capabilities of XGboost in this section. XGBoost implements parallel processing and should run faster as compared to GBM.
Update: took 40 mins, slightly faster than GB, though it's likely due to the different params used.

In [None]:
from xgboost import XGBClassifier

In [None]:
gsX = GridSearchCV(estimator = XGBClassifier(random_state=42),\
                   param_grid = {'max_depth': [9,11],\
                                 'learning_rate' : [0.1],\
                                 'n_estimators' : [100,125],\
                                 'objective' : ['binary:logistic'],\
                                 'gamma': [0.5,1],\
                                 'min_child_weight': [1,5],\
                                 'subsample': [0.5,1.0],\
                                 'colsample_bytree': [0.5,1.0] },\
                   cv = 5,\
                   verbose = 2)

Notes on params:
- binary:logistic –logistic regression for binary classification, returns predicted probability (not class).
- gamma (default=0) specifies the minimum loss reduction required to make a split; makes model conservative.
- min child weight (default=1) minimum sum of weights of all observations required in a child. Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
- subsample (default=1) Denotes the fraction of observations to be randomly samples for each tree. Lower values make the algorithm more conservative and prevents overfitting. Typical values: 0.5, 1.
- colsample_bytree(default=1) Denotes the fraction of columns to be randomly samples for each tree. Typical values: 0.5, 1

In [None]:
# CAUTION: Takes long time to run (more than 50 mins)
# Fit GridSearch to the cleaned training data.
gsX.fit(X_sm,y_sm)


In [None]:
# Check the results of the grid search

print(f"Best parameters: {gsX.best_params_}")
print(f"Best score: {gsX.best_score_}")

In [None]:
# Save model
modelX = gsX.best_estimator_

In [None]:
# Score model on train set and validate set
print(f"Accuracy on train set: {modelX.score(X_sm, y_sm)}")
print(f"Accuracy on validate set: {modelX.score(X_val, y_val)}")

The model is overfitted with about 6% drop in validate accuracy compared to train accuracy. The accuracy is lower than the all previous models.

In [None]:
# Confusion matrix
# Pass in true values, predicted values to confusion matrix
# Convert confusion matrix into dataframe
# Positive class (class 1) is wnv
preds = gsX.predict(X_val)
cm = confusion_matrix(y_val, preds)
cm_df = pd.DataFrame(cm,columns=['pred no wnv','pred wnv'], index=['Actual no wnv','Actual wnv'])
cm_df

In [None]:
# return nparray as a 1-D array.
confusion_matrix(y_val, preds).ravel()
# Save TN/FP/FN/TP values.
tn, fp, fn, tp = confusion_matrix(y_val,preds).ravel()

In [None]:
# Summary of metrics for log reg model
sens = tp/(tp+fn)
prec = tp/(tp+fp)
f1 = 2*(prec*sens)/(prec+sens)
print(f"Sensitivity: {round(sens,4)}")
print(f"Precision: {round(prec,4)}")
print(f"F1: {round(f1,4)}")

In [None]:
pred_proba = [i[1] for i in gsX.predict_proba(X_val)]

pred_df = pd.DataFrame({'validate_values': y_val,
                        'pred_probs':pred_proba})
pred_df.head()

In [None]:
# Calculate ROC AUC.
roc_auc_score(pred_df['validate_values'],pred_df['pred_probs'])

### Summary of Model Metrics 

In [None]:
# Summary of Model scores in Dataframe
summary_df = pd.DataFrame({'accuracy(val)' : [0.945, 0.918, 0.934, 0.814, 0.852, 0.909, 0.914],\
                           'sensitivity' :   [0,     0.226, 0.066, 0.650, 0.524, 0.226, 0.285],\
                           'precision' :     [0,     0.226, 0.173, 0.171, 0.200, 0.196, 0.241],\
                           'F1' :            ['NaN', 0.226, 0.095, 0.271, 0.290, 0.210, 0.261],\
                           'roc_auc' :       [0.791, 0.795, 0.721, 0.804, 0.698, 0.847, 0.849]})
# Transpose dataframe
summary_dft = summary_df.T
# Rename columns
summary_dft.columns = ['LogReg','SVM', 'KNN', 'DT', 'RF(Smote O&U)', 'GBc', 'XGBc']
summary_dft

We pick XGBosst classifier as the best model, based on F1 score and roc_auc. Next generate the predict probabilities on test set for kaggle submission.

In [None]:
# instantiate the best model with the best hyperparams
best_model = modelX

### Model Evaluation

In [None]:
# Check train set headers
df_traincopy.head(1)

In [None]:
# Check test set headers
df_testcopy.head(1)

In [None]:
# making a copy of the train_kaggle df
X_train_kaggle = df_traincopy.copy().drop(['num_mos','wnv'],axis=1)
y_train_kaggle = df_traincopy.copy()['wnv']

X_test_kaggle = df_testcopy.copy()

In [None]:
X_train_kaggle.columns.difference(X_test_kaggle.columns)

In [None]:
X_test_kaggle.columns.difference(X_train_kaggle.columns)

In [None]:
#checking shape
X_train_kaggle.shape

In [None]:
X_train_kaggle.head()

In [None]:
X_test_kaggle.head()

In [None]:
#checking shape
X_test_kaggle.shape

In [None]:
# Scale variables
# Commented out (since best model is based on decision trees, scaling not required)
#ss = StandardScaler(n)
#X_train_kaggle_ss = ss.fit_transform(X_train_kaggle)
#X_test_kaggle_ss = ss.transform(X_test_kaggle)

In [None]:
# Resampling using SMOTE oversmaple on minority class
X_train_kaggle_sm, y_train_kaggle_sm = smote.fit_sample(X_train_kaggle, y_train_kaggle)

In [None]:
# to insert best model

best_model.fit(X_train_kaggle_sm, y_train_kaggle_sm,)

In [None]:
# Score model on train set and validate set
print(f"Accuracy on train set: {best_model.score(X_train_kaggle_sm,y_train_kaggle_sm)}")

In [None]:
# predicting kaggle output
predict_kaggle = best_model.predict(X_test_kaggle)

In [None]:
# probability predicition
predict_proba_kaggle = best_model.predict_proba(X_test_kaggle)

In [None]:
#Saving an output CSV file for submission
output = pd.DataFrame({'Id': df_test['id'], 'WnvPresent': predict_proba_kaggle[:,1]})
output.to_csv('my_submission_DSI_final.csv', index=False)