## Final model and predictions

### Hyperparameter tuning 

For the AdaBoost model, we tune the hyperparameters `learning_rate` (weight applied to each classifier at each boosting iteration. A higher learning rate increases the contribution of each classifier) and `n_estimtors` (maximum number of estimators used). 

In [2]:
import numpy as np
import pandas as pd

from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV

## this is to suppress warnings I was getting in this code. 
import warnings
# Suppress FutureWarning messages
warnings.simplefilter(action='ignore', category=FutureWarning)


In [3]:
## these are our parameters we want to tune.
param_grid = {"n_estimators": np.arange(50,750,100),
              "learning_rate": [0.01, 0.1, 1]}

In [5]:
#importing the clean survey training data to tune the model
survey_train = pd.read_csv('Data/survey_data_train.csv')

In [6]:
#features we are focusing on for our model
features = ['S2', 'D4', 'Fan_magnitude']

## the outputs we are predicting
targets = ['VL1r1','VL1r2','VL1r4','VL1r5','VL1r7',
           'VL1r10','VL1r11','VL1r12','VL1r13' ,'VL1r14']

In [7]:
## initialize our model
Ada = AdaBoostClassifier()

## dictionary for our hyperparameters
VL_dict = {}

## for our outputs, we determine the best parameters and store those in a dictionary
## to use later.
for VL in targets:
    print(VL)
    search = GridSearchCV(Ada, param_grid, cv=5).fit(survey_train[features], survey_train[VL])
    VL_dict[VL] = search.best_params_

VL1r1
VL1r2
VL1r4
VL1r5
VL1r7
VL1r10
VL1r11
VL1r12
VL1r13
VL1r14


In [8]:
## viewing our dictionary of best parameters for each VL output
VL_dict

{'VL1r1': {'learning_rate': 1, 'n_estimators': 350},
 'VL1r2': {'learning_rate': 0.1, 'n_estimators': 450},
 'VL1r4': {'learning_rate': 0.1, 'n_estimators': 350},
 'VL1r5': {'learning_rate': 1, 'n_estimators': 550},
 'VL1r7': {'learning_rate': 0.01, 'n_estimators': 650},
 'VL1r10': {'learning_rate': 0.1, 'n_estimators': 150},
 'VL1r11': {'learning_rate': 0.1, 'n_estimators': 150},
 'VL1r12': {'learning_rate': 0.1, 'n_estimators': 250},
 'VL1r13': {'learning_rate': 1, 'n_estimators': 150},
 'VL1r14': {'learning_rate': 0.1, 'n_estimators': 250}}

### Now, we run the final test on the chosen model

With the above best performing hyper parameters, we run the model on the test data. We store the accuracy scores and feature importances for each `VL1r` question in distinct dictionaries. 

In [9]:
from sklearn.metrics import accuracy_score

survey_test = pd.read_csv('Data/survey_data_test.csv')

In [10]:
accuracy = {}
precision = {}
recall = {}
ada_importance = {}

## reminder of our features and outputs
# features = ['S2', 'D4', 'Fan_magnitude']
# targets = ['VL1r1','VL1r2','VL1r4','VL1r5','VL1r7',
#            'VL1r10','VL1r11','VL1r12','VL1r13' ,'VL1r14']

for v in VL_dict.items():
    ## initialize the model with the best_params_ found above
    Ada = AdaBoostClassifier(**v[1])
    
    ## fit the model with the training data
    Ada.fit(survey_train[features].values, survey_train[v[0]].values)
    
    ## predict the test data
    pred = Ada.predict(survey_test[features].values)
    
    ## store the accuracy score for the test VL values and predicted values
    accuracy[v[0]] = accuracy_score(survey_test[v[0]].values, pred)
    
    ##comuting the confusion matrix for the VL value
    matrix = confusion_matrix(survey_test[v[0]].values, pred)
    
    ## isolating True negative, True positive, etc.
    TN = matrix[0,0]
    FP = matrix[0,1]
    FN = matrix[1,0]
    TP = matrix[1,1]
    
    ## storing recall and precision in a dictionary
    recall[v[0]] = np.round(TP/(TP+FN),4)
    precision[v[0]] = np.round(TP/(TP+FP),4)
    
    ## store the feature importance for each VL
    ada_importance[v[0]] = Ada.feature_importances_

In [11]:
## viewing our final accuracy scores
accuracy

{'VL1r1': 0.6560540279787748,
 'VL1r2': 0.6406174626145683,
 'VL1r4': 0.7838880849011095,
 'VL1r5': 0.6348287506029908,
 'VL1r7': 0.5914134105161601,
 'VL1r10': 0.7211770381090208,
 'VL1r11': 0.7650747708634829,
 'VL1r12': 0.9276410998552822,
 'VL1r13': 0.7930535455861071,
 'VL1r14': 0.8335745296671491}

In [12]:
## viewing the feature importance
# features = ['S2', 'D4', 'Fan_magnitude']
ada_importance

{'VL1r1': array([0.22285714, 0.04571429, 0.73142857]),
 'VL1r2': array([0.34666667, 0.18222222, 0.47111111]),
 'VL1r4': array([0.38571429, 0.14857143, 0.46571429]),
 'VL1r5': array([0.21090909, 0.01454545, 0.77454545]),
 'VL1r7': array([0.37076923, 0.01846154, 0.61076923]),
 'VL1r10': array([0.33333333, 0.18666667, 0.48      ]),
 'VL1r11': array([0.3 , 0.32, 0.38]),
 'VL1r12': array([0.332, 0.288, 0.38 ]),
 'VL1r13': array([0.27333333, 0.06      , 0.66666667]),
 'VL1r14': array([0.416, 0.16 , 0.424])}

In [None]:
precision

In [None]:
recall