## Summary: Compare model results and final model selection
Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

Accuracy:
1. How do they handle data of different sizes, such as short and fat, long and skinny?
2. How will they handle the complexity of feature relationships?
3. How will they handle messy data?

Latency:
1. How long will it take to train?
2. How long will it take to predict?

Which algorithm generates the best model for this given problem?
Using the tendencies and strenghts of the algorithms, i can narrow down the algorithms conidered for a given problem.

Bottom line is, sometimes i will just not know which algorithm will perform best.

In this section, I will do the following:
1. Evaluate all of my saved models on the validation set
2. Select the best model based on performance on the validation set
3. Evaluate that model on the holdout test set

### Read in Data

In [1]:
import joblib
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score
from time import time

val_features = pd.read_csv('../data/val_features.csv')
val_labels = pd.read_csv('../data/val_labels.csv')

te_features = pd.read_csv('../data/test_features.csv')
te_labels = pd.read_csv('../data/test_labels.csv')



### Read in Models

In [2]:
# for ease, since they were all saved to the same name template, i will read these-
# in using a loop.

# then i will store these model objects in a dictionary. so the dictionary will have-
# model name as the key, and model object as the value. 

# in the next step, i can take those model objects and make predictions with them.

# so i have this list of model names that i can loop through and then all i need-
# to do is enter the location within joblib.load to pull in those pickled model objects-
# and store them in the dictionary. 

# what the curly braces will do is it will allow me to format this string by passing in the individual model names. 
models = {}

for mdl in ['LR', 'SVM', 'MLP', 'RF', 'GB']:
    models[mdl] = joblib.load('../data/{}_model.pkl'.format(mdl))

In [3]:
models

{'LR': LogisticRegression(C=1),
 'SVM': SVC(C=0.1, kernel='linear'),
 'MLP': MLPClassifier(activation='tanh', learning_rate='adaptive'),
 'RF': RandomForestClassifier(max_depth=4, n_estimators=50),
 'GB': GradientBoostingClassifier(n_estimators=50)}

### Evaluate models on the validation set

![Evaluation Metrics](../img/eval_metrics.png)

In [4]:
# function that will help me evaluate each of my 5 models on the validation set. 

# accepts the following arguments: name of the model, the model object itself-
# features, so this will be either the validation set or the test set,-
# and labels, validation labes or test set labels

# all the time method does is store the time whenever it is called.
# so what i'm going to do is between the start and end, i'm going to add the predict functionality.

# i'll store the time immediately before the predict call, and the time immediately after-
# the predict call so that ican calculate how long it took to make the actual predictions

def evaluate_model(name, model, features, labels):
    start = time()
    pred = model.predict(features)
    end = time()
    accuracy = round(accuracy_score(labels, pred), 3)
    precision = round(precision_score(labels, pred), 3)
    recall = round(recall_score(labels, pred), 3)
    print('{} -- Accuracy: {} / Precision: {} / Recall: {} / Latency: {}ms'.format(name,
                                                                                   accuracy,
                                                                                   precision,
                                                                                   recall,
                                                                                   round((end - start)*1000, 1)))

# Results:
Now i have the models dictionary, where the keys are the model names and the values are the stored objects.

Now i will loop through this models dictionary, i will want to extract the name and the model, so thats the key and the value of the dictionary.
I do this by calling models.items.

And its that items call that helps to split out the key and the value. Now that i have name and model object i have my features and labels i can call the evaluate function.

Within each loop, i will call evaluate model, pass in the name first, and then the model object and right now i want to evaluate this on the validation set, so i'll pass in validation features and labels.

In [5]:
for name, mdl in models.items():
    evaluate_model(name, mdl, val_features, val_labels)
    
# before i continue into the results, theres one important thing to note.
# i have mentioned before that if i ran random forest twice, i would get-
# different results. 

# its critical to understand that this was only in the training phase.
# what i am dealing with now is a stored fit concrete model.

# at this point if i run this cell twice, i'll get the same results. 
# the only difference might be the latency but the actual accuracy will remain the same.

# Couple things to note here:
# 1st, the Gradient booster model is generating the best results on this unseen data.
# it has the best accuracy, precision, but slightly lower recall than  random forest.

# 2nd, Random forest takes the longest to make predictions, so this brings me to a conversation about trade-offs
# there are 2 types of trade-offs. The first is precision vs recall.

# typically i will have to make a choice between what i rather have in my model. this will all-
# depend on the question i want to answer or the business case. 

# for instance if this is a spam detection problem, then i would optimize for precision.
# in other words, if my model says that its spam, it better be spam or else i'll be-
# blocking real emails that people would like to see. 

# on the other side, if this is a fraud detection model, i would more likely optimize recall,-
# because missing any of these real fraudulent transactions could cost thousands of dollars

# The second trade-off is between overall accuracy, and when i say overall accuracy, I mean precision recall and accuracy-
# and latency.

# In my case, my best model based on accuracy is also the slowest model to make predictions

LR -- Accuracy: 0.775 / Precision: 0.712 / Recall: 0.646 / Latency: 2.0ms
SVM -- Accuracy: 0.747 / Precision: 0.672 / Recall: 0.6 / Latency: 3.0ms
MLP -- Accuracy: 0.781 / Precision: 0.717 / Recall: 0.662 / Latency: 8.0ms
RF -- Accuracy: 0.809 / Precision: 0.792 / Recall: 0.646 / Latency: 6.0ms
GB -- Accuracy: 0.809 / Precision: 0.804 / Recall: 0.631 / Latency: 1.0ms


### Evaluate best model on test set

In [6]:
# I should see performance that aligns fairly closely with the validation set.

# the reason i evaluate both on the validation set and the test set is -
# because i used performance on the validation set to select my best model.

# in a sense, the validation set played a role in my selection of what my best model was-
# for this problem. 

# this test set will not be used for any kind of model selection. so it is a completely-
# unbiased view of how i can expect this model to perform moving forward.
evaluate_model('Gradient Boost', models['GB'], te_features, te_labels)

Gradient Boost -- Accuracy: 0.816 / Precision: 0.852 / Recall: 0.684 / Latency: 2.0ms


In [7]:
# I can see that performance is relatively close to what i saw on the validation set
# Accuracy is slightly higher along with precision and recall

# So now i have a great feel for the likely performance of the model on new data.
# I can be confident in proposing this model as the best model for making predictions on whether people-
# aboard the titanic will survive or not. 