## Pipeline: Evaluate results on validation set

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, I will use what I learned in last section to fit the best few models on the full training set and then evaluate the model on the validation set.

### Read in data

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

tr_features = pd.read_csv('../data/train_features.csv')
tr_labels = pd.read_csv('../data/train_labels.csv')

val_features = pd.read_csv('../data/val_features.csv')
val_labels = pd.read_csv('../data/val_labels.csv')

te_features = pd.read_csv('../data/test_features.csv')
te_labels = pd.read_csv('../data/test_labels.csv')



### Fit best models on full training set

Results from last section:
```
BEST PARAMS: {'max_depth': 10, 'n_estimators': 100}

0.772 (+/-0.068) for {'max_depth': 2, 'n_estimators': 5}
0.805 (+/-0.098) for {'max_depth': 2, 'n_estimators': 50}
0.792 (+/-0.144) for {'max_depth': 2, 'n_estimators': 100}
0.794 (+/-0.052) for {'max_depth': 10, 'n_estimators': 5}
0.82 (+/-0.057) for {'max_depth': 10, 'n_estimators': 50}
---0.832 (+/-0.054) for {'max_depth': 10, 'n_estimators': 100}
0.801 (+/-0.052) for {'max_depth': 20, 'n_estimators': 5}
0.807 (+/-0.034) for {'max_depth': 20, 'n_estimators': 50}
0.803 (+/-0.026) for {'max_depth': 20, 'n_estimators': 100}
0.794 (+/-0.06) for {'max_depth': None, 'n_estimators': 5}
---0.807 (+/-0.014) for {'max_depth': None, 'n_estimators': 50}
---0.809 (+/-0.033) for {'max_depth': None, 'n_estimators': 100}
```

In [2]:
# i will take the 3 best hyper parameter combinations and i will re fit those models on the-
# full training set.

# why do i need to refit on the full training set?
# i want to evaluate this on the validation set, so i will allow my model to learn from-
# the full training set, instead of limiting it to only 80%

# once this cell is run, rf1, rf2, rf3 will become fit models. i can then use them to make-
# predictions on unseen data
rf1 = RandomForestClassifier(n_estimators=100, max_depth=10)
rf1.fit(tr_features, tr_labels.values.ravel())

rf2 = RandomForestClassifier(n_estimators=100, max_depth=None)
rf2.fit(tr_features, tr_labels.values.ravel())

rf3 = RandomForestClassifier(n_estimators=50, max_depth=None)
rf3.fit(tr_features, tr_labels.values.ravel())

### Evaluate models on validation set
Now i will evealuate the above fit models on the validation set.
The only examples these models have seen at this point are those in the training set.

This is the true test. its the test of the models ability to generalize to unseen data. 
If they are overfit or underfit, they will fail here.

I will be using accuracy, precision, and recall to evaluate these models to select the one that generalizes best to the validation set.

In [4]:
# to eveluate these i will use a for loop where i will cycle through my three models.
# and then for each model, i will call the predict method, and then pass in my validation features.
# this will make predictions and it will output an array of those features 

# after i have the predictions, i want to generate some results metrics
for mdl in [rf1, rf2, rf3]:
    y_pred = mdl.predict(val_features)
    accuracy = round(accuracy_score(val_labels, y_pred), 3)
    precision = round(precision_score(val_labels, y_pred), 3)
    recall = round(recall_score(val_labels, y_pred), 3)
    print('MAX DEPTH: {} / # OF EST: {} -- A: {} / P: {} / R: {}'.format(mdl.max_depth,
                                                                         mdl.n_estimators,
                                                                         accuracy,
                                                                         precision,
                                                                         recall))

# i can see that the model that performed best in cross validation actually did not-
# perform best in the validation set

MAX DEPTH: 10 / # OF EST: 100 -- A: 0.821 / P: 0.824 / R: 0.737
MAX DEPTH: None / # OF EST: 100 -- A: 0.799 / P: 0.786 / R: 0.724
MAX DEPTH: None / # OF EST: 50 -- A: 0.821 / P: 0.824 / R: 0.737


### Evaluate the best model on the test set
Up to this point i have dont the following:

1. Explore & clean the data
2. Split data into train/validation/test
3. Fit an initial model and evaluate
4. Tune hyper parameters 
5. Evaluate on Validation Set
6. Final model selection and evaluation on test set

Now that i have the best model, i need to evaluate it on a test set to get a truly unbiased view of how it should perform moving forward. 

So this final test set is purely for evaluation purposes to see  that it matches the performance that i have seen before. And to give me more confidence in the performance of the model moving forward. 

In [6]:
# i am using my best model which was rf1, and that was already fit on the full training set.
# i will make predictions using my test features
y_pred = rf1.predict(te_features)
accuracy = round(accuracy_score(te_labels, y_pred), 3)
precision = round(precision_score(te_labels, y_pred), 3)
recall = round(recall_score(te_labels, y_pred), 3)
print('MAX DEPTH: {} / # OF EST: {} -- A: {} / P: {} / R: {}'.format(rf1.max_depth,
                                                                     rf1.n_estimators,
                                                                     accuracy,
                                                                     precision,
                                                                     recall))

# looking at these performance metrics and circling all the way back to grid search cv-
# the model with these hyper parameter settings had 83.2% accuracy for grid search cv, 82.1% on validation set,
# and 79.2% accuracy on this test set.

# this highlights how the performance of a given model can vary based on the data i give it.

MAX DEPTH: 10 / # OF EST: 100 -- A: 0.792 / P: 0.741 / R: 0.662
