# Getting Set Up

This notebook assumes you have done the setup required in Week 1 and 2.

# Outline

- [Train-Validation-Test Split](#section1)
- [Pipeline](#section2)
- [Hypertuning](#section3)
    - [Grid Search](#section3a)
    - [Priotiritizing Parameters](#section3b)
    - [Other Strategies](#section3c)
- [Troubleshooting](#section4)
    - [Imbalanced Classes](#section4a)
    - [Information Leakage](#section4b)
- [Lab](#section5)

<a id = 'section1'></a>

# Train-Validation-Test Split

Before you begin fitting a model on a new dataset you should, almost always, split your initial dataset into a "train" dataset, a "validation" dataset and a "test" dataset.  The train dataset gives us a way to have our model "learn".  The validation dataset gives us a way to judge the performance of the model against other potential models.  The test dataset gives us an idea of how well our model generalizes for **unseen** data.  

In practice, we will use the train dataset to train all your potential models.  The validation dataset will be passed to each of these models to judge the performance of each of these models allowing us to compare models against eachother.  Then once finding our optimal model we finally pass the test dataset to judge the model's performance on **unseen** data and the performance based on the test dataset will be the one reported in your academic paper or to your employer.

You should generally keep 20-50% of the data for the validation and test sets and use the remaining 50-80% for training.

Never just split your data into the first 80% and the remaining 20% for your validation and test sets.  You should always split your data as randomly as possible. The slightest inclusion of a non-random process in the selection of the training set can skew model parameters. Data is frequently sorted in some way (by date or even by the value you are trying to predict).

There is a method implemented in Scikit that splits the dataset randomly for us called [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split).  We can use this method twice to perform a train-validation-test split done below.

![a](images/dataset.png)
![a](images/testtrainvalidation.png)
[source](https://cdn-images-1.medium.com/max/948/1*4G__SV580CxFj78o9yUXuQ.png)

In [1]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# make the split reproducible
random_state = 42

# get some data
digits = load_digits()
X, y = digits.data, digits.target

# get our split percentages
validation_size = .25
test_size = .25
validation_test_size = validation_size + test_size
test_size_adjusted = test_size / validation_test_size

# perform the first split which gets us the train data and the validation/test data that
# we must split one more time
X_train, X_validation_test, y_train, y_validation_test =  train_test_split(X, y,\
                                                                           test_size = validation_test_size,\
                                                                           random_state = random_state)

# perform the second split which splits the validation/test data into two distinct datasets
X_validation, X_test, y_validation, y_test = train_test_split(X_validation_test, y_validation_test,\
                                                              test_size = test_size_adjusted,\
                                                              random_state = random_state)

<a id = 'section2'></a>

# Pipeline

In data science a ***pipeline*** is a chain of modelling related tasks.  There can be up to $n$ modelling tasks in any given pipeline.  We start with some initial input, which is fed into the first modelling task.  The output of the first modelling task is then fed to the next second modelling task, and so on and so forth, until we reach the final modelling task and output.  

In the context of this course, we use a pipeline with two modelling tasks.  The initial input is an article that we want to classify as fake news or not.  The first modelling task takes our article and embeds it.  The output of the first model, the embeddings, are fed into the final modelling task, the classifier.  The final output of our pipeline, the classification will indicate whether the initial input, the article, is fake or not.

When using Scikit you can use its builtin pipelining feature to build pipelines using your Scikit models. To see how to use this tool you may look at this [example](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#example-model-selection-grid-search-text-feature-extraction-py).  In this example, a text feature extractor is composed with a linear classifier that uses stochastic gradient descent.

<a id = 'section3'></a>

# Hypertuning

With any machine learning algorithm we must pass some set of parameters to initialize the model. For any model the set of hyperparameters we use depends on the data we are trying to train on.  The process of finding the optimal set of hyperparameters for your model for the given dataset is called ***hypertuning***.  The process of hypertuning innvolves training multiple models with different sets of hyperparameters and using some metric or culmination of metrics(i.e. F1 Score, Precision, Recall, etc.) to determine the optimal set of hyperparameters.  We choose the optimal set of hyperparameters based on the model using the optimal set of hyperparameters producing the best overall metrics for the validation and/or training set of data.

<a id = 'section3a'></a>

## Grid Search

Grid searches are typically used when you don't know (and often don't care too much about the meaning of) a set of optimal parameters to a given estimator or set of estimators. They are essentially a set of for loops that try out a series of parameters and construct a single model for each case (hence a grid). Scikits has a [grid search class](http://scikit-learn.org/stable/modules/grid_search.html#grid-search) that will automate an exhaustive or optimized search for one or more estimator parameters.

Also somewhat confusingly, people will often conflate "pipeline" and "grid search", sometimes using the former to mean the latter. You can do a grid search as part of a pipeline, using a final function to estimate model quality, and the output of the tested models as input. Scikits has an [example of this here](http://scikit-learn.org/stable/modules/pipeline.html#pipeline).

There are two kinds of Grid Search, exhaustive and random.

![a](images/gridsearch.png)

[source](https://cdn-images-1.medium.com/max/1920/1*Uxo81NjcpqNXYJCeqnK1Pw.png)

### Exhaustive

Exhaustive grid search is nothing more than a series of for loops, each iterating over a dictionary of possible hyperparameter values. The best performance of any of the searched parameters is kept and the chosen hyperparameters are returned.  Scikit has a method for this, though you could write your own doing something similar to this pseudo-code:

```
results = {}
parameter_vals = {'p1':[a_1,a_2...a_K], 'p2':[b_1, b_2, ... b_M], ... , 'pN':[zz_1, zz_2, ..., zz_N]}

parameter_sets = generate_parameter_grid by exhaustive combinations
for set in parameter_sets
    test accuracy of model(set)
results[set] = accuracy   
return argmax(results)
```

### Random

A random search for parameter values uses a generating function (typically a selected distribution, i.e. rbf/beta/gamma with user-input parameters) to produce candidate value sets for the hyperparameters. This has one main benefits over an exhaustive search:

    - A budget can be chosen independent of the number of parameters and possible values. Thus the user only has one parameter to handle.

Below is an example of how to perform both a **random** and **exhaustive** gridsearch.

In [1]:
import numpy as np

from time import time
from operator import itemgetter
from scipy.stats import randint as sp_randint
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier

# load digit dataset
digits = load_digits()
# split data into inputs and output
X, y = digits.data, digits.target

# build a random forest classifier
clf = RandomForestClassifier(n_estimators=20)


# Utility function to report best scores
def report(grid_scores, n_top=3):
    # sort scores based on metric so we can grab the n_top models
    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
    # iterate over the n_top models
    for i in range(n_top):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              grid_scores['mean_test_score'][i],
              grid_scores['std_test_score'][i]))
        print("Parameters: {0}".format(grid_scores['params'][i]))
        print("")


# specify parameters and distributions to sample from - 
# what methods might we consider that would improve these estimates?
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# number of models we are going to train
n_iter_search = 20
# create our randomized gridsearch classifier
#      clf, is the model we are performing the search on
#      param_dist, is a dictionary of paramater distributions that we will sample over
#      n_iter_search, number of models we are going to train
#      True, the scores from our training for each model will be returned when we perform the gridsearch
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search, return_train_score=True)
# start a timer so we know how long the random gridsearch took
start = time()
# perform the random gridsearch
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
# print the top 3 model outputs from the random gridsearch
report(random_search.cv_results_)

# use a full grid over all parameters. 
# The grid search will generate parameter sets for each and every one of these
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2,3,10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# create an exhaustive gridsearch object
#      clf, is the model we are performing the search on
#      param_grid dictionary with the parameter settings the search will try
#      True, the scores from our training for each model will be returned when we perform the gridsearch
grid_search = GridSearchCV(clf, param_grid=param_grid, return_train_score=True)
# start a timer so we know how long the exhaustive gridsearch took
start = time()
# perform the exhaustive gridsearch
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.cv_results_)))
# print the top 3 model outputs from the exhaustive gridsearch
report(grid_search.cv_results_)

RandomizedSearchCV took 3.41 seconds for 20 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.923 (std: 0.017)
Parameters: {'bootstrap': False, 'criterion': 'gini', 'max_depth': None, 'max_features': 7, 'min_samples_leaf': 4, 'min_samples_split': 2}

Model with rank: 2
Mean validation score: 0.905 (std: 0.013)
Parameters: {'bootstrap': True, 'criterion': 'gini', 'max_depth': None, 'max_features': 7, 'min_samples_leaf': 4, 'min_samples_split': 10}

Model with rank: 3
Mean validation score: 0.768 (std: 0.024)
Parameters: {'bootstrap': True, 'criterion': 'gini', 'max_depth': 3, 'max_features': 1, 'min_samples_leaf': 2, 'min_samples_split': 2}

GridSearchCV took 44.23 seconds for 22 candidate parameter settings.
Model with rank: 1
Mean validation score: 0.761 (std: 0.026)
Parameters: {'bootstrap': True, 'criterion': 'gini', 'max_depth': 3, 'max_features': 1, 'min_samples_leaf': 1, 'min_samples_split': 2}

Model with rank: 2
Mean validation score: 0.785 (std: 0.020)

<a id = 'section3b'></a>

## Priotiritizing Parameters

When hypertuning, it is critical to remember that not all hyperparameters have equal importance.  With most models, a subset of hyperparameters will have a major impact on the model's performance, while the remaining hyperparameters will do little to nothing to impact a model's performance or there is an established value that you should use for a hyperparameter regardless of the data.  Hence, our hypertuning should focus on finding optimal values for this subset of important hyperparameters. 

For example, with neural networks, the two most important hyperparameters to tune are the learning rate and weight regularization of the optimizer.  Both of these parameters control the rate at which the neural network learns.  If you are "aggressive" with these parameters then we might overshoot the optimal weights, though if we are too "lenient" with these parameters we might undershoot the optimal weights.

<a id = 'section3c'></a>

## Other Strategies 

There are other ways to perform hypertuning beside grid search.  One alternative is ***Bayesian Optimization***.  Bayesian Optimization approximates a posterior distribution based on the model you are trying to train to find the optimal set of hyperparameters.  Here is an [implementation in Python](https://github.com/fmfn/BayesianOptimization).

<a id = 'section4'></a>

# Troubleshooting 

In data science there are a multitude of problems that can arise either on the data and/or modeling side.  Fortunately, for us, a lot of problems we face in data science have been encountered by others and approaches have been established to solves these problems.  In this section we will look at two common problems that arise in data science and some tools of the trade for how to address them.

<a id = 'section4a'></a>

## Imbalanced Classes

***Imbalanced Classes*** occurs when performing a classification learning and the a subset of the potential classes we could output make up a substantial majority of our data.  

To ensure our model is able to learn about all classes we could use a model that is robust to class imbalances.  One example of a model that is robust to class imbalances is the class weighted Support Vector Machine.  Essentially this model places a higher penalty on misclassifying observations of the minority class causing the model to put equal importance to each class despite the disparity in number of observations for each class.  Scikit's version of this SVM can be found [here](http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html#).

Though sometimes we might want to use a model not robust to imbalanced classes for our imbalanced data(i.e. this course using XGBoost on our imbalanced article dataset).  In these cases it is best to resample your data such that your sampled data corrects the imbalance.  This balanced sampled data is then used to train your model without it being affected by the imbalanced data. 

![a](images/unbalanced.png)

[source](http://www.svds.com/wp-content/uploads/2016/08/messy.png)

<a id = 'section4b'></a>

## Information Leakage 

***Information Leakage*** occurs when training data has extra information that makes it seem like our model produces better results than it actuall would in the "real world".  The usual way we combat this is by performing a train, validation and test split to our data.  We only use the test set to judge the how well our final model will perform when put into production.  Though, sometimes we do not have a sufficient amount of data to to have a pure test set.  One way to combat information leakage(i.e. insufficient data) is to perform a KFold Cross Validation.

### KFold Cross Validation

Ideally, when training a model, we'd like to   When we lack sufficient data, when can still gauge the performance of a model using KFold Cross Validation.  We can get a more accurate value of the error by using KFold Cross Validation.

Basically, we break the data into k groups. We take one of these groups and make it the test set. We then train the data on the remaining groups and calculate the error on the test group. We can repeat this process k times and average all the results. This gives us a more accurate picture of the error.

You can perform a KFold Cross Validation in Scikit using this [method](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold)

![cross validation](images/cross-validation.jpg)

([source](http://cse3521.artifice.cc/classification-evaluation.html))

<a id = 'section5'></a>

# Lab 

Now you will create your own pipeline where you take the embeddings we created in Lecture 2 and feed them into XGBoost, that we learned about in Lecture 1.

1) Setup a pipeline where embeddings we created in Lecture 2 are fed into XGBoost

2) How did your first iteration of the pipeline do?

3) How could we improve the performance of the pipeline?

4) What parameters are important to tune for the [embedding process?](https://radimrehurek.com/gensim/models/doc2vec.html)

5) What parameters are important to tune for [XGBoost?](http://xgboost.readthedocs.io/en/latest/python/python_api.html)

6) Now that you know what parameters are important to both processes in the pipeline, hypertune both models.

7) Are there any sources of information leakage? Explain.

8) Is the data balanced? How do we know the balances of the data?

9) If the data is imbalanced what can we do to make our pipeline robust to the imbalances?

10) Should our test set be balanced or not?  Explain.

11) Based on the data we have, should we perform KFold Cross Validation and/or a train-validation-test split?

12) If time permits, write some code so that we can have balanced classes.