# Getting Set Up

This notebook assumes you have done the setup required in Week 1 and 2.

# Outline

- [Pipeline](#section1)
- [Hypertuning](#section2)
    - [Grid Search](#section2a)
    - [Priotiritizing Parameters](#section2b)
    - [Other Strategies](#section2c)
- [Troubleshooting](#section3)
    - [Imbalanced Datasets](#section3a)
    - [Information Leakage](#section3b)
- [Lab](#section4)

<a id = 'section1'></a>

# Pipeline

In data science a ***pipeline*** is a chain of modelling related tasks.  There can be up to $n$ modelling tasks in any given pipeline.  We start with some initial input, which is fed into the first modelling task.  The output of the first modelling task is then fed to the next second modelling task, and so on and so forth, until we reach the final modelling task and output.  

In the context of this course, we use a pipeline with two modelling tasks.  The initial input is an article that we want to classify as fake news or not.  The first modelling task takes our article and embeds it.  The output of the first model, the embeddings, are fed into the final modelling task, the classifier.  The final output of our pipeline, the classification will indicate whether the initial input, the article, is fake or not.

When using Scikit you can use its builtin pipelining feature to build pipelines using your Scikit models. To see how to use this tool you may look at this [example](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#example-model-selection-grid-search-text-feature-extraction-py).  In this example, a text feature extractor is composed with a linear classifier that uses stochastic gradient descent.

<a id = 'section2'></a>

# Hypertuning

With any machine learning algorithm we must pass some set of parameters to initialize the model. For any model the set of hyperparameters we use depends on the data we are trying to train on.  The process of finding the optimal set of hyperparameters for your model for the given dataset is called ***hypertuning***.  The process of hypertuning innvolves training multiple models with different sets of hyperparameters and using some metric or culmination of metrics(i.e. F1 Score, Precision, Recall, etc.) to determine the optimal set of hyperparameters.

<a id = 'section2a'></a>

## Grid Search

Grid searches are typically used when you don't know (and often don't care too much about the meaning of) a set of optimal parameters to a given estimator or set of estimators. They are essentially a set of for loops that try out a series of parameters and construct a single model for each case (hence a grid). Scikits has a [grid search class](http://scikit-learn.org/stable/modules/grid_search.html#grid-search) that will automate an exhaustive or optimized search for one or more estimator parameters.

Also somewhat confusingly, people will often conflate "pipeline" and "grid search", sometimes using the former to mean the latter. You can do a grid search as part of a pipeline, using a final function to estimate model quality, and the output of the tested models as input. Scikits has an [example of this here](http://scikit-learn.org/stable/modules/pipeline.html#pipeline).

There are two kinds of Grid Search, exhaustive and random.

### Exhaustive

Exhaustive grid search is nothing more than a series of for loops, each iterating over a dictionary of possible hyperparameter values. The best performance of any of the searched parameters is kept and the chosen hyperparameters are returned.  Scikit has a method for this, though you could write your own doing something similar to this example:

```
results = {}
parameter_vals = {'p1':[a_1,a_2...a_K], 'p2':[b_1, b_2, ... b_M], ... , 'pN':[zz_1, zz_2, ..., zz_N]}

parameter_sets = generate_parameter_grid by exhaustive combinations
for set in parameter_sets
    test accuracy of model(set)
results[set] = accuracy   
return argmax(results)
```

### Random

A random search for parameter values uses a generating function (typically a selected distribution, i.e. rbf/beta/gamma with user-input parameters) to produce candidate value sets for the hyperparameters. This has two main benefits over an exhaustive search:

    1) A budget can be chosen independent of the number of parameters and possible values. Thus the user only has one parameter to handle.

    2) Adding parameters that do not influence the performance does not decrease efficiency, contrary to a standard grid search in that manual selections of a specifed parameter may result in very little influence to the tuning.

In [2]:
import numpy as np

from time import time
from operator import itemgetter
from scipy.stats import randint as sp_randint
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier

# get some data
digits = load_digits()
X, y = digits.data, digits.target

# build a classifier
clf = RandomForestClassifier(n_estimators=20)


# Utility function to report best scores
def report(grid_scores, n_top=3):
    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}".format(score.parameters))
        print("")


# specify parameters and distributions to sample from - 
# what methods might we consider that would improve these estimates
# 
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(1, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.grid_scores_)

# use a full grid over all parameters. 
# The grid search will generate parameter sets for each and every one of these
# 
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [1,3,10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.grid_scores_)))
report(grid_search.grid_scores_)

ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

<a id = 'section2b'></a>

## Priotiritizing Parameters

When hypertuning, it is critical to remember that not all hyperparameters have equal importance.  With most models, a subset of hyperparameters will have a major impact on the model's performance, while the remaining hyperparameters will do little to nothing to impact a model's performance.  Hence, our hypertuning should focus on finding optimal values for this subset of important hyperparameters. 

<a id = 'section2c'></a>

## Other Strategies 

<a id = 'section3'></a>

# Troubleshooting 

<a id = 'section3a'></a>

## Imbalanced Datasets 

<a id = 'section3b'></a>

## Information Leakage 

<a id = 'section4'></a>

# Lab 