# Tuning the Algorithm Parameters

![Tuning the model like tuning a guitar](how-to-tune-a-guitar.jpg)

### The below code can be split in secions as :

* Import required datasets
* User defined function to extract top "n" model
* A small introduction to the data set used (self read)
* Import dataset for processing
* Grid Search
    - create a grid dictionary of hyper-parameters
    - execute the Grid Search Optimisation
    - output the results
* Random Search
    - create a dictionary of hyper-parameters
    - execute the Random Search Optimisation
    - output the results

# The coding begins.... 

In [1]:
print(__doc__)

# Import the required modules
import numpy as np

from time import time
from operator import itemgetter
from scipy.stats import randint as sp_randint

from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier

Automatically created module for IPython interactive environment


In [2]:
# Utility function to report best scores
def report(grid_scores, n_top=3):
    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}".format(score.parameters))
        print("")

### The Digit Dataset

This dataset is made up of 1797 8x8 images. Each image, like the one shown below, is of a hand-written digit. In order to utilize an 8x8 figure like this, we’d have to first transform it into a feature vector with length 64.

![Digit](plot_digits_last_image_001.png)


The info about the "Digits" dataset can be read from:
* http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits

* http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html

/bin/sh: -c: line 0: syntax error near unexpected token `plot_digits_last_image_001.png'
/bin/sh: -c: line 0: `[Digit](plot_digits_last_image_001.png)'


In [3]:
# get some data
digits = load_digits()
X, y = digits.data, digits.target

# build a classifier
clf = RandomForestClassifier(n_estimators=20)

In [4]:
# use a full grid over all parameters
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [1, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.grid_scores_)))
report(grid_search.grid_scores_)
print("\n")

GridSearchCV took 43.57 seconds for 216 candidate parameter settings.
Model with rank: 1
Mean validation score: 0.934 (std: 0.012)
Parameters: {'bootstrap': False, 'min_samples_leaf': 3, 'min_samples_split': 1, 'criterion': 'gini', 'max_features': 10, 'max_depth': None}

Model with rank: 2
Mean validation score: 0.933 (std: 0.011)
Parameters: {'bootstrap': False, 'min_samples_leaf': 1, 'min_samples_split': 3, 'criterion': 'entropy', 'max_features': 10, 'max_depth': None}

Model with rank: 3
Mean validation score: 0.931 (std: 0.012)
Parameters: {'bootstrap': True, 'min_samples_leaf': 1, 'min_samples_split': 3, 'criterion': 'gini', 'max_features': 10, 'max_depth': None}





In [5]:
# specify parameters and distributions to sample from
param_dist = {"max_depth": [5, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(1, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 108
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

In [6]:
start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
print("\n")
report(random_search.grid_scores_)

RandomizedSearchCV took 25.30 seconds for 108 candidates parameter settings.


Model with rank: 1
Mean validation score: 0.935 (std: 0.003)
Parameters: {'bootstrap': False, 'min_samples_leaf': 1, 'min_samples_split': 3, 'criterion': 'gini', 'max_features': 4, 'max_depth': None}

Model with rank: 2
Mean validation score: 0.932 (std: 0.011)
Parameters: {'bootstrap': False, 'min_samples_leaf': 1, 'min_samples_split': 1, 'criterion': 'gini', 'max_features': 9, 'max_depth': None}

Model with rank: 3
Mean validation score: 0.929 (std: 0.005)
Parameters: {'bootstrap': False, 'min_samples_leaf': 1, 'min_samples_split': 5, 'criterion': 'gini', 'max_features': 10, 'max_depth': None}



![Comparison](GridRandomComparison2.jpg)

            For half the number of iterations, Random Search gives almost the same Validation score as Grid Search

####  Light-reading stuff :-)  

This is blog is about handwriting recognition and Python, the blog explains on how images are recognised, what training methods are used to recoginse and predict the digit from the image. 

http://blog.yhat.com/posts/digit-recognition-with-node-and-python.html

#### References:

* Random Forest - http://goo.gl/F14BqE

* GridSearchCV - http://goo.gl/Fca3kX

* RandomizedSearchCV - http://goo.gl/T4MZct

* How to find the best model parameters in scikit-learn (video) - https://goo.gl/1xDhtm

* Random Search for Hyper-Parameter Optimization -  http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf

* How to Evaluate Machine Learning Models: Hyperparameter Tuning by Alice Zheng  - http://goo.gl/B7KNJs

* Comparing randomized search and grid search for hyperparameter estimation - http://goo.gl/9q1qgd

![Thank You](2ThankYou.jpeg)