# Hour 2: Introduction to Classification:
Python and Classification Labcamp - Milano, 25/05/2016

Alex Loosley (a.loosley@reply.de)
<br>Alex Salles (a.salles@reply.de)

# What is Classification:

When an object can be **labeled** by a discrete class (*e.g.* dog or cat), the act of determining that class is called classification.

## Iris Dataset:

In the Pandas intro, we downloaded the IRIS dataset.  Exerpt [wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set):

<blockquote>
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.[1] It is sometimes called Anderson's Iris datad set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.[2] Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".[3]
</blockquote>

<img src="https://upload.wikimedia.org/wikipedia/commons/7/78/Petal-sepal.jpg" width="300">

## Visualization of Iris Dataset:

<img src="../doc/ml_logo.jpg" width="500">

# Classification Algorithms:

### Logistic Regression:
Find straight lines that fit between two classes.  In a 3 class problem, find lines between *is_class* and *is_not_class* (*e.g.* iris-setosa vs. everything else, iris-versicolor vs. everything else, *etc.*)

### Trees, Naive Bayes, SVM, Neural Networks, etc.
![Picture from SKlearn](http://scikit-learn.org/stable/_images/plot_classifier_comparison_001.png)
[Link to SKlearn classificaiton page](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)

We'll implement a few of these below on the iris dataset:

# Metrics for Classificaton:

There are a plethora of metrics for classification and they depend on whether the predictions are given in terms of the potential label classes or probabilities.

## Metrics for Class Predictions:

Let's start with the simplest.

Recall this well-known table

|                     | Observation Positive     | Observation Negative    |
|---------------------|:------------------------:|:-----------------------:|
| Prediction Positive |     True Positive        | False Positive (Type I) |
| Prediction Negative | False Negative (Type II) |     True Negative       |

There are many summary statistics one can compute from this table:
1. The **Accuracy** gives the fraction labels correctly predicted (True Positives and True Negatives over everything).  
1. The **Hamming Loss** gives the fraction of labels incorrectly predicted.  It is 1 - Accuracy.
1. The **Precision** is true positives divided by all positive predictions: $p = \frac{TP}{FP+TP}$
1. The **Recall** is true positives divided by all positive observations: $r = \frac{TP}{FN+TP}$
1. There is also **F-beta** score: $F_\beta=\frac{(1+\beta^2)\cdot\textrm{precission}\cdot\textrm{recall}}{\beta^2 \cdot\textrm{precission}+\textrm{recall}}$

    This gives a weighted geometric average between the precision and recall (as a function of $\beta$) and the **F-1** score is the special case when $\beta = 1$.
1. The **Jaccard Similarity Coefficient** is the True positives divided by the sum of true positives, false negatives, and false positives.  

**Questions:**
1. What's the interpretation of precision or recall? When would you want each?  
1. Is Harvard's admission's process high precision or high recall? 
1. Should one optimize on recall or precission for an HIV-test?
1. What about Sir Blackstone's aphorism "Better that ten guilty persons escape than that one innocent suffer" with Captain Louis Renault's order to "Round up the Usual Suspects" in the film "Casablanca"?

MATERIAL ABOVE MODIFIED FROM MATERIAL I HAD FROM [THE DATA INCUBATOR](http://www.thedataincubator.com).

## IRIS Dataset Classification Example With Logistic Regression:
Let's first grab the data and plot it:

In [None]:
# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets

# import some data to play with
iris = datasets.load_iris()
X = pd.DataFrame(iris.data[:, :2], columns=iris.feature_names[:2])  # we only take the first two features.
y = iris.target

In [None]:
plt.scatter(X.iloc[:,0], X.iloc[:,1], c=y)
plt.xlabel(X.columns[0])
plt.ylabel(X.columns[1])
plt.title('Three Class Example:')

### Goal:
Classify flower type based on these two features: 
* sepal width
* sepal length

### Method:
* Train Logistic Regression classifier based on data points givin (X,y)
* Use Logistic Regression classifier to predict the flower type of a meshgrid of points spanning our feature space

In [None]:
h = .02  # step size in the mesh

logreg = linear_model.LogisticRegression(C=1e5)

# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X.iloc[:, 0].min() - .5, X.iloc[:, 0].max() + .5
y_min, y_max = X.iloc[:, 1].min() - .5, X.iloc[:, 1].max() + .5

# Create a meshgrid of points:
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
X_predict = pd.DataFrame(np.c_[xx.ravel(), yy.ravel()], columns=iris.feature_names[:2])

In [None]:
X_predict.head(2)

ravel along with [np._c](https://docs.scipy.org/doc/numpy/reference/generated/numpy.c_.html) can be used to create an array of points from the meshgrid:
* ravel flattens the matrix into an array
* np._c zips the two points together

Make predictions within feature space and plot:

In [None]:
y_predict = logreg.predict(X_predict)

# Put the result into a color plot
y_predict = y_predict.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, y_predict, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Flower Type Prediction Plot:')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())

plt.show()

### Questions:

- How well does our model perform? How do we measure it?

- Why do we divide the model into training set and test set?
















# Exercises:

## Variance-Bias Tradeoff

The *Bias* corresponds to how far off we expect the model to deviate from reality (i.e. the model's bias) because of parametric assumptions (e.g. we forced the model to be linear or to be a tree of maximum depth 2).  It is given by the *In-Sample Error* of the above plot and always goes down with complexity.  High Bias models correspond to *underfitting*.

The *Variance* accounts for the fact that the model was only trained on a (noisy) subset of the data and that the idiosyncratic noise in the data is therefore likely to contribute some variance to the model.  The more complex we allow the model to be, the more likely we are to overfit by picking up more of this noise.  High variance modesl correspond to *overfitting*.

We can also think of bias as unmodelled data and variance as modelled noise.  As we increase the complexity of the model, we will necessarily model more of the data (reduce bais, reduce underfitting) but also start modelling noise (increase variance, increase overfitting).  Here's a helpful diagram of the decomposition.  Notice that at the optimal point, we have not yet learned on all our signal (still unmodelled data left) and we have picked up some noise and overfitting.

![Bias-Variance from Dartmouth](../doc/bias-variance.png)

(This section material from [The Data Incubator](http://www.thedataincubator.com), an prestigious data science training)

- split the data set into training set and test set
- train a Logistic Regression model with the traning set
- predict the results of the test set
- using the predicted values and the "true" values, create a confusion matrix
- interpret the results

## optional:
- repeat the exercise with diferent classifiers and compare the results

### Exercise Solutions:

In [None]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
# create the model
logreg_validation = linear_model.LogisticRegression(C=1e5)

# train the model
logreg_validation.fit(X_train, y_train)


In [None]:
# predict the output
y_predict = logreg_validation.predict(X_test)


In [None]:
cm = confusion_matrix(y_test, y_predict)

In [None]:
print cm