## Simple machine learning vocabulary (More tomorrow)

* **Sample**: An object that your machine learning algorithm is trying to understand. Also known as an **example**.
* **Feature**: An entity that describes your sample. 
    * The input to most machine learning algorithm is a **feature vector**, which is encoded numeric data about your sample
* **Classification**: Describes a task where you have to predict labels, categories, etc. from a finite set of possibilities
    * Examples: Spam detection, image identification, image annotation (if your annotation set is finite), sentiment analysis
    * Classification is done in the **supervised setting**, where we train an algorithm using an already labeled dataset. 
* A **dataset** for classification is usually in the form of a matrix $X$ of samples (row wise) and another vector of labels, $y$

## Example dataset of feature vectors

![](images/features.png)

Example class labels

![](images/liver-classes.png)

## UCI Liver Disorders dataset

* Predict whether a patient has a liver disorder based
* Dataset information [here](https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.names)

In [None]:
import pandas as pd

prefix = "../datasets/"

columns=["mean corpuscular volume", "protein1", "protein2", 
         "protein3", "protein4", "number of drinks", "Has Liver Disorder"]
df = pd.read_csv(prefix + "liver.data", header=None, names=columns)

In [None]:
X = df.drop("Has Liver Disorder", axis=1)
y = df["Has Liver Disorder"]

## Protocol for (simple) evaluation of supervised models

(The simplicity of this procedure is scorned upon. But we'll talk about robust evaluation of models tomorrow.)

1. Split your dataset into a **training** and **testing** dataset
    * **Training** set represents data that has already been observed and labeled.
    * **Testing** set represents data that has not been labeled yet, which we seek to label. (But, in this experimental setting, we have these labels for evaluation.)
![](images/train_test_split.svg)        
2. Train your model* on your training set
3. Evaluate your model on your testing set

![](images/supervised_workflow.svg)

## Common classification models
* [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) ([Cox](https://en.wikipedia.org/wiki/David_Cox_(statistician), 1958)
* [Support Vector Machine](https://en.wikipedia.org/wiki/Support_vector_machine) ([Vapnik](https://en.wikipedia.org/wiki/Vladimir_Vapnik), Chervonenkis, 1963)
* [Neural Networks](https://en.wikipedia.org/wiki/Artificial_neural_network) (Backpropagation algorithm by Remelhart, [Hinton](https://en.wikipedia.org/wiki/Geoffrey_Hinton), Williams, 1986)
* [Random Forests](https://en.wikipedia.org/wiki/Random_forest) (Breiman, 2001) and Decision Trees ([ID3 algorithm](https://en.wikipedia.org/wiki/ID3_algorithm) in 1986 by [Quinlan](https://en.wikipedia.org/wiki/Ross_Quinlan))
* [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)?...

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Train the model - a Logistic Regression
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Use the model to predict on unseen data
predictions = clf.predict(X_test)

In [None]:
# Finally, check how you did on this dataset
from sklearn.metrics import accuracy_score

print "Score on liver classification: ", accuracy_score(y_test, predictions)