# Classification

A classification problem is when we're trying to predict a discrete (categorical) outcome.

Here are some example questions:

* Does a patient have cancer?
* Will a team win the next game?
* Will the customer buy my product?
* Will I get the loan?

In binary classification, we have two labels: 1 or 0.

## Classification Algorithms

1. Logistic Regression
2. Decision Trees
3. Random Forests

### Logistic Regression

Find a line which separates the data.

![Classification datapoints](images/classification.png)

There are two classes of data points:

* red circles
* blue pluses

### Logistic Regression

![Logistic Regression decision boundary](images/logistic_regression.png)

### Decision Trees

![Decision Tree Example](images/dtree.gif)

(If you flip it upside down it will look more like a tree.)

## Decision Trees: Choosing which feature to split on

A good split:
* All those with a criminal record shouldn't be given loans and all those without a record should be given loans

A bad split:
* 50% of women should be given a loan and 50% of men should be given a loan

## Downside of Decision Trees: Overfitting

There could just be a single datapoint that has income >$70K and has a criminal record.

We can't extrapolate that all datapoints in this bucket would have the same result.

## Solution: Random Forests

A collection of trees, often 10 trees.

* **Bootstrap Aggregation (Bagging):** Each tree gets a random sample *with replacement* of the dataset to build the tree with.

* **Random subset of features:** Only consider a subset of the features when finding the best one to split on.

# Metrics

We have a couple different metrics we use to evaluate how good the model is:

**Accuracy:** This is the percent of predictions that were correct.

**Precision:** This is the fraction of datapoints that you predicted positively that are correct.

```
 number predicted positively that are truly positive
------------------------------------------------------
      number predicted positively (including misses)
```

**Recall:** This is the fraction of datapoints that are truely positive that you predicted correctly.

```
 number predicted positively that are truly positive
------------------------------------------------------
                number of positives
```

# Example: Titanic

Goal: predict if someone survives or not

```
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)
```

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/titanic.csv')
df.head()

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


In [2]:
# stats on the data
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


* There are 891 datapoints
* 38% of the people survived
* Only 714 of the 891 datapoints have the age feature filled in

In [3]:
df = pd.concat([df, pd.get_dummies(df['embarked'])], axis=1)
df['female'] = df['sex'] == 'female'
df['age_filled'] = df['age'].fillna(df['age'].mean())
df.head()

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,C,Q,S,female,age_filled
0,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,0,0,1,False,22
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,1,0,0,True,38
2,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,0,0,1,True,26
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,0,0,1,True,35
4,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,0,0,1,False,35


In [4]:
features = ['pclass', 'age_filled', 'sibsp', 'parch', 'fare', 'C', 'Q', 'S', 'female']
X = df[features].values.astype(float)
y = df['survived'].values

print "X dimensions:", X.shape
print "y dimensions:", y.shape

X dimensions: (891, 9)
y dimensions: (891,)


In [5]:
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score

kfolds = KFold(len(X))

def run_model(X, y, kfolds, Model):
    accuracies = []
    precisions = []
    recalls = []
    for train_index, test_index in kfolds:
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        model = Model()
        model.fit(X_train, y_train)
        y_predict = model.predict(X_test)
        accuracies.append(accuracy_score(y_predict, y_test))
        precisions.append(precision_score(y_predict, y_test))
        recalls.append(recall_score(y_predict, y_test))

    print "accuracy:", np.mean(accuracies)
    print "precision:", np.mean(precisions)
    print "recall:", np.mean(recalls)

print "Logistic Regression:"
run_model(X, y, kfolds, LogisticRegression)


Logistic Regression:
accuracy: 0.789001122334
precision: 0.675874970883
recall: 0.747860965872


In [6]:
from sklearn.tree import DecisionTreeClassifier

print "Decision Tree:"
run_model(X, y, kfolds, DecisionTreeClassifier)

Decision Tree:
accuracy: 0.757575757576
precision: 0.670857718961
recall: 0.686929824561


In [7]:
from sklearn.ensemble import RandomForestClassifier

print "Random Forest:"
run_model(X, y, kfolds, RandomForestClassifier)

Random Forest:
accuracy: 0.79797979798
precision: 0.683279830538
recall: 0.759372435843


Let's see if we can add some additional features that will help us. This is called *feature engineering*.

1. A missing age might mean something!
2. The length of the name might have some status.
3. Use the cabin column. Did they have an assigned cabin?

In [8]:
df['missing_age'] = pd.isnull(df['age'])
df['name_length'] = df['name'].apply(lambda x: len(x))
df['no_cabin'] = pd.isnull(df['cabin'])

In [9]:
new_features = features + ['missing_age', 'name_length', 'no_cabin']
X = df[new_features].values.astype(float)
y = df['survived'].values

print "Random Forest:"
run_model(X, y, kfolds, RandomForestClassifier)

Random Forest:
accuracy: 0.794612794613
precision: 0.660883269276
recall: 0.769083820663


In [10]:
print "feature importances:"
rf_model = RandomForestClassifier().fit(X, y)
sorted(zip(rf_model.feature_importances_, new_features), reverse=True)

feature importances:


[(0.24996092924865998, 'female'),
 (0.18060870673998769, 'fare'),
 (0.17764809383355634, 'age_filled'),
 (0.16876470705763175, 'name_length'),
 (0.075846096630100765, 'pclass'),
 (0.036383190959322818, 'sibsp'),
 (0.031386514124276281, 'no_cabin'),
 (0.029385544726632774, 'parch'),
 (0.015162873337054216, 'missing_age'),
 (0.01500098039067875, 'S'),
 (0.013131614687145129, 'C'),
 (0.0067207482649535603, 'Q')]