# Breast Cancer Assignment - Advanced Validation

### Due 03/14/2016
### By Jacob Metzger

0.  Using breast_cancer.csv, create a random forest model that predicts malignant given the other relevant variables.  Use a single holdout (test/train split).  Use Grid Search to optimize model hyperparameters.  Measure the model's performance using AUC, Accuracy, Precision, and Recall.

1.  Implement K-Fold Cross Validation, with 10 folds, on your Breast Cancer Model

2.  Report on how the K-Fold CV score compared to your single holdout AUC

3.  Write a short description of your model's performance.   Include AUC, Accuracy, Precision, and Recall in your discussion.

In [1]:
from __future__ import division #for floating division
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score, accuracy_score
%matplotlib inline
np.random.seed(314) # Set for reproducibility. 

In [2]:
# The following is clipped from the Random Forests == Awesome notebook in the course notes

# Here is a simple function to show descriptive stats on the categorical variables
def describe_categorical(X):
    """
    Just like .describe(), but returns the results for
    categorical variables only.
    """
    from IPython.display import display, HTML
    display(HTML(X[X.columns[X.dtypes == "object"]].describe().to_html()))

In [3]:
#Another function taken from the Random Forests == Awesome notebook from the course page

# Look at all the columns in the dataset
def printall(X, max_rows=10):
    from IPython.display import display, HTML
    display(HTML(X.to_html(max_rows=max_rows)))

In [4]:
X = pd.read_csv("breast_cancer.csv")

In [5]:
X.head()

Unnamed: 0.1,Unnamed: 0,id number,clump_thickness,uniformity_of_cell_size,uniformity_of_cell_shape,marginal_adhesion,epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,malignant
0,0,1000025,5,1,1,1,2,1,3,1,1,0
1,1,1002945,5,4,4,5,7,10,3,2,1,0
2,2,1015425,3,1,1,1,2,2,3,1,1,0
3,3,1016277,6,8,8,1,3,4,3,7,1,0
4,4,1017023,4,1,1,3,2,1,3,1,1,0


In [6]:
X.describe()

Unnamed: 0.1,Unnamed: 0,id number,clump_thickness,uniformity_of_cell_size,uniformity_of_cell_shape,marginal_adhesion,epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,malignant
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,349.0,1071704.098712,4.41774,3.134478,3.207439,2.806867,3.216023,3.440629,3.437768,2.866953,1.589413,0.344778
std,201.928205,617095.729819,2.815741,3.051459,2.971913,2.855379,2.2143,3.665507,2.438364,3.053634,1.715078,0.475636
min,0.0,61634.0,1.0,1.0,1.0,1.0,1.0,-1.0,1.0,1.0,1.0,0.0
25%,174.5,870688.5,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,0.0
50%,349.0,1171710.0,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,0.0
75%,523.5,1238298.0,6.0,5.0,5.0,4.0,4.0,5.0,5.0,4.0,1.0,1.0
max,698.0,13454352.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0


It's worth pointing out here that since the mean of the malignant column is around 1/3, we're looking at a lopsided target variable.

In [7]:
y = X.pop("malignant")

### Single-holdout test-train split

In [8]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [9]:
#Create pre-optimized model

from sklearn.ensemble import RandomForestClassifier
rfModel = RandomForestClassifier(n_jobs=-1)
rfModel.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

### Implement Gridsearch for hyperparameter optimization

In [10]:
#The following commented code is adapted from lecture entitled Random Forests == Awesome
#Remove single comment # to reproduce. Double # lines were not used.

#from sklearn.grid_search import GridSearchCV
#n_estimator_options = [200, 300, 400, 500]
#max_features_options = ["auto", None, "sqrt", "log2", 0.9, 0.2]
#min_samples_leaf_options = [1, 2, 3, 4, 5]
##min_samples_split_options = [1,2,3,4,5]
##max_depth_options = [1,2,3,4,5]
##max_leaf_nodes_options = [2,3,4,5, None]
##min_weight_fraction_leaf_options = [0.0,0.1,0.2,0.3,0.4]

#estimator = GridSearchCV(rfModel, dict(
#        n_estimators=n_estimator_options,
#        max_features=max_features_options,
#        min_samples_leaf=min_samples_leaf_options,
##        min_samples_split=min_samples_split_options,
##        max_depth=max_depth_options,
##        max_leaf_nodes = max_leaf_nodes_options,
##        min_weight_fraction_leaf = min_weight_fraction_leaf_options
#    ), cv=3, n_jobs=-2, scoring="roc_auc")

#estimator.fit(X, y)

#print estimator.best_estimator_

#rfModel = estimator.best_estimator_

In [11]:
# Generate model based on optimized parameters. Comment out if using the code above instead.

rfModel = RandomForestClassifier(n_jobs=-2, max_features=0.2, min_samples_leaf=4, n_estimators=300, random_state=42)
rfModel.fit(X_train, y_train) 

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.2, max_leaf_nodes=None,
            min_samples_leaf=4, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=-2,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

### Performance metrics on single-holdout model (note no confidence estimates)

In [12]:
from sklearn.metrics import roc_auc_score, mean_squared_error, precision_score, recall_score, f1_score, adjusted_mutual_info_score
print "ROC:  ", roc_auc_score(y_test, rfModel.predict(X_test))
print "MSE:  ", mean_squared_error(y_test, rfModel.predict(X_test))
print "Prec: ", precision_score(y_test, rfModel.predict(X_test))
print "Rec:  ", recall_score(y_test, rfModel.predict(X_test))
print "F1:   ", f1_score(y_test, rfModel.predict(X_test))

ROC:   0.956725146199
MSE:   0.0428571428571
Prec:  0.914893617021
Rec:   0.955555555556
F1:    0.934782608696


### Implement 10-fold Cross Validation

In [13]:
from sklearn.cross_validation import cross_val_score
from scipy.stats import sem
from sklearn.metrics import make_scorer

roc_scores = cross_val_score(rfModel, X, y, cv=10, scoring="roc_auc", n_jobs=-2)

mse_scores = cross_val_score(rfModel, X, y, cv=10, scoring="mean_squared_error", n_jobs=-2) #note that this is actually negative as a scoring function
mse_scores *= -1

precision_scores = cross_val_score(rfModel, X, y, cv=10, scoring="precision", n_jobs=-2)
recall_scores = cross_val_score(rfModel, X, y, cv=10, scoring="recall", n_jobs=-2)
f1_scores = cross_val_score(rfModel, X, y, cv=10, scoring="f1", n_jobs=-2)

mi_scores = cross_val_score(rfModel, X,y, cv=10, scoring=make_scorer(adjusted_mutual_info_score), n_jobs=-2)

print "ROC:  ", roc_scores.mean(),"+-", 2.262*sem(roc_scores) #2.262 is t-value for sample size of 10
print "MSE:  ", mse_scores.mean(),"+-", 2.262*sem(mse_scores)
print "Prec: ", precision_scores.mean(), "+-", 2.262*sem(precision_scores)
print "Rec:  ", recall_scores.mean(), "+-", 2.262*sem(recall_scores)
print "F1:   ", f1_scores.mean(), "+-", 2.262*sem(f1_scores)

ROC:   0.991584943639 +- 0.00744875957896
MSE:   0.0299408042458 +- 0.0187908146862
Prec:  0.950912087912 +- 0.0425340837936
Rec:   0.967 +- 0.0231158359476
F1:    0.957804363103 +- 0.0252544966201


### Discussion

Based on the above, we can see that, in general, the 10-fold cross-validated scores for the random forest model are better than that of the single holdout. However, this isn't due to any change in the model but rather in the specific test-train split used in the single holdout. The cross-validated scores presented are averaged over 10 different test-train splits, showing that, on average, the model has a great ROC AUC score and low MSE. Further, the cross validation allows us to provide 95% confidence intervals, as a measure of the stability of the model over different data. As shown, the confidence intervals are fairly narrow, the largest being in precision. The cross-validated precision and recall scores are close to each other, being reflected in an F1 score that is also close to these scores (as their harmonic mean). The cross-validated scores are generally more trustworthy than the scores from the single holdout because they allow for a measure of performance for the model on various slices of the dataset. This helps to diagnose and deter potential overfitting because at no point is the entire dataset presented to the model at any one time, allowing for the rotating holdout portion to reasonably test the model.