The Challenge:

Your task is to develop a model that predicts whether a biopsied breast cell is benign (not harmful) or malignant (cancerous), given a set of attributes about the cell.

There are many ways you can explore, visualize, engineer your features, and tell a story with this data! Being able to clearly communicate your thought process is one of the most important parts of a data challenge. Some important questions to think about are: how can you best explore the data? Why did you select your particular model? How did you validate your model?

Please code and annotate your analysis in an Jupyter notebook.

The dataset consists of 699 cells for which you have the following features:

Sample code number: id number
Clump Thickness: 1 - 10
Uniformity of Cell Size: 1 - 10
Uniformity of Cell Shape: 1 - 10
Marginal Adhesion: 1 - 10
Single Epithelial Cell Size: 1 - 10
Bare Nuclei: 1 - 10
Bland Chromatin: 1 - 10
Normal Nucleoli: 1 - 10
Mitoses: 1 - 10
Class: (2 for benign, 4 for malignant)
The dataset is available here: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data

In [167]:
# import libraries

import pandas as pd
import io
import requests
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
#import statsmodels.discrete.discrete_model as sm
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.naive_bayes import GaussianNB as GNB
from sklearn.naive_bayes import MultinomialNB as MNB
from sklearn.neural_network import MLPClassifier
import numpy as np
import seaborn as sns

# useful for evaluating predictive capabilities 
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import (brier_score_loss, precision_score, recall_score,
                             f1_score)
from sklearn.metrics import classification_report,confusion_matrix

# not using this but good to keep in mind!
from sklearn.model_selection import train_test_split

# will use this for preprocessing the data
from sklearn.preprocessing import StandardScaler

In [22]:
# from data description, these are the column names
names_list = ['id','clump_thickness','uniform_size','uniform_shape','adhesion','epithel_size',\
              'bare_nuclei','bland_chromatin','nucleoli','mitoses','Class']

# import the data into pandas dataframe
link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
f = requests.get(link).content
data = pd.read_csv(io.StringIO(f.decode('utf-8')), names = names_list)

# keep only numeric types
data = data.select_dtypes(include=[np.number])

['id', 'clump_thickness', 'uniform_size', 'uniform_shape', 'adhesion', 'epithel_size', 'bland_chromatin', 'nucleoli', 'mitoses', 'Class']


In [135]:
# lets look at the histograms
data.hist()
plt.figure(1)
plt.savefig('histograms.pdf')
#plt.show()

The target variable 'Class' is a binary classifier, so my first thought is to use a logistic regression.

I'll start initially by including all of the independent variables, except for the 'id' variable as this should be irrelevant to the prediction of the tumor size.

In [51]:
# lets first start with a logistic regression to see if we can 
# accurately classify the data into benign or malignant
# first change the benign and malignant scores to 0 and 1, respectively 
data['Class'] = data['Class'].replace(to_replace=[2,4], value = [0,1])

#### need to split up training/testing set 80/20 ########
train = data.sample(frac = 0.8, random_state = 1)
test = data.loc[~data.index.isin(train.index)]

### train a logistic regression on a selection of the columns ###
train_cols = ['clump_thickness', 'uniform_size', 'uniform_shape', 'epithel_size','adhesion','nucleoli']

# subset the set used for the training data and add a constant
logit_train_data = train[train_cols]
logit_train_data = sm.add_constant(logit_train_data)

logit = sm.Logit(train['Class'],logit_train_data)
result = logit.fit()

print result.summary()

Optimization terminated successfully.
         Current function value: 0.139046
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:                  Class   No. Observations:                  559
Model:                          Logit   Df Residuals:                      552
Method:                           MLE   Df Model:                            6
Date:                Wed, 21 Feb 2018   Pseudo R-squ.:                  0.7831
Time:                        20:11:35   Log-Likelihood:                -77.727
converged:                       True   LL-Null:                       -358.30
                                        LLR p-value:                5.588e-118
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
const              -6.7598      0.642    -10.537      0.000      -8.017      -5.502
clump_thicknes

In [56]:
# use the test data to predict the target variable on the test set
logit_test_data = test[train_cols]
logit_test_data = sm.add_constant(logit_test_data)

# get the prediction and round to the nearest integer 
# since this is a binary classification
predict = result.predict(logit_test_data)

round_predict = [round(x) for x in predict]

All of the tested independent variables have p-values less than 0.05 which indicates that they are significant. 

In [88]:
# let's look at the predictions by histogramming them
plt.figure(2)

bins = np.arange(0,1.1,.1)
plt.hist(round_predict,bins,label = 'LR predicted', alpha = 0.3)
plt.hist(test['Class'],bins,label ='real', alpha = 0.3)
plt.xlabel('Class')
plt.ylabel('prediction')
plt.title("Histogram of LR prediction results")
plt.legend()
plt.savefig('LogReg_Results.pdf')

In [172]:
print "\n LR confusion matrix:\n",(confusion_matrix(y_test,round_predict))
print(classification_report(y_test,round_predict))


 LR confusion matrix:
[[86  3]
 [ 7 44]]
             precision    recall  f1-score   support

          0       0.92      0.97      0.95        89
          1       0.94      0.86      0.90        51

avg / total       0.93      0.93      0.93       140



It looks like this model OVERESTIMATES the number of benign tumors, and underestimates the number of malignant tumors. That's a pretty big deal... For this kind of problem, the patient and doctor should probably err on the side of caution and choose a prediction method that would diagnose a benign tumor as malignant rather than vice versa. 

Another approach would be to use Linear and Quadratic Discriminant Analysis 

In [112]:
# train linear discriminant analysis
lda = LDA()
lda_model = lda.fit(train[train_cols],train['Class'])
results_lda = lda_model.predict(test[train_cols])

# train quadratic discriminant analysis
qda = QDA()
qda_model = qda.fit(train[train_cols],train['Class'])
results_qda = qda_model.predict(test[train_cols])

plt.figure(3)
#plt.hist(round_predict,bins,label = 'LR predicted', alpha = 0.3)
plt.hist(results_lda,bins,label='LDA predict',alpha =0.3)
plt.hist(test['Class'],bins,label ='real', alpha = 0.3)
plt.legend()
plt.xlabel('Class')
plt.ylabel('prediction')
plt.title("Histogram of LDA prediction results")
plt.savefig('LDA_Results.pdf')

plt.figure(4)
plt.hist(results_qda,bins,label='QDA predict',alpha =0.3)
plt.hist(test['Class'],bins,label ='real', alpha = 0.3)
plt.legend()
plt.xlabel('Class')
plt.ylabel('prediction')
plt.title("Histogram of QDA prediction results")
plt.savefig('QDA_Results.pdf')

plt.figure(5)
plt.hist(round_predict,bins,label = 'LR predicted', alpha = 0.4)
plt.hist(results_lda,bins,label='LDA predict',alpha =0.3)
plt.hist(results_qda,bins,label='QDA predict',alpha =0.2)
plt.hist(test['Class'],bins,label ='real', alpha = 0.1)
plt.title("Histogram of multiple prediction results")
plt.legend()
plt.savefig('Combined_Results.pdf')

In [169]:
print "\n LDA confusion matrix:\n",(confusion_matrix(y_test,results_lda))
print(classification_report(y_test,results_lda))


 LDA confusion matrix:
[[87  2]
 [ 8 43]]
             precision    recall  f1-score   support

          0       0.92      0.98      0.95        89
          1       0.96      0.84      0.90        51

avg / total       0.93      0.93      0.93       140



In [168]:
print "\n QDA confusion matrix:\n",(confusion_matrix(y_test,results_qda))
print(classification_report(y_test,results_qda))


 QDA confusion matrix:
[[85  4]
 [ 1 50]]
             precision    recall  f1-score   support

          0       0.99      0.96      0.97        89
          1       0.93      0.98      0.95        51

avg / total       0.97      0.96      0.96       140



QDA gets much better. In fact, it starts to overpredict the malignant tumors vs. benign tumors and has a 97% accuracy rate. 

Let's try a few more such as Gaussian Naive Bayes which assumes that the liklihood of the features is assumed to be Gaussian. I'm not really sure we can say that. Perhaps we should check the features for normality. 

In [130]:
# check for normality of the values in the training column
print "Check the Shapiro test for the variables:\n"
for i in train_cols:
    print i, stats.shapiro(data[i])


Check the Shapiro test for the variables:

clump_thickness (0.877661943435669, 5.2220136990493934e-23)
uniform_size (0.6862915754318237, 4.978149239932788e-34)
uniform_shape (0.7180565595626831, 1.1502892540151622e-32)
epithel_size (0.6991004943847656, 1.7103106686548345e-33)
adhesion (0.6512163877487183, 2.0616883781547084e-35)
nucleoli (0.6399070024490356, 7.802240438649756e-36)

Now for the Gaussian Naive Bayes...
Number of mislabeled points out of a total 699 points : 34


Most of these variables are not very close to normally distributed as indicated by the W value of the Shapiro test. (W should be near 1). Therefore, the Gaussian Naive Bayes method is making some pretty major assumptions about the input variables. It seems to do OK at predicting the target variables. 

I tried the multinomial Naive Bayes below, which is suited for discrete variables, but this definitely did not improve predictions results. 

In [131]:
gnb = GNB()
gnb_pred = gnb.fit(data[train_cols], data['Class']).predict(data[train_cols])

print "\nNow for the Gaussian Naive Bayes..."
print("Number of mislabeled points out of a total %d points : %d" % (data[train_cols].shape[0],(data['Class'] != gnb_pred).sum()))


Now for the Gaussian Naive Bayes...
Number of mislabeled points out of a total 699 points : 34


In [134]:
mnb = MNB()
mnb_pred = mnb.fit(data[train_cols], data['Class']).predict(data[train_cols])

print "\nNow for the Multinomial Naive Bayes..."
print("Number of mislabeled points out of a total %d points : %d" % (data[train_cols].shape[0],(data['Class'] != mnb_pred).sum()))


Now for the Multinomial Naive Bayes...
Number of mislabeled points out of a total 699 points : 125


In [154]:
# use this to scale the data
scaler = StandardScaler()

X_train = train[train_cols]
X_test = test[train_cols]

y_train = train['Class']
y_test = test['Class']

# Fit only to the training data
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# train the classifier
# the hidden layers number is tricky - not sure how to pick these
# Gave error when I used (30,30,30)
mlp = MLPClassifier(hidden_layer_sizes=(10,10,10))
mlp_model = mlp.fit(X_train,y_train)

# predict on the test set
mlp_predictions = mlp.predict(X_test)

In [165]:
# this is the confusing matrix
#print "f1 score: ", f1_score(y_test, mlp_predictions,average = 'micro')
print "\n MLP confusion matrix:\n",(confusion_matrix(y_test,mlp_predictions))


 MLP confusion matrix:
[[85  4]
 [ 4 47]]


In [150]:
# this is the classification report
print(classification_report(y_test,mlp_predictions))

             precision    recall  f1-score   support

          0       0.98      0.97      0.97        89
          1       0.94      0.96      0.95        51

avg / total       0.96      0.96      0.96       140



MLP gives us a 96% accuracy rate.

Finally let's try K-Nearest Neighbors for classification