<H1>Diagnosing Malignant Tumors with Logistic Regression</H1>
<H3>By Michael Klear</H3><br>
This is an analysis of the <a href='http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29'>UCI Wisconsin Breast Cancer</a> dataset. It's a simple and short dataset that provides enough information to create a highly sensitive linear logistic regression diagnosis model.<br><br>
Acknowledgements to the publishers of this dataset can be found in <a href='https://github.com/AlliedToasters/CancerDiagnosis.git'>this document</a>. Thanks, guys!


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn import ensemble

In [2]:
df = pd.read_csv('breast-cancer-wisconsin.data', header=None)

I rename the columns to indicate what they measure, as provided by the <a href='https://github.com/AlliedToasters/CancerDiagnosis.git'>dataset metada</a>.

In [3]:
df.columns = [ 
    'patient_id',
    'clump_thickness',
    'unfrm_cell_size',
    'unfrm_cell_shape',
    'mrg_adhesion',
    'sing_epi_cell_size',
    'bare_nuclei',
    'bland_chrom',
    'norm_nucleoli',
    'mitosis',
    'malig'
]

In [4]:
#Relable the outcome variable to 0 for benign, 1 for malignant.
df.malig = np.where(df.malig==4, 1, 0) 

16 of the ~700 samples include null fields ('?') for the bare nuclei measurement. I could just drop them, but we have a small data set to begin with. Instead I'll try to roughly estimate them with a linear model based on the complete samples.

In [5]:
complete_data = df[df['bare_nuclei']!='?'].copy()
incomplete_data = df[df['bare_nuclei']=='?'].copy()
features = [
    'clump_thickness',
    'unfrm_cell_size',
    'unfrm_cell_shape',
    'mrg_adhesion',
    'sing_epi_cell_size',
    'bland_chrom',
    'norm_nucleoli',
    'mitosis',
    'malig'
]
X = complete_data[features].copy()
X_ = incomplete_data[features].copy()
for feature in features:
    X[feature] = X[feature]/X[feature].max() #scale for regularization
    X_[feature] = X_[feature]/X_[feature].max() 
Y = complete_data['bare_nuclei']

#Regularize to deal with colinearity of other features
reg = Ridge()
params = {
    'alpha': [8, 7, 6]
}
srch = GridSearchCV(reg, params, cv=5)
srch.fit(X, Y)
print('best score: ', srch.best_score_, 'best parameters: ', srch.best_params_)

#Make predictions and set these values in df
incomplete_data['predicted_value'] = srch.predict(X_).astype(int)
for row in incomplete_data.index:
    df.set_value(row, 'bare_nuclei', incomplete_data.loc[row].predicted_value)
    
df.bare_nuclei = df.bare_nuclei.astype(int)

best score:  0.643045532848 best parameters:  {'alpha': 7}


<H2>Logistic Regression Model</H2><br>
Now that we've dealt with all of our null values, we can put together a logistic regression model. Using regularization, GridSearch indicates that the model performs best with very little regularization (parameter C set to 1.)

In [6]:
fts = [
    'clump_thickness',
    'unfrm_cell_size',
    'unfrm_cell_shape',
    'mrg_adhesion',
    'sing_epi_cell_size',
    'bare_nuclei',
    'bland_chrom',
    'norm_nucleoli',
    'mitosis'
]
X = df[fts]
Y = df['malig']


mod = LogisticRegression(class_weight='balanced')
params = {
    'penalty': ['l1', 'l2'],
    'C': [1, .999, 1.001],
}

#Apply GridSearchCV
srch = GridSearchCV(mod, params, cv=2)
srch.fit(X, Y)
print('Grid search results: ', srch.best_score_, srch.best_params_) #We see best performance C=1 and loss 'l1'.

mod.fit(X, Y)
coefficients = pd.DataFrame()
coefficients['measurement'] = ['intercept'] + fts
coefficients['coefficient'] = [mod.intercept_] + list(mod.coef_.reshape(-1, 1))
coefficients

Grid search results:  0.97138769671 {'C': 1, 'penalty': 'l1'}


Unnamed: 0,measurement,coefficient
0,intercept,[-6.05440107912]
1,clump_thickness,[0.274181866616]
2,unfrm_cell_size,[0.181069635919]
3,unfrm_cell_shape,[0.28718440646]
4,mrg_adhesion,[0.153864198781]
5,sing_epi_cell_size,[-0.0413967666258]
6,bare_nuclei,[0.399203132543]
7,bland_chrom,[0.160175031869]
8,norm_nucleoli,[0.162696093664]
9,mitosis,[0.223789469312]


<H3>Model Performance</H3><br>
To evaluate our model, I train on half of the data and test on the other half.

In [7]:
cutoff = int(len(df)*.5)

#Set training and test sets
X_train = df[fts][:cutoff]
X_test = df[fts][cutoff:]
Y_train = df['malig'][:cutoff]
Y_test = df['malig'][cutoff:]

mod = LogisticRegression(penalty='l2', C=1, class_weight='balanced')
mod.fit(X_train, Y_train)

Y_ = mod.predict(X_test)
print('confusion at p=.5 threshold:\n ', pd.crosstab(Y_, Y_test), '\n')

confusion at p=.5 threshold:
  malig    0   1
row_0         
0      265   3
1        2  80 



<H3>Adjusting Sensitivity to Avoid False Negatives</H3><br>
We can see that the model performs with a high degree of accuracy, with only five mislabeled points. Given the high cost of a false negative (a malignant tumor being classified as benign), we may be better off diagnosing all tumors with p>.25 as malignant (increasing model sensitivity). Let's see how it performs at this higher sensitivity:

In [8]:
Y_ = np.where(mod.predict_proba(X_test)[:, 1] >= .25, 1, 0)
print('confusion at p=.25 threshold:\n ', pd.crosstab(Y_, Y_test), '\n')

confusion at p=.25 threshold:
  malig    0   1
row_0         
0      254   0
1       13  83 



We can see that this is an effective way to avoid false negatives (type II errors). This adjustment in test sensitivity results in only a slightly lower overall accuracy (6 mislabeled rows, up from 5 at p=.5 threshold.)<br><br>
<H2>Interpreting the Model</H2><br>
The simple linear regression model provides the benefit of interpretable characteristics. We can look at the coefficients to see how the probability of malignancy is calculated by the model.

In [9]:
coefficients['p_delta_per_standard_deviation'] = coefficients.coefficient*pd.Series([0]+list(df[fts].std()))
coefficients.head(10)

Unnamed: 0,measurement,coefficient,p_delta_per_standard_deviation
0,intercept,[-6.05440107912],[-0.0]
1,clump_thickness,[0.274181866616],[0.77202502968]
2,unfrm_cell_size,[0.181069635919],[0.552526590062]
3,unfrm_cell_shape,[0.28718440646],[0.853487004104]
4,mrg_adhesion,[0.153864198781],[0.439340638859]
5,sing_epi_cell_size,[-0.0413967666258],[-0.0916648556471]
6,bare_nuclei,[0.399203132543],[1.44699971223]
7,bland_chrom,[0.160175031869],[0.390565071824]
8,norm_nucleoli,[0.162696093664],[0.496814305971]
9,mitosis,[0.223789469312],[0.383816382582]


In [10]:
#Look at summary of data to help interpret coefficients.
df.describe()

Unnamed: 0,patient_id,clump_thickness,unfrm_cell_size,unfrm_cell_shape,mrg_adhesion,sing_epi_cell_size,bare_nuclei,bland_chrom,norm_nucleoli,mitosis,malig
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,1071704.0,4.41774,3.134478,3.207439,2.806867,3.216023,3.503577,3.437768,2.866953,1.589413,0.344778
std,617095.7,2.815741,3.051459,2.971913,2.855379,2.2143,3.62472,2.438364,3.053634,1.715078,0.475636
min,61634.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,0.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,0.0
75%,1238298.0,6.0,5.0,5.0,4.0,4.0,6.0,5.0,4.0,1.0,1.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0


The probability is simply the intercept (~-6) added to each measurement times its respective coefficient. We can see that all but one measurement (row 5, single epithelial cell size) have a positive correlation with an increase in probability of malignancy.<br><br>
All measurements contribute significantly to the probability, but a single standard deviation change in "bare nuclei" results in the highest change in probability (see 'p_delta_per_standard_deviation' column above), making this the most determinant of all of our measurements in finding the probability.<br><br>
<H2>Conclusion</H2><br>
This model is not perfect. However, it performs sufficiently well to provide sensitive diagnoses and gives us some information about what factors are most important in determing malignancy. This is a great example of the usefulness of a simple logistic regression model.