### Exercise mammography
Mammography is the most effective method for breast cancer screening available today. However, the low positive predictive value of breast biopsy resulting from mammogram interpretation leads to approximately    70% unnecessary biopsies with benign outcomes. To reduce the high  number of unnecessary breast biopsies, several computer-aided diagnosis (CAD) systems have been proposed in the last years. These systems help physicians in their decision to perform a breast biopsy on a suspicious lesion seen in a mammogram or to perform a short term follow-up examination instead.  
 
This data set can be used to predict the severity (benign or malignant) of a mammographic mass lesion from BI-RADS attributes and the patient's age. It contains a BI-RADS assessment, the patient's age and three BI-RADS attributes together with the ground truth (the severity field) for 516 benign and 445 malignant masses that have been identified on full field digital mammograms  collected at the Institute of Radiology of the University Erlangen-Nuremberg between 2003 and 2006.  
 
Each instance has an associated BI-RADS assessment ranging from 1 (definitely benign) to 5 (highly suggestive of malignancy) assigned in a double-review process by physicians. Assuming that all cases with BI-RADS assessments greater or equal a given value (varying from 1 to 5), are malignant and the other cases benign,  sensitivities and associated specificities can be calculated. These can be an  indication of how well a CAD system performs compared to the radiologists. 
  
Number of Instances: 961
Number of Attributes: 6 (1 goal field, 1 non-predictive, 4 predictive attributes)
Attribute Information:  
 
   1. BI-RADS assessment: 1 to 5 (ordinal)  = assessment by radiologist:   
    
   - 0- incomplete  
   - 1-negative   
   - 2-benign findings  
   - 3-probably benign  
   - 4-suspicious abnormality  
   - 5-highly suspicious of malignancy  
   2. Age: patient's age in years (integer)   

   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)   
   
   4. Margin: mass margin:   
    
   - circumscribed=1  
   - microlobulated=2  
   - obscured=3  
   - ill-defined=4  
   - spiculated=5 (nominal) 
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)  
     
   6. Severity: benign=0 or malignant=1 (binominal) 

Missing Attribute Values: Yes

    - BI-RADS assessment:   	2
    - Age:                   	5
    - Shape:                	31
    - Margin:               	48
    - Density:              	76
    - Severity:             	0
Class Distribution: benign: 516; malignant: 445
 
See https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass  for the source of the data set. 


In [346]:
# import the library 
import pandas as pd
url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/mammography.csv'
mammo = pd.read_csv(url)

In [347]:
mammo

Unnamed: 0,BIRADS,Age,Shape,Margin,Density,Severity
0,5,67,3,5,3,1
1,4,43,1,1,?,1
2,5,58,4,5,3,1
3,4,28,1,1,3,0
4,5,74,1,5,?,1
...,...,...,...,...,...,...
956,4,47,2,1,3,0
957,4,56,4,5,3,1
958,4,64,4,5,3,0
959,5,66,4,5,3,1


In [348]:
# Calculate how often breast cancer occurs on average in this data set. (0,46…)
mammo['Severity'].mean()

0.4630593132154006

In [349]:
# Calculate how many of the samples are benign and how many are malignant (benign: 516, malign: 445)
mammo.groupby('Severity')['BIRADS'].count()

Severity
0    516
1    445
Name: BIRADS, dtype: int64

In [350]:
# Missing attributes are indicated as “?”. Remove all lines that contain some “?”. 
mammo = mammo[mammo['BIRADS'] != '?']
mammo = mammo[mammo['Age'] != '?']
mammo = mammo[mammo['Shape'] != '?']
mammo = mammo[mammo['Margin'] != '?']
mammo = mammo[mammo['Density'] != '?']

In [351]:
mammo.count()

BIRADS      830
Age         830
Shape       830
Margin      830
Density     830
Severity    830
dtype: int64

In [352]:
# Calculate the percentage of the BI-RADS assessments that were correct, assuming a BI-RADS assessment of 4 or higher is malign. (51 %)
import numpy as np
mammo['BIRADS1'] = np.where(mammo['BIRADS'] >= '4',1,0)

In [353]:
mammo

Unnamed: 0,BIRADS,Age,Shape,Margin,Density,Severity,BIRADS1
0,5,67,3,5,3,1,1
2,5,58,4,5,3,1,1
3,4,28,1,1,3,0,1
8,5,57,1,5,3,1,1
10,5,76,1,4,3,1,1
...,...,...,...,...,...,...,...
956,4,47,2,1,3,0,1
957,4,56,4,5,3,1,1
958,4,64,4,5,3,0,1
959,5,66,4,5,3,1,1


In [354]:
mammo['CORRECT'] = np.where(mammo['BIRADS1'] == mammo['Severity'],1,0)

In [355]:
mammo

Unnamed: 0,BIRADS,Age,Shape,Margin,Density,Severity,BIRADS1,CORRECT
0,5,67,3,5,3,1,1,1
2,5,58,4,5,3,1,1,1
3,4,28,1,1,3,0,1,0
8,5,57,1,5,3,1,1,1
10,5,76,1,4,3,1,1,1
...,...,...,...,...,...,...,...,...
956,4,47,2,1,3,0,1,0
957,4,56,4,5,3,1,1,1
958,4,64,4,5,3,0,1,0
959,5,66,4,5,3,1,1,1


In [356]:
mammo['CORRECT'].mean()

0.5120481927710844

In [357]:
# Use random forest classification to determine a model for predicting the severity (benign/malign) 
# of new mammography results. Determine the accuracy of the classifier (+/- 79%)
# Determine the optimal number of trees 
from sklearn.model_selection import train_test_split
X = mammo.drop('BIRADS',axis=1).drop('BIRADS1',axis=1).drop('CORRECT',axis=1).drop('Severity',axis=1)
y = mammo['Severity']

In [358]:
X_remainder, X_test, y_remainder, y_test = train_test_split(X,y,test_size=0.30)

best_accuracy = 0
best_trees = 0

for trees in range(50,550,50):
    X_train, X_validation, y_train, y_validation = train_test_split(X_remainder,y_remainder,test_size=0.30)
    model = RandomForestClassifier(n_estimators=trees)
    model.fit(X_train, y_train)    
    y_validation2 = model.predict(X_validation)
    accuracy = accuracy_score(y_validation, y_validation2)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_trees = trees
        best_validation = model.predict(X_test)
        
print('Optimal number of trees = % s' %(best_trees))
print('Accuracy on validation set = % 3.2f' % (best_accuracy)) 
accuracyOnTestSet = accuracy_score(y_test, best_validation)
print('Accuracy on test set = % 3.2f' % (accuracyOnTestSet))

Optimal number of trees = 100
Accuracy on validation set =  0.78
Accuracy on test set =  0.75


In [359]:
# Which two factors are the most determining for malign breast tumors. (Age and margin)
pd.DataFrame(model.feature_importances_,columns=['Importance'],index=X_train.columns).sort_values(by='Importance',ascending=False)

Unnamed: 0,Importance
Age,0.448168
Margin,0.292037
Shape,0.236111
Density,0.023684


In [365]:
# Determine the false negative rate and the false positive rate. 
# Which proportion of the real cancer cases are we missing with this test? (+/- 25%)
results = pd.DataFrame({'true':y_test,'estimated':best_validation})

In [366]:
results['TP'] = np.where((results['true'] == 1) & (results['estimated'] == 1),1,0)
results['TN'] = np.where((results['true'] == 0) & (results['estimated'] == 0),1,0)
results['FP'] = np.where((results['true'] == 0) & (results['estimated'] == 1),1,0)
results['FN'] = np.where((results['true'] == 1) & (results['estimated'] == 0),1,0)

In [367]:
results.head()

Unnamed: 0,true,estimated,TP,TN,FP,FN
274,1,0,0,0,0,1
864,1,1,1,0,0,0
589,1,1,1,0,0,0
893,0,0,0,1,0,0
313,1,1,1,0,0,0


In [368]:
FPrate = results['FP'].sum()/(results['FP'].sum() + results['TN'].sum())
print(FPrate)

0.344


In [369]:
FNrate = results['FN'].sum()/(results['FN'].sum() + results['TP'].sum())
print(FNrate)

0.16129032258064516
