# Project

## Predict whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

### Applying

* Decision tree
* Random forest
* KNN
* Naive Bayes
* SVM
* Logistic Regression
* And a neural network using Keras.

The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.

Remember some techniques such as SVM also require the input data to be normalized first.

Many techniques also have "hyperparameters" that need to be tuned. Once you identify a promising approach, see if you can make it even better by tuning its hyperparameters.




## Let's begin: prepare your data

Start by importing the mammographic_masses.data.txt file into a Pandas dataframe .

In [2]:
import pandas as pd
data=pd.read_csv(r'mammographic_masses.data.txt',header=0,names=["BI_RADS","age","shape","margin","density","severity"],na_values=['?'])

Make sure you use the optional parmaters in read_csv to convert missing data (indicated by a ?) into NaN, and to add the appropriate column names (BI_RADS, age, shape, margin, density, and severity):

In [3]:
data.head()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
0,4.0,43.0,1.0,1.0,,1
1,5.0,58.0,4.0,5.0,3.0,1
2,4.0,28.0,1.0,1.0,3.0,0
3,5.0,74.0,1.0,5.0,,1
4,4.0,65.0,1.0,,3.0,0


Evaluate whether the data needs cleaning; your model is only as good as the data it's given. Hint: use describe() on the dataframe.

In [4]:
data.describe()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
count,958.0,955.0,929.0,912.0,884.0,960.0
mean,4.347599,55.475393,2.721206,2.79386,2.910633,0.4625
std,1.783838,14.482917,1.243428,1.565702,0.380647,0.498852
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


There are quite a few missing values in the data set. Before we just drop every row that's missing data, let's make sure we don't bias our data in doing so. Does there appear to be any sort of correlation to what sort of data has missing fields? If there were, we'd have to try and go back and fill that data in.

In [5]:
data.corr()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
BI_RADS,1.0,0.094487,0.185987,0.162731,0.038643,0.231346
age,0.094487,1.0,0.364015,0.410717,0.02876,0.431572
shape,0.185987,0.364015,1.0,0.742751,0.07862,0.563413
margin,0.162731,0.410717,0.742751,1.0,0.109121,0.574269
density,0.038643,0.02876,0.07862,0.109121,1.0,0.063774
severity,0.231346,0.431572,0.563413,0.574269,0.063774,1.0


If the missing data seems randomly distributed, go ahead and drop rows with missing data. Hint: use dropna().

In [6]:
data.dropna(inplace=True)
data.describe()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
count,829.0,829.0,829.0,829.0,829.0,829.0
mean,4.393245,55.768396,2.781665,2.810615,2.915561,0.484922
std,1.889394,14.675456,1.243088,1.566276,0.351136,0.500074
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,46.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


Next you'll need to convert the Pandas dataframes into numpy arrays that can be used by scikit_learn. Create an array that extracts only the feature data we want to work with (age, shape, margin, and density) and another array that contains the classes (severity). You'll also need an array of the feature name labels.

In [7]:
feature_data=data[["age","shape","margin","density"]].values

class_data=data[["severity"]].values
feature_data



array([[58.,  4.,  5.,  3.],
       [28.,  1.,  1.,  3.],
       [57.,  1.,  5.,  3.],
       ...,
       [64.,  4.,  5.,  3.],
       [66.,  4.,  5.,  3.],
       [62.,  3.,  3.,  3.]])

Some of our models require the input data to be normalized, so go ahead and normalize the attribute data. Hint: use preprocessing.StandardScaler().

In [8]:
from sklearn import preprocessing
import numpy as np
scaler = preprocessing.StandardScaler().fit(feature_data)
features_scaled=scaler.transform(feature_data)
features_scaled

array([[ 0.15215552,  0.98067959,  1.39867207,  0.24061945],
       [-1.89330809, -1.43412253, -1.15669795,  0.24061945],
       [ 0.0839734 , -1.43412253,  1.39867207,  0.24061945],
       ...,
       [ 0.56124824,  0.98067959,  1.39867207,  0.24061945],
       [ 0.69761248,  0.98067959,  1.39867207,  0.24061945],
       [ 0.424884  ,  0.17574555,  0.12098706,  0.24061945]])

## Decision Trees


In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_scaled,class_data, test_size=0.25,random_state=1)

Now create a DecisionTreeClassifier and fit it to your training data.

In [10]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

Display the resulting decision tree.

In [11]:
predicted_labels=clf.predict(X_test)

Measure the accuracy of the resulting decision tree model using your test data.

In [12]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predicted_labels)

0.7451923076923077

Now instead of a single train/test split, use K-Fold cross validation to get a better measure of your model's accuracy (K=10). 

In [13]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, features_scaled, class_data.ravel(), cv=10)
scores.mean()

0.7442109903026741

## Random Forest

In [14]:
from sklearn.ensemble import RandomForestClassifier
clf= RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train,y_train.ravel())
predicted_vals=clf.predict(X_test)
accuracy_score(y_test,predicted_vals)


0.7836538461538461

## SVM



In [15]:
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
clf = make_pipeline(StandardScaler(), SVC(gamma='auto',kernel='linear'))
clf.fit(X_train, y_train)
svm_scores=clf.predict(X_test)


  return f(*args, **kwargs)


In [16]:
accuracy_score(y_test,svm_scores)

0.7644230769230769

## KNN


In [17]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=10)
cv_scores = cross_val_score(neigh, features_scaled, class_data.ravel(), cv=10)
cv_scores.mean()

0.7912136350279165

Choosing K is tricky, so we can't discard KNN until we've tried different values of K. Write a for loop to run KNN with K values ranging from 1 to 50 and see if K makes a substantial difference. Make a note of the best performance you could get out of KNN.

In [18]:
from sklearn import neighbors
for n in range(1, 50):
    clf = neighbors.KNeighborsClassifier(n_neighbors=n)
    cv_scores = cross_val_score(clf, features_scaled, class_data.ravel(), cv=10)
    print (n, cv_scores.mean())

1 0.717660887452248
2 0.6995739053776081
3 0.7574199235968263
4 0.7344989714957391
5 0.7767411107846017
6 0.7731266529532765
7 0.7936526594181605
8 0.7791507493388188
9 0.7900235086688217
10 0.7912136350279165
11 0.7972818101674992
12 0.7840287981193066
13 0.784014105201293
14 0.7779753158977373
15 0.7791801351748457
16 0.7743755509844255
17 0.7767851895386423
18 0.7731707317073171
19 0.7876432559506317
20 0.7840141052012929
21 0.7864531295915369
22 0.7804143402879812
23 0.7804143402879812
24 0.7792095210108728
25 0.7852776961504555
26 0.7852483103144284
27 0.7828533646782251
28 0.7864531295915369
29 0.7852483103144284
30 0.7900675874228622
31 0.7876579488686454
32 0.7876579488686453
33 0.7864678225095504
34 0.785263003232442
35 0.785263003232442
36 0.7888627681457537
37 0.7888921539817807
38 0.7864678225095505
39 0.785263003232442
40 0.785263003232442
41 0.7816485454011166
42 0.7828533646782251
43 0.7816485454011166
44 0.7816485454011166
45 0.7828533646782251
46 0.7840581839553336
47 

## Naive Bayes



In [19]:
from sklearn.naive_bayes import MultinomialNB

scaler = preprocessing.MinMaxScaler()
all_features_minmax = scaler.fit_transform(features_scaled)

clf = MultinomialNB()
cv_scores = cross_val_score(clf, all_features_minmax, class_data.ravel(), cv=10)

cv_scores.mean()

0.7851895386423743

## Revisiting SVM



In [20]:
from sklearn import svm

C = 1.0
svc = svm.SVC(kernel='rbf', C=C)
cv_scores = cross_val_score(svc, features_scaled, class_data.ravel(), cv=10)
cv_scores.mean()

0.8033352923890685

In [21]:
C = 1.0
svc = svm.SVC(kernel='sigmoid', C=C)
cv_scores = cross_val_score(svc, features_scaled, class_data.ravel(), cv=10)
cv_scores.mean()

0.7395092565383485

In [22]:
C = 1.0
svc = svm.SVC(kernel='poly', C=C)
cv_scores = cross_val_score(svc, features_scaled, class_data.ravel(), cv=10)
cv_scores.mean()

0.7912577137819572

## Logistic Regression



In [23]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0)
lr_scores= cross_val_score(clf, features_scaled, class_data.ravel(), cv=10)
lr_scores.mean()

0.8069791360564208

## Neural Networks



In [24]:
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


In [26]:
def create_baseline():
    # create model
    model = Sequential()
    model.add(Dense(4, input_dim=4, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [30]:
# evaluate model with standardized dataset
estimator = KerasClassifier(build_fn=create_baseline, epochs=100, batch_size=5, verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(estimator, features_scaled, class_data.ravel(), cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Baseline: 80.69% (3.40%)


Have a draw between Logistic Regression and Keras model