# Final Project

## Predict whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

## Your assignment

Apply several different supervised machine learning techniques to this data set, and see which one yields the highest accuracy as measured with K-Fold cross validation (K=10). Apply:

* Decision tree
* Random forest
* KNN
* Naive Bayes
* SVM
* Logistic Regression
* And, as a bonus challenge, a neural network using Keras.

The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.

Remember some techniques such as SVM also require the input data to be normalized first.

Many techniques also have "hyperparameters" that need to be tuned. Once you identify a promising approach, see if you can make it even better by tuning its hyperparameters.

I was able to achieve over 80% accuracy - can you beat that?

Below I've set up an outline of a notebook for this project, with some guidance and hints. If you're up for a real challenge, try doing this project from scratch in a new, clean notebook!


## Let's begin: prepare your data

Start by importing the mammographic_masses.data.txt file into a Pandas dataframe (hint: use read_csv) and take a look at it.

In [119]:
import pandas as pd
path_file = 'mammographic_masses.data.txt'
inputData = pd.read_csv(path_file)
inputData.head()

Unnamed: 0,5,67,3,5.1,3.1,1
0,4,43,1,1,?,1
1,5,58,4,5,3,1
2,4,28,1,1,3,0
3,5,74,1,5,?,1
4,4,65,1,?,3,0


Make sure you use the optional parmaters in read_csv to convert missing data (indicated by a ?) into NaN, and to add the appropriate column names (BI_RADS, age, shape, margin, density, and severity):

In [121]:
inputData = pd.read_csv(path_file, na_values='?',names= ['BI-RADS', 'Age', 'Shape', 'Margin','Density','Severity'])
inputData.head()


Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


Evaluate whether the data needs cleaning; your model is only as good as the data it's given. Hint: use describe() on the dataframe.

In [122]:
inputData.describe()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


There are quite a few missing values in the data set. Before we just drop every row that's missing data, let's make sure we don't bias our data in doing so. Does there appear to be any sort of correlation to what sort of data has missing fields? If there were, we'd have to try and go back and fill that data in.

In [123]:
# viewing the dataset's location with missing data values
inputData.loc[(inputData['Age'].isnull())|
              (inputData['Shape'].isnull())|
              inputData['Margin'].isnull()|
              inputData['Density'].isnull()]

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
1,4.0,43.0,1.0,1.0,,1
4,5.0,74.0,1.0,5.0,,1
5,4.0,65.0,1.0,,3.0,0
6,4.0,70.0,,,3.0,0
7,5.0,42.0,1.0,,3.0,0
...,...,...,...,...,...,...
778,4.0,60.0,,4.0,3.0,0
819,4.0,35.0,3.0,,2.0,0
824,6.0,40.0,,3.0,4.0,1
884,5.0,,4.0,4.0,3.0,1


If the missing data seems randomly distributed, go ahead and drop rows with missing data. Hint: use dropna().

In [124]:
inputData.dropna(inplace=True)
inputData.describe()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
count,830.0,830.0,830.0,830.0,830.0,830.0
mean,4.393976,55.781928,2.781928,2.813253,2.915663,0.485542
std,1.888371,14.671782,1.242361,1.567175,0.350936,0.500092
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,46.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


Next you'll need to convert the Pandas dataframes into numpy arrays that can be used by scikit_learn. Create an array that extracts only the feature data we want to work with (age, shape, margin, and density) and another array that contains the classes (severity). You'll also need an array of the feature name labels.

In [125]:
# lets convert pandas dataframe to numpy arrays
features = ['BI-RADS', 'Age', 'Shape', 'Margin','Density']
# the next line of code converts the inputDat with the list of features to a numpy array
all_feature = inputData[features].values

# the next line of code converts the severity column to a numpy array
all_classes = inputData['Severity'].values

all_feature

array([[ 5., 67.,  3.,  5.,  3.],
       [ 5., 58.,  4.,  5.,  3.],
       [ 4., 28.,  1.,  1.,  3.],
       ...,
       [ 4., 64.,  4.,  5.,  3.],
       [ 5., 66.,  4.,  5.,  3.],
       [ 4., 62.,  3.,  3.,  3.]])

Some of our models require the input data to be normalized, so go ahead and normalize the attribute data. Hint: use preprocessing.StandardScaler().

In [126]:
from sklearn.preprocessing import StandardScaler as st
#scaler = StandardScaler()
scaler = st()
scaledFeature = scaler.fit_transform(all_feature)
scaledFeature

array([[ 0.3211177 ,  0.7650629 ,  0.17563638,  1.39618483,  0.24046607],
       [ 0.3211177 ,  0.15127063,  0.98104077,  1.39618483,  0.24046607],
       [-0.20875843, -1.89470363, -1.43517241, -1.157718  ,  0.24046607],
       ...,
       [-0.20875843,  0.56046548,  0.98104077,  1.39618483,  0.24046607],
       [ 0.3211177 ,  0.69686376,  0.98104077,  1.39618483,  0.24046607],
       [-0.20875843,  0.42406719,  0.17563638,  0.11923341,  0.24046607]])

## Decision Trees

Before moving to K-Fold cross validation and random forests, start by creating a single train/test split of our data. Set aside 75% for training, and 25% for testing.

In [127]:
#  creating a single train/test split of our data. Set aside 75% for training, and 25% for testing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaledFeature, all_classes, test_size=0.25, random_state=42)

Now create a DecisionTreeClassifier and fit it to your training data.

In [54]:
#  creating a DecisionTreeClassifier and fit it to your training data
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)

Display the resulting decision tree.

In [55]:
# dispalying the decision tree classifier

from IPython.display import Image  
from sklearn.externals.six import StringIO  
from pydotplus import graph_from_dot_data

dot_data = StringIO()  
tree.export_graphviz(clf, out_file=dot_data,  
                         feature_names=features)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

InvocationException: GraphViz's executables not found

Measure the accuracy of the resulting decision tree model using your test data.

In [146]:
# computing the score of decisio tree
clf.score(X_train,y_train)

0.837620578778135

Now instead of a single train/test split, use K-Fold cross validation to get a better measure of your model's accuracy (K=10). Hint: use model_selection.cross_val_score

In [147]:
from sklearn.model_selection import cross_val_score
print(cross_val_score(clf,X_train,y_train, cv=10))

[0.73015873 0.80952381 0.72580645 0.70967742 0.67741935 0.77419355
 0.79032258 0.79032258 0.72580645 0.77419355]


Now try a RandomForestClassifier instead. Does it perform better?

In [148]:
# using random forest 
from sklearn.ensemble import RandomForestClassifier
RandomForest = RandomForestClassifier(max_depth=2, random_state=0)
RandomForest.fit(X_train,y_train)
RandomForest.score(X_train,y_train)

0.8488745980707395

## SVM

Next try using svm.SVC with a linear kernel. How does it compare to the decision tree?

In [149]:
from sklearn.svm import SVC

C = 1.0
svc = SVC(kernel='linear', C=C).fit(X_train,y_train)


In [150]:
svc.score(X_train,y_train)

0.8279742765273312

## KNN
How about K-Nearest-Neighbors? Hint: use neighbors.KNeighborsClassifier - it's a lot easier than implementing KNN from scratch like we did earlier in the course. Start with a K of 10. K is an example of a hyperparameter - a parameter on the model itself which may need to be tuned for best results on your particular data set.

In [151]:
from sklearn.neighbors import KNeighborsClassifier
>>> KNN = KNeighborsClassifier(n_neighbors=10)
>>> KNN.fit(X_train,y_train)
KNN.score(X_train,y_train)

0.8215434083601286

Choosing K is tricky, so we can't discard KNN until we've tried different values of K. Write a for loop to run KNN with K values ranging from 1 to 50 and see if K makes a substantial difference. Make a note of the best performance you could get out of KNN.

In [152]:
# using KNN with neighb
def loopingK():
    for i in range(1,50):
        KNN = KNeighborsClassifier(n_neighbors=i)
        KNN.fit(X_train,y_train)
        print(i, KNN.score(X_train,y_train)) 

loopingK()

1 0.9228295819935691
2 0.8456591639871383
3 0.8665594855305466
4 0.8360128617363344
5 0.8392282958199357
6 0.8344051446945338
7 0.8263665594855305
8 0.819935691318328
9 0.8215434083601286
10 0.8215434083601286
11 0.819935691318328
12 0.8247588424437299
13 0.8183279742765274
14 0.819935691318328
15 0.8135048231511254
16 0.8070739549839229
17 0.8086816720257235
18 0.8086816720257235
19 0.8118971061093248
20 0.819935691318328
21 0.815112540192926
22 0.815112540192926
23 0.8135048231511254
24 0.8038585209003215
25 0.8086816720257235
26 0.8086816720257235
27 0.8022508038585209
28 0.8038585209003215
29 0.8086816720257235
30 0.8070739549839229
31 0.8006430868167203
32 0.8022508038585209
33 0.8022508038585209
34 0.797427652733119
35 0.7990353697749196
36 0.7990353697749196
37 0.7958199356913184
38 0.7958199356913184
39 0.797427652733119
40 0.797427652733119
41 0.8038585209003215
42 0.8038585209003215
43 0.8022508038585209
44 0.8006430868167203
45 0.8022508038585209
46 0.8022508038585209
47 0.8

## Naive Bayes

Now try naive_bayes.MultinomialNB. How does its accuracy stack up? Hint: you'll need to use MinMaxScaler to get the features in the range MultinomialNB requires.

In [153]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
# y_pred = gnb.fit(X_train, y_train).predict(X_test)
gnb.fit(X_train, y_train)
gnb.score(X_train, y_train)

0.797427652733119

## Revisiting SVM

svm.SVC may perform differently with different kernels. The choice of kernel is an example of a "hyperparamter." Try the rbf, sigmoid, and poly kernels and see what the best-performing kernel is. Do we have a new winner?

In [154]:
# svc with rbf
svc = SVC(kernel='rbf', C=C).fit(X_train,y_train)
svc.score(X_train,y_train)

0.837620578778135

In [155]:
svc = SVC(kernel='sigmoid', C=C).fit(X_train,y_train)
svc.score(X_train,y_train)

0.747588424437299

In [157]:
svc = SVC(kernel='poly', C=8).fit(X_train,y_train)
svc.score(X_train,y_train)

0.8327974276527331

## Logistic Regression

We've tried all these fancy techniques, but fundamentally this is just a binary classification problem. Try Logisitic Regression, which is a simple way to tackling this sort of thing.

In [158]:
from sklearn.linear_model import LogisticRegression
logRegress= LogisticRegression(random_state=0)
logRegress.fit(X_train,y_train)
logRegress.score(X_train,y_train)

0.8135048231511254

## Neural Networks

As a bonus challenge, let's see if an artificial neural network can do even better. You can use Keras to set up a neural network with 1 binary output neuron and see how it performs. Don't be afraid to run a large number of epochs to train the model if necessary.

In [93]:
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout 
from tensorflow.keras.optimizers import RMSprop # optimization function


In [159]:
# setting 
model = Sequential()
model.add(Dense(5, activation = 'relu', input_shape=(5,)))
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='relu'))


In [139]:
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 5)                 30        
_________________________________________________________________
dense_13 (Dense)             (None, 10)                60        
_________________________________________________________________
dense_14 (Dense)             (None, 1)                 11        
Total params: 101
Trainable params: 101
Non-trainable params: 0
_________________________________________________________________


In [160]:


model.compile(loss='binary_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])

history = model.fit(X_train,y_train,
                    batch_size=83,
                    epochs=200,
                    verbose=2,
                    validation_data=(X_test,y_test))

Train on 622 samples, validate on 208 samples
Epoch 1/200
622/622 - 1s - loss: 7.2465 - acc: 0.5161 - val_loss: 3.6433 - val_acc: 0.5096
Epoch 2/200
622/622 - 0s - loss: 3.4515 - acc: 0.5161 - val_loss: 2.9495 - val_acc: 0.5096
Epoch 3/200
622/622 - 0s - loss: 3.1198 - acc: 0.5161 - val_loss: 2.7411 - val_acc: 0.5096
Epoch 4/200
622/622 - 0s - loss: 2.8758 - acc: 0.5161 - val_loss: 2.5979 - val_acc: 0.5096
Epoch 5/200
622/622 - 0s - loss: 2.5082 - acc: 0.5161 - val_loss: 2.1182 - val_acc: 0.5096
Epoch 6/200
622/622 - 0s - loss: 1.9932 - acc: 0.5161 - val_loss: 1.6584 - val_acc: 0.5096
Epoch 7/200
622/622 - 0s - loss: 1.5611 - acc: 0.5161 - val_loss: 1.3876 - val_acc: 0.5096
Epoch 8/200
622/622 - 0s - loss: 1.3565 - acc: 0.5161 - val_loss: 1.2981 - val_acc: 0.5096
Epoch 9/200
622/622 - 0s - loss: 1.2614 - acc: 0.5161 - val_loss: 1.1883 - val_acc: 0.5096
Epoch 10/200
622/622 - 0s - loss: 1.1620 - acc: 0.5161 - val_loss: 1.1181 - val_acc: 0.5096
Epoch 11/200
622/622 - 0s - loss: 1.0786 - 

Epoch 90/200
622/622 - 0s - loss: 0.4383 - acc: 0.8167 - val_loss: 0.4668 - val_acc: 0.8606
Epoch 91/200
622/622 - 0s - loss: 0.4391 - acc: 0.8167 - val_loss: 0.4617 - val_acc: 0.8654
Epoch 92/200
622/622 - 0s - loss: 0.4378 - acc: 0.8215 - val_loss: 0.4673 - val_acc: 0.8654
Epoch 93/200
622/622 - 0s - loss: 0.4381 - acc: 0.8183 - val_loss: 0.4692 - val_acc: 0.8654
Epoch 94/200
622/622 - 0s - loss: 0.4378 - acc: 0.8183 - val_loss: 0.4579 - val_acc: 0.8654
Epoch 95/200
622/622 - 0s - loss: 0.4366 - acc: 0.8232 - val_loss: 0.5186 - val_acc: 0.8606
Epoch 96/200
622/622 - 0s - loss: 0.4376 - acc: 0.8183 - val_loss: 0.4543 - val_acc: 0.8702
Epoch 97/200
622/622 - 0s - loss: 0.4365 - acc: 0.8248 - val_loss: 0.4575 - val_acc: 0.8654
Epoch 98/200
622/622 - 0s - loss: 0.4356 - acc: 0.8248 - val_loss: 0.4587 - val_acc: 0.8606
Epoch 99/200
622/622 - 0s - loss: 0.4362 - acc: 0.8248 - val_loss: 0.4550 - val_acc: 0.8702
Epoch 100/200
622/622 - 0s - loss: 0.4361 - acc: 0.8248 - val_loss: 0.4560 - val

Epoch 179/200
622/622 - 0s - loss: 0.4153 - acc: 0.8408 - val_loss: 0.5631 - val_acc: 0.8606
Epoch 180/200
622/622 - 0s - loss: 0.4356 - acc: 0.8376 - val_loss: 0.5571 - val_acc: 0.8606
Epoch 181/200
622/622 - 0s - loss: 0.4336 - acc: 0.8376 - val_loss: 0.5649 - val_acc: 0.8606
Epoch 182/200
622/622 - 0s - loss: 0.4337 - acc: 0.8392 - val_loss: 0.5549 - val_acc: 0.8606
Epoch 183/200
622/622 - 0s - loss: 0.4190 - acc: 0.8424 - val_loss: 0.5544 - val_acc: 0.8558
Epoch 184/200
622/622 - 0s - loss: 0.4152 - acc: 0.8424 - val_loss: 0.5540 - val_acc: 0.8558
Epoch 185/200
622/622 - 0s - loss: 0.4151 - acc: 0.8424 - val_loss: 0.5551 - val_acc: 0.8558
Epoch 186/200
622/622 - 0s - loss: 0.4154 - acc: 0.8424 - val_loss: 0.5545 - val_acc: 0.8558
Epoch 187/200
622/622 - 0s - loss: 0.4149 - acc: 0.8392 - val_loss: 0.5552 - val_acc: 0.8654
Epoch 188/200
622/622 - 0s - loss: 0.4149 - acc: 0.8441 - val_loss: 0.5562 - val_acc: 0.8606
Epoch 189/200
622/622 - 0s - loss: 0.4148 - acc: 0.8424 - val_loss: 0.

## Do we have a winner?

Which model, and which choice of hyperparameters, performed the best? Feel free to share your results!

In [None]:
neural network
