# Project

## Predict whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

## Let's begin

### Import libaries

In [1]:
import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer # cleaning the missing data
from sklearn.preprocessing import StandardScaler # feature scaling
from sklearn.model_selection import train_test_split # to train/test split 
from sklearn.metrics import accuracy_score, classification_report # watch out the accuracy of your model
from sklearn.model_selection import cross_val_score # k-fold cross validation

%matplotlib inline

### Import data set

In [2]:
data = pd.read_csv('mammographic_masses.data.txt'
                   , names=['BI_RADS', 'age', 'shape', 'margin', 'density','severity']
                    , na_values='?')
data.head()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


In [3]:
data.describe()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


In [4]:
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values

In [5]:
X

array([[ 5., 67.,  3.,  5.,  3.],
       [ 4., 43.,  1.,  1., nan],
       [ 5., 58.,  4.,  5.,  3.],
       ...,
       [ 4., 64.,  4.,  5.,  3.],
       [ 5., 66.,  4.,  5.,  3.],
       [ 4., 62.,  3.,  3.,  3.]])

### Cleaning the missing data (nan)

In [6]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') 
X = imputer.fit_transform(X)

In [7]:
X

array([[ 5.        , 67.        ,  3.        ,  5.        ,  3.        ],
       [ 4.        , 43.        ,  1.        ,  1.        ,  2.91073446],
       [ 5.        , 58.        ,  4.        ,  5.        ,  3.        ],
       ...,
       [ 4.        , 64.        ,  4.        ,  5.        ,  3.        ],
       [ 5.        , 66.        ,  4.        ,  5.        ,  3.        ],
       [ 4.        , 62.        ,  3.        ,  3.        ,  3.        ]])

### Train/Test split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25)

In [9]:
X_train.shape

(720, 5)

In [10]:
X_test.shape

(241, 5)

### Feature scaling

In [11]:
stand = StandardScaler()
X_train = stand.fit_transform(X_train)
X_test = stand.transform(X_test)

# Decision Trees

In [12]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(random_state=0)
dtree.fit(X_train,y_train)

DecisionTreeClassifier(random_state=0)

In [13]:
y_pred = dtree.predict(X_test)
#print(np.concatenate((y_test.reshape(len(y_test),-1),y_pred.reshape(len(y_pred),-1)),1))

In [14]:
# check out the accuracy
accuracy_score(y_test, y_pred)*100

77.59336099585063

In [15]:
# use K-Fold cross validation to get a better measure of your model's accuracy
k_fold = cross_val_score(estimator=dtree, X=X_train, y=y_train, cv=10)
k_fold.mean()*100

76.52777777777779

## RandomForestClassifier

In [16]:
from sklearn.ensemble import RandomForestClassifier
randomf = RandomForestClassifier(random_state=0)
randomf.fit(X_train, y_train)

RandomForestClassifier(random_state=0)

In [17]:
y_pred = randomf.predict(X_test)
accuracy_score(y_test, y_pred)

0.7925311203319502

In [18]:
k_fold = cross_val_score(estimator=randomf, X=X_train, y=y_train, cv=10)
k_fold.mean()*100

78.05555555555557

## SVM

In [19]:
from sklearn.svm import SVC
svc = SVC(kernel='linear')
svc.fit(X_train, y_train)

SVC(kernel='linear')

In [20]:
y_pred = svc.predict(X_test)
accuracy_score(y_test, y_pred)

0.8340248962655602

In [21]:
k_fold = cross_val_score(estimator=svc, X=X_train, y=y_train, cv=10)
k_fold.mean()*100

82.6388888888889

## KNN

In [22]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

KNeighborsClassifier()

In [23]:
y_pred = knn.predict(X_test)
accuracy_score(y_test, y_pred)

0.7883817427385892

Choosing K is tricky, so we can't discard KNN until we've tried different values of K. Write a for loop to run KNN with K values ranging from 1 to 50 and see if K makes a substantial difference. Make a note of the best performance you could get out of KNN.

In [24]:
k_fold = cross_val_score(estimator=knn, X=X_train, y=y_train)
k_fold.mean()*100

78.75

## Naive Bayes

In [25]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB()

In [26]:
y_pred = classifier.predict(X_test)
accuracy_score(y_test, y_pred)

0.8257261410788381

In [27]:
k_fold = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=10)
k_fold.mean()

0.8013888888888889

## Logistic Regression


In [28]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression()
log.fit(X_train, y_train)

LogisticRegression()

In [29]:
y_pred = log.predict(X_test)
accuracy_score(y_test,y_pred)

0.8257261410788381

In [30]:
k_fold = cross_val_score(estimator=log, X=X_train, y=y_train,cv=10)
k_fold.mean()

0.8319444444444445

## Artificial Neural Networks


In [31]:
from keras.models import Sequential
from keras.layers import Dense

In [32]:
# init ann
model = Sequential()

# adding input  layer and first hidden layer
model.add(Dense(6,activation='relu'))
# adding output layer
model.add(Dense(1,activation='sigmoid'))
# compile the ann
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

In [33]:
model.fit(x=X_train,y=y_train,epochs=100,verbose=0)

<tensorflow.python.keras.callbacks.History at 0x7f9904302cd0>

In [34]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 6)                 36        
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
Total params: 43
Trainable params: 43
Non-trainable params: 0
_________________________________________________________________


In [35]:
model.metrics_names

['loss', 'accuracy']

In [36]:
model.evaluate(x=X_test, y=y_test)



[0.3779608905315399, 0.8257261514663696]

In [39]:
prediction = model.predict_classes(X_test)

In [40]:
from sklearn.metrics import classification_report,accuracy_score
classification_report(y_test,prediction)
accuracy_score(y_test,prediction)

0.8257261410788381