
# Predicting Whether a Mammogram Mass is Benign or Malignant

Data from "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

## Data Preperation

In [1]:
import numpy as np
import pandas as pd

masses=pd.read_csv("mammographic_masses.data.txt", names=["Bi_RADS","age","shape","margin","density","severity"], na_values = '?')
masses.head()

Unnamed: 0,Bi_RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


In [2]:
masses.describe()

Unnamed: 0,Bi_RADS,age,shape,margin,density,severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


In [3]:
masses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 961 entries, 0 to 960
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Bi_RADS   959 non-null    float64
 1   age       956 non-null    float64
 2   shape     930 non-null    float64
 3   margin    913 non-null    float64
 4   density   885 non-null    float64
 5   severity  961 non-null    int64  
dtypes: float64(5), int64(1)
memory usage: 45.2 KB


Blank field were entered in as '?' which we converted to NaN at the data. We will drop these.

In [4]:
masses_dropped = masses.dropna()
masses_dropped.describe()

Unnamed: 0,Bi_RADS,age,shape,margin,density,severity
count,830.0,830.0,830.0,830.0,830.0,830.0
mean,4.393976,55.781928,2.781928,2.813253,2.915663,0.485542
std,1.888371,14.671782,1.242361,1.567175,0.350936,0.500092
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,46.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


Assign features and classes to be used. Severity (0 or 1) describes if the cancer is malignant or not.

In [5]:
feature_columns = ["age", "shape", "margin", "density"]
classes_columns = ["severity"]

features = masses_dropped[["age", "shape", "margin", "density"]].values
classes = masses_dropped[classes_columns].values
classes = np.ravel(classes,order="C")
print(features)

[[67.  3.  5.  3.]
 [58.  4.  5.  3.]
 [28.  1.  1.  3.]
 ...
 [64.  4.  5.  3.]
 [66.  4.  5.  3.]
 [62.  3.  3.  3.]]


Normalise features

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
print(scaled_features)


[[ 0.7650629   0.17563638  1.39618483  0.24046607]
 [ 0.15127063  0.98104077  1.39618483  0.24046607]
 [-1.89470363 -1.43517241 -1.157718    0.24046607]
 ...
 [ 0.56046548  0.98104077  1.39618483  0.24046607]
 [ 0.69686376  0.98104077  1.39618483  0.24046607]
 [ 0.42406719  0.17563638  0.11923341  0.24046607]]


## Decision Trees

Train/test split created.

In [7]:
from sklearn import tree
import numpy
from sklearn.model_selection import train_test_split

(training_inputs, testing_inputs, training_classes,
 testing_classes) = train_test_split(scaled_features,classes,
                                     train_size=0.75, random_state=1)

Now create a DecisionTreeClassifier and fit it to your training data.

In [8]:
clf_dt = tree.DecisionTreeClassifier()
clf_dt = clf_dt.fit(training_inputs, training_classes)

Measure acuracy using train/test

In [9]:
clf_dt.score(testing_inputs, testing_classes)

0.7403846153846154

impliment K-Fold cross validation to get a better measure of model's accuracy.

In [10]:
# k fold classification k=10
from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf_dt,scaled_features, classes, cv=10)
print(scores.mean())

0.733734939759036


## Random Forest

In [11]:
from sklearn.ensemble import RandomForestClassifier

clf_rf = RandomForestClassifier()

In [12]:
scores = cross_val_score(clf_rf,scaled_features, np.ravel(classes,order='C'), cv=10)
print(scores.mean())

0.7614457831325302


## SVM

svm.SVC with a linear kernel.

In [13]:
from sklearn import svm
svc = svm.SVC(kernel='linear').fit(training_inputs,training_classes)

In [14]:
svc.score(testing_inputs, testing_classes)

0.7692307692307693

In [15]:
svc_cv = cross_val_score(svc,scaled_features,classes,cv=10)

In [16]:
print(svc_cv.mean())

0.7975903614457832


Optimise kernel

In [28]:
h_param = ["rbf", "sigmoid", "poly"]
for i in h_param:
    svc = svm.SVC(kernel=i).fit(training_inputs,training_classes)
    svc_cv = cross_val_score(svc,scaled_features,classes,cv=10)
    print(i,": ", svc_cv.mean())

rbf :  0.8012048192771084
sigmoid :  0.7457831325301204
poly :  0.7903614457831326


## KNN
KNN

In [17]:
from sklearn import neighbors

clf_knn = neighbors.KNeighborsClassifier(n_neighbors=10)
cv_score = cross_val_score(clf_knn,scaled_features, classes, cv=10)
cv_score.mean()

0.7915662650602409

optimise number of nearesrt neighbous 

In [18]:
K = []
cv = []
for k in range(1,21):
    clf_knn = neighbors.KNeighborsClassifier(n_neighbors=k)
    score = cross_val_score(clf_knn,scaled_features,classes, cv=10)
    score = score.mean()
    print("For {} neighbors, the mean accuracy is {}".format(k,score))
    K.append(k)
    cv.append(score)

d = {"KNN":K, "score":cv }
df = pd.DataFrame(d)
df

For 1 neighbors, the mean accuracy is 0.7325301204819278
For 2 neighbors, the mean accuracy is 0.6903614457831325
For 3 neighbors, the mean accuracy is 0.7542168674698796
For 4 neighbors, the mean accuracy is 0.7349397590361446
For 5 neighbors, the mean accuracy is 0.7710843373493976
For 6 neighbors, the mean accuracy is 0.7686746987951807
For 7 neighbors, the mean accuracy is 0.7951807228915662
For 8 neighbors, the mean accuracy is 0.7771084337349398
For 9 neighbors, the mean accuracy is 0.7903614457831326
For 10 neighbors, the mean accuracy is 0.7915662650602409
For 11 neighbors, the mean accuracy is 0.7891566265060241
For 12 neighbors, the mean accuracy is 0.783132530120482
For 13 neighbors, the mean accuracy is 0.7879518072289157
For 14 neighbors, the mean accuracy is 0.7867469879518072
For 15 neighbors, the mean accuracy is 0.7867469879518072
For 16 neighbors, the mean accuracy is 0.7831325301204819
For 17 neighbors, the mean accuracy is 0.7783132530120482
For 18 neighbors, the me

Unnamed: 0,KNN,score
0,1,0.73253
1,2,0.690361
2,3,0.754217
3,4,0.73494
4,5,0.771084
5,6,0.768675
6,7,0.795181
7,8,0.777108
8,9,0.790361
9,10,0.791566


In [19]:
df_sorted = df.sort_values(by=["score"], ascending=False)

In [20]:
df_sorted

Unnamed: 0,KNN,score
6,7,0.795181
9,10,0.791566
8,9,0.790361
10,11,0.789157
12,13,0.787952
14,15,0.786747
13,14,0.786747
19,20,0.785542
18,19,0.784337
11,12,0.783133


So K = 7 neighbors is the best.

## Naive Bayes

Using MinMaxScaler to get the features in the range MultinomialNB requires.

In [23]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()
minmax_features = scaler.fit_transform(features)
clf_nb = MultinomialNB()
cv_score = cross_val_score(clf_nb,minmax_features, classes, cv=10)
cv_score.mean()

0.7855421686746988

## Logistic Regression

In [29]:
from sklearn.linear_model import LogisticRegression

clf_lr = LogisticRegression()
cv_scores = cross_val_score(clf_lr, scaled_features, classes, cv=10)
print(cv_scores.mean())

0.8072289156626505


## Conclusion
Asisde from decision tree/random forest, all models produce an accuracy of around 0.80. THere is scope to fine tune hyperparameters to increase accuracy further.