# Predict whether a mammogram mass is benign or malignant

Use the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

## Task

Build a Multi-Layer Perceptron and train it to classify masses as benign or malignant based on its features.

The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.



## Preparing the data

In [1]:
data_path = '/content/drive/MyDrive/Mammogram/mammographic_masses.data'
names_path = '/content/drive/MyDrive/Mammogram/mammographic_masses.names'

In [2]:
import pandas as pd

masses_data = pd.read_csv(data_path)
masses_data.head()

Unnamed: 0,5,67,3,5.1,3.1,1
0,4,43,1,1,?,1
1,5,58,4,5,3,1
2,4,28,1,1,3,0
3,5,74,1,5,?,1
4,4,65,1,?,3,0


In [3]:
# Reading the data appropriately
masses_data = pd.read_csv(data_path, na_values=['?'],  names=['BI-RADS', 'age', 'shape', 'margin', 'density', 'severity'])
masses_data.head()

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


In [4]:
# Describe the dataset
masses_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
BI-RADS,959.0,4.348279,1.783031,0.0,4.0,4.0,5.0,55.0
age,956.0,55.487448,14.480131,18.0,45.0,57.0,66.0,96.0
shape,930.0,2.721505,1.242792,1.0,2.0,3.0,4.0,4.0
margin,913.0,2.796276,1.566546,1.0,1.0,3.0,4.0,5.0
density,885.0,2.910734,0.380444,1.0,3.0,3.0,3.0,4.0
severity,961.0,0.463059,0.498893,0.0,0.0,0.0,1.0,1.0


### Examine the rows before dropping to ensure no bias is introduced into the dataset.Are there any sort of correlation to what sort of data has missing fields?

In [5]:
masses_data.loc[masses_data['age'].isnull() | masses_data['shape'].isnull() |
                masses_data['margin'].isnull() | masses_data['density'].isnull()]

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
1,4.0,43.0,1.0,1.0,,1
4,5.0,74.0,1.0,5.0,,1
5,4.0,65.0,1.0,,3.0,0
6,4.0,70.0,,,3.0,0
7,5.0,42.0,1.0,,3.0,0
...,...,...,...,...,...,...
778,4.0,60.0,,4.0,3.0,0
819,4.0,35.0,3.0,,2.0,0
824,6.0,40.0,,3.0,4.0,1
884,5.0,,4.0,4.0,3.0,1


Since the missing data seems randomly distributed, dropping rows with missing data.

In [6]:
masses_data.dropna(inplace=True)
masses_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
BI-RADS,830.0,4.393976,1.888371,0.0,4.0,4.0,5.0,55.0
age,830.0,55.781928,14.671782,18.0,46.0,57.0,66.0,96.0
shape,830.0,2.781928,1.242361,1.0,2.0,3.0,4.0,4.0
margin,830.0,2.813253,1.567175,1.0,1.0,3.0,4.0,5.0
density,830.0,2.915663,0.350936,1.0,3.0,3.0,3.0,4.0
severity,830.0,0.485542,0.500092,0.0,0.0,0.0,1.0,1.0


In [7]:
# Create features & classes numpy array
X = masses_data[['age', 'shape', 'margin', 'density']].values
y = masses_data['severity'].values
feature_names = ['age', 'shape', 'margin', 'density']
X

array([[67.,  3.,  5.,  3.],
       [58.,  4.,  5.,  3.],
       [28.,  1.,  1.,  3.],
       ...,
       [64.,  4.,  5.,  3.],
       [66.,  4.,  5.,  3.],
       [62.,  3.,  3.,  3.]])

In [8]:
# Normalzie the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled

array([[ 0.7650629 ,  0.17563638,  1.39618483,  0.24046607],
       [ 0.15127063,  0.98104077,  1.39618483,  0.24046607],
       [-1.89470363, -1.43517241, -1.157718  ,  0.24046607],
       ...,
       [ 0.56046548,  0.98104077,  1.39618483,  0.24046607],
       [ 0.69686376,  0.98104077,  1.39618483,  0.24046607],
       [ 0.42406719,  0.17563638,  0.11923341,  0.24046607]])

In [9]:
!pip install keras -q

In [10]:
import keras
from keras import layers

In [11]:
def create_model():
  model = keras.Sequential()
  model.add(layers.Dense(6, input_shape=(4, ), kernel_initializer='normal', activation='relu'))
  model.add(layers.Dense(3, kernel_initializer='normal', activation='relu'))
  model.add(layers.Dense(1, kernel_initializer='normal', activation='sigmoid'))
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  return model

In [None]:
!pip install scikeras[tensorflow]

In [17]:
from sklearn.model_selection import cross_val_score
from scikeras.wrappers import KerasClassifier

model = KerasClassifier(model=create_model, epochs=100, batch_size=10, verbose=0)

scores = cross_val_score(model, X_scaled, y, cv=10)

print(scores)
print(scores.mean())

[0.75903614 0.80722892 0.86746988 0.8313253  0.81927711 0.72289157
 0.79518072 0.81927711 0.87951807 0.78313253]
0.808433734939759
