Start by importing the mammographic_masses.data.txt file into a Pandas dataframe and take a look at it.

In [27]:
import pandas as pd

masses_data = pd.read_csv('mammographic_masses.data.txt')
masses_data.head()

Unnamed: 0,5,67,3,5.1,3.1,1
0,4,43,1,1,?,1
1,5,58,4,5,3,1
2,4,28,1,1,3,0
3,5,74,1,5,?,1
4,4,65,1,?,3,0


Make sure you use the optional parmaters in read_csv to convert missing data (indicated by a ?) into NaN, and to add the appropriate column names (BI_RADS, age, shape, margin, density, and severity):

In [28]:
masses_data = pd.read_csv('mammographic_masses.data.txt', na_values=['?'], names = ['BI-RADS', 'age', 'shape', 'margin', 'density', 'severity'])
masses_data.head()

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


Evaluate whether the data needs cleaning ... your model is only as good as the data it's given.

In [29]:
masses_data.describe()

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


There are quite a few missing values in the data set. Before we just drop every row that's missing data, let's make sure we don't bias our data in doing so. Does there appear to be any sort of correlation to what sort of data has missing fields? If there were, we'd have to try and go back and fill that data in.

In [30]:
masses_data.loc[(masses_data['age'].isnull()) |
              (masses_data['shape'].isnull()) |
              (masses_data['margin'].isnull()) |
              (masses_data['density'].isnull())]

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
1,4.0,43.0,1.0,1.0,,1
4,5.0,74.0,1.0,5.0,,1
5,4.0,65.0,1.0,,3.0,0
6,4.0,70.0,,,3.0,0
7,5.0,42.0,1.0,,3.0,0
...,...,...,...,...,...,...
778,4.0,60.0,,4.0,3.0,0
819,4.0,35.0,3.0,,2.0,0
824,6.0,40.0,,3.0,4.0,1
884,5.0,,4.0,4.0,3.0,1


If the missing data seems randomly distributed, go ahead and drop rows with missing data.

In [31]:
masses_data.dropna(inplace=True)
masses_data.describe()

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
count,830.0,830.0,830.0,830.0,830.0,830.0
mean,4.393976,55.781928,2.781928,2.813253,2.915663,0.485542
std,1.888371,14.671782,1.242361,1.567175,0.350936,0.500092
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,46.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


Next you'll need to convert the Pandas dataframes into numpy arrays that can be used by scikit_learn. Create an array that extracts only the feature data we want to work with (age, shape, margin, and density) and another array that contains the classes (severity). You'll also need an array of the feature name labels.

In [32]:
features = masses_data[['age', 'shape',
                             'margin', 'density']].values


classes = masses_data['severity'].values

feature_names = [ 'age', 'shape', 'margin', 'density']

features

array([[67.,  3.,  5.,  3.],
       [58.,  4.,  5.,  3.],
       [28.,  1.,  1.,  3.],
       ...,
       [64.,  4.,  5.,  3.],
       [66.,  4.,  5.,  3.],
       [62.,  3.,  3.,  3.]])

Some of our models require the input data to be normalized, so go ahead and normalize the attribute data. 

In [33]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler()
features = scaler.fit_transform(features)
features

array([[ 0.7650629 ,  0.17563638,  1.39618483,  0.24046607],
       [ 0.15127063,  0.98104077,  1.39618483,  0.24046607],
       [-1.89470363, -1.43517241, -1.157718  ,  0.24046607],
       ...,
       [ 0.56046548,  0.98104077,  1.39618483,  0.24046607],
       [ 0.69686376,  0.98104077,  1.39618483,  0.24046607],
       [ 0.42406719,  0.17563638,  0.11923341,  0.24046607]])

## Decision Trees

start by creating a single train/test split of our data. Set aside 75% for training, and 25% for testing.

In [34]:
import numpy
from sklearn.model_selection import train_test_split

numpy.random.seed(1234)

(training_inputs,
 testing_inputs,
 training_classes,
 testing_classes) = train_test_split(features, classes, train_size=0.75, random_state=1)

In [35]:
from sklearn.tree import DecisionTreeClassifier

DT= DecisionTreeClassifier(random_state=1)

# Train the classifier on the training set
DT.fit(training_inputs, training_classes)

DecisionTreeClassifier(random_state=1)

In [36]:
DT.score(testing_inputs, testing_classes)

0.7355769230769231

# RandomForestClassifier

In [37]:
from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier(n_estimators=10, random_state=1)
RF.fit(training_inputs, training_classes)

RandomForestClassifier(n_estimators=10, random_state=1)

In [38]:
RF.score(testing_inputs, testing_classes)

0.7596153846153846

## SVM

svm.SVC perform differently with different kernels. The choice of kernel is an example of a "hyperparamter." Try the rbf, sigmoid, and poly kernels and see what the best-performing kernel is.

In [39]:
from sklearn import svm

C = 1.0
svc = svm.SVC(kernel='linear', C=C)

svc.fit(training_inputs, training_classes)

SVC(kernel='linear')

In [40]:
svc.score(testing_inputs, testing_classes)

0.7692307692307693

In [41]:
C = 1.0
svc = svm.SVC(kernel='rbf', C=C)

svc.fit(training_inputs, training_classes)
svc.score(testing_inputs, testing_classes)

0.7788461538461539

In [42]:
C = 1.0
svc = svm.SVC(kernel='sigmoid', C=C)

svc.fit(training_inputs, training_classes)
svc.score(testing_inputs, testing_classes)

0.7067307692307693

In [43]:
C = 1.0
svc = svm.SVC(kernel='poly', C=C)

svc.fit(training_inputs, training_classes)
svc.score(testing_inputs, testing_classes)

0.75

## KNN
How about K-Nearest-Neighbors? it's a lot easier than implementing KNN from scratch like we did earlier in the course. Start with a K of 10. K is an example of a hyperparameter - a parameter on the model itself which may need to be tuned for best results on your particular data set.

In [44]:
from sklearn import neighbors

KNN = neighbors.KNeighborsClassifier(n_neighbors=10)
KNN.fit(training_inputs, training_classes)

KNN.score(testing_inputs, testing_classes)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


0.7692307692307693

## Logistic Regression

We've tried all these fancy techniques, but fundamentally this is just a binary classification problem. Try Logisitic Regression, which is a simple way to tackling this sort of thing.

In [45]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()

LR.fit(training_inputs, training_classes)

LogisticRegression()

In [46]:
LR.score(testing_inputs, testing_classes)

0.7788461538461539

# STACKING
### Implementing a meta-Learner

In [47]:
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(features, classes, test_size=0.2, random_state=42)

pred1 = DT.predict(X_test)
pred2 = RF.predict(X_test)
pred3 = svc.predict(X_test)
pred4 = LR.predict(X_test)

In [48]:
stacked_X = numpy.column_stack((pred1, pred2,pred3,pred4))

In [49]:
meta_learner = DecisionTreeClassifier(random_state=42)
meta_learner.fit(stacked_X, y_test)

DecisionTreeClassifier(random_state=42)

In [50]:
stacked_pred = meta_learner.predict(stacked_X)

In [51]:
accuracy = accuracy_score(y_test, stacked_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9036144578313253


# Saving Model for Usage

In [52]:
import pickle as pkl
pkl.dump(meta_learner, open('meta_learner.pkl','wb'))