## Neural networks and the Breast cancer dataset

We will use the Breast cancer dataset to check an application of neural networks with real-world data. Load the breast cancer dataset from sklearn and print the description. 

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()  # for plot styling
import numpy as np
import pandas as pd

In [3]:
from sklearn.datasets import load_breast_cancer
breast_cancer_dataset = load_breast_cancer()
print(breast_cancer_dataset.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

Print the dataset `target names`, the `feature names` and the `shape` of the `data`. These are the dimensions of the input features.

In [4]:
print(breast_cancer_dataset.target_names)

['malignant' 'benign']


In [11]:
print(breast_cancer_dataset.feature_names)

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


In [11]:
print(breast_cancer_dataset.data.shape)

(569, 30)


Now perform a train test split with `random_state=0` and train a MLP with the default parameters and `random state=42`. After training print the accuracy on the training and the test set.

In [4]:
X = breast_cancer_dataset.data
Y = breast_cancer_dataset.target

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, random_state = 0)

from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(random_state=42).fit(X_train, Y_train)
print(f'Final Training Accuracy: {clf.score(X_train,Y_train)*100}%')
print(f'Model Accuracy: {clf.score(X_test,Y_test)*100}%')



Final Training Accuracy: 93.89671361502347%
Model Accuracy: 91.6083916083916%


You may compare the performance of the MLP with the RandomForestClassifier. You can use `n_estimators=1000` for the Random Forest. Print the accuracies.

In [4]:
from sklearn.ensemble import RandomForestClassifier
clf2 = RandomForestClassifier(n_estimators = 1000)
clf2.fit(X_train,Y_train)

RandomForestClassifier(n_estimators=1000)

In [5]:
print(f'Final Training Accuracy: {clf2.score(X_train,Y_train)*100}%')
print(f'Model Accuracy: {clf2.score(X_test,Y_test)*100}%')

Final Training Accuracy: 100.0%
Model Accuracy: 97.2027972027972%


Which model has obtained the best performance? Why do you think that is? Compute the standard deviation for each feature in the dataset.

In [7]:
dataFrame = pd.DataFrame(breast_cancer_dataset.data, columns = breast_cancer_dataset.feature_names)
i = 0
for column in dataFrame:
    print(column)
    print(dataFrame.std()[column])
    i+=1


mean radius
3.524048826212078
mean texture
4.301035768166949
mean perimeter
24.2989810387549
mean area
351.9141291816527
mean smoothness
0.014064128137673616
mean compactness
0.0528127579325122
mean concavity
0.0797198087078935
mean concave points
0.03880284485915359
mean symmetry
0.027414281336035712
mean fractal dimension
0.007060362795084459
radius error
0.2773127329861041
texture error
0.5516483926172023
perimeter error
2.021854554042107
area error
45.49100551613178
smoothness error
0.003002517943839067
compactness error
0.017908179325677377
concavity error
0.030186060322988394
concave points error
0.006170285174046866
symmetry error
0.008266371528798399
fractal dimension error
0.0026460709670891942
worst radius
4.833241580469324
worst texture
6.146257623038323
worst perimeter
33.60254226903635
worst area
569.3569926699492
worst smoothness
0.022832429404835458
worst compactness
0.15733648891374194
worst concavity
0.20862428060813235
worst concave points
0.0657323411959421
worst sym

Now normalize the data, subtract the mean and divide by the standard deviation. You have to compute the mean and std on the training set, and use the same one for the test set.

In [10]:
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
X_train_scaled = scaler.fit(X_train).transform(X_train)
X_test_scaled = scaler.fit(X_test).transform(X_test)




After doing that, you can check that the mean and std has been actually set to 0 and 1.

[1.40323217e+01 1.94583916e+01 9.14481119e+01 6.44385315e+02
 9.63432867e-02 1.06341049e-01 8.92437413e-02 4.82491189e-02
 1.83213287e-01 6.33356643e-02 4.06295105e-01 1.23063497e+00
 2.94077273e+00 3.92688182e+01 7.20275524e-03 2.66701818e-02
 3.24728252e-02 1.20779580e-02 2.08567133e-02 4.03964336e-03
 1.61273077e+01 2.57941259e+01 1.06671608e+02 8.59537063e+02
 1.31966993e-01 2.58521189e-01 2.80253776e-01 1.12600979e-01
 2.91345455e-01 8.51539860e-02]


Run again the MLP on the normalized data and print the accuracy on the training and the test. Did the results improve?

In [11]:
mlp = MLPClassifier(random_state=42)   

y_pred = mlp.fit(X_train_scaled, Y_train).predict(X_test_scaled)

print('Accuracy on the training subset: {:.3f}'.format(mlp.score(X_train_scaled, Y_train)))
print('Accuracy on the test subset: {:.3f}'.format(mlp.score(X_test_scaled, Y_test)))




Accuracy on the training subset: 0.993
Accuracy on the test subset: 0.958




You should have a warning saying that the optimization has not converged. This usually means we should add more iterations, set `max_iter` to 1000. What are the accuracies now? 

In [12]:
mlp = MLPClassifier(max_iter = 1000, random_state=42)   

y_pred = mlp.fit(X_train_scaled, Y_train).predict(X_test_scaled)

print('Accuracy on the training subset: {:.3f}'.format(mlp.score(X_train_scaled, Y_train)))
print('Accuracy on the test subset: {:.3f}'.format(mlp.score(X_test_scaled, Y_test)))




Accuracy on the training subset: 1.000
Accuracy on the test subset: 0.958


Now try different values of regularizer hyperparam `alpha`, which values gives the best results? You can compare to the RandomForest with normalized data.

In [18]:
mlp = MLPClassifier(max_iter = 1000, alpha = 0.9, random_state=42)   

y_pred = mlp.fit(X_train_scaled, Y_train).predict(X_test_scaled)

print('Accuracy on the training subset: {:.3f}'.format(mlp.score(X_train_scaled, Y_train)))
print('Accuracy on the test subset: {:.3f}'.format(mlp.score(X_test_scaled, Y_test)))





Accuracy on the training subset: 0.988
Accuracy on the test subset: 0.972
