# INFO 371 Lab - Artificial Neural Networks

This lab asks you to play with using Multi-Layer Perceptron models (aka: neural networks) and examining how your architecture choices affect your accuracy. Like with the trees lab, we will use the Wisconsin
Diagnostic Breast Cancer (WDBC) data for categorization. 

The aim of this lab is to give you some experience with neural networks.  Try to get as good accuracy as possible (while also minimizing overfitting)!

In [258]:
import pandas as pd
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
import random

## Classification 
In this task you work with WDBC data.  As a reminder, your task is to
predict __diagnosis__ (''M'' = cancer, ''B'' = no cancer).  


1. Load wdbc data and ensure it looks good.


2. Create your feature matrix $X$ and label vector $y$.  The former should contain all 30 features,  everything, except __diagnosis__ and __id__.  The latter should be __diagnosis__, converted to either logical or numeric variable (otherwise sklearn will fail).


3.  Split your data into training and validation chunks (or do cross validation below, but that is slower).


In [259]:
#code goes here
#1
df = pd.read_csv("wdbc.csv.bz2")
df.head(5)

Unnamed: 0,id,diagnosis,radius.mean,texture.mean,perimeter.mean,area.mean,smoothness.mean,compactness.mean,concavity.mean,concpoints.mean,...,radius.worst,texture.worst,perimeter.worst,area.worst,smoothness.worst,compactness.worst,concavity.worst,concpoints.worst,symmetry.worst,fracdim.worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [260]:
print(df.shape)
print(df.isna().sum())

(569, 32)
id                   0
diagnosis            0
radius.mean          0
texture.mean         0
perimeter.mean       0
area.mean            0
smoothness.mean      0
compactness.mean     0
concavity.mean       0
concpoints.mean      0
symmetry.mean        0
fracdim.mean         0
radius.se            0
texture.se           0
perimeter.se         0
area.se              0
smoothness.se        0
compactness.se       0
concavity.se         0
concpoints.se        0
symmetry.se          0
fracdim.se           0
radius.worst         0
texture.worst        0
perimeter.worst      0
area.worst           0
smoothness.worst     0
compactness.worst    0
concavity.worst      0
concpoints.worst     0
symmetry.worst       0
fracdim.worst        0
dtype: int64


In [261]:
#2
X = df.loc[:, ~df.columns.isin(['diagnosis', 'id'])] 
y = df.diagnosis.to_frame()

y.diagnosis = np.where(y.diagnosis == 'M', 1, 0)

In [262]:
#3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

Now everything should be ready for us to train some Neural Networks.  Your
task is to analyze the effect of changing the __hidden_layer_sizes__ hyperparameter of [MLPClassifier](https://scikit-learn.org/stable/modules/neural_networks_supervised.html). 


4. Get a baseline testing and training accuracy for your model using the default paramters.


5. Experiment with different numers of layers (and number of nodes at each layer) to try to get the best possible accuracy. The __hidden_layer_sizes__ parameter accepts a set where the length is the number of hidden layers and each element is the number of nodes in each layer. For example if I set as hidden_layer_sizes=(5, 2, 3) then my neural network will have 3 hidden layers where the first layer has 5 nodes, the second 2 nodes, and the third has 3 nodes. 


6. Report on what the best "architecture" you got. Tell us how manuy layers you used and the total number of nodes you used to get your best accuracy. 


7. Does your model overfit? Back up your claim with evidence. 
        Hint: you need to examine test/train accuracy as you vary the complexity of the model.


8. Finally, compare the best accuracy you achieved using neural networks with a similar accuracy using a default Random Forest model. Which model gives you better accuracy? Which model is more prone to overfitting with this dataset? 



In [263]:
#code goes here
#4
m = DummyClassifier(strategy = "most_frequent")
m.fit(X_train, y_train)
y_pred_train= m.predict(X_train)
y_pred_test = m.predict(X_test)

print((accuracy_score(y_train, y_pred_train)), (accuracy_score(y_test, y_pred_test)))

0.6373626373626373 0.5877192982456141


In [264]:
#5
best_lyr = ()
best_acc = 0
lyrs = [(random.randint(1, 101),), (random.randint(1, 101),), 
        (random.randint(1, 101), random.randint(1, 101)), (random.randint(1, 101), random.randint(1, 101)), 
        (random.randint(1, 101), random.randint(1, 101), random.randint(1, 101)), 
        (random.randint(1, 101), random.randint(1, 101), random.randint(1, 101)),
       (random.randint(1, 101), random.randint(1, 101), random.randint(1, 101), random.randint(1, 101)),
       (random.randint(1, 101), random.randint(1, 101), random.randint(1, 101), random.randint(1, 101))]
for l in lyrs:
    clf = MLPClassifier(hidden_layer_sizes = l, max_iter=1000)
    clf.fit(X_train, y_train.values.ravel())
    score = clf.score(X_test, y_test)
    if(best_acc < score):
        best_lyr = l
        best_acc = score

print("hidden_layer_sizes = " + str(best_lyr) + ": ")
print(best_acc)

hidden_layer_sizes = (35, 46): 
0.9385964912280702


In [265]:
#6
print("The best architecture: " + str(best_lyr))
print("The total number of layers: " + str(len(best_lyr)))

ttl = 0
for n in best_lyr:
    ttl += n
print("The total number of nodes: " + str(ttl))

The best architecture: (35, 46)
The total number of layers: 2
The total number of nodes: 81


In [266]:
#7
clf = MLPClassifier(hidden_layer_sizes = best_lyr, max_iter=1000)
clf.fit(X_train, y_train.values.ravel())
print(clf.score(X_train, y_train), clf.score(X_test, y_test))

0.9296703296703297 0.9385964912280702


According to the training/testing accuracy, it does not seem that the model is overfitting, as both the training and testing accuracy are comparatively high, with the testing accuracy performing better than the training accuracy.

In [267]:
#8
clf_r = RandomForestClassifier()
clf_r.fit(X_train, y_train.values.ravel())
print(clf_r.score(X_train, y_train), clf_r.score(X_test, y_test))

1.0 0.9649122807017544


While the Random Forest model achieved higher accuracy overall on this dataset, it is more prone to overfitting, given the model has a higher training accuracy compared to the testing accuracy.