# INFO 3071 Lab - Artificial Neural Networks

This lab asks you to play with using Multi-Layer Perceptron models (aka: neural networks) and examining how your architecture choices affect your accuracy. Like with the trees lab, we will use the Wisconsin
Diagnostic Breast Cancer (WDBC) data for categorization. 

The aim of this lab is to give you some experience with neural networks.  Try to get as good accuracy as possible (while also minimizing overfitting)!

## Classification 
In this task you work with WDBC data.  As a reminder, your task is to
predict __diagnosis__ (''M'' = cancer, ''B'' = no cancer).  


1. Load wdbc data and ensure it looks good.


2. Create your feature matrix $X$ and label vector $y$.  The former should contain all 30 features,  everything, except __diagnosis__ and __id__.  The latter should be __diagnosis__, converted to either logical or numeric variable (otherwise sklearn will fail).


3.  Split your data into training and validation chunks (or do cross validation below, but that is slower).


In [1]:
#code goes here
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv("wdbc.csv.bz2")

y = df.diagnosis
y = (y == 'M').astype(int)
df = df.drop(columns=['id', 'diagnosis'])

x_train, x_test, y_train, y_test = train_test_split(df, y, test_size=0.20)



Now everything should be ready for us to train some Neural Networks.  Your
task is to analyze the effect of changing the __hidden_layer_sizes__ hyperparameter of [MLPClassifier](https://scikit-learn.org/stable/modules/neural_networks_supervised.html). 


4. Get a baseline testing and training accuracy for your model using the default paramters.


5. Experiment with different numers of layers (and number of nodes at each layer) to try to get the best possible accuracy. The __hidden_layer_sizes__ parameter accepts a set where the length is the number of hidden layers and each element is the number of nodes in each layer. For example if I set as hidden_layer_sizes=(5, 2, 3) then my neural network will have 3 hidden layers where the first layer has 5 nodes, the second 2 nodes, and the third has 3 nodes. 


6. Report on what the best "architecture" you got. Tell us how manuy layers you used and the total number of nodes you used to get your best accuracy. 


7. Does your model overfit? Back up your claim with evidence. 
        Hint: you need to examine test/train accuracy as you vary the complexity of the model.


8. Finally, compare the best accuracy you achieved using neural networks with a similar accuracy using a default Random Forest model. Which model gives you better accuracy? Which model is more prone to overfitting with this dataset? 



In [19]:
from sklearn.neural_network import MLPClassifier
model = MLPClassifier()
model.fit(x_train, y_train)
print("default testing accuracy:", model.score(x_test, y_test))
print("default training accuracy:", model.score(x_train, y_train))

default testing accuracy: 0.9473684210526315
default training accuracy: 0.9406593406593406


In [None]:
#code goes here
test_accuracy = []
train_accuracy = []
for i in range(2,7):
    for j in range(2,7):
        for k in range(2,7):
            model = MLPClassifier(hidden_layer_sizes=(i,j,k))
            model.fit(x_train, y_train)
            test = ((model.score(x_test, y_test)), (i,j,k))
            train = ((model.score(x_train, y_train)), (i,j,k))
            test_accuracy.append(test)
            train_accuracy.append(train)

In [18]:
print("testing architecture: ", max(test_accuracy))
print("training architecture: ",max(train_accuracy))

testing architecture:  (0.956140350877193, (5, 3, 5))
training architecture:  (0.9428571428571428, (5, 3, 4))


The results above indicate that the best model found in a simple 3 layer structure uses a 5,3,4. The loop above only restricts to three layers for complexity purposes and only perceptron layers up 6, but the above architecture came out on top in test and training. Based on the testing the model does not seem to overfit particularly aggressively given the training and testing accuracy are so similar 

In [17]:
import sklearn.ensemble as e


forest = e.RandomForestClassifier(criterion = 'entropy')
m = forest.fit(x_train, y_train)
print(m.score(x_train, y_train))
m.score(x_test, y_test)


1.0


0.956140350877193

Based on the results above it seems clear that the random forest and the NN are comparable in terms of testing accuracy but that the random forest may be more prone to overfitting. This can be seen as it exhibited perfect training accuracy. Looking in literature though it seems clear that my findings are against the norms which consider random forests more robust than NN.