# ANN (MLP) Code Examples

In [1]:
# set up
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

# get data
df = pd.read_csv("assets/wdbc.csv")
df = df.drop(['ID'],axis=1)
X  = df.drop(['Diagnosis'],axis=1)
y = df['Diagnosis']


# neural network
model = MLPClassifier(hidden_layer_sizes=(60,30), max_iter=10000)

# do the 5-fold cross validation
scores = cross_val_score(model, X, y, cv=5)
print("Fold Accuracies: {}".format(scores))
print("Accuracy: {}".format(scores.mean()))


Fold Accuracies: [ 0.94782609  0.37391304  0.90265487  0.92920354  0.88495575]
Accuracy: 0.8077106579453636


## MLP Grid Search

We can also perform a grid search to find the optimal network.

BEWARE: a grid search over all possible parameters of an MLP is almost impossible - combinatoric explosion, too many different combinations possible.

Here we only perform a grid over the number of nodes in a single hidden layer.



In [2]:
# set up
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from bootstrap import bootstrap

# get data
df = pd.read_csv("assets/wdbc.csv")
df = df.drop(['ID'],axis=1)
X  = df.drop(['Diagnosis'],axis=1)
actual_y = df['Diagnosis']

# neural network
model = MLPClassifier(max_iter=10000)

# grid search
param_grid = {'hidden_layer_sizes': [ (5,30), (10,30), (20,30), (30,30), 
                                     (40,30), (50,30), (60,30), (70,30), 
                                     (80,30), (90,30), (100,30)]}
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X, actual_y)
print("Grid Search: best parameters: {}".format(grid.best_params_))

# evaluate the best model
best_model = grid.best_estimator_
predict_y = best_model.predict(X)
print("Accuracy: {}".format(accuracy_score(actual_y, predict_y)))

# build the confusion matrix
labels = ['B', 'M']
cm = confusion_matrix(actual_y, predict_y, labels=labels)
cm_df = pd.DataFrame(cm, index=labels, columns=labels)
print("Confusion Matrix:\n{}".format(cm_df))

# boostrapped confidence interval
print("Confidence interval best MLP: {}".format(bootstrap(best_model,df,'Diagnosis')))

Grid Search: best parameters: {'hidden_layer_sizes': (50, 30)}
Accuracy: 0.9033391915641477
Confusion Matrix:
     B    M
B  350    7
M   48  164
Confidence interval best MLP: (0.30701754385964913, 0.95614035087719296)


# Team Exercise

Use the Crohn’s Disease dataset: [CrohnD](https://vincentarelbundock.github.io/Rdatasets/datasets.html)

You will need to preprocess this before you can use it: 

c1 -> 0, c2 -> 1, F -> 0, M -> 1

Build a ANN/MLP with the best cross-validated performance you can find.

Compare it to either a tree or a KNN (or both).

Report if the difference between the models is statistically significant.


# Teams

