## Classification MLP

In [15]:
#import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import *
from sklearn.linear_model import *
from sklearn import metrics
from sklearn.neural_network import MLPClassifier

def check_NaN(dataframe):
    print("Total NaN:", dataframe.isnull().values.sum())
    print("NaN by column:\n",dataframe.isnull().sum())
    return

def one_hot_encode(dataframe, col_name):
    dataframe = pd.get_dummies(dataframe, columns=[col_name], prefix = [col_name], dtype=int)
    return dataframe

### Using a Multi-Layered Perceptron (MLP) to Classify Mushrooms as Edible or Poisonous
In this Notebook, we'll be using the mushroom classification dataset, which you can find here https://www.kaggle.com/uciml/mushroom-classification to train an MLP to determine whether a mushroom is edible (e) or poisonous (p), based its physical characteristics.

In [16]:
#load the dataset
data = pd.read_csv("./mushrooms.csv")

In [17]:
#check out its features
data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


Let's choose gill-size (narrow or broad) and spore print color as our features. Note spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y

In [18]:
chosen_features = data.filter(['class','gill-size','spore-print-color',"odor","cap-shape", "gill-spacing","cap-color", "cap-surface"])
chosen_features.head()

Unnamed: 0,class,gill-size,spore-print-color,odor,cap-shape,gill-spacing,cap-color,cap-surface
0,p,n,k,p,x,c,n,s
1,e,b,n,a,x,c,y,s
2,e,b,n,l,b,c,w,s
3,p,n,k,p,x,c,w,y
4,e,b,n,n,x,w,g,s


In [19]:
#always remember to check for NaN values
check_NaN(chosen_features)

Total NaN: 0
NaN by column:
 class                0
gill-size            0
spore-print-color    0
odor                 0
cap-shape            0
gill-spacing         0
cap-color            0
cap-surface          0
dtype: int64


One hot encode the chosen features

In [20]:
subset = one_hot_encode(chosen_features, 'class')
subset = one_hot_encode(subset, 'gill-size')
subset = one_hot_encode(subset, 'spore-print-color')
subset = one_hot_encode(subset, 'cap-shape')
subset = one_hot_encode(subset, 'odor')
subset = one_hot_encode(subset, 'gill-spacing')
subset = one_hot_encode(subset, 'cap-color')
subset = one_hot_encode(subset, 'cap-surface')

subset.head()

Unnamed: 0,class_e,class_p,gill-size_b,gill-size_n,spore-print-color_b,spore-print-color_h,spore-print-color_k,spore-print-color_n,spore-print-color_o,spore-print-color_r,...,cap-color_n,cap-color_p,cap-color_r,cap-color_u,cap-color_w,cap-color_y,cap-surface_f,cap-surface_g,cap-surface_s,cap-surface_y
0,0,1,0,1,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0
1,1,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,1,0
2,1,0,1,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,1,0
3,0,1,0,1,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,1
4,1,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0


Now, let's just pick the 'class_e' feature. This means if the Perceptron returns a value of 1, then the mushroom is edible. If it returns 0, then the mushroom is poisonous. Let's also pick 'gill_size_b', because the only other value it can be is 'gill_size_n', which means the gill size will be broad when it = 1, and narrow when it = 0. We'll pick all the colours to train on. 

In [21]:
final = subset.filter(['class_e','gill-size_b','spore-print-color_h','spore-print-color_h','spore-print-color_k','spore-print-color_n','spore-print-color_o','spore-print-color_r','spore-print-color_u','spore-print-color_w','spore-print-color_y'])
final.head()

final = subset.drop(["class_p", "gill-size_n"], axis = 1)

In [22]:
#Create the train/test splits as we did before
x_train, x_test, y_train, y_test = train_test_split(final.drop(['class_e'], axis=1),final['class_e'],test_size=0.2,random_state=1)                                                                       
print("x train/test ",x_train.shape, x_test.shape)
print("y train/test ",y_train.shape, y_test.shape)

x train/test  (6499, 41) (1625, 41)
y train/test  (6499,) (1625,)


In [23]:
#Convert them from pandas to numpy arrays
x = x_train.values
y = y_train.values
x_t = x_test.values
y_t = y_test.values

#### MLP Training and Evaluation
Let's create an MLP. Currently the only loss function it supports is the Cross-Entropy loss function, which is used by default. By default, it uses the ReLU activation function. It also has a default of 1 hidden layer, containing 100 neurons. 

Here are some parameter options you can explore:
* activation{‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default=’relu’
* hidden_layer_sizestuple, length = n_layers - 2, default=(100,) where the ith element represents the number of neurons in the ith hidden layer.
* Find out more here: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [33]:
MLP = MLPClassifier() #activation='logistic', hidden_layer_sizes=(1,),hidden_layer_sizes=(1,)

In [34]:
#train the mlp
MLP.fit(x, y)

In [35]:
predictions = MLP.predict(x_t)
#Calculate the mean squared error and accuracy
print("Mean squared error: ",np.mean((predictions - y_t) ** 2))
print("Accuracy:",str(round(metrics.f1_score(y_t, predictions)*100))+"%")

Mean squared error:  0.0012307692307692308
Accuracy: 100%


Test a mushroom with a broad gill-size and black spore print color, where index = 0 is gill-size and index = 3 is black

In [36]:
test_mushroom = [1,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,1,0,1,0,0,0,1,0,0,1,0,1,0,0,0,0,0]
prediction = MLP.predict([test_mushroom])

In [37]:
if prediction==1:
    print('Edible')
else:
    print('Poisonous')

Edible
