Mushroom Classification Challange - Safe to eat or deadly poison?
---
Given a set of mushroom features, decide whether it's poisonous or eatable. 
Classifier will be trained on a set from [Kaggle](https://www.kaggle.com/uciml/mushroom-classification), described as follows: 

"This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy."



In [13]:
#
# First, let's import required dependencies
#

import numpy as np
import pandas as pd

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

In [14]:
#
# Then, let's load and inspect the dataset (requires mushrooms.csv from https://www.kaggle.com/uciml/mushroom-classification
# to be present in the root of colab file manager)
#

df = pd.read_csv("mushrooms.csv")
display(df)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,e,?,s,s,o,o,p,o,o,p,b,c,l
8120,e,x,s,n,f,n,a,c,b,y,e,?,s,s,o,o,p,n,o,p,b,v,l
8121,e,f,s,n,f,n,a,c,b,n,e,?,s,s,o,o,p,o,o,p,b,c,l
8122,p,k,y,n,f,y,f,c,n,b,t,?,s,k,w,w,p,w,o,e,w,v,l


In [15]:
# As we can see, dataset contains chars, classes in the first column and a row of feature descriptions at the top
# However, to train the classifier we need 2 separate arrays for classes and features, both with numerical values
# 
#
# Let's pre-process the data. OneHotEncoder will replace char class descriptions with 0-1 values.
# Note that it'll increase number of columns, since not all feature labels are binary

encoder = OneHotEncoder(drop='first', dtype=int)

# Separate poisonous/eatable column
y = df.loc[:,'class'].values

# Remove formentioned colum form the dataset
X = df.drop(['class'], axis=1)

# One-hot encode both. Note reshape() as colum returned by df.loc is one-dimensional
# and todense(), needed to prevent Keras from throwing sparse tensor errors later on
y = encoder.fit_transform(y.reshape(-1, 1)).todense()
X = encoder.fit_transform(X)

#
# Finally, let's split our dataset into Train/Val/Test subsets
#

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, shuffle=True)

In [16]:
#
# Now we need a model. In this particular case very basic dense + dropout model should suffice
# Dropout in between dense layers to prevent overfitting
# Sigmoid activation in the final layer, as we're workin with 2 classes
#

model = Sequential()
model.add(Dense(32, activation='relu'))
model.add(Dropout(rate=0.25))
model.add(Dense(64, activation='relu'))
model.add(Dropout(rate=0.25))
model.add(Dense(1, activation='sigmoid')) 

# Binary crossentropy since we're facing binary classification problem, 
# and adam optimizer with the custom learning rate for smoother training
model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.0001), metrics=['accuracy'])

#
# Our classifier learns rather quickly, 10 epochs turned out to be sufficient for 99%+ accuracy
#

history = model.fit(X_train, y_train, batch_size=16, epochs=10, validation_data=(X_val, y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [17]:
#
# Finally, let's evaluate the classifier
#

score = model.evaluate(X_test, y_test)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.011611659079790115
Test accuracy: 0.9975384473800659


# Ending note

---
Although 99%+ accuracy could indicate overfitting, test set evaluation also shows over 99% accuracy. Possibly this particular dataset is simple enough even for very basic dense + dropout classifier
