**PREDICTING WHETHER A MUSHROOM IS SAFE TO EAT OR NOT.**

classes: edible=e, poisonous=p

cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s

cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s

cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y

bruises: bruises=t,no=f

odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

gill-attachment: attached=a,descending=d,free=f,notched=n

gill-spacing: close=c,crowded=w,distant=d

gill-size: broad=b,narrow=n

gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y

stalk-shape: enlarging=e,tapering=t

stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?

stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s

stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s

stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

veil-type: partial=p,universal=u

veil-color: brown=n,orange=o,white=w,yellow=y

ring-number: none=n,one=o,two=t

ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z

spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y

population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y

habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler


from sklearn.neural_network import MLPClassifier

# to calculate the performances of the models 
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score # score evaluation
from sklearn.model_selection import cross_val_predict # prediction

In [None]:
df = pd.read_csv('mushrooms.csv')
df.info()

In [None]:
df.head()

Although we have no null values every variable is an object, by using head we can see the values are letters so we have to convert the variables into numeric values using some sort of encoding

In [None]:
#just to ensure that there are really no empty values
df.isnull().sum()

In [None]:
#we are trying to train a model to tell if a mushroom is poisonous or not so we can start by visualiing how many classes
#of those we have in out data

plt.figure(figsize=(6.5, 4))
plt.bar(df['class'].value_counts().index, df['class'].value_counts().values,color=['lightblue','lightgreen'])
plt.show()

In [None]:
df_encoded = df.copy()

Le = LabelEncoder()

# iterating the encoding all the attributes
for features in df.columns:
    df_encoded[features] = Le.fit_transform(df_encoded[features])

df_encoded.head()

class: poisonous = 1, edible = 0

Now that all the data is encoded we can do some exploritory analysis.

In [None]:


plt.figure(figsize=(12,10))
sns.heatmap(df_encoded.corr(), cmap='Greens')

In [None]:
#we are dropping veil type because it is useless cause it contains the same value for every row

df_encoded.drop(['veil-type'], axis = 1, inplace = True)
df_encoded.head()

In [None]:
len(df_encoded.columns)

In [None]:

plt.figure(figsize=(12,10))
sns.heatmap(df_encoded.corr(), cmap='Greens')

Now we can create the model to fit our data<br>
First we split our data into training and test set.
We will be using a neural network with a logistic regression function as the activation function and a scocastic gradient descent method
poisonous = 1, edible = 0


# Baseline Implementation

## Neural Network

In [None]:

X = df_encoded.drop(columns='class')
X
Y = df_encoded['class']
Y

X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.20)

In [None]:
clf = MLPClassifier(hidden_layer_sizes=(21,21,21),activation='logistic',early_stopping = True)

In [None]:
clf.fit(X_train,y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
print(confusion_matrix(y_test,y_pred))

In [None]:

plt.figure(figsize = (8,6))
heatmap = sns.heatmap(confusion_matrix(y_test,y_pred), annot=True)
bottom, top = heatmap.get_ylim()
heatmap.set_ylim(bottom + 0.5, top - 0.5)

The confusion matrix above shows that out of the 1625 rows that were tested it was able to succesfully predict if the mushroom was poisnous or not for 1622 rows. It was unable to predict the correct y values for 3 rows. 

In [None]:
print(classification_report(y_test,y_pred))

We can view the classification report and see that the f1-score is really good. This indicates our model is very good because the f1-score is the most balanced metric in the classification_report.

In [None]:
print(accuracy_score(y_test,y_pred))

Overall the model is pretty accuarte with an accuracy score of 99.8%.

# Implementation with hyperparameter tuning

The accuracy above was achived without regularization, We will perform regularization in an attempt to improve the learning of the model.

## Neural Network

In [None]:
reg_clf = MLPClassifier(hidden_layer_sizes=(21,21,21),activation='logistic',early_stopping = True,alpha=0.002)

In [None]:
scaler = StandardScaler()
# Fit only to the training data
scaler.fit(X_train)

In [None]:
# Now apply the transformations to the data:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#print(X_train,X_test)

In [None]:
reg_clf.fit(X_train,y_train)

In [None]:
reg_y_pred = reg_clf.predict(X_test)

In [None]:
print(confusion_matrix(y_test,reg_y_pred))

In [None]:
print(classification_report(y_test,reg_y_pred))

In [None]:
print(accuracy_score(y_test,reg_y_pred))