# Mushroom Dataset (Part 2)

<b> Goal:</b> Predict if a Mushroom is Edible or not

https://archive.ics.uci.edu/ml/datasets/Mushroom
This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like `leaflets three, let it be' for Poisonous Oak and Ivy.

From Audobon Society Field Guide; mushrooms described in terms of physical characteristics; classification: poisonous or edible

    cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
    cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
    cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
    bruises?: bruises=t,no=f
    odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
    gill-attachment: attached=a,descending=d,free=f,notched=n
    gill-spacing: close=c,crowded=w,distant=d
    gill-size: broad=b,narrow=n
    gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
    stalk-shape: enlarging=e,tapering=t
    stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
    stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
    stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
    stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
    stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
    veil-type: partial=p,universal=u
    veil-color: brown=n,orange=o,white=w,yellow=y
    ring-number: none=n,one=o,two=t
    ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
    spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
    population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
    habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

## Clean and preprocess data

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import cm as cm
import random
import sys
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_squared_error

In [2]:
columns = ["edible", "cap-shape", "cap-surface", "cap-color", "bruises?",
            "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color",
            "stalk-shape", "stalk-root", "stalk-surface-above-ring",
            "stalk-surface-below-ring", "stalk-color-above-ring",
            "stalk-color-below-ring", "veil-type", "veil-color", "ring-number",
            "ring-type", "spore-print-color", "population", "habitat"
            ]

data_mush = pd.read_csv('agaricus-lepiota.data.csv', names=columns, index_col=None)

In [3]:
data_mush.head()

Unnamed: 0,edible,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [4]:
data_mush.describe()

Unnamed: 0,edible,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [5]:
#? in stalk root. Replace with NaN

data_mush['stalk-root'].replace( '?', value= np.nan, inplace=True )
data_mush['stalk-root'].isnull().sum()

print('There are a total of ', data_mush['stalk-root'].isnull().sum(), "out of ", len(data_mush['stalk-root']), " missing")

There are a total of  2480 out of  8124  missing


In [6]:
#Dropping the columns due to missing data
data_mush = data_mush.drop(['stalk-root'],axis=1)
None

In [7]:
# Separate the feature variables X and the target variable y (edible):
X = data_mush.drop(['edible'], 1)
y = data_mush['edible']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)

In [9]:
from sklearn.pipeline import Pipeline

ohe = preprocessing.OneHotEncoder(handle_unknown='ignore')


In [10]:
# ohe = preprocessing.OneHotEncoder(handle_unknown='ignore')
# X_train_oh = ohe.fit_transform(X_train)
# X_test_oh = ohe.transform(X_test)
# feature_names = ohe.get_feature_names(X_train.columns)

In [11]:
#y values using label encoder to convert to binary.
#le = preprocessing.LabelEncoder()


#convert dependent variable to binary 
#y_train = le.fit_transform( train.edible.values )

#y_test = le.fit_transform( test.edible.values )

In [12]:
X_train

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
6025,x,s,n,f,s,f,c,n,b,t,...,k,w,w,p,w,o,e,w,v,p
1594,x,f,g,f,n,f,w,b,n,t,...,f,w,w,p,w,o,e,k,s,g
6479,x,s,n,f,s,f,c,n,b,t,...,k,p,p,p,w,o,e,w,v,p
2127,x,y,g,t,n,f,c,b,n,t,...,s,w,p,p,w,o,p,k,v,d
6330,k,y,e,f,f,f,c,n,b,t,...,k,p,p,p,w,o,e,w,v,d
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7409,k,y,e,f,f,f,c,n,b,t,...,k,p,p,p,w,o,e,w,v,l
3325,f,y,e,t,n,f,c,b,u,t,...,s,g,g,p,w,o,p,n,v,d
1414,f,f,w,f,n,f,w,b,k,t,...,s,w,w,p,w,o,e,n,a,g
5787,f,y,e,t,n,f,c,b,w,e,...,s,e,e,p,w,t,e,w,c,w


In [13]:
y[:10]

0    p
1    e
2    e
3    p
4    e
5    e
6    e
7    e
8    p
9    e
Name: edible, dtype: object

## 1.2 Running models

##KNN

In [14]:
from sklearn.neighbors import KNeighborsClassifier

print("-------------------------------------------------------------")

#Call model into variable
knn = KNeighborsClassifier(n_neighbors=5)
knn = Pipeline([('one_hot_enc', ohe), ('estimator', knn)])

#Train model
knn.fit(X_train,y_train)
print(knn)

#KNN predict
knnpredictions = knn.predict(X_test)

print("-------------------------------------------------------------")
#Confusion Matrix
knn_cm = confusion_matrix(y_test,knnpredictions)
print(knn_cm)
print("-------------------------------------------------------------")
#Classification Report
print(classification_report(y_test,knnpredictions))
print("-------------------------------------------------------------")
#Score (Accuracy)
print(knn.score(X_test,y_test))
print("-------------------------------------------------------------")

-------------------------------------------------------------
Pipeline(memory=None,
         steps=[('one_hot_enc',
                 OneHotEncoder(categorical_features=None, categories=None,
                               drop=None, dtype=<class 'numpy.float64'>,
                               handle_unknown='ignore', n_values=None,
                               sparse=True)),
                ('estimator',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=None, n_neighbors=5, p=2,
                                      weights='uniform'))],
         verbose=False)
-------------------------------------------------------------
[[1023    0]
 [   0 1008]]
-------------------------------------------------------------
              precision    recall  f1-score   support

           e       1.00      1.00      1.00      1023
           p       1.00   

In [15]:
pd.Series(y_test).value_counts()/len(y_test)

e    0.503693
p    0.496307
Name: edible, dtype: float64

data not unbalanced, so that can't be the reson why accuracy is so high

### KNN is perfect for this problem or is there 'leakage'?

## Naive Bayes

In [16]:
print("-------------------------------------------------------------")
from sklearn.naive_bayes import BernoulliNB
#Call model into variable
nb = BernoulliNB()
nb = Pipeline([('one_hot_enc', ohe), ('estimator', nb)])

#Train model
nb.fit(X_train,y_train)
print(nb)

#Bayes predict
nbpredictions = nb.predict(X_test)

print("-------------------------------------------------------------")
#Confusion Matrix
nb_cm = confusion_matrix(y_test,nbpredictions)
print(nb_cm)
print("-------------------------------------------------------------")
#Classification Report
print(classification_report(y_test,nbpredictions))
print("-------------------------------------------------------------")
print(nb.score(X_test,y_test))

-------------------------------------------------------------
Pipeline(memory=None,
         steps=[('one_hot_enc',
                 OneHotEncoder(categorical_features=None, categories=None,
                               drop=None, dtype=<class 'numpy.float64'>,
                               handle_unknown='ignore', n_values=None,
                               sparse=True)),
                ('estimator',
                 BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None,
                             fit_prior=True))],
         verbose=False)
-------------------------------------------------------------
[[1009   14]
 [ 101  907]]
-------------------------------------------------------------
              precision    recall  f1-score   support

           e       0.91      0.99      0.95      1023
           p       0.98      0.90      0.94      1008

    accuracy                           0.94      2031
   macro avg       0.95      0.94      0.94      2031
weighted avg       0.9

NB is worse than KNN, but we can get coefficients to undestand the data better

In [17]:
df_coeff = pd.DataFrame({'feature': nb.named_steps['one_hot_enc'].get_feature_names(X_train.columns),
              'coefficient': nb.named_steps['estimator'].coef_[0], 'abs_coefficient': np.abs(nb.named_steps['estimator'].coef_[0])})

In [18]:
df_coeff.sort_values(by='abs_coefficient',ascending=False).head(30)

Unnamed: 0,feature,coefficient,abs_coefficient
111,habitat_w,-7.975908,7.975908
16,cap-color_r,-7.975908,7.975908
61,stalk-color-above-ring_e,-7.975908,7.975908
62,stalk-color-above-ring_g,-7.975908,7.975908
64,stalk-color-above-ring_o,-7.975908,7.975908
25,odor_l,-7.975908,7.975908
70,stalk-color-below-ring_e,-7.975908,7.975908
71,stalk-color-below-ring_g,-7.975908,7.975908
22,odor_a,-7.975908,7.975908
73,stalk-color-below-ring_o,-7.975908,7.975908


Naive Bayes is working fine, but many features have the same coefficients

## SVM

In [19]:
# Support Vector Machine
from sklearn.svm import LinearSVC

#Call model into variable
svm = LinearSVC()
svm = Pipeline([('one_hot_enc', ohe), ('estimator', svm)])

print("-------------------------------------------------------------")

#Train model
svm.fit(X_train,y_train)
print(svm)

#SVM predict
svmpredictions = svm.predict(X_test)

print("-------------------------------------------------------------")
svm_cm = confusion_matrix(y_test,svmpredictions)
print(svm_cm)
print("-------------------------------------------------------------")
print(classification_report(y_test,svmpredictions))
print("-------------------------------------------------------------")
print(svm.score(X_test,y_test))

-------------------------------------------------------------
Pipeline(memory=None,
         steps=[('one_hot_enc',
                 OneHotEncoder(categorical_features=None, categories=None,
                               drop=None, dtype=<class 'numpy.float64'>,
                               handle_unknown='ignore', n_values=None,
                               sparse=True)),
                ('estimator',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='squared_hinge', max_iter=1000,
                           multi_class='ovr', penalty='l2', random_state=None,
                           tol=0.0001, verbose=0))],
         verbose=False)
-------------------------------------------------------------
[[1023    0]
 [   0 1008]]
-------------------------------------------------------------
              precision    recall  f1-score   support

           e       1.00      1.

SVM is also perfect and we can also get coefficients

In [20]:
df_coeff = pd.DataFrame({'feature': svm.named_steps['one_hot_enc'].get_feature_names(X_train.columns),
              'coefficient': svm.named_steps['estimator'].coef_[0], 'abs_coefficient': np.abs(svm.named_steps['estimator'].coef_[0])})

In [21]:
df_coeff.sort_values(by='abs_coefficient',ascending=False).head(10)

Unnamed: 0,feature,coefficient,abs_coefficient
95,spore-print-color_r,1.68805,1.68805
23,odor_c,1.33972,1.33972
25,odor_l,-1.140768,1.140768
22,odor_a,-1.140462,1.140462
27,odor_n,-0.839024,0.839024
24,odor_f,0.789267,0.789267
52,stalk-surface-above-ring_k,0.658581,0.658581
86,ring-type_f,-0.637131,0.637131
111,habitat_w,-0.610314,0.610314
93,spore-print-color_n,-0.57369,0.57369


In [22]:
(y == 'p').sum()

3916

In [23]:
(X['spore-print-color'] == 'r').sum()

72

In [24]:
((X['spore-print-color'] == 'r') & (y == 'p')).sum()

72

In [25]:
(X['odor'] == 'c').sum()

192

In [26]:
((X['odor'] == 'c') & (y == 'p')).sum()

192

In [27]:
(X['odor'] == 'l').sum()

400

In [28]:
((X['odor'] == 'l') & (y == 'p')).sum()

0

In [29]:
(X['odor'] == 'a').sum()

400

In [30]:
((X['odor'] == 'a') & (y == 'p')).sum()

0

In [31]:
(X['odor'] == 'n').sum()

3528

In [32]:
((X['odor'] == 'n') & (y == 'P')).sum()

0

In [33]:
(X['odor'] == 'f').sum()

2160

In [34]:
((X['odor'] == 'f') & (y == 'p')).sum()

2160

In [35]:
(X['stalk-surface-above-ring'] == 'k').sum()

2372

In [36]:
((X['stalk-surface-above-ring'] == 'k') & (y == 'p')).sum()

2228

In [37]:
(X['ring-type'] == 'f').sum()

48

In [38]:
((X['ring-type'] == 'f') & (y == 'p')).sum()

0

In [39]:
(X['habitat'] == 'w').sum()

192

In [40]:
((X['habitat'] == 'w') & (y == 'p')).sum()

0

In [41]:
(X['spore-print-color'] == 'n').sum()

1968

In [42]:
((X['spore-print-color'] == 'n') & (y == 'p')).sum()

224

# Looks like certain feature are indicative of the mushroom being poisonous
That is also why experts can telll you if the mushroom is poisonous with absolute confidence