Case Study on Probability for Data Science
Problem Statement:
To make a suitable machine learning algorithm to predict if the mushroom is
edible or poisonous (e or p) using the given dataset.
(Along with other ML algorithms, Naïve Bayes’ Classifier should be applied)
Also, if some data pre-processing is necessary do that as well.



Attribute Information:
• cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
• cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s
• cap-colour: brown=n, buff=b, cinnamon=c, Gray=g, green=r, pink=p, purple=u,
red=e, white=w, yellow=y
• bruises: bruises=t, no=f
• odour: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n,
pungent=p, spicy=s
• gill-attachment: attached=a, descending=d, free=f, notched=n
• gill-spacing: close=c, crowded=w, distant=d
• gill-size: broad=b, narrow=n
• gill-colour: black=k, brown=n, buff=b, chocolate=h, grey=g, green=r, orange=o,
pink=p, purple=u, red=e, white=w, yellow=y
• stalk-shape: enlarging=e, tapering=t
• Stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r,
missing=?
• stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s
• stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s
• stalk-colour-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o,
pink=p, red=e, white=w, yellow=y
• stalk-colour-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o,
pink=p, red=e, white=w, yellow=y
• veil-type: partial=p, universal=u
• veil-colour: brown=n, orange=o, white=w, yellow=y
• ring-number: none=n, one=o, two=t
• ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p,
sheathing=s, zone=z
• spore-print-colour: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o,
purple=u, white=w, yellow=y
• population: abundant=a, clustered=c, numerous=n, scattered=s, several=v,
solitary=y
• habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w,
woods=d

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import MinMaxScaler

In [2]:
#Reading the data
data= pd.read_csv('C:/Users/Stevelal/Downloads/mushrooms.csv')

In [3]:
data

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,...,s,o,o,p,o,o,p,b,c,l
8120,e,x,s,n,f,n,a,c,b,y,...,s,o,o,p,n,o,p,b,v,l
8121,e,f,s,n,f,n,a,c,b,n,...,s,o,o,p,o,o,p,b,c,l
8122,p,k,y,n,f,y,f,c,n,b,...,k,w,w,p,w,o,e,w,v,l


In [4]:
# Checking for null values
data.isnull().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

In [5]:
data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [6]:
data.columns

Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
       'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
       'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
       'stalk-surface-below-ring', 'stalk-color-above-ring',
       'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
       'ring-type', 'spore-print-color', 'population', 'habitat'],
      dtype='object')

In [7]:
# Label Encoding the Variables
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

a=['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor','gill-attachment', 'gill-spacing', 'gill-size', 'gill-color','stalk-shape', 'stalk-root', 'stalk-surface-above-ring','stalk-surface-below-ring', 'stalk-color-above-ring','stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number','ring-type', 'spore-print-color', 'population', 'habitat']
for i in np.arange(len(a)):
    data[a[i]]= le.fit_transform(data[a[i]])

In [8]:
data

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,1,5,2,4,1,6,1,0,1,4,...,2,7,7,0,2,1,4,2,3,5
1,0,5,2,9,1,0,1,0,0,4,...,2,7,7,0,2,1,4,3,2,1
2,0,0,2,8,1,3,1,0,0,5,...,2,7,7,0,2,1,4,3,2,3
3,1,5,3,8,1,6,1,0,1,5,...,2,7,7,0,2,1,4,2,3,5
4,0,5,2,3,0,5,1,1,0,4,...,2,7,7,0,2,1,0,3,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,0,3,2,4,0,5,0,0,0,11,...,2,5,5,0,1,1,4,0,1,2
8120,0,5,2,4,0,5,0,0,0,11,...,2,5,5,0,0,1,4,0,4,2
8121,0,2,2,4,0,5,0,0,0,5,...,2,5,5,0,1,1,4,0,1,2
8122,1,3,3,4,0,8,1,0,1,0,...,1,7,7,0,2,1,0,7,4,2


In [9]:
# Importing train_test_split and spliting the data
from sklearn.model_selection import train_test_split
y= data['class']
x= data.drop(['class'], axis=1)
x_train, x_test, y_train, y_test= train_test_split(x,y, random_state= 30,test_size= 0.2)

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score, precision_score,recall_score, f1_score
from sklearn.naive_bayes import BernoulliNB
classifier= BernoulliNB()
Logit_model= LogisticRegression()
dt_model= DecisionTreeClassifier()

In [11]:
# Checking with Logistic regression model
Logit_model.fit(x_train, y_train)
y_pred2= Logit_model.predict(x_test)
print('Accuracy Score is:', accuracy_score(y_test, y_pred2))
print('Recall Score is:', recall_score(y_test, y_pred2))
print('Precision Score:', precision_score(y_test, y_pred2))
print('F1 score is:', f1_score(y_test, y_pred2))

Accuracy Score is: 0.955076923076923
Recall Score is: 0.9463806970509383
Precision Score: 0.9553450608930988
F1 score is: 0.9508417508417508


In [12]:
cm= confusion_matrix(y_test, y_pred2)
cm

array([[846,  33],
       [ 40, 706]], dtype=int64)

In [13]:
# Checking with GaussianNB
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score, precision_score,recall_score, f1_score
Classifier= GaussianNB()
Classifier.fit(x_train, y_train)
y_pred= Classifier.predict(x_test)
print('Accuracy Score is:', accuracy_score(y_test, y_pred))
print('Recall Score is:', recall_score(y_test, y_pred))
print('Precision Score:', precision_score(y_test, y_pred))
print('F1 score is:', f1_score(y_test, y_pred))

Accuracy Score is: 0.9027692307692308
Recall Score is: 0.9195710455764075
Precision Score: 0.875
F1 score is: 0.8967320261437908


In [14]:
cm= confusion_matrix(y_test, y_pred)
cm

array([[781,  98],
       [ 60, 686]], dtype=int64)

In [15]:
# Checking with BernoulliNB 
from sklearn.naive_bayes import BernoulliNB
classifier= BernoulliNB()
classifier.fit(x_train, y_train)
y_pred1= classifier.predict(x_test)
print('Accuracy Score is:', accuracy_score(y_test, y_pred1))
print('Recall Score is:', recall_score(y_test, y_pred1))
print('Precision Score:', precision_score(y_test, y_pred1))
print('F1 score is:', f1_score(y_test, y_pred1))

Accuracy Score is: 0.8326153846153846
Recall Score is: 0.7265415549597856
Precision Score: 0.8885245901639345
F1 score is: 0.799410029498525


In [16]:
cm= confusion_matrix(y_test, y_pred1)
cm

array([[811,  68],
       [204, 542]], dtype=int64)

In [17]:
# Checking with the Decision tree Classifier model
dt_model.fit(x_train, y_train)
y_pred3= dt_model.predict(x_test)
print('Accuracy Score is:', accuracy_score(y_test, y_pred3))
print('Recall Score is:', recall_score(y_test, y_pred3))
print('Precision Score:', precision_score(y_test, y_pred3))
print('F1 score is:', f1_score(y_test, y_pred3))

Accuracy Score is: 1.0
Recall Score is: 1.0
Precision Score: 1.0
F1 score is: 1.0


In [18]:
cm= confusion_matrix(y_test, y_pred3)
cm

array([[879,   0],
       [  0, 746]], dtype=int64)

Among all the Machine Learning Algorithums Decision Tree Classifier suits the data most, as the accuracy score, recall and 