Objective
====

This project's intent is to be able to make a classification judgement on a mushroom, either poisonous or edible, based on the given dataset.

In [None]:
# The dataset comes from Kaggle.com

**1. IMPORT NECESSARY LIBRARIES**

In [1]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
import matplotlib.pyplot as plt

**2. LOAD DATA**

In [2]:
df1 = pd.read_csv("mushrooms.csv")
df = df1.copy()
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


**3. UNDERSTANDING THE DATA-SET**

Data Summary
--

This dataset includes the descriptions of various hypothetical samples encompassing 23 different gilled mushrooms in the Agaricus and Lepiota Family Mushroom. These samples are drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each of the various species are identifiable as definitely edible, definitely poisonous, or of unknown edibility and are definitely not recommended to be eaten. This latter class is combined with the poisonous one category and identified as such. The Audubon Society Field Guide clearly states in the literature that there is not any simple route or rule for deciding edibility of a mushroom. This is very different from Poisonous Oak and Ivy that have a rule "leaflets three, let it be''; therefore the outdoor activity of ‘Shrooming’ can be a bit more challenging!

**The variables of the dataset are the following:**

**Dependent Variable**

classes: 
<ul>
<li>edible=e, 
<li>poisonous=p
</ul>

**Independent Variable**

<ul>
<li>cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
<li>cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
<li>cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
<li>bruises: bruises=t,no=f
<li>odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
<li>gill-attachment: attached=a,descending=d,free=f,notched=n
<li>gill-spacing: close=c,crowded=w,distant=d
<li>gill-size: broad=b,narrow=n
<li>gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
<li>stalk-shape: enlarging=e,tapering=t
<li>stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
<li>stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
<li>stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
<li>stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
<li>stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
<li>veil-type: partial=p,universal=u
<li>veil-color: brown=n,orange=o,white=w,yellow=y
<li>ring-number: none=n,one=o,two=t
<li>ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
<li>spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
<li>population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
<li>habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d
</ul>

**Exploring the data**

In [3]:
df.shape

(8124, 23)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

In [5]:
df.columns

Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
       'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
       'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
       'stalk-surface-below-ring', 'stalk-color-above-ring',
       'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
       'ring-type', 'spore-print-color', 'population', 'habitat'],
      dtype='object')

In [6]:
df.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


*#Checking for missing values*

In [7]:
df.isnull().values.any()

False

In [None]:
df.isnull().sum()

**3. EXPLORATORY DATA ANALYSIS ON CATEGORICAL DATA**

*It is important to understand or be aware of the categorical variable. Which are the variables in the data set that are categorical?*

In [None]:
df.dtypes

In [None]:
cat_cols = [col for col in df.columns if df[col].dtypes == 'O']
print('Number of Categorical Variables : ', len(cat_cols))

In [None]:
cat_cols

*Here are some examples of the categorical features in the dataset:*

<ul>
<li>bruises
</ul>

In [None]:
df["bruises"].value_counts()

<ul>
<li>cap-shape
</ul>

In [None]:
df["cap-shape"].value_counts()

<ul>
<li>cap-color
</ul>

In [None]:
df["cap-color"].value_counts()

<ul>
<li>cap-surface
</ul>

In [None]:
df["cap-surface"].value_counts()

<ul>
<li>gill-size
</ul>

In [None]:
df["gill-size"].value_counts()

<ul>
<li>veil-type
</ul>

In [None]:
df["veil-type"].value_counts()

The variable "veil-type" does not include the needed information about its class. It is recommended that we drop it for this exericise.

In [None]:
df.corr()

In [None]:
# Make a heatmap of the data 
plt.figure(figsize=(40, 20))
sns.heatmap(df.corr(), annot=True)