# Machine Learning on Mushroom Toxicity

### Description of Dataset

This dataset includes 61069 hypothetical mushrooms with caps based on 173 species (353 mushrooms per species). Each mushroom is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended (the latter class was combined with the poisonous class).

### Variables

One binary class divided in edible=e and poisonous=p (with the latter one also containing mushrooms of unknown edibility).

Twenty remaining variables (n: nominal, m: metrical)
1. cap-diameter (m): float number in cm
2. cap-shape (n): bell=b, conical=c, convex=x, flat=f,
sunken=s, spherical=p, others=o
3. cap-surface (n): fibrous=i, grooves=g, scaly=y, smooth=s,
shiny=h, leathery=l, silky=k, sticky=t,
wrinkled=w, fleshy=e
4. cap-color (n): brown=n, buff=b, gray=g, green=r, pink=p,
purple=u, red=e, white=w, yellow=y, blue=l,
orange=o, black=k
5. does-bruise-bleed (n): bruises-or-bleeding=t,no=f
6. gill-attachment (n): adnate=a, adnexed=x, decurrent=d, free=e,
sinuate=s, pores=p, none=f, unknown=?
7. gill-spacing (n): close=c, distant=d, none=f
8. gill-color (n): see cap-color + none=f
9. stem-height (m): float number in cm
10. stem-width (m): float number in mm
11. stem-root (n): bulbous=b, swollen=s, club=c, cup=u, equal=e,
rhizomorphs=z, rooted=r
12. stem-surface (n): see cap-surface + none=f
13. stem-color (n): see cap-color + none=f
14. veil-type (n): partial=p, universal=u
15. veil-color (n): see cap-color + none=f
16. has-ring (n): ring=t, none=f
17. ring-type (n): cobwebby=c, evanescent=e, flaring=r, grooved=g,
large=l, pendant=p, sheathing=s, zone=z, scaly=y, movable=m, none=f, unknown=?
18. spore-print-color (n): see cap color
19. habitat (n): grasses=g, leaves=l, meadows=m, paths=p, heaths=h,
urban=u, waste=w, woods=d
20. season (n): spring=s, summer=u, autumn=a, winter=w

### Objectives

Identify what features are most commonly associated with poisonous and non-poisonous mushrooms.




## Step 1: Data Cleaning & Preprocessing

In [28]:
# Import packages
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

In [29]:
# Import data
mushrooms = pd.read_csv('./Resources/secondary_data_shuffled.csv', sep=';')

In [30]:
# Check the data
mushrooms.tail()

Unnamed: 0,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
61064,p,12.79,x,e,n,t,p,,e,9.6,...,c,,y,,,f,f,,d,u
61065,p,2.42,x,d,w,f,a,d,p,3.52,...,,h,w,,,f,f,,g,u
61066,e,12.33,s,,u,f,s,c,u,7.71,...,,,u,,,f,f,,d,a
61067,p,3.85,s,w,u,f,a,c,u,5.32,...,,s,u,,,f,f,,l,a
61068,p,1.98,x,i,k,f,a,,w,3.16,...,,,w,,,f,f,p,g,a


In [31]:
# The data must be cleaned, normalized, and standardized prior to modeling 
clean = mushrooms.dropna(axis=1)

## Step 2: Data Exploration & Visualization

In [32]:
clean.head()

Unnamed: 0,class,cap-diameter,cap-shape,cap-color,does-bruise-or-bleed,gill-color,stem-height,stem-width,stem-color,has-ring,habitat,season
0,e,1.26,x,y,f,w,5.04,1.73,y,f,d,a
1,e,10.32,f,b,f,b,4.68,19.44,w,t,d,a
2,p,0.92,x,p,f,p,4.59,1.15,k,f,d,u
3,p,4.27,x,p,f,w,4.55,6.52,w,f,d,a
4,e,3.08,f,w,f,w,2.67,5.18,w,f,m,a


In [33]:
clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61069 entries, 0 to 61068
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   class                 61069 non-null  object 
 1   cap-diameter          61069 non-null  float64
 2   cap-shape             61069 non-null  object 
 3   cap-color             61069 non-null  object 
 4   does-bruise-or-bleed  61069 non-null  object 
 5   gill-color            61069 non-null  object 
 6   stem-height           61069 non-null  float64
 7   stem-width            61069 non-null  float64
 8   stem-color            61069 non-null  object 
 9   has-ring              61069 non-null  object 
 10  habitat               61069 non-null  object 
 11  season                61069 non-null  object 
dtypes: float64(3), object(9)
memory usage: 5.6+ MB


In [34]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, roc_auc_score, accuracy_score
from sklearn.tree import export_graphviz
from sklearn.metrics import precision_recall_curve, auc
from sklearn.model_selection import cross_val_predict, cross_val_score

In [37]:
X = clean.drop(['class'], axis=1)
y = clean['class']

In [39]:
labelencoder_clean=LabelEncoder()
for column in X.columns:
    X[column] = labelencoder_clean.fit_transform(X[column])
    
labelencoder_y=LabelEncoder()
y = labelencoder_y.fit_transform(y)

In [40]:
X.head()

Unnamed: 0,cap-diameter,cap-shape,cap-color,does-bruise-or-bleed,gill-color,stem-height,stem-width,stem-color,has-ring,habitat,season
0,84,6,11,0,10,394,119,12,0,0,0
1,990,2,0,0,0,358,1890,11,1,0,0
2,50,6,7,0,7,349,61,4,0,0,2
3,385,6,7,0,10,345,598,11,0,0,0
4,266,2,10,0,10,157,464,11,0,4,0


## Step 3: Predictive Analyses

### Principal Components Analysis

In [None]:
# Import the PCA module
from sklearn.decomposition import PCA

### Random Forest

In [None]:
# Import the Random Forest module
from sklearn.ensemble import RandomForestClassifier

### Logistic Regression

In [None]:
# Import the Logistic Regression modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score