# Mushroom classification - introduction
In this notebook, we will look into the mushroom dataset to analyze its characteristics and gain insights. Following our exploration, we will develop a predictive model to determine the edibility of mushrooms using the available variables, which are these:
1. cap-shape:                bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
2. cap-surface:              fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color:                brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
4. bruises?:                 bruises=t,no=f
5. odor:                     almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
6. gill-attachment:          attached=a,descending=d,free=f,notched=n
7. gill-spacing:             close=c,crowded=w,distant=d
8. gill-size:                broad=b,narrow=n
9. gill-color:               black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
10. stalk-shape:              enlarging=e,tapering=t
11. stalk-root:               bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
16. veil-type:                partial=p,universal=u
17. veil-color:               brown=n,orange=o,white=w,yellow=y
18. ring-number:              none=n,one=o,two=t
19. ring-type:                cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color:        black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
21. population:               abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
22. habitat:                  grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

For prediction, we will use Logistic Regression, Random Forest Classifier, SVM, XGBoost, and Neural Networks. We will compare their performance and select the best model.

Firstly we will perform EDA on the data, to get fammiliar with it and get some initial insights.

# 1. Data cleaning
Before we start with the EDA or modeling the data, we need to adress the quality of data we have available, this mainly means handling the missing values, etc.


In [24]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import statsmodels.api as sm
from patsy import dmatrices

df = pd.read_csv('../Mushroom-Classification/MushroomDataset/secondary_data.csv', sep=';')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61069 entries, 0 to 61068
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   class                 61069 non-null  object 
 1   cap-diameter          61069 non-null  float64
 2   cap-shape             61069 non-null  object 
 3   cap-surface           46949 non-null  object 
 4   cap-color             61069 non-null  object 
 5   does-bruise-or-bleed  61069 non-null  object 
 6   gill-attachment       51185 non-null  object 
 7   gill-spacing          36006 non-null  object 
 8   gill-color            61069 non-null  object 
 9   stem-height           61069 non-null  float64
 10  stem-width            61069 non-null  float64
 11  stem-root             9531 non-null   object 
 12  stem-surface          22945 non-null  object 
 13  stem-color            61069 non-null  object 
 14  veil-type             3177 non-null   object 
 15  veil-color         

In [25]:
# Functions
def remove_columns_with_many_missing(df, threshold=0.5):
    # Calculate the proportion of missing values for each column
    missing_proportion = df.isnull().mean()
    
    # Identify columns to drop
    columns_to_drop = missing_proportion[missing_proportion > threshold].index
    
    # Drop the columns
    df_cleaned = df.drop(columns=columns_to_drop)
    
    return df_cleaned

In [26]:
# removing columns that has more than 50% missing values
df = remove_columns_with_many_missing(df)

# filling the remaining missing values with new 'missing' variable
df.fillna('missing', inplace=True)

# map class values
classMap = {'e': 0, 'p': 1}
#df['class'] = df['class'].map(classMap)
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61069 entries, 0 to 61068
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   class                 61069 non-null  object 
 1   cap-diameter          61069 non-null  float64
 2   cap-shape             61069 non-null  object 
 3   cap-surface           61069 non-null  object 
 4   cap-color             61069 non-null  object 
 5   does-bruise-or-bleed  61069 non-null  object 
 6   gill-attachment       61069 non-null  object 
 7   gill-spacing          61069 non-null  object 
 8   gill-color            61069 non-null  object 
 9   stem-height           61069 non-null  float64
 10  stem-width            61069 non-null  float64
 11  stem-color            61069 non-null  object 
 12  has-ring              61069 non-null  object 
 13  ring-type             61069 non-null  object 
 14  habitat               61069 non-null  object 
 15  season             

In [36]:
# lets visualize the data for cap

capShapedf = df[['class', 'cap-shape']].groupby(['cap-shape', 'class']).size().reset_index(name='Count')

fig = px.bar(capShapedf, x='cap-shape', y='Count', color='class', barmode='group',
            labels={'cap-shape': 'Cap Shape', 'Count': 'Count', 'class': 'Class'})

fig.show()

In [33]:
capShapedf

Unnamed: 0,cap-shape,class,0
0,b,e,1258
1,b,p,4436
2,c,e,774
3,c,p,1041
4,f,e,6502
5,f,p,6902
6,o,e,825
7,o,p,2635
8,p,e,1567
9,p,p,1031
