# Mushroom Prediction: A Preliminary Notebook

<div style="border: 2px double #dcdcdc; padding: 10px; border-radius: 5px; background-color: #202020; max-width: 97.5%; overflow-x: auto;">
<h2> Version Log </h2>
<p>
<br>[5_8_2024]: Updated the data-cleaning process.
<br>[4_8_2024]: Debugged the models and their parameters. Enhanced the code for better migration to kaggle
<br>[3_8_2024]: Completed data preprocessing. Constructed and trained various base-lv models and ensembled into a meta-model. 
                Made prediction, MCC = 0.98
<br>[2_8_2024]: Completed data cleaning 
</p>
</div>

## 1. Setup Environment:

In [1]:
## This is a Juptyer notebook for the Kaggle Project: Mushroom Classification
# %pip install ydata-profiling
# %pip install numpy
# %pip install --upgrade pandas
# %pip install --upgrade matplotlib
# %pip install --upgrade seaborn
# %pip install --upgrade scikit-learn
# %pip install --upgrade scipy
# %pip install --upgrade catboost
# %pip install --upgrade xgboost
# %pip install --upgrade lightgbm
# %pip install ipywidgets

In [2]:
## Import libaries
import os

## Data analysis and wrangling
import numpy as np
import pandas as pd
import random as rnd

## Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import set_config
from ydata_profiling import ProfileReport
%matplotlib inline 
from scipy.stats import boxcox

# Metrics
from sklearn.metrics import make_scorer
from sklearn.metrics import matthews_corrcoef


# Machine learning_ Classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC, LinearSVC
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.dummy import DummyClassifier

# # Model selection
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


#Palette
palette = ['#328ca9', '#0e6ea9', '#2c4ea3', '#193882', '#102446']

# Set the style of the visualization
sns.set(style="whitegrid")

# Set the configuration of sklearn
SEED = 42 # for reproducibility

## 2.  Problem identification

### Problem Statement:

This is one of the 2024 playground competitions on Kaggle. 

The major goal of the project is to develope a classifier for classifying muchrooms into edible or poisonous based on physical characteristics presented in tabulated formats. 
The performance of the model will be assessed by the Matthews correlation coefficient (MCC), that is calculated by:

$$MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}$$


<br> Reference:
<br> [1] Walter Reade, Ashley Chow. (2024). Binary Prediction of Poisonous Mushrooms. Kaggle. https://kaggle.com/competitions/playground-series-s4e8
<br> [2] https://archive.ics.uci.edu/dataset/73/mushroom 

## 3. Reading Data

In [3]:
## Reading data

# Check if running on Kaggle
if 'KAGGLE_KERNEL_RUN_TYPE' in os.environ:
    train_df=pd.read_csv(r'/kaggle/input/playground-series-s4e8/train.csv')#
    test_df=pd.read_csv(r'/kaggle/input/playground-series-s4e8/train.csv')#
else:
    train_df=pd.read_csv(r'Input\train.csv')#
    test_df=pd.read_csv(r'Input\test.csv')

## 4. Data Inspection

### 4.1. The number of features

In [4]:
# Number of columns and rows in the dataset
print(train_df.columns.values)
print(train_df.shape)

['id' 'class' 'cap-diameter' 'cap-shape' 'cap-surface' 'cap-color'
 'does-bruise-or-bleed' 'gill-attachment' 'gill-spacing' 'gill-color'
 'stem-height' 'stem-width' 'stem-root' 'stem-surface' 'stem-color'
 'veil-type' 'veil-color' 'has-ring' 'ring-type' 'spore-print-color'
 'habitat' 'season']
(3116945, 22)


<div style="border: 2px solid #999999; padding: 10px; border-radius: 5px; background-color: #282828; max-width: 97.5%; overflow-x: auto;">
<p>
<br> - There are 22 features in the 3116945 entries in the training dataset. 
<br> - The id is the index of the data entry; the class is the target variable and the others are features.  
<br> - The cap, bruise, gill, stem, veil and ring are different parts of a mushroom that can be found in the anatomy shown below. 
<br> - From https://archive.ics.uci.edu/dataset/73/mushroom, the features included are described below. It is also known that, all entries in the categorial features should contain <b> exactly 1 letter </b>. 
</p>
</div>


1. id                   = the index of the data entry
2. class                = e (edible)/ p (poisonous) - <b>The target variable </b>
3. cap-diameter         = The di
4. cap-shape            = a shape descriptor of the mushroom's cap 
5. cap-surface          = a surface descriptor 
6. cap-color            = the color of the mushroom cap
7. does-bruise-or-bleed = Will the mushroom change color when being bruised or cut 
8. gill-attachment      = Does the mushroom have gill attached
9. gill-spacing         = the spacing of gill under the cap of mushroom  
10. gill-color          = the color of the gill
11. stem-height         = 
12. stem-width          = 
13. stem-root           = 
14. stem-surface        = 
15. stem-color          = the color of the stem
16. veil-type           = the type of the veil 
17. veil-color          = the color of the veil
18. has-ring            = the existance of rings
19. ring-type           = the type of rings.
20. spore-print-color   = the color of print of spore obtained by cutting the cap and gill and cover with a blank paper.
21. habitat             = the habit of the mushrooms.
22. season              = the season of obtaining the mushrooms.



![alt text](<Anatomy of a Mushroom Graphic-1.webp>)

In [5]:
# Look at the first 5 rows of the dataset
train_df.head()

Unnamed: 0,id,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
0,0,e,8.8,f,s,u,f,a,c,w,...,,,w,,,f,f,,d,a
1,1,p,4.51,x,h,o,f,a,c,n,...,,y,o,,,t,z,,d,w
2,2,e,6.94,f,s,b,f,x,c,w,...,,s,n,,,f,f,,l,w
3,3,e,3.88,f,y,g,f,s,,g,...,,,w,,,f,f,,d,u
4,4,e,5.85,x,l,w,f,d,,w,...,,,w,,,f,f,,g,a


In [6]:
# Look at the last 5 rows of the dataset
train_df.tail()

Unnamed: 0,id,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
3116940,3116940,e,9.29,f,,n,t,,,w,...,b,,w,u,w,t,g,,d,u
3116941,3116941,e,10.88,s,,w,t,d,c,p,...,,,w,,,f,f,,d,u
3116942,3116942,p,7.82,x,e,e,f,a,,w,...,,,y,,w,t,z,,d,a
3116943,3116943,e,9.45,p,i,n,t,e,,p,...,,y,w,,,t,p,,d,u
3116944,3116944,p,3.2,x,s,g,f,d,c,w,...,,,w,,,f,f,,g,u


### 4.2. Data type of the variables

In [7]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3116945 entries, 0 to 3116944
Data columns (total 22 columns):
 #   Column                Dtype  
---  ------                -----  
 0   id                    int64  
 1   class                 object 
 2   cap-diameter          float64
 3   cap-shape             object 
 4   cap-surface           object 
 5   cap-color             object 
 6   does-bruise-or-bleed  object 
 7   gill-attachment       object 
 8   gill-spacing          object 
 9   gill-color            object 
 10  stem-height           float64
 11  stem-width            float64
 12  stem-root             object 
 13  stem-surface          object 
 14  stem-color            object 
 15  veil-type             object 
 16  veil-color            object 
 17  has-ring              object 
 18  ring-type             object 
 19  spore-print-color     object 
 20  habitat               object 
 21  season                object 
dtypes: float64(3), int64(1), object(18)
memory

<div style="border: 2px solid #999999; padding: 10px; border-radius: 5px; background-color: #282828; max-width: 97.5%; overflow-x: auto;">
<p>
<br> It is observed that most of the variables are categoiral, except id, cap-diameter , stem-height and stem-width, which are numerical.  </p>
</div>


In [8]:
# Display the unique values and the count of unique values in the dataset
train_df.describe(include=['O'])


Unnamed: 0,class,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
count,3116945,3116905,2445922,3116933,3116937,2593009,1858510,3116888,359922,1136084,3116907,159452,375998,3116921,2988065,267263,3116900,3116945
unique,2,74,83,78,26,78,48,63,38,60,59,22,24,23,40,32,52,4
top,p,x,t,n,f,a,c,w,b,s,w,u,w,f,f,k,d,a
freq,1705396,1436026,460777,1359542,2569743,646034,1331054,931538,165801,327610,1196637,159373,279070,2368820,2477170,107310,2177573,1543321


### 4.3. Handling missing, distinct and duplicated entries:

#### 4.3.1 Number of Missing/ distinct and duplicated entries:

In [9]:
print("Missing Data in the training dataset (%)")
print("="*100)
print((train_df.isnull().sum()/len(train_df)*100).sort_values(ascending = False).round(3))
print("="*100)

Missing Data in the training dataset (%)
veil-type               94.884
spore-print-color       91.425
stem-root               88.453
veil-color              87.937
stem-surface            63.551
gill-spacing            40.374
cap-surface             21.528
gill-attachment         16.809
ring-type                4.135
gill-color               0.002
habitat                  0.001
cap-shape                0.001
stem-color               0.001
has-ring                 0.001
cap-color                0.000
does-bruise-or-bleed     0.000
cap-diameter             0.000
id                       0.000
stem-width               0.000
class                    0.000
stem-height              0.000
season                   0.000
dtype: float64


In [10]:
print("Distinct Data in the training dataset (%)")
print("="*100)
print(train_df.nunique().sort_values(ascending = False))
print("="*100)

Distinct Data in the training dataset (%)
id                      3116945
stem-width                 5836
cap-diameter               3913
stem-height                2749
cap-surface                  83
cap-color                    78
gill-attachment              78
cap-shape                    74
gill-color                   63
stem-surface                 60
stem-color                   59
habitat                      52
gill-spacing                 48
ring-type                    40
stem-root                    38
spore-print-color            32
does-bruise-or-bleed         26
veil-color                   24
has-ring                     23
veil-type                    22
season                        4
class                         2
dtype: int64


In [11]:
print("Duplicated Data in the training dataset (%)")
print("="*100)
print((train_df.duplicated().sum()/len(train_df)*100).round(3))
print("="*100)

Duplicated Data in the training dataset (%)
0.0


<div style="border: 2px solid #999999; padding: 10px; border-radius: 5px; background-color: #282828; max-width: 97.5%; overflow-x: auto;">
<p>
<br>  There are multiple missing values in the data with some of them contribute a large portion of the entries (e.g. veil-type & spore-print-color). </p>
</div>

b. Understand the cause and type of missing data. 

- Different types of missing data may leads to different handling methods. Following are some examples of missing data and their handling methods.

    - Missing Completely at Random (MCAR):
        - Handling method: Simple imputation
        - Cause: The probability of missing an entry is completely independent of the values of the variables in the dataset, as well as the unobserved data.
        - Example: a sensor malfunctions and randomly fails to record some measurements
    
    - Missing at Random (MAR): 
        - Handling method: advanced imputation techniques
        - Cause: The probability of a data point being missing depends on the observed variables in the dataset, but not on the unobserved (missing) data.
        - Example: income data is more likely to be missing for individuals with lower education levels
    
    - Missing Not at Random (MNAR):
        - Handling methods: pattern mixture models or selection models
        - Causes: the probability of a data point being missing depends on the unobserved (missing) data itself.
        - Example: individuals with higher income are less likely to report their income
        
    - Systematic Missing Data:
        - Handling methods: Simple imputation or creating Proxy variable
    - Causes: An entire variables or features are missing from the dataset, typically due to issues in data collection or data processing.
    - Example: A sensor was not installed on certain devices, resulting in the absence of data for a specific feature.

<div style="border: 2px solid #999999; padding: 10px; border-radius: 5px; background-color: #282828; max-width: 97.5%; overflow-x: auto;">
<p>
<br> - By inspecting the entries in various columns, it is found that there are categoiral features having entries with digits/ more than one letters, which might due to issues in data collection or data processing. 
<br>
<br> - Therefore, it is decided that we will impute these entries with the most frequent option in the column. To reduce cardinality, it is also decide to gather all options which contributes to less than 5% of the column count into a new category "Other".</p>
</div>

c. Handling of the missing/distinct/diplicated data.



In [12]:
# Drop the 'Id' colum since it's unnecessary for the prediction process
train_df = train_df.drop(['id'], axis=1)
test_df = test_df.drop(['id'], axis=1)

# Drop features that have > 50% missing values
train_df = train_df.dropna(thresh=0.5*len(train_df), axis=1)

# Drop the 'target' column and assign it to the target variable
y = train_df['class']
train_df = train_df.drop(['class'], axis=1)

# Drop the same features in the test dataset
test_df = test_df[train_df.columns]


In [13]:
# Separate the numerical and categorical columns
numerical_cols = train_df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = train_df.select_dtypes(include=[object]).columns.tolist()

# Replace entries containing special characters, or string with a length larger than 1 with NaN
train_df = train_df.replace({'^.*[^a-zA-Z].*$': np.nan, '^.{2,}$': np.nan}, regex=True)
train_df = train_df.replace({'f': 'False', 't': 'True'}, regex=True)
test_df = test_df.replace({'^.*[^a-zA-Z].*$': np.nan, '^.{2,}$': np.nan}, regex=True)
test_df = test_df.replace({'f': 'False', 't': 'True'}, regex=True)

# Inspect the % of distinct values in the categorical columns
print("Distinct Data in the training dataset (%)")
print("="*100)
print(train_df.nunique().sort_values(ascending = False))
print("="*100)

Distinct Data in the training dataset (%)
stem-width              5836
cap-diameter            3913
stem-height             2749
cap-shape                 23
cap-surface               23
cap-color                 23
gill-attachment           23
gill-color                23
stem-color                23
ring-type                 23
habitat                   23
does-bruise-or-bleed      22
has-ring                  21
gill-spacing              19
season                     4
dtype: int64


In [14]:
# See the unique values in the one selected column
percentage = train_df['gill-color'].value_counts(normalize=True)
percentage

gill-color
w        0.298874
n        0.174340
y        0.150623
p        0.110249
g        0.068071
o        0.050410
k        0.041058
False    0.038403
r        0.020148
e        0.017982
b        0.015159
u        0.014566
l        0.000018
d        0.000017
True     0.000017
s        0.000015
x        0.000011
c        0.000010
a        0.000009
h        0.000008
z        0.000005
m        0.000005
i        0.000003
Name: proportion, dtype: float64

In [15]:
# Keep only the entries in the categorical columns if they contribute to more than 5% of the data in the column, otherwise replace them with 'Other'
for col in categorical_cols:
    threshold = 0.05
    value_counts = train_df[col].value_counts(normalize=True)
    labels_to_keep = value_counts[value_counts > threshold].index.tolist()
    train_df[col] = np.where(train_df[col].isin(labels_to_keep), train_df[col], 'Other')
    test_df[col] = np.where(test_df[col].isin(labels_to_keep), test_df[col], 'Other')

# Check the unique values in the categorical columns
train_df

Unnamed: 0,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,stem-width,stem-color,has-ring,ring-type,habitat,season
0,8.80,False,s,Other,False,a,c,w,4.51,15.39,w,False,False,d,a
1,4.51,x,h,o,False,a,c,n,4.79,6.48,Other,True,Other,d,w
2,6.94,False,s,Other,False,x,c,w,6.85,9.93,n,False,False,l,w
3,3.88,False,y,g,False,s,Other,g,4.16,6.53,w,False,False,d,u
4,5.85,x,Other,w,False,d,Other,w,3.37,8.36,w,False,False,g,a
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3116940,9.29,False,Other,n,True,Other,Other,w,12.14,18.81,w,True,Other,d,u
3116941,10.88,s,Other,w,True,d,c,p,6.65,26.97,w,False,False,d,u
3116942,7.82,x,Other,e,False,a,Other,w,9.51,11.06,y,True,Other,d,a
3116943,9.45,Other,Other,n,True,e,Other,p,9.13,17.77,w,True,Other,d,u


In [16]:
# Check the unique values in the categorical columns
print("Distinct Data in the training dataset (%)")
print("="*100)
print(train_df.nunique().sort_values(ascending = False))
print("="*100)

Distinct Data in the training dataset (%)
stem-width              5836
cap-diameter            3913
stem-height             2749
cap-surface                8
cap-color                  7
gill-attachment            7
gill-color                 7
cap-shape                  5
gill-spacing               4
stem-color                 4
habitat                    4
season                     4
does-bruise-or-bleed       3
has-ring                   3
ring-type                  2
dtype: int64


In [17]:
# See the unique values in the one selected column
percentage = train_df['gill-color'].value_counts(normalize=True)
percentage

gill-color
w        0.298863
n        0.174333
y        0.150617
Other    0.147467
p        0.110244
g        0.068068
o        0.050408
Name: proportion, dtype: float64

In [18]:
# imputate missing values
for col in train_df.columns:
    if train_df[col].isnull().sum() > 0:
        # print(f"Imputing {col} with mode")
        train_df[col].fillna(train_df[col].mode()[0], inplace=True)

for col in test_df.columns:
    if test_df[col].isnull().sum() > 0:
        # print(f"Imputing {col} with mode")
        test_df[col].fillna(test_df[col].mode()[0], inplace=True)
# Check if there are any missing values
print("Missing Data in the training dataset (%)")
print("="*100)
print((train_df.isnull().sum()/len(train_df)*100).sort_values(ascending = False).round(3))
print("="*100)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_df[col].fillna(train_df[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_df[col].fillna(test_df[col].mode()[0], inplace=True)


Missing Data in the training dataset (%)
cap-diameter            0.0
cap-shape               0.0
cap-surface             0.0
cap-color               0.0
does-bruise-or-bleed    0.0
gill-attachment         0.0
gill-spacing            0.0
gill-color              0.0
stem-height             0.0
stem-width              0.0
stem-color              0.0
has-ring                0.0
ring-type               0.0
habitat                 0.0
season                  0.0
dtype: float64


In [19]:
train_df['does-bruise-or-bleed'].dtype

dtype('O')

In [20]:
# save the cleaned data
train_df.to_csv(f"/kaggle/working/train_cleaned.csv" if 'KAGGLE_KERNEL_RUN_TYPE' in os.environ else f"Output\\Cleaned_Data\\train_cleaned.csv", index=False)
test_df.to_csv(f"/kaggle/working/test_cleaned.csv" if 'KAGGLE_KERNEL_RUN_TYPE' in os.environ else f"Output\\Cleaned_Data\\test_cleaned.csv", index=False)
y.to_csv(f"/kaggle/working/target.csv" if 'KAGGLE_KERNEL_RUN_TYPE' in os.environ else f"Output\\Cleaned_Data\\target.csv", index=False)

In [21]:
# Check the saved files
train_df=pd.read_csv(f"/kaggle/working/train_cleaned.csv" if 'KAGGLE_KERNEL_RUN_TYPE' in os.environ else f"Output\\Cleaned_Data\\train_cleaned.csv")
test_df=pd.read_csv(f"/kaggle/working/test_cleaned.csv" if 'KAGGLE_KERNEL_RUN_TYPE' in os.environ else f"Output\\Cleaned_Data\\test_cleaned.csv")
y=pd.read_csv(f"/kaggle/working/target.csv" if 'KAGGLE_KERNEL_RUN_TYPE' in os.environ else f"Output\\Cleaned_Data\\target.csv")


  train_df=pd.read_csv(f"/kaggle/working/train_cleaned.csv" if 'KAGGLE_KERNEL_RUN_TYPE' in os.environ else f"Output\\train_cleaned.csv")
  test_df=pd.read_csv(f"/kaggle/working/test_cleaned.csv" if 'KAGGLE_KERNEL_RUN_TYPE' in os.environ else f"Output\\test_cleaned.csv")


<div style="border: 2px solid #999999; padding: 10px; border-radius: 5px; background-color: #282828; max-width: 97.5%; overflow-x: auto;">
<p>
Here we successfully:
<br> - 1. remove all the mistyped entries in categorical features, 
<br> - 2. reduce the number of unique values in each feature by grouping values contributing to less than 5% of the total count into a new option "Other", and 
<br> - 3. impute the missing value with the most frequent option in a feature.
</p>
</div>

4.4. Summary of the dataset by ProfileReport

In [22]:
# profile=ProfileReport(train_df,title='Pandas Profiling Report',explorative=True)
# profile.to_notebook_iframe()