# A Machine Learning Approach to Mushroom Classification: Predicting Edibility and Toxicity Based on Physical Characteristics Using the UC Irvine Audubon Society Field Guide Dataset    

Author: Data-Git-Hub <br>
GitHub Project Repository Link: https://github.com/Data-Git-Hub/applied-ml-data-git-hub <br>
UCI Machine Learning Repository Link: https://archive.ics.uci.edu/dataset/73/mushroom <br>
18 March 2025 <br>

### Introduction
The objective of the Mushroom Classification problem is to predict whether a mushroom is edible or poisonous based on its physical and chemical characteristics. The dataset includes descriptions of 23 species of gilled mushrooms from the Agaricus and Lepiota families. These mushrooms are classified as definitely edible or definitely poisonous (including those with unknown edibility, which were grouped into the poisonous category). 

Unlike other biological classifications, there is no simple rule (such as "leaflets three, let it be" for poison ivy) to determine a mushroom’s safety. Instead, various morphological and chemical features must be analyzed, making machine learning a valuable tool for automating classification.

This project will involve: <br>

- Decision Tree Classifier (DT)  

    A Decision Tree splits data into smaller groups based on decision rules (such as "is cap color white?"). It works like a flowchart, where each decision point leads to another question until a final classification is reached. <br>

    Strengths: Easy to interpret and fast to train. <br>
    Weaknesses: Can overfit if the tree becomes too complex. <br>

- Support Vector Machine (SVM)  

    A Support Vector Machine (SVM) finds the best boundary (a hyperplane) to separate edible and poisonous mushrooms. It is effective in handling non-linear relationships through the use of different kernel functions. <br>

    Strengths: Works well with complex data and is effective when a clear margin of separation exists. <br>
    Weaknesses: Computationally expensive for large datasets. <br>

- Neural Network (NN)  

    A Neural Network is inspired by how the human brain processes information. It consists of layers of interconnected "neurons" that process input data and learn patterns. Given the dataset’s complexity, a neural network may identify subtle relationships between mushroom characteristics and toxicity. <br>

    Strengths: Can capture complex patterns and non-linear relationships. <br>
    Weaknesses: Requires careful tuning to avoid overfitting and may need more training data. <br>


### Imports
Python libraries are collections of pre-written code that provide specific functionalities, making programming more efficient and reducing the need to write code from scratch. These libraries cover a wide range of applications, including data analysis, machine learning, web development, and automation. Some libraries, such as os, sys, math, json, and datetime, come built-in with Python as part of its standard library, providing essential functions for file handling, system operations, mathematical computations, and data serialization. Other popular third-party libraries, like pandas, numpy, matplotlib, seaborn, and scikit-learn, must be installed separately and are widely used in data science and machine learning. The extensive availability of libraries in Python's ecosystem makes it a versatile and powerful programming language for various domains. <br>

Pandas is a powerful data manipulation and analysis library that provides flexible data structures, such as DataFrames and Series. It is widely used for handling structured datasets, enabling easy data cleaning, transformation, and aggregation. Pandas is essential for data preprocessing in machine learning and statistical analysis. <br>
https://pandas.pydata.org/docs/ <br>

NumPy (Numerical Python) is a foundational library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a comprehensive collection of mathematical functions to operate on these arrays efficiently. NumPy is a key component in scientific computing and machine learning. <br>
https://numpy.org/doc/stable/ <br>

Matplotlib is a widely used data visualization library that allows users to create static, animated, and interactive plots. It provides extensive tools for generating various chart types, including line plots, scatter plots, histograms, and bar charts, making it a critical library for exploratory data analysis. <br>
https://matplotlib.org/stable/contents.html <br>

Seaborn is a statistical data visualization library built on top of Matplotlib, designed for creating visually appealing and informative plots. It simplifies complex visualizations, such as heatmaps, violin plots, and pair plots, making it easier to identify patterns and relationships in datasets. <br>
https://seaborn.pydata.org/ <br>

Scikit-learn provides a variety of tools for machine learning, including data preprocessing, model selection, and evaluation. It contains essential functions for building predictive models and analyzing datasets. <br>
sklearn.metrics: This module provides various performance metrics for evaluating machine learning models. <br>
https://scikit-learn.org/stable/modules/model_evaluation.html<br>

IPython.core.display is a module from the IPython library that provides tools for displaying rich output in Jupyter Notebooks, including formatted text, images, HTML, and interactive widgets. It enhances visualization and interaction within Jupyter environments.
https://ipython.readthedocs.io/en/stable/api/generated/IPython.core.display.html <br>

In [24]:
# Data handling
import pandas as pd
import numpy as np

# Machine learning imports
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Fully disable output truncation in Jupyter (for VS Code)
from IPython.core.interactiveshell import InteractiveShell
from IPython.display import display


# Load the dataset mushroom_dataset.csv by defining path
file_path = r"c:\Projects\ml04\data\mushroom_dataset.csv"

# Load the dataset with headers
mushroom_df = pd.read_csv(file_path)

### Section 1. Import and Inspect Data

We will import the first few rows of the dataset and display important information to make decisions later through the process.

#### Section 1.1. Load the dataset and display the first 10 rows. <br>

In [25]:
# Display dataset structure
print("\nDataset Shape:", mushroom_df.shape)

# Display data types of each column
print("\nData Types of Each Column:")
print(mushroom_df.dtypes)

# Show the first 10 rows of the dataset
display(mushroom_df.head(10))



Dataset Shape: (8417, 23)

Data Types of Each Column:
eat                         object
cap_shape                   object
cap_surface                 object
cap_color                   object
bruise                      object
odor                        object
gill_attached               object
gill_space                  object
gill_size                   object
gill_color                  object
stalk_shape                 object
stalk_root                  object
stalk_surface_above_ring    object
stalk_surface_below_ring    object
stalk_color_above_ring      object
stalk_color_below_ring      object
veil_type                   object
veil_color                  object
ring_number                 object
ring_type                   object
spore_print_number          object
population                  object
habitat                     object
dtype: object


Unnamed: 0,eat,cap_shape,cap_surface,cap_color,bruise,odor,gill_attached,gill_space,gill_size,gill_color,...,stalk_surface_below_ring,stalk_color_above_ring,stalk_color_below_ring,veil_type,veil_color,ring_number,ring_type,spore_print_number,population,habitat
0,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,WHITE,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS
1,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,WHITE,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,BROWN,SEVERAL,WOODS
2,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,PINK,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS
3,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,PINK,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,BROWN,SEVERAL,WOODS
4,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,BROWN,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS
5,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,BROWN,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,BROWN,SEVERAL,WOODS
6,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ANISE,FREE,CROWDED,NARROW,WHITE,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS
7,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ANISE,FREE,CROWDED,NARROW,WHITE,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,BROWN,SEVERAL,WOODS
8,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ANISE,FREE,CROWDED,NARROW,PINK,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS
9,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ANISE,FREE,CROWDED,NARROW,PINK,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,BROWN,SEVERAL,WOODS


#### 1.2 Check for missing values and display summary statistics.

In [26]:
# Check for standard missing values (NaN)
print("\nMissing Values per Column:")
print(mushroom_df.isnull().sum())

# Check for non-standard missing values (e.g., '?', empty strings)
print("\nChecking for Non-Standard Missing Values:")

for col in mushroom_df.columns:
    missing_values = (mushroom_df[col] == '?').sum() + (mushroom_df[col] == '').sum()
    if missing_values > 0:
        print(f"{col}: {missing_values} missing values")
# continues on next block of code


Missing Values per Column:
eat                         0
cap_shape                   1
cap_surface                 1
cap_color                   1
bruise                      1
odor                        1
gill_attached               1
gill_space                  1
gill_size                   1
gill_color                  1
stalk_shape                 1
stalk_root                  1
stalk_surface_above_ring    1
stalk_surface_below_ring    1
stalk_color_above_ring      1
stalk_color_below_ring      1
veil_type                   1
veil_color                  1
ring_number                 1
ring_type                   1
spore_print_number          1
population                  1
habitat                     1
dtype: int64

Checking for Non-Standard Missing Values:
stalk_root: 2480 missing values


In [27]:
# Split code to eliminate output truncation
# Display summary statistics (including categorical values)
print("\nSummary Statistics:")
print(mushroom_df.describe(include="all"))


Summary Statistics:
           eat cap_shape cap_surface cap_color bruise  odor gill_attached  \
count     8417      8416        8416      8416   8416  8416          8416   
unique       3         6           4        10      2     9             2   
top     EDIBLE    CONVEX       SCALY     BROWN     NO  NONE          FREE   
freq      4488      3796        3268      2320   5040  3808          8200   

       gill_space gill_size gill_color  ... stalk_surface_below_ring  \
count        8416      8416       8416  ...                     8416   
unique          2         2         12  ...                        4   
top         CLOSE     BROAD       BUFF  ...                   SMOOTH   
freq         6824      5880       1728  ...                     5076   

       stalk_color_above_ring stalk_color_below_ring veil_type veil_color  \
count                    8416                   8416      8416       8416   
unique                      9                      9         1          4   
t

#### 1.2.1 Listing the Unique Data Names Per Data Columns

Creating a unique_values_output.txt file to look at the values since the statistics showed that the amount was not too burdensome to get a better fidelity of the nature of the dataset.

In [33]:
# Define the output file path
output_file = r"c:\Projects\ml04\data\unique_values_output.txt"

# Save unique values to a text file
with open(output_file, "w") as f:
    for col in mushroom_df.columns:
        unique_values = mushroom_df[col].unique()
        f.write(f"\nColumn: {col} ({len(unique_values)} unique values)\n")
        
        # Write unique values in a structured format
        for value in unique_values:
            f.write(f"- {value}\n")
        
        f.write("-" * 50 + "\n")

print(f"\n✅ Unique values saved to '{output_file}'. Open this file to view the full output.")




✅ Unique values saved to 'c:\Projects\ml04\data\unique_values_output.txt'. Open this file to view the full output.


### Reflection 1: What do you notice about the dataset? Are there any data issues?

When reviewing the mushroom dataset, I noticed that all the features are categorical, which means that I will need to encode them properly before applying machine learning models. The dataset contains various characteristics of mushrooms, such as cap shape, color, odor, and other physical attributes, which will be useful in distinguishing between edible and poisonous mushrooms. One issue I found is the presence of missing or ambiguous values. Some columns contain question marks, which likely represent missing data that will need to be handled appropriately. <br>

Additionally, since all the features are categorical, I need to ensure that the encoding method preserves the meaningful relationships between values without introducing bias. Another potential issue is class imbalance, which could affect model performance if one class significantly outnumbers the other. Before proceeding with modeling, I will need to explore the distribution of classes and handle any missing or erroneous values to ensure the dataset is clean and ready for analysis. <br>

After reviewing `unique_values_output.txt`, I identified additional data cleaning and transformation opportunities. Some categorical variables could be simplified or grouped to reduce complexity while retaining meaningful distinctions. Additionally, certain columns contain values that could be converted into boolean features, which might improve model interpretability and efficiency. Implementing these transformations will help create a more diverse dataset that allows machine learning models to extract better patterns from the data. <br>

## References:

Data-Git-Hub. (2025). GitHub - Data-Git-Hub/ml04. GitHub. https://github.com/Data-Git-Hub/ml04

‌Mushroom [Dataset]. (1981). UCI Machine Learning Repository. https://doi.org/10.24432/C5959T.
