# Mushroom Classification Project

**Author:** Lindsay Foster 

**Date:** November 2025  

**Course:** Applied Machine Learning â€“ Midterm Project 

## Overview: 
This project applies machine learning classification techniques to predict whether a mushroom is edible or poisonous based on its physical characteristics.  
The analysis follows a structured workflow including data exploration, preprocessing, feature selection, model training, evaluation, and comparison of multiple classifiers.

In [2]:
# imports
from ucimlrepo import fetch_ucirepo
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, plot_tree

## Section 1. Import and Inspect the Data

In [6]:
# Fetch dataset
from ucimlrepo import fetch_ucirepo

mushroom = fetch_ucirepo(id=73)

# Data (as pandas DataFrames)
X = mushroom.data.features
y = mushroom.data.targets

# Show first 10 rows of features and target
print("ðŸ”¹ Features (X) - First 10 rows:")
display(X.head(10))

print("\nðŸ”¹ Target (y) - First 10 rows:")
display(y.head(10))

mushroom_df = pd.concat([X, y], axis=1)
mushroom_df.head(10)


ðŸ”¹ Features (X) - First 10 rows:


Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,x,s,n,t,p,f,c,n,k,e,...,s,w,w,p,w,o,p,k,s,u
1,x,s,y,t,a,f,c,b,k,e,...,s,w,w,p,w,o,p,n,n,g
2,b,s,w,t,l,f,c,b,n,e,...,s,w,w,p,w,o,p,n,n,m
3,x,y,w,t,p,f,c,n,n,e,...,s,w,w,p,w,o,p,k,s,u
4,x,s,g,f,n,f,w,b,k,t,...,s,w,w,p,w,o,e,n,a,g
5,x,y,y,t,a,f,c,b,n,e,...,s,w,w,p,w,o,p,k,n,g
6,b,s,w,t,a,f,c,b,g,e,...,s,w,w,p,w,o,p,k,n,m
7,b,y,w,t,l,f,c,b,n,e,...,s,w,w,p,w,o,p,n,s,m
8,x,y,w,t,p,f,c,n,p,e,...,s,w,w,p,w,o,p,k,v,g
9,b,s,y,t,a,f,c,b,g,e,...,s,w,w,p,w,o,p,k,s,m



ðŸ”¹ Target (y) - First 10 rows:


Unnamed: 0,poisonous
0,p
1,e
2,e
3,p
4,e
5,e
6,e
7,e
8,p
9,e


Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,poisonous
0,x,s,n,t,p,f,c,n,k,e,...,w,w,p,w,o,p,k,s,u,p
1,x,s,y,t,a,f,c,b,k,e,...,w,w,p,w,o,p,n,n,g,e
2,b,s,w,t,l,f,c,b,n,e,...,w,w,p,w,o,p,n,n,m,e
3,x,y,w,t,p,f,c,n,n,e,...,w,w,p,w,o,p,k,s,u,p
4,x,s,g,f,n,f,w,b,k,t,...,w,w,p,w,o,e,n,a,g,e
5,x,y,y,t,a,f,c,b,n,e,...,w,w,p,w,o,p,k,n,g,e
6,b,s,w,t,a,f,c,b,g,e,...,w,w,p,w,o,p,k,n,m,e
7,b,y,w,t,l,f,c,b,n,e,...,w,w,p,w,o,p,n,s,m,e
8,x,y,w,t,p,f,c,n,p,e,...,w,w,p,w,o,p,k,v,g,p
9,b,s,y,t,a,f,c,b,g,e,...,w,w,p,w,o,p,k,s,m,e


In [7]:
# Combine features and target for convenience
mushroom_df = pd.concat([X, y], axis=1)

# Check basic info
print("ðŸ”¹ Dataset Overview:")
mushroom_df.info()

# Check for missing values
print("\nðŸ”¹ Missing Values per Column:")
print(mushroom_df.isnull().sum())

# Check for duplicates
duplicate_count = mushroom_df.duplicated().sum()
print(f"\nðŸ”¹ Number of Duplicate Rows: {duplicate_count}")

# Display summary statistics
print("\nðŸ”¹ Summary Statistics (Numerical Columns):")
display(mushroom_df.describe())

# Display basic info for categorical columns
print("\nðŸ”¹ Summary of Categorical Columns:")
display(mushroom_df.describe(include=['object']))


ðŸ”¹ Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   cap-shape                 8124 non-null   object
 1   cap-surface               8124 non-null   object
 2   cap-color                 8124 non-null   object
 3   bruises                   8124 non-null   object
 4   odor                      8124 non-null   object
 5   gill-attachment           8124 non-null   object
 6   gill-spacing              8124 non-null   object
 7   gill-size                 8124 non-null   object
 8   gill-color                8124 non-null   object
 9   stalk-shape               8124 non-null   object
 10  stalk-root                5644 non-null   object
 11  stalk-surface-above-ring  8124 non-null   object
 12  stalk-surface-below-ring  8124 non-null   object
 13  stalk-color-above-ring    8124 non-null   object
 14  s

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,poisonous
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,6,4,10,2,9,2,2,2,12,2,...,9,9,1,4,3,5,9,6,7,2
top,x,y,n,f,n,f,c,b,b,t,...,w,w,p,w,o,p,w,v,d,e
freq,3656,3244,2284,4748,3528,7914,6812,5612,1728,4608,...,4464,4384,8124,7924,7488,3968,2388,4040,3148,4208



ðŸ”¹ Summary of Categorical Columns:


Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,poisonous
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,6,4,10,2,9,2,2,2,12,2,...,9,9,1,4,3,5,9,6,7,2
top,x,y,n,f,n,f,c,b,b,t,...,w,w,p,w,o,p,w,v,d,e
freq,3656,3244,2284,4748,3528,7914,6812,5612,1728,4608,...,4464,4384,8124,7924,7488,3968,2388,4040,3148,4208


### Reflection 1: 
The data is mostly intact, however, the column stalk root has 2480 missing values. Since this feature is categorical and has a large number of missing entries, I plan to impute these as "unknown" rather than drop the entire column.
This approach will preserve all rows for model training while allowing the model to learn from the known root types.

In [9]:
# Fill missing values in 'stalk-root' with the label 'unknown'
mushroom_df['stalk-root'] = mushroom_df['stalk-root'].fillna('unknown')

# Verify the fix
print(mushroom_df['stalk-root'].value_counts(dropna=False))
print("\nRemaining missing values:", mushroom_df['stalk-root'].isnull().sum())


stalk-root
b          3776
unknown    2480
e          1120
c           556
r           192
Name: count, dtype: int64

Remaining missing values: 0
