# Midterm Project: Mushroom Classification  
**Name:** Kiruthikaa Natarajan Srinivasan  
**Date:** November 3, 2025  

## Introduction  
This project applies classification modeling techniques to a real-world dataset from the UCI repository.  
The goal is to predict whether a mushroom is edible or poisonous based on its physical characteristics.  
The dataset includes 8124 instances and 22 categorical features such as cap shape, odor, gill color, and habitat.  
This project demonstrates my ability to explore data, build models, evaluate performance, and communicate insights clearly and professionally.

In [1]:
import sys
print(sys.executable)


c:\Repos\ml_classification_kiruthikaa\.venv\Scripts\python.exe


## Section 1. Import and Inspect the Data 
Import the external Python libraries used (e.g., pandas, numpy, matplotlib, seaborn, sklearn).

### 1.1 Load the dataset and display the first 10 rows.

In [2]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

# Load the mushroom dataset
column_names = [
    'class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
    'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
    'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring',
    'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color',
    'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'
]

df = pd.read_csv('data/agaricus-lepiota.data', header=None, names=column_names)

# Display the first 10 rows
df.head(10)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
5,e,x,y,y,t,a,f,c,b,n,...,s,w,w,p,w,o,p,k,n,g
6,e,b,s,w,t,a,f,c,b,g,...,s,w,w,p,w,o,p,k,n,m
7,e,b,y,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,s,m
8,p,x,y,w,t,p,f,c,n,p,...,s,w,w,p,w,o,p,k,v,g
9,e,b,s,y,t,a,f,c,b,g,...,s,w,w,p,w,o,p,k,s,m


### Section 1.2: Check for Missing Values and Display Summary Statistics

In [4]:
# Check for missing values
missing_values = df.isnull().sum()

# Display summary statistics
summary_stats = df.describe(include='all')

missing_values, summary_stats

(class                       0
 cap-shape                   0
 cap-surface                 0
 cap-color                   0
 bruises                     0
 odor                        0
 gill-attachment             0
 gill-spacing                0
 gill-size                   0
 gill-color                  0
 stalk-shape                 0
 stalk-root                  0
 stalk-surface-above-ring    0
 stalk-surface-below-ring    0
 stalk-color-above-ring      0
 stalk-color-below-ring      0
 veil-type                   0
 veil-color                  0
 ring-number                 0
 ring-type                   0
 spore-print-color           0
 population                  0
 habitat                     0
 dtype: int64,
        class cap-shape cap-surface cap-color bruises  odor gill-attachment  \
 count   8124      8124        8124      8124    8124  8124            8124   
 unique     2         6           4        10       2     9               2   
 top        e         x           y

### Interpretation of Missing Values and Summary Statistics

- **No missing values**: All 23 columns have 8124 non-null entries, so no data cleaning is needed for nulls.
- **Categorical overview**:
  - Most features are categorical with varying numbers of unique values (e.g., `cap-color` has 10, `odor` has 9).
  - The `class` column (target) has two values: `'e'` (edible) and `'p'` (poisonous), with edible being slightly more frequent.
-  **Dominant values**:
  - Many features have a dominant category. For example:
    - `odor`: `'n'` (none) appears 3528 times.
    - `veil-type`: `'p'` is the only value, so this column may be dropped later due to zero variance.
    - `ring-number`: `'o'` appears 7488 times — very imbalanced.
-  **Next step**: Visualize class distribution to understand target balance and prepare for encoding.
