# Mushroom dataset — Cleaning and preparation

This notebook performs the same data cleaning steps used in the Iris Naive Bayes notebook, adapted to the `mushrooms.csv` dataset. Steps: read CSV, tidy column names, treat missing values ('?'), impute where needed, shuffle/reset index, separate features and label, and create a train/test split. 


### sklearn 
- library for low level ml
- tools like models train and tune
- math kits
- model benchmark (classification report)
- test case

In [1]:
# Imports
import pandas as pd # data manipuation and visualization
import numpy as np # fast math operations over arrays
from sklearn.model_selection import train_test_split # 


### Read the csv using pandas
The dataset file is `mushrooms.csv` in the same folder as this notebook.

In [2]:
# Read the file. The file has a header row with column names separated by commas.
df = pd.read_csv('mushrooms.csv')
df.shape


(8124, 23)

In [3]:
# Display first rows to inspect
df.head()


Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [4]:
# Replace '?' with NaN
df = df.replace('?', np.nan)
# Extract rows with any missing values to a temporary DataFrame
temp_missing = df[df.isna().any(axis=1)].copy()
print('rows with any missing before:', temp_missing.shape[0])
# Drop those rows from the main df
df = df.dropna().reset_index(drop=True)
print('rows remaining after drop:', df.shape[0])

# Optional: inspect the first few missing rows
temp_missing.head()

rows with any missing before: 2480
rows remaining after drop: 5644


Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
3984,e,x,y,b,t,n,f,c,b,e,...,s,e,w,p,w,t,e,w,c,w
4023,p,x,y,e,f,y,f,c,n,b,...,s,w,w,p,w,o,e,w,v,p
4076,e,f,y,u,f,n,f,c,n,h,...,f,w,w,p,w,o,f,h,y,d
4100,p,x,y,e,f,y,f,c,n,b,...,s,p,p,p,w,o,e,w,v,d
4104,p,x,y,n,f,f,f,c,n,b,...,s,p,p,p,w,o,e,w,v,l


### Tidy column names
Replace hyphens with underscores to make column access easier in Python.

In [5]:
# Rename columns: replace '-' with '_' this is proper python notation
df.columns = [c.replace('-', '_') for c in df.columns]

# show columns
df.columns.tolist()


['class',
 'cap_shape',
 'cap_surface',
 'cap_color',
 'bruises',
 'odor',
 'gill_attachment',
 'gill_spacing',
 'gill_size',
 'gill_color',
 'stalk_shape',
 'stalk_root',
 'stalk_surface_above_ring',
 'stalk_surface_below_ring',
 'stalk_color_above_ring',
 'stalk_color_below_ring',
 'veil_type',
 'veil_color',
 'ring_number',
 'ring_type',
 'spore_print_color',
 'population',
 'habitat']

In [6]:
# Save extracted missing rows and the cleaned DataFrame to CSV files
try:
    # If temp_missing and df exist in the notebook kernel, save directly
    temp_missing.to_csv('mushrooms_missing_rows.csv', index=False)
    df.to_csv('mushrooms_clean.csv', index=False)
    print(f"Saved 'mushrooms_missing_rows.csv' ({len(temp_missing)} rows)")
    print(f"Saved 'mushrooms_clean.csv' ({len(df)} rows)")
except NameError:
    # Variables are not present (not executed in this kernal aka set of processes)). Recompute safely and save. 
    # Avoid by running previous cells 
    print('Kernel variables not found — recomputing from source CSV and saving...')
    df_src = pd.read_csv('mushrooms.csv', dtype=str)
    df_src = df_src.replace('?', np.nan)
    temp_missing = df_src[df_src.isna().any(axis=1)].copy()
    temp_missing.to_csv('mushrooms_missing_rows.csv', index=False)
    df_clean = df_src.dropna().reset_index(drop=True)
    df_clean.to_csv('mushrooms_clean.csv', index=False)
    print(f"Recomputed and saved 'mushrooms_missing_rows.csv' ({len(temp_missing)} rows)")
    print(f"Recomputed and saved 'mushrooms_clean.csv' ({len(df_clean)} rows)")

Saved 'mushrooms_missing_rows.csv' (2480 rows)
Saved 'mushrooms_clean.csv' (5644 rows)


### Shuffle and reset index
Shuffling helps avoid any ordering effects. We use a fixed random_state for reproducibility.

In [7]:
df_random = df.sample(frac=1, random_state=42).reset_index(drop=True)
df_random.shape


(5644, 23)

### Separate features and label
The `class` column is the label (edible `e` or poisonous `p`). We'll separate feature matrix X and label vector y.

In [8]:
# label column is 'class' 
y = df_random['class']
# For X, one-hot encode all non-numeric columns excluding class and keep numeric columns as-is
non_numeric = df_random.select_dtypes(exclude=['number']).columns.tolist()
# remove 'class' if present so model doesn't have actual y label
non_numeric = [c for c in non_numeric if c != 'class']
numeric_cols = df_random.select_dtypes(include=['number']).columns.tolist()
# One-hot encode non-numeric columns; make a binary for each category associated with each feature
if len(non_numeric) > 0:
    X_cat = pd.get_dummies(df_random[non_numeric], drop_first=False)
else:
    X_cat = pd.DataFrame(index=df_random.index)
# Combine numeric columns (if any) with encoded categorical columns
if len(numeric_cols) > 0:
    X_num = df_random[numeric_cols].reset_index(drop=True)
    X = pd.concat([X_num, X_cat.reset_index(drop=True)], axis=1)
else:
    X = X_cat.reset_index(drop=True)

X.shape, y.shape

((5644, 98), (5644,))

In [9]:
# Show distribution of target classes good to see balance of dataset
y.value_counts()


class
e    3488
p    2156
Name: count, dtype: int64

In [10]:
# Create a cleaned DataFrame with dummies and the label, then save to CSV
# Reset index on y to align with X
df_clean = pd.concat([X.reset_index(drop=True), y.reset_index(drop=True)], axis=1)
# Save cleaned dataset with dummies and no incomplete rows
df_clean.to_csv('mushrooms_clean.csv', index=False)
print('Saved mushrooms_clean.csv with shape:', df_clean.shape)

df_clean.head()

Saved mushrooms_clean.csv with shape: (5644, 99)


Unnamed: 0,cap_shape_b,cap_shape_c,cap_shape_f,cap_shape_k,cap_shape_s,cap_shape_x,cap_surface_f,cap_surface_g,cap_surface_s,cap_surface_y,...,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,class
0,False,False,True,False,False,False,False,False,True,False,...,False,False,False,False,True,False,False,False,False,e
1,False,False,False,False,False,True,False,False,False,True,...,False,False,True,False,True,False,False,False,False,p
2,False,False,True,False,False,False,False,False,False,True,...,False,True,False,True,False,False,False,False,False,e
3,False,False,True,False,False,False,False,False,False,True,...,False,True,False,False,True,False,False,False,False,p
4,False,False,False,False,False,True,False,False,True,False,...,True,False,False,False,False,False,False,False,True,p


In [11]:
def get_clean_X_y(sort_by=None, ascending=True):
    """Return the cleaned feature matrix X and label y from the current notebook state.
    If sort_by is provided, it must be a column name present in X and will be used to sort both X and y.
    """
    # Validate variables exist
    try:
        X_local = X.copy()
        y_local = y.copy()
    except NameError:
        raise RuntimeError('X and y are not defined in the notebook. Run the earlier cells to build them.')
    
    return X_local, y_local

### Train/test split
We'll split 80% train / 20% test. Because features are categorical, further encoding would be needed before training models; this notebook focuses on cleaning and splitting.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12, shuffle=False)
X_train.shape, X_test.shape, y_train.shape, y_test.shape


((4515, 98), (1129, 98), (4515,), (1129,))