In [1]:
# Import necessary libraries
import pandas as pd

# Load the dataset
file_path = "mushrooms.csv"  # Adjust path if needed
df = pd.read_csv(file_path)

# Display basic information about the dataset
df.info()

# Display the first few rows of the dataset
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


### Step 1: Reading the Dataset

#### Code Explanation:
1. **Loaded the dataset** using `pandas.read_csv()`.
2. **Checked the structure** with `df.info()`, which confirms:
   - There are **8124 rows** and **23 columns**.
   - All columns are **categorical (object dtype)**.
   - There are **no missing values**.
3. **Displayed the first five rows** using `df.head()` to inspect the dataset.

#### Why This Step?
- It helps to **understand the dataset** before preprocessing.
-  **data types and missing values** must beverified to decide on the next steps.
- Since all features are categorical, **encode them into numeric values** in the next step allows form simple processing.

In [2]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to all categorical columns
for column in df.columns:
    df[column] = label_encoder.fit_transform(df[column])

# Display the first few rows after encoding
df.head()

# Check class distribution
df['class'].value_counts(normalize=True) * 100

class
0    51.797144
1    48.202856
Name: proportion, dtype: float64

### Step 2: Data Preprocessing (Label Encoding & Class Distribution Check)

#### Code Explanation:
1. **Label Encoding**:  
   - All features in our dataset are categorical.
   - Machine learning models (such as Naïve Bayes) require numerical input.
   - `LabelEncoder()` from `sklearn.preprocessing` is used to convert categorical values into numerical values.

2. **Checking Class Distribution**:
   - The target variable (`class`) determines whether a mushroom is **edible or poisonous**.
   - `value_counts(normalize=True) * 100` is used to calculate the percentage of each class.
   - This helps identify if there is an **imbalance in class distribution**.

#### Why This Step?
- **Label encoding** is necessary for numerical processing.
- **Class distribution** helps to understand if one class is dominant, which may affect model performance.

In [3]:
from sklearn.model_selection import train_test_split

# Define features (X) and target variable (y)
X = df.drop(columns=["class"])  # Features (all columns except the target)
y = df["class"]  # Target (edible or poisonous)

# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Display the shape of the training and testing sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((6499, 22), (1625, 22), (6499,), (1625,))

### Step 3: Splitting the Data

#### Code Explanation:
1. **Defining Features and Target Variable**:
   - The **features (X)** are all columns **except "class"** (i.e., all independent variables).
   - The **target variable (y)** is the "class" column (edible or poisonous mushrooms).

2. **Splitting the Dataset**:
   - `train_test_split()` from `sklearn.model_selection` is used to divide the dataset.
   - **80% of the data** is used for **training** (`X_train`, `y_train`).
   - **20% of the data** is used for **testing** (`X_test`, `y_test`).
   - `random_state=42` ensures reproducibility.
   - `stratify=y` maintains the **same class proportion** in both training and testing sets.

#### Why This Step?
- **Ensures the model generalises well** by evaluating it on unseen data.
- **Prevents data leakage** by keeping training and testing separate.
- **Stratification** helps prevent class imbalance from affecting model learning.

In [4]:
from sklearn.naive_bayes import GaussianNB

# Initialize the Naïve Bayes model
nb_model = GaussianNB()

# Train the model on the training data
nb_model.fit(X_train, y_train)

### Step 4: Training the Naïve Bayes Classifier

#### Code Explanation:
1. **Choosing the Classifier**:
   - `GaussianNB()` from `sklearn.naive_bayes`.
   - **Naïve Bayes assumes features are independent** given the class label.

2. **Training the Model**:
   - `.fit(X_train, y_train)` is called to train the model using the **training dataset**.

#### Why This Step?
- This step is crucial as the model learns the relationship between features and the target variable.
- The **Gaussian Naïve Bayes** classifier is well-suited for categorical data and **probabilistic reasoning**.

In [7]:
from sklearn.metrics import classification_report, confusion_matrix, precision_score, f1_score

# Generate the classification report
class_report = classification_report(y_test, y_pred)

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Calculate Precision and F1 Score
precision = precision_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Print classification report
print("Classification Report:\n", class_report)

# Print confusion matrix
print("Confusion Matrix:\n", conf_matrix)

# Print precision and F1 score
print(f"Precision Score: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.92      0.93       842
           1       0.92      0.93      0.93       783

    accuracy                           0.93      1625
   macro avg       0.93      0.93      0.93      1625
weighted avg       0.93      0.93      0.93      1625

Confusion Matrix:
 [[778  64]
 [ 52 731]]
Precision Score: 0.9287
F1 Score: 0.9286


### Step 5: Full Classification Report & Confusion Matrix

#### Code Explanation:
1. **Classification Report**:
   - `classification_report(y_test, y_pred)` provides:
     - **Precision**: Measures how many of the predicted positives were actually positive.
     - **Recall**: Measures how many actual positives were correctly identified.
     - **F1-score**: The harmonic mean of precision and recall.
     - **Support**: Number of occurrences for each class.

2. **Confusion Matrix**:
   - `confusion_matrix(y_test, y_pred)` generates a matrix showing:
     - **True Positives (TP)**: Correct edible/poisonous predictions.
     - **False Positives (FP)**: Incorrectly classified edible as poisonous (or vice versa).
     - **False Negatives (FN)**: Poisonous mushrooms incorrectly classified as edible.
     - **True Negatives (TN)**: Correctly identified non-target class.

3. **Precision and F1 Score**:
   - `precision_score(y_test, y_pred, average='weighted')`: Measures the ratio of correctly predicted positive observations to total predicted positives.
   - `f1_score(y_test, y_pred, average='weighted')`: Provides a balance between precision and recall.

#### Why This Step?
- **The classification report provides a breakdown of model performance per class.**
- **The confusion matrix helps understand misclassifications.**
- **Precision and F1-score give insight into model reliability beyond just accuracy.**
- This is **crucial for datasets where false negatives or false positives are highly important** (e.g., predicting poisonous mushrooms incorrectly could be dangerous).