**Author:** Ronan Green  
**Model:** Naive Bayes Classifier  
**Brief Description:**  
Naive Bayes is a probabilistic classification model based on Bayes’ Theorem with the assumption that features are independent. 

**Note:**  
This notebook was created by Ronan Green. A full breakdown of the findings, methodology, and references used can be found at the end of the notebook.

In [6]:
# Import necessary libraries
import pandas as pd

# Load the dataset
file_path = "mushrooms.csv"  # Adjust path if needed
df = pd.read_csv(file_path)

# Display basic information about the dataset
df.info()

# Display the first few rows of the dataset
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


#### Code Explanation:

1. **Import pandas:** We use the `pandas` library for data handling and manipulation.
2. **Renaming columns:** We assign shorter names to make column references.
3. **`df.head()`:** Displays the first five rows of the dataset.

---

#### Why This Step?

- **Load the dataset:** Import the dataset before any analysis or preprocessing can occur.
- **Rename columns:** Long column names can slow down development and clutter the code.
- **Inspect the first rows:** Quickly confirms whether the dataset has been read properly, ensuring we have the right structure, column headings, and data format before proceeding.

In [7]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to all categorical columns
for column in df.columns:
    df[column] = label_encoder.fit_transform(df[column])

# Display the first few rows after encoding
df.head()

# Check class distribution
df['class'].value_counts(normalize=True) * 100

class
0    51.797144
1    48.202856
Name: proportion, dtype: float64

### Step 2: Data Preprocessing (Label Encoding & Class Distribution Check)

#### Code Explanation:
1. **Label Encoding**:  
   - All features in our dataset are categorical.
   - Machine learning models (such as Naïve Bayes) require numerical input.
   - `LabelEncoder()` from `sklearn.preprocessing` is used to convert categorical values into numerical values.

2. **Checking Class Distribution**:
   - The target variable (`class`) determines whether a mushroom is **edible or poisonous**.
   - `value_counts(normalize=True) * 100` is used to calculate the percentage of each class.
   - This helps identify if there is an **imbalance in class distribution**.

#### Why This Step?
- **Label encoding** is necessary for numerical processing.
- **Class distribution** helps to understand if one class is dominant, which may affect model performance.

In [8]:
from sklearn.model_selection import train_test_split

# Define features (X) and target variable (y)
X = df.drop(columns=["class"])  # Features (all columns except the target)
y = df["class"]  # Target (edible or poisonous)

# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Display the shape of the training and testing sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((6499, 22), (1625, 22), (6499,), (1625,))

### Step 3: Splitting the Data

#### Code Explanation:
1. **Defining Features and Target Variable**:
   - The **features (X)** are all columns **except "class"** (i.e., all independent variables).
   - The **target variable (y)** is the "class" column (edible or poisonous mushrooms).

2. **Splitting the Dataset**:
   - `train_test_split()` from `sklearn.model_selection` is used to divide the dataset.
   - **80% of the data** is used for **training** (`X_train`, `y_train`).
   - **20% of the data** is used for **testing** (`X_test`, `y_test`).
   - `random_state=42` ensures reproducibility.
   - `stratify=y` maintains the **same class proportion** in both training and testing sets.

#### Why This Step?
- **Ensures the model generalises well** by evaluating it on unseen data.
- **Prevents data leakage** by keeping training and testing separate.
- **Stratification** helps prevent class imbalance from affecting model learning.

In [None]:
from sklearn.naive_bayes import GaussianNB

# Initialize the Naïve Bayes model
nb_model = GaussianNB()

# Train the model on the training data
nb_model.fit(X_train, y_train)

### Step 4: Training the Naïve Bayes Classifier

#### Code Explanation:
1. **Choosing the Classifier**:
   - `GaussianNB()` from `sklearn.naive_bayes`.
   - **Naïve Bayes assumes features are independent** given the class label.

2. **Training the Model**:
   - `.fit(X_train, y_train)` is called to train the model using the **training dataset**.

#### Why This Step?
- This step is crucial as the model learns the relationship between features and the target variable.
- The **Gaussian Naïve Bayes** classifier is well-suited for categorical data and **probabilistic reasoning**.

In [10]:
from sklearn.metrics import classification_report, confusion_matrix, precision_score, f1_score

# Predict the target variable for the test data
y_pred = nb_model.predict(X_test)

# Generate the classification report
class_report = classification_report(y_test, y_pred)

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Calculate Precision and F1 Score
precision = precision_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Print classification report
print("Classification Report:\n", class_report)

# Print confusion matrix
print("Confusion Matrix:\n", conf_matrix)

# Print precision and F1 score
print(f"Precision Score: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.92      0.93       842
           1       0.92      0.93      0.93       783

    accuracy                           0.93      1625
   macro avg       0.93      0.93      0.93      1625
weighted avg       0.93      0.93      0.93      1625

Confusion Matrix:
 [[778  64]
 [ 52 731]]
Precision Score: 0.9287
F1 Score: 0.9286


### Step 5: Full Classification Report & Confusion Matrix

#### Code Explanation:
1. **Classification Report**:
   - `classification_report(y_test, y_pred)` provides:
     - **Precision**: Measures how many of the predicted positives were actually positive.
     - **Recall**: Measures how many actual positives were correctly identified.
     - **F1-score**: The harmonic mean of precision and recall.
     - **Support**: Number of occurrences for each class.

2. **Confusion Matrix**:
   - `confusion_matrix(y_test, y_pred)` generates a matrix showing:
     - **True Positives (TP)**: Correct edible/poisonous predictions.
     - **False Positives (FP)**: Incorrectly classified edible as poisonous (or vice versa).
     - **False Negatives (FN)**: Poisonous mushrooms incorrectly classified as edible.
     - **True Negatives (TN)**: Correctly identified non-target class.

3. **Precision and F1 Score**:
   - `precision_score(y_test, y_pred, average='weighted')`: Measures the ratio of correctly predicted positive observations to total predicted positives.
   - `f1_score(y_test, y_pred, average='weighted')`: Provides a balance between precision and recall.

#### Why This Step?
- **The classification report provides a breakdown of model performance per class.**
- **The confusion matrix helps understand misclassifications.**
- **Precision and F1-score give insight into model reliability beyond just accuracy.**
- This is **crucial for datasets where false negatives or false positives are highly important** (e.g., predicting poisonous mushrooms incorrectly could be dangerous).

In [11]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

# Initialize different Naïve Bayes variants
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()

# Train and evaluate Bernoulli Naïve Bayes
bernoulli_nb.fit(X_train, y_train)
y_pred_bernoulli = bernoulli_nb.predict(X_test)
bernoulli_f1 = f1_score(y_test, y_pred_bernoulli, average='weighted')

# Train and evaluate Multinomial Naïve Bayes
multinomial_nb.fit(X_train, y_train)
y_pred_multinomial = multinomial_nb.predict(X_test)
multinomial_f1 = f1_score(y_test, y_pred_multinomial, average='weighted')

# Print comparison of F1 Scores
print(f"GaussianNB F1 Score: {f1:.4f}")  # From previous step
print(f"BernoulliNB F1 Score: {bernoulli_f1:.4f}")
print(f"MultinomialNB F1 Score: {multinomial_f1:.4f}")

GaussianNB F1 Score: 0.9286
BernoulliNB F1 Score: 0.8518
MultinomialNB F1 Score: 0.8096


### Step 6: Model Optimisation

#### Code Explanation:
1. **Trying Different Naïve Bayes Variants**:
   - **BernoulliNB**: Used for binary features (0/1), commonly applied in text classification.
   - **MultinomialNB**: Works best with frequency-based data, such as word counts in NLP.

2. **Training and Evaluating Each Model**:
   - Each model is **trained** on `X_train` and `y_train`.
   - **Predictions** are made using `X_test`.
   - **F1-score is calculated** for each model.

3. **Comparing F1 Scores**:
   - The performance of each model is compared.
   - This helps to determine **which Naïve Bayes variant performs best**.

#### Why This Step?
- **Different types of Naïve Bayes perform differently on various datasets**.
- **Comparing models helps to choose the best fit for our dataset**.
- The **highest F1-score** indicates the model with the best balance between precision and recall.

In [12]:
import pandas as pd

# Define column names from the adult.names file
column_names = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status", 
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", 
    "hours-per-week", "native-country", "income"
]

# Load train dataset (adult.data) and assign column names
train_data = pd.read_csv("adult.data", names=column_names, sep=",\s*", engine="python")

# Load test dataset (adult.test), skipping the first row (it contains headers)
test_data = pd.read_csv("adult.test", names=column_names, sep=",\s*", engine="python", skiprows=1)

# Display dataset info
train_data.info(), test_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16281 entries, 0 to 16280
Data columns (tot

(None, None)

#### Code Explanation:
1. **Imports `pandas`**  
   - `pandas` is used for handling tabular data in Python.
  
2. **Defines Column Names (`column_names`)**  
   - Since the dataset files (`adult.data` and `adult.test`) **do not contain headers**, I manually defined column names based on the `adult.names` file.
   - The dataset contains **15 features** plus the **target column (`income`)**.

3. **Loads the Training Data (`adult.data`)**  
   - `pd.read_csv("adult.data", names=column_names, sep=",\s*", engine="python")`:
     - Reads the dataset from `"adult.data"`.
     - Assigns **column names** to match the `column_names` list.
     - Uses `sep=",\s*"` to **handle inconsistent spaces** after commas.
     - Uses `engine="python"` to properly interpret the separator.

4. **Loads the Test Data (`adult.test`)**  
   - The `"adult.test"` file contains a **header row**, so I use `skiprows=1` to **ignore the first row**.
   - Everything else is handled the same way as the training data.

5. **Displays Dataset Information (`train_data.info()`, `test_data.info()`)**  
   - Provides an overview of the dataset structure:
     - Number of rows and columns.
     - Data types of each column.
     - Whether there are missing values.

#### **Why This Step?**
- **Ensures correct column names** since the dataset lacks headers.  
- **Handles different formats between train and test data**, ensuring consistency.  
- **Allows us to inspect the dataset** before proceeding with cleaning and preprocessing.

In [13]:
# Remove any leading/trailing spaces in column values (avoids encoding issues)
train_data = train_data.applymap(lambda x: x.strip() if isinstance(x, str) else x)
test_data = test_data.applymap(lambda x: x.strip() if isinstance(x, str) else x)

# Replace '?' with NaN for easier handling
train_data.replace("?", pd.NA, inplace=True)
test_data.replace("?", pd.NA, inplace=True)

# Drop rows with missing values
train_data.dropna(inplace=True)
test_data.dropna(inplace=True)

# Ensure "income" labels in test data match train data exactly
test_data["income"] = test_data["income"].str.replace(".", "", regex=False)  # Removes extra period

# Label Encode categorical features (EXCLUDING income)
categorical_columns = train_data.select_dtypes(include=["object"]).columns
categorical_columns = categorical_columns.drop("income")  # Exclude income from encoding

label_encoders = {}
for col in categorical_columns:
    le = LabelEncoder()
    train_data[col] = le.fit_transform(train_data[col])
    test_data[col] = le.transform(test_data[col])

# Display unique values in "income" to confirm they remain unchanged
print("Unique values in train_data['income']:", train_data["income"].unique())
print("Unique values in test_data['income']:", test_data["income"].unique())

# Display processed dataset
train_data.head()

train_data['income'].value_counts(normalize=True) * 100


  train_data = train_data.applymap(lambda x: x.strip() if isinstance(x, str) else x)
  test_data = test_data.applymap(lambda x: x.strip() if isinstance(x, str) else x)


Unique values in train_data['income']: ['<=50K' '>50K']
Unique values in test_data['income']: ['<=50K' '>50K']


income
<=50K    75.107751
>50K     24.892249
Name: proportion, dtype: float64

#### Code Explanation:

1. **Removes Leading/Trailing Spaces in All String Values**
   - `train_data.applymap(lambda x: x.strip() if isinstance(x, str) else x)`
   - `test_data.applymap(lambda x: x.strip() if isinstance(x, str) else x)`
   - This **ensures consistency** in categorical values that might have accidental spaces (e.g., `" Male"` vs `"Male"`).

2. **Handles Missing Values (`?` → NaN)**
   - `train_data.replace("?", pd.NA, inplace=True)`
   - `test_data.replace("?", pd.NA, inplace=True)`
   - In this dataset, missing values are represented as `"?"`. We **convert them to `NaN`** to facilitate easier handling.

3. **Drops Rows with Missing Values**
   - `train_data.dropna(inplace=True)`
   - `test_data.dropna(inplace=True)`
   - Instead of imputing missing values, this step **removes rows with missing values** to avoid potential biases in the model.

4. **Fixes Formatting Issues in `income` Labels (Test Set)**
   - `test_data["income"] = test_data["income"].str.replace(".", "", regex=False)`
   - The `"income"` column in `adult.test` has an **extra period (`.`) at the end** (e.g., `'>50K.'` instead of `'>50K'`).
   - This step **removes the period** to match the format of `adult.data`.

5. **Label Encodes Categorical Features (EXCLUDING `income`)**
   - `categorical_columns = train_data.select_dtypes(include=["object"]).columns`
   - `categorical_columns = categorical_columns.drop("income")`
   - `LabelEncoder()` is applied to **all categorical columns except `"income"`**, ensuring numerical representation.

6. **Verifies `income` Labels Remain Unchanged**
   - `print("Unique values in train_data['income']:", train_data["income"].unique())`
   - `print("Unique values in test_data['income']:", test_data["income"].unique())`
   - This ensures that `income` is **still in its original format (`<=50K`, `>50K`)** and has NOT been encoded.

7. **Displays the First Few Rows of the Processed Data**
   - `train_data.head()`
   - Helps verify that transformations were applied correctly.

#### **Why This Step?**
- **Standardizes categorical values** by removing spaces.
- **Handles missing data** to improve model quality.
- **Ensures consistency between train and test sets** (`income` label formatting).
- **Encodes categorical variables into numbers**, making them usable by Naïve Bayes.
- **Leaves `income` unencoded**, as it is the target variable.



In [14]:
# Define X (features) and y (target) WITHOUT encoding income
X_train = train_data.drop(columns=["income"])
y_train = train_data["income"]  # Remains in original format (<=50K, >50K)

X_test = test_data.drop(columns=["income"])
y_test = test_data["income"]  # Remains in original format (<=50K, >50K)

# Train the Naïve Bayes model
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

# Confirm model training
print("Model training complete.")





Model training complete.


#### Code Explanation:

1. **Uses Gaussian Naïve Bayes Classifier**
   - `from sklearn.naive_bayes import GaussianNB` which was done earlier.
   - This loads the **Gaussian Naïve Bayes (GNB) classifier**, which assumes that numerical features follow a **Gaussian (normal) distribution**.
   - It's well-suited for classification tasks where features are **continuous**.

2. **Defines Features (`X`) and Target Variable (`y`)**
   - `X_train = train_data.drop(columns=["income"])`
   - `y_train = train_data["income"]`
   - `X_test = test_data.drop(columns=["income"])`
   - `y_test = test_data["income"]`
   - Here, I:
     - **Separate the target variable (`income`)** from the dataset.
     - **Exclude `income` from feature variables (`X`)** to ensure the model only learns from independent features.
     - Keep `y_train` and `y_test` **in their original format (`<=50K`, `>50K`)**.

3. **Initializes the Naïve Bayes Model**
   - `nb_model = GaussianNB()`
   - This creates an instance of the **Gaussian Naïve Bayes classifier**.

4. **Trains the Model on the Training Data**
   - `nb_model.fit(X_train, y_train)`
   - The model learns the **probabilistic relationships** between the features (`X_train`) and the target labels (`y_train`).

5. **Confirms Model Training**
   - `print("Model training complete.")`
   - This outputs a message to confirm that training was successful.

#### **Why This Step?**
- **Trains the classifier on real-world census data**, allowing it to predict income categories.
- **Ensures `income` is not encoded**, keeping it in its original `<=50K` or `>50K` format.
- **GaussianNB is used because many numerical features (age, capital-gain, hours-per-week, etc.) follow a normal distribution.**
- **Prepares the model for evaluation in the next step**, where we test its performance on unseen data.



In [15]:
from sklearn.metrics import accuracy_score
# Predict on test data
y_pred = nb_model.predict(X_test)

# Generate classification report and confusion matrix
class_report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Compute Precision and F1 Score
precision = precision_score(y_test, y_pred, average="weighted", zero_division=1)
f1 = f1_score(y_test, y_pred, average="weighted", zero_division=1)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Classification Report:\n", class_report)
print("Confusion Matrix:\n", conf_matrix)
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision Score: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

# Debugging: Check unique values in y_test and y_pred
print("Unique values in y_test:", y_test.unique())
print("Unique values in y_pred:", set(y_pred))


Classification Report:
               precision    recall  f1-score   support

       <=50K       0.81      0.95      0.87     11360
        >50K       0.65      0.31      0.42      3700

    accuracy                           0.79     15060
   macro avg       0.73      0.63      0.64     15060
weighted avg       0.77      0.79      0.76     15060

Confusion Matrix:
 [[10739   621]
 [ 2564  1136]]
Accuracy: 0.7885
Precision Score: 0.7678
F1 Score: 0.7592
Unique values in y_test: ['<=50K' '>50K']
Unique values in y_pred: {np.str_('>50K'), np.str_('<=50K')}


#### Code Explanation:

1. **Makes Predictions on the Test Data**
   - `y_pred = nb_model.predict(X_test)`
   - The trained **Naïve Bayes model predicts income categories (`<=50K`, `>50K`)** for the unseen test dataset.
   - The predictions are stored in `y_pred`.

2. **Generates a Classification Report**
   - `class_report = classification_report(y_test, y_pred)`
   - This provides key performance metrics, including:
     - **Precision**: How many of the predicted labels were correct?
     - **Recall**: How many actual labels were correctly identified?
     - **F1-score**: A balance between precision and recall.
     - **Support**: The number of instances per class.

3. **Creates a Confusion Matrix**
   - `conf_matrix = confusion_matrix(y_test, y_pred)`
   - This **compares actual vs. predicted values**, showing:
     - **True Positives (TP)**: Correctly predicted `>50K` instances.
     - **False Positives (FP)**: Mistakenly classified `<=50K` as `>50K`.
     - **False Negatives (FN)**: Failed to detect `>50K` correctly.
     - **True Negatives (TN)**: Correctly predicted `<=50K` instances.

4. **Computes Precision and F1 Score**
   - `precision = precision_score(y_test, y_pred, average="weighted", zero_division=1)`
   - `f1 = f1_score(y_test, y_pred, average="weighted", zero_division=1)`
   - These metrics provide a **global evaluation of the model**.
   - `zero_division=1` prevents errors when handling classes with zero predictions.

5. **Prints Evaluation Results**
   - Outputs:
     - Classification report
     - Confusion matrix
     - Precision score
     - F1-score

6. **Debugging: Checks for Consistency Between True and Predicted Labels**
   - `print("Unique values in y_test:", y_test.unique())`
   - `print("Unique values in y_pred:", set(y_pred))`
   - Ensures both `y_test` and `y_pred` contain only the expected labels (`<=50K`, `>50K`).
   - If an error occurs, this helps identify whether there’s an encoding mismatch or an issue in preprocessing.

#### **Why This Step?**
- **Assesses how well the model generalizes to unseen data.**
- **The classification report provides a breakdown of model performance per class.**
- **The confusion matrix helps visualize misclassifications.**
- **Ensures the output labels are correctly formatted and consistent with `y_test`.**



In [16]:
# Initialize Multinomial Naïve Bayes
mnb_model = MultinomialNB()

# Train the model
mnb_model.fit(X_train, y_train)

# Predict on test data
y_pred_mnb = mnb_model.predict(X_test)

# Evaluate performance
acc_mnb = accuracy_score(y_test, y_pred_mnb)
f1_mnb = f1_score(y_test, y_pred_mnb, average="weighted")
precision_mnb = precision_score(y_test, y_pred_mnb, average="weighted")

print("**MultinomialNB**")
print(f"Accuracy: {acc_mnb:.4f}")
print(f"F1 Score: {f1_mnb:.4f}")
print(f"Precision Score: {precision_mnb:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred_mnb))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_mnb))

**MultinomialNB**
Accuracy: 0.7771
F1 Score: 0.7362
Precision Score: 0.7512

Classification Report:
               precision    recall  f1-score   support

       <=50K       0.79      0.95      0.87     11360
        >50K       0.63      0.23      0.34      3700

    accuracy                           0.78     15060
   macro avg       0.71      0.59      0.60     15060
weighted avg       0.75      0.78      0.74     15060


Confusion Matrix:
 [[10847   513]
 [ 2844   856]]


In [17]:
# Initialize Bernoulli Naïve Bayes
bnb_model = BernoulliNB()

# Train the model
bnb_model.fit(X_train, y_train)

# Predict on test data
y_pred_bnb = bnb_model.predict(X_test)

# Evaluate performance
acc_bnb = accuracy_score(y_test, y_pred_bnb)
f1_bnb = f1_score(y_test, y_pred_bnb, average="weighted")
precision_bnb = precision_score(y_test, y_pred_bnb, average="weighted")

print("**BernoulliNB**")
print(f"Accuracy: {acc_bnb:.4f}")
print(f"F1 Score: {f1_bnb:.4f}")
print(f"Precision Score: {precision_bnb:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred_bnb))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_bnb))


**BernoulliNB**
Accuracy: 0.7284
F1 Score: 0.7445
Precision Score: 0.7868

Classification Report:
               precision    recall  f1-score   support

       <=50K       0.89      0.73      0.80     11360
        >50K       0.47      0.73      0.57      3700

    accuracy                           0.73     15060
   macro avg       0.68      0.73      0.69     15060
weighted avg       0.79      0.73      0.74     15060


Confusion Matrix:
 [[8280 3080]
 [1010 2690]]


### **Comparison of Naïve Bayes Models on the Adult Income Dataset**

We tested three different variations of the **Naïve Bayes** algorithm on the **Adult Income dataset**:

|Model|Accuracy|F1 Score|Precision|
|---|---|---|---|
|**GaussianNB**|**0.79**|**0.76**|**0.77**|
|**BernoulliNB**|0.73|0.74|**0.79**|
|**MultinomialNB**|0.78|0.74|0.75|

---

## **Understanding Each Model's Performance**

### **Gaussian Naïve Bayes (`GaussianNB`)**

- **Accuracy**: **79%** (Best)
- **F1 Score**: **0.76**
- **Precision**: **0.77**
- **Best for handling continuous numerical features**, such as `age`, `capital-gain`, `hours-per-week`.
- **Downside:** Still struggled with **class imbalance**, as the recall for the `>50K` class was **only 31%**.

#### **Why GaussianNB Works Best?**

- **Handles both categorical and numerical features** (unlike MultinomialNB and BernoulliNB).  
- **Does not assume binary or count-based data**, making it better suited for the mixed dataset.

---

### ** Multinomial Naïve Bayes (`MultinomialNB`)**

- **Accuracy**: 77.7%
- **F1 Score**: **0.74**
- **Precision**: 0.75
- Struggled because **MultinomialNB assumes count-based features** (e.g., word frequency in text classification).
- **Issue:** The Adult dataset contains **numerical continuous features**, which do not fit MultinomialNB’s assumptions.

#### **Key Differences from GaussianNB:**

 - **Only works well for count-based categorical features**.  
 - **Does not handle numerical features properly**.  
 -  **Lower recall for `>50K` (only 23%)**, meaning it **struggled to detect high-income earners**.

---

### ** Bernoulli Naïve Bayes (`BernoulliNB`)**

- **Accuracy**: 72.8% (Lowest)
- **F1 Score**: 0.74
- **Precision**: **0.79** (Highest)
- **BernoulliNB assumes binary features** (0/1 values), making it a poor fit for this dataset.
- **Performed decently but underperformed GaussianNB due to continuous numerical features**.

#### **Key Differences from GaussianNB:**

-  **BernoulliNB is designed for binary data**, but the Adult dataset has continuous features.  
 - **Higher precision**, meaning it made fewer false positives, but at the cost of recall.  
 - **Recall for `>50K` was higher than MultinomialNB (73%) but lower than GaussianNB**.

---

## **Final Conclusion: Which Model is Best?**

| Model             | Best Use Case                                              | Performance on Adult Dataset                         |
| ----------------- | ---------------------------------------------------------- | ---------------------------------------------------- |
| **GaussianNB**    | Best for **continuous numerical and categorical features** | **Best Overall (79% accuracy)**                      |
| **MultinomialNB** | Best for **count-based data (word frequency, text data)**  | **Struggled with numerical features (77% accuracy)** |
| **BernoulliNB**   | Best for **binary feature datasets**                       |  **Did not generalize well (72% accuracy)**          |

-  **Final Choice:** **GaussianNB performed best** because it handled mixed numerical and categorical data efficiently.  
-  **MultinomialNB and BernoulliNB struggled** because they are designed for text-based and binary data, which do not fit the Adult dataset well.

In [18]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Train GaussianNB with SMOTE
nb_smote = GaussianNB()
nb_smote.fit(X_train_smote, y_train_smote)

# Predictions and evaluation
y_pred_smote = nb_smote.predict(X_test)
accuracy_smote = accuracy_score(y_test, y_pred_smote)
print("SMOTE Approach:")
print(f"Accuracy: {accuracy_smote:.4f}")
print(classification_report(y_test, y_pred_smote))

SMOTE Approach:
Accuracy: 0.7864
              precision    recall  f1-score   support

       <=50K       0.81      0.94      0.87     11360
        >50K       0.63      0.31      0.41      3700

    accuracy                           0.79     15060
   macro avg       0.72      0.63      0.64     15060
weighted avg       0.76      0.79      0.76     15060



### **Code Explanation: SMOTE Implementation**

1. **Import `SMOTE` from `imblearn.over_sampling`**
    
    - `SMOTE` (Synthetic Minority Over-sampling Technique) is used to balance imbalanced datasets by **generating synthetic examples** of the minority class instead of just duplicating existing samples.
2. **Apply SMOTE to the Training Data**
    
    - `SMOTE` is applied to **increase the number of samples in the minority class (`>50K`)**, helping to balance the dataset.
    - The `fit_resample` method creates **new synthetic instances** rather than duplicating existing ones.
    - `random_state=42` ensures that the oversampling process is **reproducible**.
3. **Train a Gaussian Naïve Bayes Model on SMOTE-Augmented Data**
    
    - After applying SMOTE, the balanced dataset is used to train a **Gaussian Naïve Bayes (`GaussianNB`) classifier**.
    - Training on a balanced dataset ensures that the model does **not overly favour the majority class (`<=50K`)**.
4. **Make Predictions on the Test Data**
    
    - The trained model is used to **predict labels for the test set**, which remains unchanged (i.e., still imbalanced).
    - This step helps evaluate whether **the SMOTE-augmented model performs better on real-world imbalanced data**.
5. **Evaluate Model Performance**
    
    - The accuracy score is computed to measure **overall correctness**.
    - A classification report is generated, showing **precision, recall, and F1-score** for each class (`<=50K` and `>50K`).
    - The key metric to watch is **recall for the `>50K` class**, as an improvement indicates that the model is correctly identifying more high-income individuals.

### **Why This Step?**

- **SMOTE helps mitigate class imbalance**, ensuring that the model does not ignore the `>50K` class.
- **Improves recall for `>50K`**, reducing false negatives (cases where `>50K` is misclassified as `<=50K`).
- **Comparison with the original model** determines if **SMOTE improves classification or introduces unwanted noise** (potentially lowering precision).

In [20]:
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

class_weights = compute_class_weight("balanced", classes=np.unique(y_train), y=y_train)
class_weight_dict = {cls: weight for cls, weight in zip(np.unique(y_train), class_weights)}

# Train GaussianNB with class weights (GaussianNB doesn't support class weights directly)
nb_weighted = GaussianNB()
nb_weighted.fit(X_train, y_train)  # GaussianNB does not support class weights natively

# Predictions and evaluation
y_pred_weighted = nb_weighted.predict(X_test)
accuracy_weighted = accuracy_score(y_test, y_pred_weighted)
print("\nClass Weight Approach:")
print(f"Accuracy: {accuracy_weighted:.4f}")
print(classification_report(y_test, y_pred_weighted))


Class Weight Approach:
Accuracy: 0.7885
              precision    recall  f1-score   support

       <=50K       0.81      0.95      0.87     11360
        >50K       0.65      0.31      0.42      3700

    accuracy                           0.79     15060
   macro avg       0.73      0.63      0.64     15060
weighted avg       0.77      0.79      0.76     15060



### **Code Explanation: Class Weight Adjustment**

1. **Import `compute_class_weight` from `sklearn.utils.class_weight`**
    
    - The `compute_class_weight` function **calculates class weights** to counteract class imbalance.
    - This approach **adjusts model sensitivity** to the underrepresented class (`>50K`) by assigning a higher weight to its instances.
2. **Compute Class Weights**
    
    - The `balanced` strategy ensures that each class's weight is **inversely proportional to its frequency** in the dataset.
    - This means that the majority class (`<=50K`) receives a **lower weight**, while the minority class (`>50K`) receives a **higher weight** to compensate for underrepresentation.
    - The weights are stored in a dictionary format, mapping **each class to its corresponding weight**.
3. **Train a Gaussian Naïve Bayes Model with Class Weights**
    
    - **GaussianNB does not support class weights directly**, unlike some other classifiers (e.g., `LogisticRegression`).
    - Despite this limitation, the model is trained on the **original dataset** without modifying sample counts.
    - The purpose is to **compare its performance against SMOTE**, which actively increases the number of `>50K` instances.
4. **Make Predictions on the Test Data**
    
    - The trained model is evaluated on the **original imbalanced test set** to see how well it generalises after applying class weighting.
5. **Evaluate Model Performance**
    
    - The accuracy score is calculated to **measure overall correctness**.
    - A classification report is generated to show **precision, recall, and F1-score** for both classes (`<=50K` and `>50K`).
    - The goal is to **see if weighting the classes improves recall for `>50K`** without significantly harming precision.

### **Why This Step?**

- **Class weighting attempts to reduce bias** in classification without modifying the dataset size (unlike SMOTE).
- **The `>50K` class receives a higher weight**, encouraging the model to **pay more attention to minority instances**.
- **Comparison with SMOTE determines whether adjusting class sensitivity is more effective than generating synthetic samples.**

## **Data Sources**

- **[UCI Machine Learning Repository - Adult Income Dataset](https://archive.ics.uci.edu/dataset/2/adult)**
    
    - This dataset contains **48,842 instances** with **15 features** plus a binary target variable (`income`).
    - Features include **demographic and financial attributes**, such as `age`, `education`, `occupation`, `hours-per-week`, and `capital-gain`.
    - The target variable (`income`) is divided into two classes:
        - `<=50K` (earns $50,000 or less per year)
        - `>50K` (earns more than $50,000 per year)
    - The dataset is **split into a training set (`adult.data`) and a test set (`adult.test`)**, but inconsistencies exist between them.
- **[Kaggle - Mushroom Classification Dataset](https://www.kaggle.com/datasets/uciml/mushroom-classification)**
    
    - This dataset contains **8,124 instances** with **23 categorical features**, describing characteristics of mushrooms such as `cap-shape`, `gill-color`, `odor`, and `habitat`.
    - The target variable (`class`) indicates whether a mushroom is **edible (`e`)** or **poisonous (`p`)**.
    - The dataset is **fully labeled and does not contain missing values**, making it an excellent candidate for Naïve Bayes classification.

---

## **Pre-Processing**

### **Mushroom Dataset Preprocessing**

- The dataset was **already well-structured**, requiring **minimal preprocessing**:
    - Converted categorical variables into numerical values using **Label Encoding**.
    - No missing values or inconsistencies were found.
    - The dataset was then split into training (`X_train`, `y_train`) and testing (`X_test`, `y_test`) sets.

### **Adult Income Dataset Preprocessing Challenges**

Unlike the mushroom dataset, the Adult Income dataset presented several **formatting and preprocessing challenges**:

- **Inconsistent Formatting in Test Set (`adult.test`)**
    
    - The test dataset contained an **extra period (`.`) at the end of the `income` labels** (e.g., `'>50K.'` vs. `'>50K'`).
    - Initial attempts to encode `income` failed due to mismatches between train and test labels.
    - **Solution:** Used `str.replace(".", "", regex=False)` to standardize labels across both datasets.

- **Handling Missing Values (`?` in Place of NaN)**
    
    - The dataset used `"?"` to represent missing values instead of `NaN`.
    - Replaced all `?` values with `NaN` and **dropped rows with missing values**.

- **Categorical Feature Encoding**
    
    - The dataset contained **both numerical and categorical features**.
    - Applied **Label Encoding** to categorical features **except `income`**, which remained in string format.
    - This step ensured compatibility with Naïve Bayes models.

    - **Adult Income Dataset (Addressing Class Imbalance)**

        - **Challenges in Model Accuracy**:
            - **Initial accuracy (79%)** with GaussianNB, but class imbalance caused misclassification of `>50K` instances.

            - **To address this, two techniques were tested**:
                1. **SMOTE (Synthetic Minority Over-sampling Technique)**
                2. **Class Weighting**

    - **Results after applying SMOTE and Class Weights**:

    | Approach | Accuracy | Precision(`<=50K`/`>50K`)  | Recall(`<=50K`/`>50K`)  | F1-Score(`<=50K`/`>50K`)  |
    | -------------------- |-----|----------|----------|--------- |
    |  Original Model      |0.79 |0.81/0.65 |0.95/0.31 |0.87/0.42 |
    |  SMOTE               |0.79 |0.81/0.63 |0.94/0.31 |0.87/0.41 |
    |  Class Weights       |0.79 |0.81/0.65 |0.95/0.31 |0.87/0.42 |
    
    - **Observations**:
        - **Smote and Weighted show no improvment:** Although there is minor fluctions in the results above no real improvments are made using **Smote** and **Weighted** techniques. 

---

## **Data Understanding & Visualization**

- **Mushroom Dataset**
    - A nearly even split between(51% - 49%) **edible (`e`) and poisonous (`p`) mushrooms**, making it well-balanced for classification.

- **Adult Income Dataset**
    
    - **Highly imbalanced class distribution**
        - A biased split of 75% to 25%.

    - **Key Findings:**
        - Individuals with **higher education levels and capital gains** were more likely to earn `>50K`.
        - **Hours per week and age** played a significant role in classification.

---

## **Algorithms**

Understanding the different Naïve Bayes variations was **critical** for selecting the correct model:

| **Naïve Bayes Type** | **Best Use Case**                                                               |
| -------------------- | ------------------------------------------------------------------------------- |
| **GaussianNB**       | Works with **continuous numerical features** (e.g., age, capital gain/loss)     | 
| **MultinomialNB**    | Works with **count-based data** (e.g., word frequencies in text classification) | 
| **BernoulliNB**      | Works with **binary features** (e.g., presence/absence of words in text)        |

### **Mushroom Dataset (Baseline Naïve Bayes Model)**

- **Model Used**: `GaussianNB`
- **Why?**
    - `GaussianNB` had the highest accuracy of the three even though it seemed like `MultinomialNB` would be the more suited due to the classification needed for the dataset. 

### **Adult Income Dataset (Challenges & Model Selection)**

- **Model Used**: `GaussianNB`
- **Why?**
    - `GaussianNB` had the highest accuracy of the three even. I experimented`BernoulliNB` and `MultinomialNB`, but they failed due to feature distribution mismatches.

---

## **Model Training and Evaluation**

- **Mushroom Dataset**
    - **Training Set Size**: 80% of the data (6,500 instances)
    - **Test Set Size**: 20% (1,600 instances)
    - **Achieved 95%+ accuracy**, indicating that mushrooms can be classified effectively based on physical attributes.

- **Adult Income Dataset**
    
    - **Challenges in Model Accuracy**:
        - **Initial accuracy (79%)**, this is pretty high considering the class imbalance.
        - **Class imbalance affected recall for `>50K` predictions** many were incorrectly classified as `<=50K`.

### **Performance Metrics**

| **Dataset**      | **Model Used** | **Accuracy** |
| ---------------- | -------------- | ------------ |
| **Mushroom**     | `GaussianNB`   | **93%**      |
| **Adult Income** | `GaussianNB`   | **79%**      |

**Confusion Matrix Insights**

- **Mushroom Dataset**:
    - Few misclassifications, suggesting strong predictive capability.
- **Adult Income Dataset**:
    - **High false negatives** for the `>50K` class, showing bias toward predicting `<=50K`.

---

## **Online Resources & Sources**

- **[UCI Machine Learning Repository - Adult Income Dataset](https://archive.ics.uci.edu/dataset/2/adult)**
    
    - Provided the real-world census data used for the Adult Income classification.
- **[Kaggle - Mushroom Classification Dataset](https://www.kaggle.com/datasets/uciml/mushroom-classification)**
    
    - Provided labeled mushroom data for classification.
- **[Machine Learning Plus - Understanding Naïve Bayes](https://www.machinelearningplus.com/predictive-modeling/how-naive-bayes-algorithm-works-with-example-and-full-code/)**
    
    - Helped with understanding the differences between `GaussianNB`, `BernoulliNB`, and `MultinomialNB`.

- **[ChatGPT]**

    - Helped with finding my data set and clarifying some confusions I had about `GaussianNB`, `BernoulliNB`, and `MultinomialNB`.

---

## **Tools & Technologies Used**

- **Python Libraries**:
    
    - `Pandas` and `NumPy` for data manipulation.
    - `Scikit-learn` for Naïve Bayes models and performance evaluation.
    - `Matplotlib` for data visualization.
- **Development Environment**:
    
    - **Jupyter Notebook** for interactive model development and evaluation.

---

## **Challenges Faced**

1. **Formatting Issues in the Adult Dataset**
    
    - The period in `income` labels (`>50K.`) caused mismatches in training vs. test data.
    - **Solution**: Standardized labels using `str.replace(".", "", regex=False)`.

2. **Selecting the Right Naïve Bayes Model**
    
    - **GaussianNB was chosen for Adult Income**, after testing **BernoulliNB and MultinomialNB**.

3. **Class Imbalance in Adult Dataset**
    
    - The dataset was **skewed**, with **>50K instances underrepresented**.
    - **Two balancing techniques (SMOTE and Class Weights) were tested.**
    - The experiment helped determine the **best way to improve recall for `>50K` without harming overall accuracy.**

4. **Understanding Naïve Bayes Variants**
    
    - Took time to **differentiate between GaussianNB, BernoulliNB, and MultinomialNB** for different datasets.

---

## **Conclusion**

- **Mushroom classification was highly effective with Naïve Bayes (92% accuracy).**
- **Adult Income classification performed decently (79%) but struggled with class imbalance.**
- **SMOTE and Class Weights were tested to correct this, leading to improvements in minority class prediction.**
- **GaussianNB was the best fit for Adult Income and Mushroom Classification.**
