# Multi-Label Email Classification Project

This notebook implements a multi-label text classification system for emails using various machine learning models. Multi-label classification allows each email to be assigned multiple categories simultaneously (e.g., an email could be both "Business" and "Customer Support").

## Project Overview
- **Dataset**: Multi-class email classification dataset
- **Task**: Predict multiple labels for each email
- **Models**: ComplementNB, MultinomialNB, LogisticRegression
- **Features**: TF-IDF vectors with bigrams

## Import Libraries and Setup

We import all necessary libraries for:
- **Text processing**: `re` for regex, `nltk` for NLP operations
- **Data handling**: `datasets` for loading data, `sklearn` for ML tasks
- **Models**: Naive Bayes variants, Logistic Regression, Random Forest
- **Evaluation**: Metrics to measure model performance

In [1]:
# Regular expressions for text cleaning
import re

# Dataset loading
from datasets import load_dataset

# Train-test splitting and preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer

# Feature extraction (TF-IDF)
from sklearn.feature_extraction.text import TfidfVectorizer

# Multi-label classification wrapper
from sklearn.multiclass import OneVsRestClassifier

# Classification models
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Evaluation metrics
from sklearn.metrics import accuracy_score, f1_score, classification_report

# NLP tools
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download stopwords (common words like "the", "is", "and")
nltk.download("stopwords")

# Initialize text processing tools
stop_words = set(stopwords.words("english"))  # Words to remove
stemmer = PorterStemmer()  # Reduces words to their root form (e.g., "running" -> "run")

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Hout\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Define Text Preprocessing Function

This function cleans and normalizes email text to improve model performance:
1. **Remove placeholders** like [Your Name]
2. **Lowercase** all text for consistency
3. **Remove punctuation** to focus on words
4. **Remove stopwords** (common words that don't add meaning)
5. **Stem words** to reduce them to root forms
6. **Collapse whitespace** for clean output

In [2]:
def preprocess_email(text):
    """
    Preprocess email text for machine learning.
    
    This function applies multiple text cleaning steps to normalize
    the input and make it suitable for feature extraction.
    
    Steps:
    - Remove placeholders like [Your Name]
    - Lowercase conversion
    - Remove punctuation
    - Remove stopwords (common words)
    - Apply stemming (reduce words to root form)
    - Collapse extra whitespace
    
    Args:
        text (str): Raw email text
        
    Returns:
        str: Cleaned and normalized text
    """
    if text is None:
        return ""
    
    # Remove square bracket placeholders (e.g., [Company Name])
    text = re.sub(r'\[.*?\]', '', text)
    
    # Convert to lowercase for consistency
    text = text.lower()
    
    # Remove punctuation, keep only alphanumeric and spaces
    text = re.sub(r'[^\w\s]', ' ', text)
    
    # Tokenize (split into words)
    tokens = text.split()
    
    # Remove stopwords and apply stemming
    # Keep only words longer than 1 character and not in stopwords list
    tokens = [stemmer.stem(t) for t in tokens if t not in stop_words and len(t) > 1]
    
    # Step 6: Rejoin tokens into a single string
    return " ".join(tokens)

## Load and Explore the Dataset

We load the multi-class email classification dataset from Hugging Face and combine the subject and body of each email into a single text field.

In [3]:
# Load the multi-class email classification dataset
dataset = load_dataset("imnim/multiclass-email-classification")
train_data = dataset["train"]

# Combine subject and body into single text field
# Handle None values by converting to empty string
texts_raw = [(s or "") + " " + (b or "") for s, b in zip(train_data["subject"], train_data["body"])]

# Extract labels (each email can have multiple labels)
labels_raw = train_data["labels"]

# Display sample data to understand structure
print("Sample raw text:", texts_raw[0])
print("Sample raw labels:", labels_raw[0])
print("Total examples:", len(texts_raw))

Sample raw text: Meeting Reminder: Quarterly Sales Review Tomorrow Dear Team, Just a friendly reminder that our Quarterly Sales Review meeting is scheduled for tomorrow at 10:00 AM in the conference room. Please make sure to bring your sales reports and any relevant updates. Coffee and pastries will be provided. Looking forward to a productive meeting. Best regards, [Your Name]
Sample raw labels: ['Business', 'Reminders']
Total examples: 2105


## Split Data and Preprocess

We split the data into training (80%) and testing (20%) sets, then apply our preprocessing function to clean the text. Labels are converted to binary format using MultiLabelBinarizer.

In [4]:
# Split data into 80% training and 20% testing
# Use random_state for reproducibility
X_train_raw, X_test_raw, labels_train_raw, labels_test_raw = train_test_split(
    texts_raw, labels_raw, test_size=0.2, random_state=42, shuffle=True
)

# Apply preprocessing to both train and test sets
# This is done AFTER splitting to prevent data leakage
X_train = [preprocess_email(t) for t in X_train_raw]
X_test = [preprocess_email(t) for t in X_test_raw]

# Convert labels to binary format for multi-label classification
# MultiLabelBinarizer creates a binary matrix where 1 = label present, 0 = label absent
mlb = MultiLabelBinarizer()
y_train = mlb.fit_transform(labels_train_raw)  
y_test = mlb.transform(labels_test_raw)        

# Show example of preprocessing effect
print("Processed example (train index 0):")
print("RAW:", X_train_raw[0][:400])
print("CLEAN:", X_train[0][:400])

Processed example (train index 0):
RAW: Important Update: Company Meeting Schedule Change Dear Team, Due to unforeseen circumstances, the upcoming company meeting scheduled for Friday has been rescheduled to Monday next week at 10:00 AM. We apologize for any inconvenience this may cause and appreciate your understanding. Please make the necessary adjustments to your calendars. Thank you. Best regards, [Your Name]
CLEAN: import updat compani meet schedul chang dear team due unforeseen circumst upcom compani meet schedul friday reschedul monday next week 10 00 apolog inconveni may caus appreci understand pleas make necessari adjust calendar thank best regard


## Feature Extraction with TF-IDF

Convert text to numerical features using TF-IDF (Term Frequency-Inverse Document Frequency):
- **max_features=15000**: Use top 15,000 most important terms
- **ngram_range=(1,2)**: Consider both single words and word pairs
- **sublinear_tf=True**: Apply logarithmic scaling to term frequencies

In [5]:
# Initialize TF-IDF vectorizer
# TF-IDF converts text to numerical features by weighing term importance
vectorizer = TfidfVectorizer(
    max_features=15000,      
    ngram_range=(1, 2),    
    sublinear_tf=True      
)

# Fit vectorizer on training data and transform both sets
X_train_vec = vectorizer.fit_transform(X_train)  
X_test_vec = vectorizer.transform(X_test)       

print("TF-IDF feature count:", len(vectorizer.get_feature_names_out()))

TF-IDF feature count: 12641


## Define Models 


In [6]:
# Define dictionary of models for potential batch training
# OneVsRestClassifier wraps each model to handle multi-label classification
models = {
    "ComplementNB": OneVsRestClassifier(ComplementNB()),
    "LogisticRegression": OneVsRestClassifier(LogisticRegression(max_iter=1000)),
    "RandomForest": OneVsRestClassifier(RandomForestClassifier(n_estimators=200, random_state=42))
}

## Train and Evaluate ComplementNB

**Complement Naive Bayes** is designed to handle imbalanced datasets by learning from the complement of each class. It often performs better than standard Naive Bayes for text classification.

**ComplementNB Model Performance Summary**

- The Complement Naive Bayes model achieved 62% subset accuracy, with Micro F1 = 0.796 and Macro F1 = 0.759, showing strong performance overall.

- It performed very well on categories with clear keyword patterns such as Finance & Bills, Travel & Bookings, Business, and Events & Invitations (F1-scores above 0.80).

- Moderate performance was seen in Customer Support, Promotions, and Reminders.

- The model struggled with categories that have more personal or vague language, especially Personal and Newsletters, due to high variability in writing style.

Overall, ComplementNB shows good generalization but still has difficulty with classes that have less consistent or more informal text.

In [7]:
# Initialize ComplementNB model wrapped in OneVsRestClassifier
# OneVsRestClassifier trains one binary classifier per label
clf_cnb = OneVsRestClassifier(ComplementNB())
clf_cnb.fit(X_train_vec, y_train)

# Predict probabilities for each label
y_proba_cnb = clf_cnb.predict_proba(X_test_vec)

# Convert probabilities to binary predictions (threshold = 0.5)
y_pred_cnb = (y_proba_cnb >= 0.5).astype(int)

# Calculate evaluation metrics
acc = accuracy_score(y_test, y_pred_cnb)       # Subset accuracy (all labels must match)
micro = f1_score(y_test, y_pred_cnb, average='micro')  # Micro-averaged F1
macro = f1_score(y_test, y_pred_cnb, average='macro')  # Macro-averaged F1

print("✅ ComplementNB")
print(f"Subset Accuracy: {acc:.3f}, Micro F1: {micro:.3f}, Macro F1: {macro:.3f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred_cnb, target_names=mlb.classes_))

✅ ComplementNB
Subset Accuracy: 0.620, Micro F1: 0.796, Macro F1: 0.759

Classification Report:
                      precision    recall  f1-score   support

            Business       0.76      0.93      0.83       174
    Customer Support       0.80      0.73      0.76        48
Events & Invitations       0.76      0.91      0.82       127
     Finance & Bills       0.87      0.98      0.93        63
     Job Application       1.00      0.81      0.89        26
         Newsletters       0.88      0.46      0.60        46
            Personal       1.00      0.21      0.35        52
          Promotions       0.72      0.78      0.75        27
           Reminders       0.68      0.67      0.68        70
   Travel & Bookings       1.00      0.95      0.97        58

           micro avg       0.80      0.79      0.80       691
           macro avg       0.85      0.74      0.76       691
        weighted avg       0.82      0.79      0.78       691
         samples avg       0.82   

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## Train and Evaluate MultinomialNB

**Multinomial Naive Bayes** is a fast text-classification model that learns from word frequency patterns. It works best when each class has clear, consistent vocabulary, but struggles with categories that use varied or informal language.

**MultinomialNB Model Performance Summary** 

- The model achieved 58.9% subset accuracy, with Micro F1 = 0.776 and Macro F1 = 0.701.

- It performed very well on categories like Finance & Bills, Travel & Bookings, Business, Events & Invitations, and Job Applications, where text patterns are more consistent.

- Performance was moderate for Promotions and Reminders.

- The model struggled significantly with Customer Support, Newsletters, and Personal emails due to highly varied and less predictable language.

To sum up, MultinomialNB is effective for structured and keyword-rich categories, but less reliable for categories with informal or diverse writing styles.

In [8]:
# Initialize and train MultinomialNB model
clf_mnb = OneVsRestClassifier(MultinomialNB())
clf_mnb.fit(X_train_vec, y_train)

# Predict probabilities and convert to binary predictions
y_proba_mnb = clf_mnb.predict_proba(X_test_vec)
y_pred_mnb = (y_proba_mnb >= 0.5).astype(int)

# Calculate evaluation metrics
acc = accuracy_score(y_test, y_pred_mnb)
micro = f1_score(y_test, y_pred_mnb, average='micro')
macro = f1_score(y_test, y_pred_mnb, average='macro')

print("✅ MultinomialNB")
print(f"Subset Accuracy: {acc:.3f}, Micro F1: {micro:.3f}, Macro F1: {macro:.3f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred_mnb, target_names=mlb.classes_))

✅ MultinomialNB
Subset Accuracy: 0.589, Micro F1: 0.776, Macro F1: 0.701

Classification Report:
                      precision    recall  f1-score   support

            Business       0.76      0.93      0.83       174
    Customer Support       1.00      0.33      0.50        48
Events & Invitations       0.86      0.87      0.86       127
     Finance & Bills       0.97      0.92      0.94        63
     Job Application       1.00      0.81      0.89        26
         Newsletters       0.93      0.30      0.46        46
            Personal       1.00      0.12      0.21        52
          Promotions       0.86      0.67      0.75        27
           Reminders       0.80      0.47      0.59        70
   Travel & Bookings       1.00      0.93      0.96        58

           micro avg       0.86      0.71      0.78       691
           macro avg       0.92      0.63      0.70       691
        weighted avg       0.88      0.71      0.74       691
         samples avg       0.81  

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## Train and Evaluate Logistic Regression

**Logistic Regression** is a linear model that often performs well for text classification. It can capture more complex relationships than Naive Bayes models.

**Logistic Regression Model Summary**

- The Logistic Regression model achieved 61.3% subset accuracy, with Micro F1 = 0.795 and Macro F1 = 0.731, showing strong overall performance.

- It performed very well on consistent categories like Business, Events & Invitations, Finance & Bills, and Travel & Bookings (high F1-scores).

- Performance was moderate on Customer Support, Promotions, and Reminders.

- It struggled with Newsletters and Personal, where writing style varies a lot, leading to low recall.

In Summary, Logistic Regression generalizes well across most classes but still finds informal or highly varied text difficult to classify accurately.

In [9]:
# Initialize and train Logistic Regression model
# max_iter=1000 allows more iterations for convergence
clf_lr = OneVsRestClassifier(LogisticRegression(max_iter=1000))
clf_lr.fit(X_train_vec, y_train)

# Predict probabilities and convert to binary predictions
y_proba_lr = clf_lr.predict_proba(X_test_vec)
y_pred_lr = (y_proba_lr >= 0.5).astype(int)

# Calculate evaluation metrics
acc = accuracy_score(y_test, y_pred_lr)
micro = f1_score(y_test, y_pred_lr, average='micro')
macro = f1_score(y_test, y_pred_lr, average='macro')

print("✅ LogisticRegression")
print(f"Subset Accuracy: {acc:.3f}, Micro F1: {micro:.3f}, Macro F1: {macro:.3f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred_lr, target_names=mlb.classes_))

✅ LogisticRegression
Subset Accuracy: 0.613, Micro F1: 0.795, Macro F1: 0.731

Classification Report:
                      precision    recall  f1-score   support

            Business       0.78      0.95      0.85       174
    Customer Support       0.91      0.62      0.74        48
Events & Invitations       0.81      0.90      0.85       127
     Finance & Bills       0.97      0.94      0.95        63
     Job Application       1.00      0.73      0.84        26
         Newsletters       0.93      0.28      0.43        46
            Personal       1.00      0.21      0.35        52
          Promotions       0.94      0.59      0.73        27
           Reminders       0.80      0.47      0.59        70
   Travel & Bookings       1.00      0.93      0.96        58

           micro avg       0.85      0.74      0.80       691
           macro avg       0.91      0.66      0.73       691
        weighted avg       0.87      0.74      0.77       691
         samples avg       0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## Model Comparison: MultinomialNB vs ComplementNB vs Logistic Regression

**MultinomialNB:** Fast, works well on classes with consistent vocabulary, but struggles with informal or varied text (Customer Support, Newsletters, Personal).

**ComplementNB:** Handles imbalanced classes better, improves recall, still weaker on highly variable text.

**Logistic Regression:** Most balanced across categories, good precision and recall, handles diverse text better than NB models.

Summary: ComplementNB has slightly higher overall scores and handles class imbalance best, Logistic Regression is most stable across categories, and MultinomialNB is a fast baseline for structured text.

## Analyze Specific Examples

Let's examine how each model performs on specific test examples. We'll see:
- The original email text
- The cleaned/processed version
- True labels vs. predicted labels
- Prediction probabilities for each label

In [10]:
# Select specific test examples to analyze
example_indices = [15, 72, 170]

# Loop through each example
for idx in example_indices:
    # Preprocess the email text
    email_text_processed = preprocess_email(X_test_raw[idx])
    
    # Transform to TF-IDF vector
    vec = vectorizer.transform([email_text_processed])
    
    # Get prediction probabilities for all labels
    proba = clf_cnb.predict_proba(vec)[0]
    
    # Convert probabilities to binary predictions (threshold = 0.5)
    pred_bin = (proba >= 0.5).astype(int)
    
    # Extract predicted label names
    pred_labels = [mlb.classes_[i] for i, val in enumerate(pred_bin) if val == 1]

    # Display results
    print("-"*80)
    print(f"Example #{idx} (ComplementNB)")
    print("Raw text:", X_test_raw[idx][:400], "...")
    print("Cleaned text:", email_text_processed[:400], "...")
    print("True labels:", labels_test_raw[idx])
    print("Predicted labels:", pred_labels)
    print("Probabilities:", {k: round(v, 3) for k, v in zip(mlb.classes_, proba)})

--------------------------------------------------------------------------------
Example #15 (ComplementNB)
Raw text: Important Update: Company Policy Changes Dear Team, I hope this email finds you well. I wanted to inform you about some recent updates to our company policies. These changes are aimed at improving efficiency and ensuring compliance with industry standards. Please review the attached document for detailed information. Should you have any questions or need further clarification, feel free to reach o ...
Cleaned text: import updat compani polici chang dear team hope email find well want inform recent updat compani polici chang aim improv effici ensur complianc industri standard pleas review attach document detail inform question need clarif feel free reach thank attent matter best regard ...
True labels: ['Business', 'Customer Support']
Predicted labels: ['Business', 'Customer Support']
Probabilities: {'Business': 0.982, 'Customer Support': 0.908, 'Events & Invitations': 0

In [11]:
# Analyze the same examples using MultinomialNB
for idx in example_indices:
    # Preprocess and vectorize
    email_text_processed = preprocess_email(X_test_raw[idx])
    vec = vectorizer.transform([email_text_processed])
    
    # Get predictions from MultinomialNB model
    proba = clf_mnb.predict_proba(vec)[0]
    pred_bin = (proba >= 0.5).astype(int)
    pred_labels = [mlb.classes_[i] for i, val in enumerate(pred_bin) if val == 1]

    # Display results
    print("-"*80)
    print(f"Example #{idx} (MultinomialNB)")
    print("Raw text:", X_test_raw[idx][:400], "...")
    print("Cleaned text:", email_text_processed[:400], "...")
    print("True labels:", labels_test_raw[idx])
    print("Predicted labels:", pred_labels)
    print("Probabilities:", {k: round(v, 3) for k, v in zip(mlb.classes_, proba)})

--------------------------------------------------------------------------------
Example #15 (MultinomialNB)
Raw text: Important Update: Company Policy Changes Dear Team, I hope this email finds you well. I wanted to inform you about some recent updates to our company policies. These changes are aimed at improving efficiency and ensuring compliance with industry standards. Please review the attached document for detailed information. Should you have any questions or need further clarification, feel free to reach o ...
Cleaned text: import updat compani polici chang dear team hope email find well want inform recent updat compani polici chang aim improv effici ensur complianc industri standard pleas review attach document detail inform question need clarif feel free reach thank attent matter best regard ...
True labels: ['Business', 'Customer Support']
Predicted labels: ['Business', 'Customer Support']
Probabilities: {'Business': 0.979, 'Customer Support': 0.539, 'Events & Invitations': 

In [12]:
# Analyze the same examples using Logistic Regression
for idx in example_indices:
    # Preprocess and vectorize
    email_text_processed = preprocess_email(X_test_raw[idx])
    vec = vectorizer.transform([email_text_processed])
    
    # Get predictions from Logistic Regression model
    proba = clf_lr.predict_proba(vec)[0]
    pred_bin = (proba >= 0.5).astype(int)
    pred_labels = [mlb.classes_[i] for i, val in enumerate(pred_bin) if val == 1]

    # Display results
    print("-"*80)
    print(f"Example #{idx} (LogisticRegression)")
    print("Raw text:", X_test_raw[idx][:400], "...")
    print("Cleaned text:", email_text_processed[:400], "...")
    print("True labels:", labels_test_raw[idx])
    print("Predicted labels:", pred_labels)
    print("Probabilities:", {k: round(v, 3) for k, v in zip(mlb.classes_, proba)})

--------------------------------------------------------------------------------
Example #15 (LogisticRegression)
Raw text: Important Update: Company Policy Changes Dear Team, I hope this email finds you well. I wanted to inform you about some recent updates to our company policies. These changes are aimed at improving efficiency and ensuring compliance with industry standards. Please review the attached document for detailed information. Should you have any questions or need further clarification, feel free to reach o ...
Cleaned text: import updat compani polici chang dear team hope email find well want inform recent updat compani polici chang aim improv effici ensur complianc industri standard pleas review attach document detail inform question need clarif feel free reach thank attent matter best regard ...
True labels: ['Business', 'Customer Support']
Predicted labels: ['Business', 'Customer Support']
Probabilities: {'Business': 0.83, 'Customer Support': 0.675, 'Events & Invitation

## Example Analysis – Multilabel Classification

**ComplementNB:** Consistently predicted all true labels correctly, even for multi-label emails. Probabilities show strong confidence for correct classes. Handles imbalanced and multi-label cases well.

**MultinomialNB:** Correct on simpler examples, but missed one label in a multi-label case (Example #72), showing lower recall for certain classes.

**Logistic Regression:** Predicted all labels correctly in most cases, with slightly lower probability scores compared to ComplementNB, but more balanced across classes.

Takeaway:
ComplementNB is strong on multi-label prediction with high confidence, MultinomialNB is faster but can miss labels, and Logistic Regression is stable and balanced across diverse examples.