

This notebook demonstrates the intelligent email labeling process used in our priority classification system. We'll walk through how emails are automatically assigned HIGH, MEDIUM, and LOW priority labels based on content analysis.


- Understand the rule-based labeling heuristics
- See real examples from the Enron dataset
- Analyze the distribution of priority labels
- Test custom labeling rules

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import re
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

print("📚 Libraries loaded successfully!")
print(f"📁 Working directory: {Path.cwd()}")

: 

## 📖 Step 1: Load Sample Data

Let's start by loading our processed Enron email dataset to see the labeled examples.

In [None]:
# Load the processed email dataset
data_path = Path('../data/enron_sample_data.csv')

if data_path.exists():
    df = pd.read_csv(data_path)
    print(f"✅ Loaded {len(df)} emails from dataset")
    print(f"📊 Columns: {list(df.columns)}")
    print(f"\n📈 Dataset shape: {df.shape}")
else:
    print("❌ Dataset not found. Please run the training script first.")
    print("💡 Run: python scripts/train_model.py --extract 1000")

In [None]:
# Display basic information about the dataset
if 'df' in locals():
    print("📋 Dataset Overview:")
    print("=" * 50)
    display(df.head())
    
    print("\n📊 Data Types:")
    print(df.dtypes)
    
    print("\n🔍 Missing Values:")
    print(df.isnull().sum())

## 🎯 Step 2: Priority Distribution Analysis

Let's analyze how emails are distributed across priority levels and understand the labeling patterns.

In [None]:
if 'df' in locals():
    # Analyze priority distribution
    priority_counts = df['priority'].value_counts()
    priority_percentages = df['priority'].value_counts(normalize=True) * 100
    
    print("📊 Priority Distribution:")
    print("=" * 30)
    for priority in ['HIGH', 'MEDIUM', 'LOW']:
        count = priority_counts.get(priority, 0)
        percentage = priority_percentages.get(priority, 0)
        print(f"{priority:>6}: {count:>4} emails ({percentage:>5.1f}%)")
    
    # Create visualization
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    
    # Bar chart
    priority_counts.plot(kind='bar', ax=ax1, color=['#ff4757', '#ffa502', '#2ed573'])
    ax1.set_title('📊 Email Count by Priority Level')
    ax1.set_xlabel('Priority Level')
    ax1.set_ylabel('Number of Emails')
    ax1.tick_params(axis='x', rotation=0)
    
    # Pie chart
    priority_counts.plot(kind='pie', ax=ax2, autopct='%1.1f%%', 
                        colors=['#ff4757', '#ffa502', '#2ed573'])
    ax2.set_title('🥧 Priority Distribution')
    ax2.set_ylabel('')
    
    plt.tight_layout()
    plt.show()

## 🔍 Step 3: Understanding the Labeling Algorithm

Now let's implement and demonstrate the intelligent labeling algorithm that assigns priorities based on email content.

In [None]:
class EmailLabeler:
    """Intelligent email priority labeling system"""
    
    def __init__(self):
        # HIGH priority indicators
        self.high_indicators = [
            'urgent', 'asap', 'emergency', 'critical', 'deadline today',
            'immediately', 'crisis', 'breaking', 'alert', 'help!',
            'server down', 'system failure', 'security breach', 'urgent response',
            'action required', 'time sensitive'
        ]
        
        # MEDIUM priority indicators
        self.medium_indicators = [
            'meeting', 'schedule', 'deadline', 'important', 'review',
            'approval', 'decision', 'update', 'conference call',
            'project', 'budget', 'quarterly', 'client', 'proposal',
            'contract', 'presentation', 'report'
        ]
        
        # LOW priority indicators
        self.low_indicators = [
            'fyi', 'for your information', 'newsletter', 'announcement',
            'heads up', 'just to let you know', 'social', 'lunch',
            'birthday', 'holiday', 'vacation', 'thanks', 'congratulations',
            'welcome', 'notice'
        ]
    
    def assign_priority(self, subject, body, to_field=''):
        """Assign priority based on email content"""
        # Combine subject and body for analysis
        subject = str(subject or '').lower()
        body = str(body or '').lower()
        combined = subject + ' ' + body
        
        reasons = []  # Track why this priority was assigned
        
        # Check for HIGH priority indicators
        high_matches = [indicator for indicator in self.high_indicators if indicator in combined]
        if high_matches:
            reasons.append(f"High keywords: {', '.join(high_matches)}")
            return 'HIGH', reasons
        
        # Check for LOW priority indicators first (to catch obvious low priority)
        low_matches = [indicator for indicator in self.low_indicators if indicator in combined]
        if low_matches:
            reasons.append(f"Low keywords: {', '.join(low_matches)}")
            return 'LOW', reasons
        
        # Check for MEDIUM priority indicators
        medium_matches = [indicator for indicator in self.medium_indicators if indicator in combined]
        if medium_matches:
            reasons.append(f"Medium keywords: {', '.join(medium_matches)}")
            return 'MEDIUM', reasons
        
        # Additional heuristics
        if subject:
            # Short, direct subjects often indicate urgency
            if len(subject.split()) <= 3 and any(word in subject for word in ['help', 'issue', 'problem']):
                reasons.append("Short urgent subject")
                return 'HIGH', reasons
            
            # ALL CAPS subject indicates urgency
            if subject.isupper() and len(subject) > 5:
                reasons.append("ALL CAPS subject")
                return 'HIGH', reasons
        
        # Multiple recipients might indicate importance
        to_count = str(to_field or '').count('@')
        if to_count > 5:
            reasons.append(f"Multiple recipients ({to_count})")
            return 'MEDIUM', reasons
        
        # Default to MEDIUM for business emails
        reasons.append("Default business email classification")
        return 'MEDIUM', reasons

# Initialize the labeler
labeler = EmailLabeler()
print("🏷️ Email labeler initialized!")
print(f"📈 High priority keywords: {len(labeler.high_indicators)}")
print(f"📊 Medium priority keywords: {len(labeler.medium_indicators)}")
print(f"📉 Low priority keywords: {len(labeler.low_indicators)}")

## 🧪 Step 4: Test the Labeling Algorithm

Let's test our labeling algorithm with some example emails to see how it works.

In [None]:
# Test emails with different priority levels
test_emails = [
    {
        'subject': 'URGENT: Server maintenance required',
        'body': 'The production server is experiencing critical issues and needs immediate attention. Please contact the IT team ASAP.',
        'expected': 'HIGH'
    },
    {
        'subject': 'Team meeting tomorrow',
        'body': 'Please join us for the quarterly review meeting tomorrow at 2 PM in the conference room.',
        'expected': 'MEDIUM'
    },
    {
        'subject': 'FYI: New company newsletter',
        'body': 'Attached is the latest company newsletter with updates from various departments.',
        'expected': 'LOW'
    },
    {
        'subject': 'Help needed',
        'body': 'Server crashed',
        'expected': 'HIGH'
    },
    {
        'subject': 'Birthday party invitation',
        'body': 'You are invited to join us for lunch to celebrate Sarah\'s birthday next Friday.',
        'expected': 'LOW'
    }
]

print("🧪 Testing Email Labeling Algorithm")
print("=" * 50)

correct_predictions = 0
for i, email in enumerate(test_emails, 1):
    predicted_priority, reasons = labeler.assign_priority(
        email['subject'], 
        email['body']
    )
    
    is_correct = predicted_priority == email['expected']
    correct_predictions += is_correct
    
    status = "✅" if is_correct else "❌"
    
    print(f"\n{status} Test {i}: {email['subject'][:30]}...")
    print(f"   Expected: {email['expected']:<6} | Predicted: {predicted_priority:<6}")
    print(f"   Reason: {reasons[0] if reasons else 'No specific reason'}")

accuracy = correct_predictions / len(test_emails) * 100
print(f"\n🎯 Labeling Accuracy: {accuracy:.1f}% ({correct_predictions}/{len(test_emails)})")

## 📊 Step 5: Analyze Real Dataset Labels

Now let's apply our labeling algorithm to some real emails from the dataset and see the results.

In [None]:
if 'df' in locals():
    # Analyze a sample of emails from each priority level
    print("📧 Sample Emails by Priority Level")
    print("=" * 60)
    
    for priority in ['HIGH', 'MEDIUM', 'LOW']:
        priority_emails = df[df['priority'] == priority]
        
        if len(priority_emails) > 0:
            print(f"\n🔴 {priority} PRIORITY EXAMPLES:" if priority == 'HIGH' else 
                  f"\n🟡 {priority} PRIORITY EXAMPLES:" if priority == 'MEDIUM' else 
                  f"\n🟢 {priority} PRIORITY EXAMPLES:")
            print("-" * 40)
            
            # Show up to 3 examples
            sample_size = min(3, len(priority_emails))
            samples = priority_emails.sample(n=sample_size, random_state=42)
            
            for idx, (_, email) in enumerate(samples.iterrows(), 1):
                subject = str(email.get('subject', 'No subject'))[:50]
                body = str(email.get('body', 'No body'))[:100]
                
                print(f"{idx}. Subject: {subject}...")
                print(f"   Body: {body}...")
                print()
        else:
            print(f"\n❌ No {priority} priority emails found in dataset")

## 🔍 Step 6: Keyword Analysis

Let's analyze which keywords appear most frequently in each priority category.

In [None]:
if 'df' in locals():
    def extract_keywords(text, keyword_list):
        """Extract keywords that appear in the text"""
        text = str(text).lower()
        found_keywords = []
        for keyword in keyword_list:
            if keyword in text:
                found_keywords.append(keyword)
        return found_keywords
    
    # Analyze keyword frequency by priority
    keyword_analysis = {}
    
    for priority in ['HIGH', 'MEDIUM', 'LOW']:
        priority_emails = df[df['priority'] == priority]
        
        if len(priority_emails) > 0:
            all_keywords = []
            
            # Get appropriate keyword list
            if priority == 'HIGH':
                keyword_list = labeler.high_indicators
            elif priority == 'MEDIUM':
                keyword_list = labeler.medium_indicators
            else:
                keyword_list = labeler.low_indicators
            
            # Extract keywords from all emails in this priority
            for _, email in priority_emails.iterrows():
                combined_text = str(email.get('subject', '')) + ' ' + str(email.get('body', ''))
                keywords = extract_keywords(combined_text, keyword_list)
                all_keywords.extend(keywords)
            
            # Count keyword frequency
            keyword_counts = Counter(all_keywords)
            keyword_analysis[priority] = keyword_counts
    
    # Display top keywords for each priority
    print("🔍 Most Frequent Keywords by Priority")
    print("=" * 50)
    
    for priority, keyword_counts in keyword_analysis.items():
        icon = "🔴" if priority == 'HIGH' else "🟡" if priority == 'MEDIUM' else "🟢"
        print(f"\n{icon} {priority} Priority Top Keywords:")
        
        if keyword_counts:
            top_keywords = keyword_counts.most_common(10)
            for keyword, count in top_keywords:
                percentage = (count / len(df[df['priority'] == priority])) * 100
                print(f"  {keyword:20} {count:3d} occurrences ({percentage:4.1f}%)")
        else:
            print("  No matching keywords found")

## 📈 Step 7: Email Length Analysis

Let's see if there's a correlation between email length and priority level.

In [None]:
if 'df' in locals():
    # Calculate email lengths
    df['subject_length'] = df['subject'].fillna('').astype(str).str.len()
    df['body_length'] = df['body'].fillna('').astype(str).str.len()
    df['total_length'] = df['subject_length'] + df['body_length']
    
    # Analyze length by priority
    length_stats = df.groupby('priority')[['subject_length', 'body_length', 'total_length']].agg([
        'mean', 'median', 'std'
    ]).round(2)
    
    print("📏 Email Length Analysis by Priority")
    print("=" * 60)
    display(length_stats)
    
    # Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Subject length distribution
    for priority in ['HIGH', 'MEDIUM', 'LOW']:
        data = df[df['priority'] == priority]['subject_length']
        axes[0, 0].hist(data, alpha=0.7, label=priority, bins=20)
    axes[0, 0].set_title('📝 Subject Length Distribution')
    axes[0, 0].set_xlabel('Subject Length (characters)')
    axes[0, 0].set_ylabel('Frequency')
    axes[0, 0].legend()
    
    # Body length distribution (log scale for better visualization)
    for priority in ['HIGH', 'MEDIUM', 'LOW']:
        data = df[df['priority'] == priority]['body_length']
        data = data[data > 0]  # Remove zero-length bodies
        axes[0, 1].hist(np.log10(data + 1), alpha=0.7, label=priority, bins=20)
    axes[0, 1].set_title('📄 Body Length Distribution (Log Scale)')
    axes[0, 1].set_xlabel('Log10(Body Length + 1)')
    axes[0, 1].set_ylabel('Frequency')
    axes[0, 1].legend()
    
    # Box plot for subject length
    df.boxplot(column='subject_length', by='priority', ax=axes[1, 0])
    axes[1, 0].set_title('📝 Subject Length by Priority')
    axes[1, 0].set_xlabel('Priority')
    axes[1, 0].set_ylabel('Subject Length')
    
    # Box plot for total length (log scale)
    df_log = df.copy()
    df_log['log_total_length'] = np.log10(df_log['total_length'] + 1)
    df_log.boxplot(column='log_total_length', by='priority', ax=axes[1, 1])
    axes[1, 1].set_title('📊 Total Length by Priority (Log Scale)')
    axes[1, 1].set_xlabel('Priority')
    axes[1, 1].set_ylabel('Log10(Total Length + 1)')
    
    plt.suptitle('Email Length Analysis', fontsize=16)
    plt.tight_layout()
    plt.show()

## 🎮 Step 8: Interactive Labeling Test

Try the labeling algorithm with your own custom emails!

In [None]:
def test_custom_email():
    """Interactive function to test custom emails"""
    print("🎮 Custom Email Priority Testing")
    print("=" * 40)
    print("Enter an email to see how it would be labeled:")
    print("(Leave empty and press Enter to use examples)\n")
    
    subject = input("📧 Email Subject: ").strip()
    
    if not subject:
        # Use example emails if no input provided
        examples = [
            ("EMERGENCY: Database corruption detected", "All systems are down. Need immediate assistance."),
            ("Project deadline reminder", "Please remember the project is due next Friday."),
            ("Office lunch invitation", "Join us for pizza in the break room at noon!")
        ]
        
        print("Using example emails:\n")
        for i, (subj, body) in enumerate(examples, 1):
            priority, reasons = labeler.assign_priority(subj, body)
            priority_icon = "🔴" if priority == 'HIGH' else "🟡" if priority == 'MEDIUM' else "🟢"
            
            print(f"{i}. Subject: {subj}")
            print(f"   Body: {body}")
            print(f"   {priority_icon} Priority: {priority}")
            print(f"   Reasoning: {reasons[0] if reasons else 'Default classification'}\n")
    else:
        body = input("📝 Email Body: ").strip()
        to_field = input("👥 To field (optional): ").strip()
        
        priority, reasons = labeler.assign_priority(subject, body, to_field)
        priority_icon = "🔴" if priority == 'HIGH' else "🟡" if priority == 'MEDIUM' else "🟢"
        
        print(f"\n{priority_icon} Predicted Priority: {priority}")
        print(f"🧠 Reasoning: {reasons[0] if reasons else 'Default classification'}")

# Run the interactive test
test_custom_email()

## 📋 Step 9: Summary and Insights

Let's summarize what we've learned about the email labeling process.

In [None]:
if 'df' in locals():
    print("📋 Email Labeling Process Summary")
    print("=" * 50)
    
    total_emails = len(df)
    priority_dist = df['priority'].value_counts()
    
    print(f"📊 Dataset Statistics:")
    print(f"   Total emails processed: {total_emails:,}")
    print(f"   High priority: {priority_dist.get('HIGH', 0):,} ({priority_dist.get('HIGH', 0)/total_emails*100:.1f}%)")
    print(f"   Medium priority: {priority_dist.get('MEDIUM', 0):,} ({priority_dist.get('MEDIUM', 0)/total_emails*100:.1f}%)")
    print(f"   Low priority: {priority_dist.get('LOW', 0):,} ({priority_dist.get('LOW', 0)/total_emails*100:.1f}%)")
    
    print(f"\n🎯 Labeling Algorithm Features:")
    print(f"   High priority keywords: {len(labeler.high_indicators)}")
    print(f"   Medium priority keywords: {len(labeler.medium_indicators)}")
    print(f"   Low priority keywords: {len(labeler.low_indicators)}")
    
    avg_subject_len = df['subject_length'].mean()
    avg_body_len = df['body_length'].mean()
    
    print(f"\n📏 Email Characteristics:")
    print(f"   Average subject length: {avg_subject_len:.1f} characters")
    print(f"   Average body length: {avg_body_len:.1f} characters")
    
    print(f"\n🔍 Key Insights:")
    print(f"   • Most emails are classified as MEDIUM priority (typical business communication)")
    print(f"   • HIGH priority emails are rare but critical for immediate attention")
    print(f"   • LOW priority emails are informational and can be processed later")
    print(f"   • Keyword-based heuristics effectively capture urgency and importance")
    print(f"   • The labeling algorithm provides transparency through reasoning")
    
    print(f"\n🚀 Next Steps:")
    print(f"   1. Train machine learning models using these labeled emails")
    print(f"   2. Evaluate model performance with cross-validation")
    print(f"   3. Fine-tune the labeling rules based on domain-specific needs")
    print(f"   4. Deploy the system for real-time email classification")
else:
    print("❌ No dataset loaded. Please run the data extraction first.")
    print("💡 Command: python scripts/train_model.py --extract 1000")

## 🎓 Conclusion

This notebook has demonstrated the intelligent email labeling process used in our priority classification system. The key takeaways are:

### 🏆 Achievements
- **Rule-based Intelligence**: Our heuristic approach effectively identifies priority levels
- **Transparency**: Each classification includes reasoning for interpretability
- **Scalability**: The algorithm can process thousands of emails quickly
- **Flexibility**: Keywords and rules can be easily customized for different domains

### 🔬 Technical Approach
- **Multi-level Analysis**: Subject, body, and metadata are all considered
- **Keyword Matching**: Domain-specific terms indicate urgency and importance
- **Fallback Logic**: Default classification ensures all emails are labeled
- **Validation Ready**: Labeled data serves as ground truth for ML training

### 📈 Business Impact
- **Productivity**: Automatic prioritization saves time and reduces email overload
- **Responsiveness**: Critical emails are identified for immediate attention
- **Organization**: Systematic approach to email management
- **Scalability**: Can handle large volumes of corporate email

This labeling process forms the foundation for training our machine learning models, enabling accurate automatic email priority classification in production environments.

---

**📧 Built for MIS520 Final Project | Fall 2025**