# Enron Email Extraction

This notebook extracts various fields from the Enron email dataset, including:
- Message-ID
- Date
- From
- To
- Cc / Bcc
- Subject
- Mime-Version, Content-Type, Content-Transfer-Encoding
- Body (plain text/HTML content)
- Attachments# Enron Email Extraction

This notebook extracts various fields from the Enron email dataset, including:
- Message-ID
- Date
- From
- To
- Cc / Bcc
- Subject
- Mime-Version, Content-Type, Content-Transfer-Encoding
- Body (plain text/HTML content)
- Attachments

## Additional Cleaning For Modern EML Files

In addition, we had to clean the body, in the modern day, .eml files are usually cluttered with html, css, scripts and other artifacts. We took this a step further to make the parsing compatible with modern day emails. Keeping the user experience in mind.

In [85]:
# Import required libraries
import os
import email
import pandas as pd
from email import policy
from email.parser import BytesParser
from bs4 import BeautifulSoup
import json
import warnings
import re
import unicodedata

warnings.filterwarnings('ignore')

print("Imported libraries successfully.")

Imported libraries successfully.


In [86]:
# Set of known zero-width / formatting chars that often survive isprintable()
ZERO_WIDTH = {
    "\u034F",       # Combining Grapheme Joiner (Mn) "͏"
    "\u200B", "\u200C", "\u200D",  # ZWSP/ZWNs/ZWJ (Cf)
    "\uFEFF",       # BOM / ZWNBSP (Cf)
    "\u2060",       # Word joiner (Cf)
    "\u00AD",       # Soft hyphen (Cf)
}

# Many wide/special spaces to normalize to a plain space
SPACE_LIKE = {
    "\u00A0", "\u1680", "\u2000", "\u2001", "\u2002", "\u2003",
    "\u2004", "\u2005", "\u2006", "\u2007", "\u2008", "\u2009",
    "\u200A", "\u202F", "\u205F", "\u3000"
}

def clean_tada_body(text: str) -> str:
    if not text:
        return ""

    # 1) Unicode normalize to simplify odd forms
    text = unicodedata.normalize("NFKC", text)

    out = []
    for ch in text:
        # Keep newlines intact (we'll collapse multiples later)
        if ch == "\n":
            out.append("\n")
            continue

        # Drop zero-width / formatting junk
        if ch in ZERO_WIDTH or unicodedata.category(ch) == "Cf":
            continue

        # Normalize any exotic spaces to a regular space
        if ch in SPACE_LIKE or ch.isspace():
            out.append(" ")
        else:
            out.append(ch)

    text = "".join(out)

    # 2) Collapse runs of spaces to a single space
    text = re.sub(r"[ ]{2,}", " ", text)

    # 3) Clean spaces around newlines
    text = re.sub(r" *\n *", "\n", text)

    # 4) Limit multiple blank lines to max one blank line (e.g., two \n)
    text = re.sub(r"\n{3,}", "\n\n", text)

    # 5) Trim each line and overall
    text = "\n".join(line.strip() for line in text.splitlines()).strip()

    return text

In [87]:
def clean_html_with_bs4(html_content):
    """
    Clean HTML content using BeautifulSoup to extract readable text
    while removing CSS, scripts, and other artifacts.
    
    Args:
        html_content (str): Raw HTML content from email
    
    Returns:
        str: Clean, readable text
    """
    if not html_content:
        return ""
    
    try:
        # Parse HTML with BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Remove script and style elements
        for script in soup(["script", "style"]):
            script.extract()
        
        # Remove hidden elements (normally not visible)
        for hidden in soup.find_all(style=lambda x: x and 'display:none' in x):
            hidden.extract()
        
        # Extract text while trying to preserve some structure
        clean_text = soup.get_text(separator='\n', strip=True)
        
        # Clean up excessive whitespace and newlines
        lines = [line.strip() for line in clean_text.split('\n') if line.strip()]
        
        return '\n'.join(lines)
    
    except Exception as e:
        print(f"Error parsing HTML content: {e}")
        # Fallback: return original content if parsing fails
        return html_content.strip()

print("clean_html_with_bs4 function defined!")

clean_html_with_bs4 function defined!


In [88]:
# Function to parse individual EML files with modern HTML support
def parse_modern_eml_file(file_path):
    """
    Parse an EML file and extract all requested fields, with support for
    HTML content cleaning using BeautifulSoup.
    
    Args:
        file_path (str): Path to the EML file
    
    Returns:
        dict: Extracted email data
    """
    with open(file_path, 'rb') as f:
        msg = BytesParser(policy=policy.default).parse(f)
    
    # Extract standard email headers
    subject = msg.get('Subject', 'No Subject')
    sender = msg.get('From', 'No Sender')
    recipients = msg.get('To', 'No Recipients')
    cc = msg.get('Cc', '')
    bcc = msg.get('Bcc', '')
    date = msg.get('Date', 'No Date')
    message_id = msg.get('Message-ID', 'No Message-ID')
    
    # Extract MIME-related headers
    mime_version = msg.get('Mime-Version', '')
    content_type = msg.get('Content-Type', '')
    content_transfer_encoding = msg.get('Content-Transfer-Encoding', '')
    
    # Decode subject if needed
    try:
        if subject:
            decoded_subject = email.header.decode_header(subject)
            subject = ''.join(part[0].decode(part[1] or 'utf-8') if isinstance(part[0], bytes) else str(part[0]) for part in decoded_subject)
    except:
        pass
    
    # Extract email body with HTML cleaning support
    body = ''
    attachments = []
    
    if msg.is_multipart():
        for part in msg.walk():
            content_type = part.get_content_type()
            
            # Handle plain text parts
            if content_type == 'text/plain' and not part.get_filename():
                part_body = part.get_payload(decode=True).decode('utf-8', errors='ignore')
                if body:  # If we already have body, append
                    body += '\n' + part_body
                else:
                    body = part_body
            
            # Handle HTML parts with BeautifulSoup cleaning
            elif content_type == 'text/html' and not part.get_filename():
                html_content = part.get_payload(decode=True).decode('utf-8', errors='ignore')
                clean_content = clean_html_with_bs4(html_content)
                if body:  # If we already have body, append
                    body += '\n' + clean_content
                else:
                    body = clean_content
            
            # Handle attachments
            elif part.get_filename():
                attachments.append(part.get_filename())
    else:
        # Single part message
        payload = msg.get_payload(decode=True)
        if payload:
            content_type = msg.get_content_type()
            
            if content_type == 'text/html':
                body = clean_html_with_bs4(payload.decode('utf-8', errors='ignore'))
                body = clean_tada_body(body)
            else:
                body = payload.decode('utf-8', errors='ignore')
    
    return {
        'file': file_path,
        'message_id': message_id,
        'date': date,
        'from': sender,
        'to': recipients,
        'cc': cc,
        'bcc': bcc,
        'subject': subject,
        'body': body,
        'mime_version': mime_version,
        'content_type': content_type,
        'content_transfer_encoding': content_transfer_encoding,
        'attachments': '; '.join(attachments) if attachments else ''
    }

print("parse_modern_eml_file function defined!")

parse_modern_eml_file function defined!


In [89]:
# Function to extract emails from folder with modern EML support
def extract_modern_emails_from_folder(folder_path, limit=1000):
    """
    Extract email data from all files in the folder with modern HTML support
    
    Args:
        folder_path (str): Path to folder containing EML files
        limit (int): Maximum number of emails to process
    
    Returns:
        list: List of extracted email data dictionaries
    """
    emails_data = []
    email_count = 0
    
    print(f"Starting extraction from {folder_path}...")
    
    if os.path.isfile(folder_path):
        # If it's a single file, process it directly
        try:
            email_data = parse_modern_eml_file(folder_path)
            emails_data.append(email_data)
            email_count = 1
            print(f'Successfully processed 1 email from {folder_path}')
        except Exception as e:
            print(f'Error processing {folder_path}: {e}')
    else:
        # If it's a folder, walk through all files
        for root, dirs, files in os.walk(folder_path):
            if email_count >= limit:
                break
                
            for file in files:
                if email_count >= limit:
                    break
                    
                file_path = os.path.join(root, file)
                try:
                    if file.endswith('.eml') or file.endswith('.msg'):
                        email_data = parse_modern_eml_file(file_path)
                        emails_data.append(email_data)
                        email_count += 1
                        
                        if email_count % 100 == 0:
                            print(f'Processed {email_count} emails...')
                            
                except Exception as e:
                    print(f'Error processing {file_path}: {e}')
                    
            if email_count >= limit:
                break
    
    print(f"Extraction completed! Processed {len(emails_data)} emails.")
    return emails_data

print("extract_modern_emails_from_folder function defined!")

extract_modern_emails_from_folder function defined!


In [90]:
# Test the parsing with the TADA email file
tada_eml_path = 'tada.eml'

if os.path.exists(tada_eml_path):
    print('Parsing TADA email with modern EML parser...')
    
    # Parse the TADA email
    tada_data = parse_modern_eml_file(tada_eml_path)
    
    # Convert to DataFrame for easier viewing
    df_tada = pd.DataFrame([tada_data])
    
    print('\n=== TADA EMAIL PARSING RESULTS ===')
    print(f"Subject: {tada_data['subject']}")
    print(f"From: {tada_data['from']}")
    print(f"Date: {tada_data['date']}")
    print(f"Content-Type: {tada_data['content_type']} (Original)")
    print('\n=== CLEANED BODY PREVIEW ===')
    print(tada_data['body'][:500] + ('...' if len(tada_data['body']) > 500 else ''))
    
else:
    print(f'TADA EML file {tada_eml_path} not found.')
    tada_data = None

print("")
print("Modern EML parser test completed!")

Parsing TADA email with modern EML parser...

=== TADA EMAIL PARSING RESULTS ===
Subject: Ding Dong~ Your TADA Updates are Here! 🔔
From: TADA <noreply@info.tada.global>
Date: Mon, 08 Sep 2025 08:32:23 +0000
Content-Type: text/html (Original)

=== CLEANED BODY PREVIEW ===
📞 New TADA Hotline Number
From today onwards, TADA’s hotline has been updated.
For any enquiries or support, please contact us at our new number:
☎️ 6750 0833
Kindly take note of our updated hotline for your convenience.
Contact Us
Heading to Downtown East? Get this!
Need a break from the busy life?
Until 31 October, use code [DTE2025] to save $5 OFF to/from Downtown East up to 2 redemptions!
*Terms and conditions apply.
Get the Deal!
Exclusive Discounts in Thailand, Hong Kong, and Cambodia for ...

Modern EML parser test completed!


In [91]:
# Save the extracted data to JSON if we have data
if 'tada_data' in locals() and tada_data:
    output_file = 'modern_emails_extracted.json'
    
    # Save as JSON with proper formatting
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump({'emails': [tada_data], 'total_count': 1}, f, indent=2, ensure_ascii=False)
    
    print(f'Parsed data saved to {output_file}')
    
    # Show file details
    file_size = os.path.getsize(output_file)
    print(f'Output file size: {file_size:,} bytes')
else:
    print('No data to save - email parsing failed or file not found.')

Parsed data saved to modern_emails_extracted.json
Output file size: 1,884 bytes


In [92]:
tada_data['body']

"📞 New TADA Hotline Number\nFrom today onwards, TADA’s hotline has been updated.\nFor any enquiries or support, please contact us at our new number:\n☎️ 6750 0833\nKindly take note of our updated hotline for your convenience.\nContact Us\nHeading to Downtown East? Get this!\nNeed a break from the busy life?\nUntil 31 October, use code [DTE2025] to save $5 OFF to/from Downtown East up to 2 redemptions!\n*Terms and conditions apply.\nGet the Deal!\nExclusive Discounts in Thailand, Hong Kong, and Cambodia for DBS Cardholders!\nPlanning a trip to Thailand, Hong Kong, or Cambodia?\nUntil 31 December 2025, you can save with TADA when paying with your\nDBS Cards\n!\nThailand:\nUse code\n[DBSTADABK]\nto get\n40 THB OFF\n.\nHong Kong:\nUse code\n[DBSTADAHK]\nto get\n35 HKD OFF\n.\nCambodia:\nUse code\n[DBSTADAKH]\nto get\n4,000 riels OFF\n.\nRide TADA for easier and more rewarding journeys!\nManage Payment Methods\nHello, Hong Kong! Let’s Ride 🇭🇰\nHeading to Hong Kong soon?\nUntil 31 December 2