<a href="https://colab.research.google.com/github/Kibet-Rotich/Data-analysis-ML-AI/blob/master/Enron_Email_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EMAIL PRIORITY CLASSIFIER USING THE ENRON EMAIL DATASET

This machine learning project is a more complex version of the SMS spam dataset project, where I adapted a spam/ham dataset to fit a specific use case. In this project, I will utilize the Enron dataset, which is extensive and designed for this particular use case. The dataset includes features such as the type of sender (manager or colleague), the number of recipients, and the content of the email, including the subject line and body. Once again, I will use the random forest classifier method.

Let's get started!

# Data collection

I downloaded the Enron dataset from Kaggle, but it is quite large—up to 1.32 GB. To make it more manageable for use on Colab, I decided to sample a few records from the dataset. I sampled 10,000 records using this script that I ran locally on my machine.

In [None]:
import pandas as pd

def sample_emails(input_file, output_file, sample_size=10000, random_state=42):
    """
    Sample random records from a CSV file and save them to a new CSV.

    Args:
        input_file (str): Path to the input CSV file
        output_file (str): Path where the sampled dataset will be saved
        sample_size (int): Number of records to sample
        random_state (int): Random seed for reproducibility
    """
    # Read the original dataset
    print("Reading the original dataset...")
    df = pd.read_csv(input_file)

    # Take random sample
    print(f"Sampling {sample_size} records from {len(df)} total records...")
    sampled_df = df.sample(n=min(sample_size, len(df)), random_state=random_state)

    # Save to new file
    sampled_df.to_csv(output_file, index=False)
    print(f"Successfully saved {len(sampled_df)} records to {output_file}")

    return sampled_df

#Usage
if __name__ == "__main__":
    INPUT_FILE = "emails.csv"
    OUTPUT_FILE = "sampled_enron_emails.csv"

    df = sample_emails(INPUT_FILE, OUTPUT_FILE)
    print(f"Dataset shape: {df.shape}")

In [19]:
import pandas as pd
df = pd.read_csv("sampled_enron_emails.csv")
df.head()


Unnamed: 0,file,message
0,shackleton-s/sent/1912.,Message-ID: <21013688.1075844564560.JavaMail.e...
1,farmer-d/logistics/1066.,Message-ID: <22688499.1075854130303.JavaMail.e...
2,parks-j/deleted_items/202.,Message-ID: <27817771.1075841359502.JavaMail.e...
3,stokley-c/chris_stokley/iso/client_rep/41.,Message-ID: <10695160.1075858510449.JavaMail.e...
4,germany-c/all_documents/1174.,Message-ID: <27819143.1075853689038.JavaMail.e...


# Data labelling

Since the data is not labeled, meaning there is no indication of whether an email is important or not, I need to label the data myself. Here is the approach I took: I checked for priority keywords such as "ASAP" and looked for executive titles like "CEO" as indicators of high priority.

In [3]:
def parse_email_headers(message):
    """Extract headers from raw email message"""
    headers = {}
    lines = message.split('\n')
    current_key = None

    for line in lines:
        if line.strip() == '':
            break
        if ': ' in line and not line.startswith(' '):
            key, value = line.split(': ', 1)
            current_key = key.lower()
            headers[current_key] = value.strip()
        elif current_key and line.startswith(' '):
            headers[current_key] += ' ' + line.strip()

    return headers

def label_emails(df):
    """Label emails based on headers and content"""
    # Create new columns for parsed data
    df['headers'] = df['message'].apply(parse_email_headers)
    df['subject'] = df['headers'].apply(lambda x: x.get('subject', ''))
    df['from'] = df['headers'].apply(lambda x: x.get('from', ''))
    df['date'] = df['headers'].apply(lambda x: x.get('date', ''))

    # Priority indicators
    priority_keywords = ['urgent', 'asap', 'important', 'priority', 'critical']
    exec_titles = ['ceo', 'cfo', 'president', 'vp', 'director']

    # Label as high priority if:
    conditions = (
        df['subject'].str.lower().apply(lambda x: any(k in x for k in priority_keywords)) |  # Priority keywords in subject
        df['from'].str.lower().apply(lambda x: any(t in x for t in exec_titles)) |          # From executives
        df['message'].str.lower().str.contains('urgent|asap|immediate|emergency')            # Priority words in body
    )

    df['importance'] = 'normal'
    df.loc[conditions, 'importance'] = 'high'

    return df

# Usage
df = pd.read_csv("sampled_enron_emails.csv")
df = label_emails(df)

# Print statistics
print(f"Total emails: {len(df)}")
print(f"High priority: {(df['importance'] == 'high').sum()}")

Total emails: 10000
High priority: 895


The resulting labelled data has 895 emails out of 10,000 as high priority which is a good fraction in my opinion.

In [5]:
df.to_csv("sampled_enron_emails_labelled.csv", index=False) #save labelled data

In [21]:
df = pd.read_csv("sampled_enron_emails_labelled.csv")
print(f"Columns: {df.columns.tolist()}")
print(f"Shape: {df.shape}")
df.head()


Columns: ['file', 'message', 'headers', 'subject', 'from', 'date', 'importance']
Shape: (10000, 7)


Unnamed: 0,file,message,headers,subject,from,date,importance
0,shackleton-s/sent/1912.,Message-ID: <21013688.1075844564560.JavaMail.e...,{'message-id': '<21013688.1075844564560.JavaMa...,Re: Credit Derivatives,sara.shackleton@enron.com,"Tue, 29 Aug 2000 01:26:00 -0700 (PDT)",normal
1,farmer-d/logistics/1066.,Message-ID: <22688499.1075854130303.JavaMail.e...,{'message-id': '<22688499.1075854130303.JavaMa...,Meter #1591 Lamay Gaslift,pat.clynes@enron.com,"Mon, 24 Apr 2000 05:43:00 -0700 (PDT)",normal
2,parks-j/deleted_items/202.,Message-ID: <27817771.1075841359502.JavaMail.e...,{'message-id': '<27817771.1075841359502.JavaMa...,Re: man night again?,knipe3@msn.com,"Thu, 2 May 2002 04:54:27 -0700 (PDT)",normal
3,stokley-c/chris_stokley/iso/client_rep/41.,Message-ID: <10695160.1075858510449.JavaMail.e...,{'message-id': '<10695160.1075858510449.JavaMa...,"Enron 480, 1480 charges",kalmeida@caiso.com,"Wed, 8 Aug 2001 14:35:08 -0700 (PDT)",normal
4,germany-c/all_documents/1174.,Message-ID: <27819143.1075853689038.JavaMail.e...,{'message-id': '<27819143.1075853689038.JavaMa...,Transport Deal,chris.germany@enron.com,"Wed, 21 Jun 2000 04:58:00 -0700 (PDT)",normal


# Feature extraction

Now that we have examined the emails, we can extract features from the dataset.

In [16]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
!pip install nltk
import nltk
nltk.download('punkt_tab')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import pickle

class EmailFeatureExtractor:
    def __init__(self, max_features=1000, tfidf_vectorizer=None, label_encoder=None):
        nltk.download('punkt')
        nltk.download('stopwords')
        nltk.download('wordnet')

        self.max_features = max_features
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
        self.tfidf = tfidf_vectorizer if tfidf_vectorizer else TfidfVectorizer(max_features=max_features)
        self.label_encoder = label_encoder if label_encoder else LabelEncoder()

    def preprocess_text(self, text):
        if not isinstance(text, str):
            return ''

        text = text.lower()
        text = ' '.join(
            self.lemmatizer.lemmatize(token)
            for token in word_tokenize(text)
            if token not in self.stop_words and len(token) > 2
        )
        return text

    def extract_metadata_features(self, df):
        features = pd.DataFrame()

        # Email characteristics
        features['subject_length'] = df['subject'].fillna('').str.len()
        features['subject_word_count'] = df['subject'].fillna('').str.split().str.len()
        features['message_length'] = df['message'].fillna('').str.len()
        features['message_word_count'] = df['message'].fillna('').str.split().str.len()

        # Sender features
        features['sender_is_executive'] = df['from'].str.lower().str.contains(
            'ceo|cfo|president|vp|director', na=False).astype(int)

        # Time features
        features['has_date'] = df['date'].notna().astype(int)

        return features

    def extract_features(self, df, is_training=True):
        print("Preprocessing text...")
        combined_text = df['subject'].fillna('') + ' ' + df['message'].fillna('')
        processed_text = combined_text.apply(self.preprocess_text)

        print("Extracting TF-IDF features...")
        if is_training:
            tfidf_features = self.tfidf.fit_transform(processed_text)
        else:
            tfidf_features = self.tfidf.transform(processed_text)  # No fitting during testing

        print("Extracting metadata features...")
        metadata_features = self.extract_metadata_features(df)

        X = np.hstack([
            tfidf_features.toarray(),
            metadata_features.values
        ])

        y = None
        if 'importance' in df.columns:
            if is_training:
                y = self.label_encoder.fit_transform(df['importance'])
            else:
                y = self.label_encoder.transform(df['importance'])  # Transform only during testing

        feature_names = (
            self.tfidf.get_feature_names_out().tolist() +
            metadata_features.columns.tolist()
        )

        print(f"Extracted {X.shape[1]} features")
        return X, y, feature_names

    def load_vectorizer_and_encoder(self):
        with open('tfidf_vectorizer.pkl', 'rb') as f:
            self.tfidf = pickle.load(f)
        with open('label_encoder.pkl', 'rb') as f:
            self.label_encoder = pickle.load(f)
#example usage
# if __name__ == "__main__":
#     df = pd.read_csv("sampled_enron_emails_labelled.csv")
#     extractor = EmailFeatureExtractor(max_features=1000)
#     X, y, feature_names = extractor.extract_features(df, is_training=True)
#     print(f"Feature matrix shape: {X.shape}")
#     print(f"Number of classes: {len(np.unique(y))}")




[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [37]:
print(feature_names)
print(X.shape)
print(y.shape)

['00', '000', '01', '02', '03', '04', '05', '06', '07', '0700', '08', '0800', '09', '10', '100', '11', '12', '13', '14', '15', '16', '17', '18', '19', '1968', '1999', '20', '2000', '2001', '2002', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '345', '35', '36', '37', '38', '39', '3d', '40', '41', '42', '44', '45', '46', '47', '48', '49', '50', '500', '51', '52', '53', '55', '56', '57', '58', '59', '646', '713', '7bit', '800', '853', '99', 'able', 'access', 'according', 'account', 'accounting', 'act', 'action', 'activity', 'add', 'added', 'addition', 'additional', 'address', 'administration', 'agency', 'ago', 'agreement', 'alan', 'align', 'all', 'allen', 'already', 'also', 'although', 'america', 'american', 'amount', 'analysis', 'analyst', 'and', 'anderson', 'andrew', 'andy', 'ann', 'announcement', 'another', 'ansi_x3', 'answer', 'anyone', 'anything', 'aol', 'application', 'approval', 'approved', 'apr', 'april', 'area', 'arnold', 'around', 'article'

The feature extractor extracted 1006 features.

# Model Training

To train the data, I split it into training and testing sets. I then extracted the features again and fit them to the random forest classifier model. After that, I evaluated and saved the model.

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import pandas as pd
import pickle

# Load your training data
train_df = pd.read_csv("sampled_enron_emails_labelled.csv")

# Split data into training and testing sets
train_data, test_data = train_test_split(train_df, test_size=0.2, random_state=42)

# Training
extractor = EmailFeatureExtractor(max_features=1000)
X_train, y_train, feature_names = extractor.extract_features(train_data, is_training=True)
X_test, y_test, _ = extractor.extract_features(test_data, is_training=False)

# Train your classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Evaluate the model
predictions = rf_classifier.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, predictions))

#save the model
with open('tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(extractor.tfidf, f)

with open('label_encoder.pkl', 'wb') as f:
    pickle.dump(extractor.label_encoder, f)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Preprocessing text...
Extracting TF-IDF features...
Extracting metadata features...
Extracted 1006 features
Preprocessing text...
Extracting TF-IDF features...
Extracting metadata features...
Extracted 1006 features
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.55      0.69       168
           1       0.96      1.00      0.98      1832

    accuracy                           0.96      2000
   macro avg       0.95      0.77      0.84      2000
weighted avg       0.96      0.96      0.95      2000



#USAGE

I generated my own test messages, extracted their features, and fed them to the model for prediction.

In [18]:
# Test messages
test_messages = [
    """
    From: john.doe@enron.com
    To: team@enron.com
    Subject: URGENT: Emergency Board Meeting

    Emergency board meeting called for 3PM today to discuss critical market developments. Immediate attendance required. Executive decisions pending.
    """,
    """
    From: cfo@enron.com
    To: finance@enron.com
    Subject: Critical Financial Report Needed

    Need Q4 projections ASAP for investor meeting tomorrow morning. This is top priority.
    """,
    """
    From: jane.smith@enron.com
    To: team@enron.com
    Subject: Weekly Team Update

    Here's a summary of this week's progress on the Thompson project. Next review scheduled for Friday.
    """,
    """
    From: hr@enron.com
    To: all@enron.com
    Subject: Office Holiday Schedule

    Attached is the upcoming holiday schedule for December. Please submit time-off requests by end of week.
    """,
    """
    From: communications@enron.com
    To: all@enron.com
    Subject: Monthly Newsletter - February Edition

    Hello Team,
    Attached is the February edition of our monthly newsletter. It includes updates on company initiatives, employee highlights, and upcoming events.
    Best,
    Enron Communications Team
    """,
    """
    From: jane.doe@enron.com
    To: john.smith@enron.com
    Subject: Quick Catch-Up

    Hi John,
    Just checking in to see how things are going with your new project. Let me know if you'd like to grab coffee sometime this week.
    Cheers,
    Jane

    """,
    """
    From: it.support@enron.com
    To: all@enron.com
    Subject: Scheduled System Maintenance

    Dear Team,
    Please be informed that there will be scheduled maintenance on the company servers this Saturday from 2 AM to 4 AM. Minimal disruption is expected.
    Regards,
    IT Support

    """,
    """
    From: project.manager@enron.com
    To: project.team@enron.com
    Subject: Recap of Today's Meeting

    Hello Team,
    Here’s a summary of today’s project meeting, including key takeaways and action items. Please review and reach out if you have questions.
    Best,
    Project Manager

    """,
    """
    From: hr@enron.com
    To: all@enron.com
    Subject: Team Building Event Invitation

    Hi Everyone,
    We’re excited to invite you to our upcoming team-building event next Friday at the company lounge. There will be games, snacks, and lots of fun!
    Best,
    HR Team

    """
]

# Process test messages
test_df = pd.DataFrame({
    'message': test_messages,
    'file': ['test_' + str(i) for i in range(len(test_messages))]
})

# Use your labeling function here
test_df = label_emails(test_df)

# Extract features using the loaded vectorizer and encoder
extractor = EmailFeatureExtractor()

with open('tfidf_vectorizer.pkl', 'rb') as f:
    extractor.tfidf = pickle.load(f)

with open('label_encoder.pkl', 'rb') as f:
    extractor.label_encoder = pickle.load(f)

with open('rf_classifier.pkl', 'rb') as f:
    rf_classifier = pickle.load(f)

X_test_new = extractor.extract_features(test_df, is_training=False)[0]

# Predict
predictions = rf_classifier.predict(X_test_new)
print("Predictions:", [("High Priority" if p == 1 else "Normal") for p in predictions])

Preprocessing text...
Extracting TF-IDF features...
Extracting metadata features...
Extracted 1006 features
Predictions: ['High Priority', 'High Priority', 'High Priority', 'High Priority', 'High Priority', 'High Priority', 'High Priority', 'High Priority', 'High Priority']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


The model categorized all my messages as high priority, which I believe is not ideal. Ideally, I should receive at least one low-priority message. I can attribute this issue to the feature extraction process. When I fitted the training data to the extractor, it extracted 1,006 features, including vocabulary, dates, and more. However, my test messages may not contain all the vocabulary needed to account for these 1,006 features. Since the extractor remains the same (saved), any features not present in the message are recorded as 0. This leads to all my messages being categorized as high priority (0).