#### **1. Dataset Exploration & Preprocessing:**

* In this note, I have performed the `dataset exploration & preprocessing` steps.
* **Step 1:** Read the data from `../data/emails.csv` using pandas.
* **Step 2:** Checked the shape and null values in the dataset.
* **Step 3:** Removed metadata and signatures from the dataset, and separated the `email body` from the main email.
* **Step 4:** Added a column called `body` which contains the main content of the emails (cleaned body).
* **Step 5:** Categorized the emails into several types (e.g., meeting, project, request) by parsing keywords from the email body.
* **Step 6:** Finally, saved the processed data into separate CSV files in the same directory by filtering on `email_types` such as meeting, project, request, etc.


In [1]:
# require imports
import pandas as pd

In [2]:
# read the 
df = pd.read_csv("../data/emails.csv")

In [None]:
# shape of the dataset
df.shape

(517401, 2)

In [None]:
# read the head
df.head(10)

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...
5,allen-p/_sent_mail/1002.,Message-ID: <30965995.1075863688265.JavaMail.e...
6,allen-p/_sent_mail/1003.,Message-ID: <16254169.1075863688286.JavaMail.e...
7,allen-p/_sent_mail/1004.,Message-ID: <17189699.1075863688308.JavaMail.e...
8,allen-p/_sent_mail/101.,Message-ID: <20641191.1075855687472.JavaMail.e...
9,allen-p/_sent_mail/102.,Message-ID: <30795301.1075855687494.JavaMail.e...


In [None]:
# check for null values
df.isnull().sum()

file       0
message    0
dtype: int64

In [None]:
# see some of the mail to understand the message structure
print(df['message'].values[0])

Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 


In [7]:
print(df['message'].values[1])

Message-ID: <15464986.1075855378456.JavaMail.evans@thyme>
Date: Fri, 4 May 2001 13:51:00 -0700 (PDT)
From: phillip.allen@enron.com
To: john.lavorato@enron.com
Subject: Re:
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: John J Lavorato <John J Lavorato/ENRON@enronXgate@ENRON>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Traveling to have a business meeting takes the fun out of the trip.  Especially if you have to prepare a presentation.  I would suggest holding the business plan meetings here then take a trip without any formal business meetings.  I would even try and get some honest opinions on whether a trip is even desired or necessary.

As far as the business meetings, I think it would be more productive to try and stimulate discussions across the different groups about what is working and what is not.  Too often the

In [8]:
print(df['message'].values[2])

Message-ID: <24216240.1075855687451.JavaMail.evans@thyme>
Date: Wed, 18 Oct 2000 03:00:00 -0700 (PDT)
From: phillip.allen@enron.com
To: leah.arsdall@enron.com
Subject: Re: test
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Leah Van Arsdall
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\'sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf

test successful.  way to go!!!


In [9]:
print(df['message'].values[11])

Message-ID: <25459584.1075855687536.JavaMail.evans@thyme>
Date: Fri, 13 Oct 2000 06:45:00 -0700 (PDT)
From: phillip.allen@enron.com
To: stagecoachmama@hotmail.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: stagecoachmama@hotmail.com
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\'sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf

Lucy,

 Here are the rentrolls:



 Open them and save in the rentroll folder.  Follow these steps so you don't 
misplace these files.

 1.  Click on Save As
 2.  Click on the drop down triangle under Save in:
 3.  Click on the  (C): drive
 4.  Click on the appropriate folder
 5.  Click on Save:

Phillip


##### **Data Preprocessing**

In [10]:
import re

# Step 1: Split header and body using the first blank line
def extract_email_body(raw_message):
    parts = raw_message.split('\n\n', 1)  # split into header and body
    return parts[1] if len(parts) > 1 else raw_message  # return body

# Step 2: Clean the body
def clean_body(text):
    text = text.lower()  # lowercase
    text = re.sub(r'\s+', ' ', text)  # remove extra whitespaces/newlines
    text = re.sub(r'[^a-zA-Z0-9\s\.\,\']', '', text)  # keep only basic punctuation
    return text.strip()

In [11]:
# Apply both
df['body'] = df['message'].apply(extract_email_body).apply(clean_body)

In [12]:
df.head()

Unnamed: 0,file,message,body
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,here is our forecast
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,traveling to have a business meeting takes the...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,test successful. way to go
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,"randy, can you send me a schedule of the salar..."
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,let's shoot for tuesday at 1145.


In [None]:
# message before cleaned
print(df['message'].values[2])

Message-ID: <24216240.1075855687451.JavaMail.evans@thyme>
Date: Wed, 18 Oct 2000 03:00:00 -0700 (PDT)
From: phillip.allen@enron.com
To: leah.arsdall@enron.com
Subject: Re: test
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Leah Van Arsdall
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\'sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf

test successful.  way to go!!!


In [None]:
# message after cleaned
print(df['body'].values[2])

test successful. way to go


##### **Categorized the Emails**

In [15]:
# Detect emails that are about meetings, project updates, reminders

# Define keyword categories
keywords = {
    'meeting': ['meeting', 'schedule', 'calendar', 'zoom', 'teams'],
    'project': ['project', 'update', 'status', 'deadline', 'milestone'],
    'request': ['request', 'can you', 'could you', 'please send'],
    'reminder': ['reminder', 'don’t forget', 'as discussed'],
}

# Tag emails with categories
def detect_category(text):
    for category, words in keywords.items():
        for word in words:
            if word in text:
                return category
    return 'other'

df['email_type'] = df['body'].apply(detect_category)

In [16]:
df.head()

Unnamed: 0,file,message,body,email_type
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,here is our forecast,other
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,traveling to have a business meeting takes the...,meeting
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,test successful. way to go,other
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,"randy, can you send me a schedule of the salar...",meeting
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,let's shoot for tuesday at 1145.,other


In [17]:
# Check distribution
df['email_type'].value_counts()

email_type
other       293750
meeting     114892
project      52893
request      51973
reminder      3893
Name: count, dtype: int64

In [None]:
# filtered the emails based selected categories
lst = keywords.keys()
df_filtered = df[df['email_type'].isin(lst)][['file', 'body', 'email_type']].reset_index(drop=True)

In [19]:
df_filtered.head()

Unnamed: 0,file,body,email_type
0,allen-p/_sent_mail/10.,traveling to have a business meeting takes the...,meeting
1,allen-p/_sent_mail/1000.,"randy, can you send me a schedule of the salar...",meeting
2,allen-p/_sent_mail/1003.,please cc the following distribution list with...,project
3,allen-p/_sent_mail/102.,forwarded by phillip k allenhouect on 10162000...,reminder
4,allen-p/_sent_mail/103.,"mr. buckner, for delivered gas behind san dieg...",request


In [20]:
df_filtered.shape

(223651, 3)

In [None]:
# check distribution after filtered
df_filtered['email_type'].value_counts()

email_type
meeting     114892
project      52893
request      51973
reminder      3893
Name: count, dtype: int64

In [None]:
# save the filtered data for further analysis
df_filtered.to_csv("../data/filtered_enron_emails.csv", index=False)