# The Enron Email Dataset

## 1. Downloading data

More info about dataset: [link](https://www.kaggle.com/datasets/wcukierski/enron-email-dataset)

Please provide your Kaggle credentials to download this dataset. Learn more: [link](http://bit.ly/kaggle-creds)

In [1]:
import os, sys, email
import pandas as pd
import numpy as np

import opendatasets as od

od.download_kaggle_dataset("https://www.kaggle.com/datasets/wcukierski/enron-email-dataset", data_dir='.')

In [2]:
data = pd.read_csv('enron-email-dataset/emails.csv')
data.head()

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


In [3]:
print(data.message[0])

Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 


## 2. Parse data

Let's define some useful function for email parsing

In [11]:
def get_text_from_email(msg: email.message.Message) -> str:
    """To get the content from email objects"""
    parts = []
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            parts.append( part.get_payload() )
    return ''.join(parts)

def split_email_addresses(line: str) -> frozenset:
    """To separate multiple email addresses"""
    if line:
        addresses = line.split(',')
        addresses = tuple(map(lambda x: x.strip(), addresses))
    else:
        addresses = tuple()
    return addresses

In [13]:
df = pd.DataFrame()
df['Date'] = data['message'].apply(lambda msg: pd.to_datetime(email.message_from_string(msg)['Date'], errors='ignore'))
df['From'] = data['message'].apply(lambda msg: split_email_addresses(email.message_from_string(msg)['From']))
df['To'] = data['message'].apply(lambda msg: split_email_addresses(email.message_from_string(msg)['To']))
df['Subject'] = data['message'].apply(lambda msg: email.message_from_string(msg)['Subject'])
df['SubjectType'] = df['Subject'].apply(lambda x: 'replied' if 're:' == x[:3].lower() else
                                                  'forwarded' if 'fw:' == x[:3].lower() else
                                                  'empty' if len(x) == 0 else
                                                  'ordinary')
df['Content'] = data['message'].apply(lambda msg: get_text_from_email(email.message_from_string(msg)))
df.head()

Unnamed: 0,Date,From,To,Subject,SubjectType,Content
0,2001-05-14 16:39:00-07:00,"(phillip.allen@enron.com,)","(tim.belden@enron.com,)",,empty,Here is our forecast\n\n
1,2001-05-04 13:51:00-07:00,"(phillip.allen@enron.com,)","(john.lavorato@enron.com,)",Re:,replied,Traveling to have a business meeting takes the...
2,2000-10-18 03:00:00-07:00,"(phillip.allen@enron.com,)","(leah.arsdall@enron.com,)",Re: test,replied,test successful. way to go!!!
3,2000-10-23 06:13:00-07:00,"(phillip.allen@enron.com,)","(randall.gay@enron.com,)",,empty,"Randy,\n\n Can you send me a schedule of the s..."
4,2000-08-31 05:07:00-07:00,"(phillip.allen@enron.com,)","(greg.piper@enron.com,)",Re: Hello,replied,Let's shoot for Tuesday at 11:45.


In [14]:
df = df.drop_duplicates().sort_values(by='Date', ignore_index=True)

In [15]:
df.to_csv('./enron-email-dataset/messages.csv', sep='|')

Description of `messages`:
- `Date` - date of sending message
- `From` - who send message
- `CorpFromFlg` - `True` if 'from' email is corporate
- `To` - who get message
- `CorpToFlg` - `True` if 'to' email is corporate
- `Subject` - email's subject
- `SubjectType` - one of [`empty`, `replied`, `forwarded`, `ordinary`]
- `Content` - content of message

In [None]:
from tqdm import tqdm
employee = pd.DataFrame(columns=['Email', 'Name', 'FirstMessageDate', 'LastMessageDate'])
for message_id, row in tqdm(df.iterrows(), total=len(df)):
    for email in row['From']:
        if '@enron.com' in email and email not in employee['Email']:
            employee = pd.concat(
                [employee if not employee.empty else None,
                 pd.DataFrame(
                     {
                         'Email': [email],
                         'Name': [[_.capitalize() for _ in email.replace('@enron.com', '').split('.')]],
                         'FirstMessageDate': [row['Date']],
                         'LastMessageDate': [None]
                     })
                 ]
            )
employee.head()

 31%|███       | 78307/255493 [17:45<1:01:42, 47.85it/s]

In [None]:
for idx, row in employee.iterrows():
    employee.loc[idx, 'LastMessageDate'] = df[df['From'] in row['Email']][['Date', 'From']].sort_values(by='Date').iloc[-1, 1]]