# Enron Dataset Pre-processing
## Introduction
The Enron email dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages.

This notebook includes the exploratory analysis of the enron dataset to observe information about the different types of emails, users and their relations in the dataset. This notebook will also apply some preprocessing to the dataset for our use.

Let's start by importing the usual python libraries for data analysis. We'll use pandas for manipulating our dataset.

In [1]:
# import necessary libraries
import email
import pandas as pd

## Importing the dataset
The dataset obtained from the kaggle website is in zipped format. After extraction, we obtain a CSV file which can then be easily imported using pandas.

In [2]:
# Let's import the dataset
data = pd.read_csv('data/emails.csv')
# Let's check some data
data.head()

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


Let's check the number of records in this dataset. There are 517401 rows of email messages in two columns, file and message.

In [3]:
data.shape

(517401, 2)

The file column contains the file location of email inside the dataset. We can obtain the user name from the file column. The message column contains the email contents, including the email content.

In [4]:
data.file[0]

'allen-p/_sent_mail/1.'

Let's check one of the email messages from the dataset.

In [5]:
print(data.message[6])

Message-ID: <16254169.1075863688286.JavaMail.evans@thyme>
Date: Tue, 22 Aug 2000 07:44:00 -0700 (PDT)
From: phillip.allen@enron.com
To: david.l.johnson@enron.com, john.shafer@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: david.l.johnson@enron.com, John Shafer
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\'sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf

Please cc the following distribution list with updates:

Phillip Allen (pallen@enron.com)
Mike Grigsby (mike.grigsby@enron.com)
Keith Holst (kholst@enron.com)
Monique Sanchez
Frank Ermis
John Lavorato


Thank you for your help

Phillip Allen



## Starting the Pre-processing
Let's start the pre-processing of the dataset. We will start by creating some helper functions to process the email message. The `get_text_from_email()` function will help us in extracting plain text content from the email and ignore the multipart content and other data. The `split_email_adresses()` function will help us to extract list of emails from the 'From' and 'To' fields of the email messages.

In [6]:
# Some helper functions
def get_text_from_email(msg):
    parts = []
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            parts.append(part.get_payload())
    return ''.join(parts)

def split_email_addresses(line):
    if line:
        address = line.split(',')
        addresses = frozenset(map(lambda x: x.strip(), address))
    else:
        addresses = None
    return addresses

Let's parse the email message to organize email contents. We will use the `email` module to parse the email messages from the message column of the dataset and convert it into a list of email dictionaries. Then, we will drop the message column from the dataset since we have already parsed it into the messages list.

In [7]:
messages = list(map(email.message_from_string, data['message']))
data.drop('message', axis=1, inplace=True)

We can obtain the header information of the email messages by taking the keys from the email message. Then, we can use those keys to create respective columns for the email messages in the dataframe.

In [8]:
keys = messages[0].keys()
for key in keys:
    data[key] = [message[key] for message in messages]

Now, let's clean the message content by using our `get_text_from_email()` helper function to exclude everything except the text/plain content type. Then, using the `split_email_addresses()` function we'll to extract the emails from the 'From' and 'To' fields of the email.

In [15]:
data['content'] = list(map(get_text_from_email, messages))
data['From'] = data['From'].map(split_email_addresses)
data['To'] = data['To'].map(split_email_addresses)

Now, let's extract the user name from the 'file' column and drop it from the dataframe. The final result of the processing is below.

In [16]:
data['user'] = data['file'].map(lambda x: x.split('/')[0])
data.drop('file', axis=1, inplace=True)
data.head()

Unnamed: 0,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName,content,user
0,<18782981.1075855378110.JavaMail.evans@thyme>,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",(phillip.allen@enron.com),(tim.belden@enron.com),,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Tim Belden <Tim Belden/Enron@EnronXGate>,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Here is our forecast\n\n,allen-p
1,<15464986.1075855378456.JavaMail.evans@thyme>,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",(phillip.allen@enron.com),(john.lavorato@enron.com),Re:,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,John J Lavorato <John J Lavorato/ENRON@enronXg...,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Traveling to have a business meeting takes the...,allen-p
2,<24216240.1075855687451.JavaMail.evans@thyme>,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",(phillip.allen@enron.com),(leah.arsdall@enron.com),Re: test,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Leah Van Arsdall,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,test successful. way to go!!!,allen-p
3,<13505866.1075863688222.JavaMail.evans@thyme>,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",(phillip.allen@enron.com),(randall.gay@enron.com),,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Randall L Gay,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,"Randy,\n\n Can you send me a schedule of the s...",allen-p
4,<30922949.1075863688243.JavaMail.evans@thyme>,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",(phillip.allen@enron.com),(greg.piper@enron.com),Re: Hello,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Greg Piper,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,Let's shoot for Tuesday at 11:45.,allen-p


While working with large set of datasets, we should keep in mind the memory of the system. Python keeps all the used variables in the memory for further usage. It doesn't take long to fill the memory with unused variables. So, it is a good practice to delete unrequired data and free the memory. So, let's delete the messages list which we do not require.

In [17]:
del messages

Now, let's drop the unrequired columns from the dataset like the Mime Version, Content type and Content transfer encoding. The Messagi ID column will be set as the index of our dataframe. We will also require to convert the Date string into an actual datetime datatype so we can apply temporal analysis to the email messages.

In [18]:
data = data.set_index('Message-ID').drop(['Mime-Version', 'Content-Type', 'Content-Transfer-Encoding'], axis=1)
data['Date'] = pd.to_datetime(data['Date'], infer_datetime_format=True)
data.dtypes

Date          datetime64[ns]
From                  object
To                    object
Subject               object
X-From                object
X-To                  object
X-cc                  object
X-bcc                 object
X-Folder              object
X-Origin              object
X-FileName            object
content               object
user                  object
dtype: object

The dataset is now processed according to our requirements and now we can save it for further processing. We will create a `emails-processed.csv` file using the `to_csv()` function of pandas dataframe.

In [19]:
data.to_csv('emails-processed.csv')

The saved dataset is about 1.2 GB. So, let's compress it for better portability.

In [20]:
!tar czfv emails-processed.tar.gz emails-processed.csv

emails-processed.csv


This new dataset is much structured than the original one and will be used for further analysis of the Enron dataset.