In [1]:
# import packages
import pandas as pd # Data processing
import numpy as np # Linear Algebra
from os import listdir, path # Get files from directory

## 1. Extract email information

In this notebook we will be extracting the email information from the Enron Email dataset. This dataset is publicly available and can be downloaded from *https://www.cs.cmu.edu/~./enron/*. The data is stored in separate folders. Each of the 150 employees has its own folder with subfolders including there inbox, sent emails, deleted items, discussion threads, etc. For this project we focussed on the deleted items, inbox messages and sent emails.

The mails are stored in separate folders for each employee. The *head folder* is called maildir, within this folder we find a separate folder for every employee. We begin with checking how many folders there are for every employee and look at an example of a specific email.

The goal was to extract the email addresses of the sender and the receiver, the subject, the message ID and, the date and time of every email.

#### 1.1 Look at an example

In [2]:
# List of all employees in the directory
Employees = listdir('Enron/maildir')

In [3]:
# Check how many employees are present in the dataset
len(Employees)

150

In [4]:
# An example of an email
file = open('Enron/maildir/allen-p/deleted_items/448', 'r')
file = file.read()
print(file)

Message-ID: <12840380.1075862163884.JavaMail.evans@thyme>
Date: Tue, 27 Nov 2001 12:06:55 -0800 (PST)
From: tim.heizenrader@enron.com
To: john.lavorato@enron.com, k..allen@enron.com, john.zufferli@enron.com, 
	mike.grigsby@enron.com
Subject: West Power Briefing
Cc: tim.belden@enron.com, mike.swerzbin@enron.com, cooper.richey@enron.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Bcc: tim.belden@enron.com, mike.swerzbin@enron.com, cooper.richey@enron.com
X-From: Heizenrader, Tim </O=ENRON/OU=NA/CN=RECIPIENTS/CN=THEIZEN>
X-To: Lavorato, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Jlavora>, Allen, Phillip K. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Pallen>, Zufferli, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Jzuffer>, Grigsby, Mike </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Mgrigsb>
X-cc: Belden, Tim </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Tbelden>, Swerzbin, Mike </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Mswerzb>, Richey, Cooper </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Crichey>
X-bcc: 
X-Fol

#### 1.2 Extract information from email

To create our social network we are interested in the employee who send the email and the employee who received the email. We would like to know the date-time, message ID and the subject these will be important further down the line to remove duplicate emails. We start by creating a function **ExtractInfo** which gives us all the variables of interest. All variables are stored in a dataframe.

In [5]:
# Import packages
import re
from datetime import datetime

In [9]:
def ExtractInfo(path):
    
    """This function extract all necessary information for our porject from the email based on the path to the text file.
    It returns 2 dataframes for CC'ed and direct receivers. As extra information it returns the subject, the date and time 
    and the Message_ID"""
    
    # open directory
    file = open(path, 'r')
    mail = file.read()
    
    # Preparation of text file, 
    mail = re.sub(r'\n\t', '', mail)
    # Split to list
    mail_list = re.split(r'\n', mail)
    
    # Extract message ID
    Message_ID = [line for line in mail_list if line.startswith('Message-ID:')]
    if len(Message_ID)>0:
        # We are only interested in the numbers
        Message_ID = ''.join(re.findall('\d', Message_ID[0]))
    
    # Extract date from mail
    Date = [line for line in mail_list if line.startswith('Date:')]
    if len(Date)>0:
        Date = re.sub(r'Date: ', '', Date[0])
        Date = Date.split(' -')[0]
        Date = datetime.strptime(Date, '%a, %d %b %Y %H:%M:%S')
        
    # Extract subject
    Subject = [line for line in mail_list if line.startswith('Subject:')]
    if len(Subject)>0:
        Subject = re.sub('Subject: ', '', Subject[0])
    
    # Extract From
    From = [line for line in mail_list if line.startswith('From:')]
    # Remove From, re changes list into string
    From = re.sub('From: ', '', From[0])
    
    # Extract To here can be multiple receivers, split based on ', '
    To = [line for line in mail_list if line.startswith('To:')]
    # There must be a value for To
    if len(To)>0:
        To = re.sub('To: ', '', To[0])
        To = To.split(', ')
        # Create dataframe containing from, to, date, message id and subject
        From_To = pd.DataFrame({'From': [From for name in To], 'To': To})
        From_To['Date'] = Date
        From_To['Subject'] = Subject
        From_To['Message_ID'] = Message_ID
    else: From_To = None
    
    # Return the dataframes 
    return From_To

In [10]:
# Check the function on a specific mail
From_To= ExtractInfo('Enron/maildir/allen-p/deleted_items/448')

In [11]:
From_To

Unnamed: 0,From,To,Date,Subject,Message_ID
0,tim.heizenrader@enron.com,john.lavorato@enron.com,2001-11-27 12:06:55,West Power Briefing,128403801075862163884
1,tim.heizenrader@enron.com,k..allen@enron.com,2001-11-27 12:06:55,West Power Briefing,128403801075862163884
2,tim.heizenrader@enron.com,john.zufferli@enron.com,2001-11-27 12:06:55,West Power Briefing,128403801075862163884
3,tim.heizenrader@enron.com,mike.grigsby@enron.com,2001-11-27 12:06:55,West Power Briefing,128403801075862163884


#### 1.3 Loop over all folders of interest

For this project we will be looking at all mails in the following folders:
* deleted_items
* inbox
* _sent_mail
* sent

We create an additional function which loops over all employees, selects the correct folder and extracts the variables we are interested in. 

In [14]:
# Create definition
def GetMailInfo(folder):

    # Create empty dataframe
    From_To_All = pd.DataFrame(columns=['From', 'To', 'Date', 'Subject', 'Message_ID'])

    # Loop over employees and the directory of interest
    for employee in Employees:
        Employee_directory = '/'.join(['Enron/maildir', employee])
        if folder in listdir(Employee_directory):
            deletedFiles = listdir('/'.join([Employee_directory, folder]))
            # Construct path of deleted file
            Path_DeletedFiles = [Employee_directory + '/' + folder + '/' + file for file in deletedFiles]
            # Make sure the file is not a folder
            Path_DeletedFiles = [file_path for file_path in Path_DeletedFiles if path.isfile(file_path)]
            for file_path in Path_DeletedFiles:
                From_To = ExtractInfo(file_path)
                From_To_All = pd.concat([From_To_All, From_To])
                
    # Return dataframes
    return From_To_All

**Extract all information from deleted emails**

In [12]:
From_To_All_deleted_items = GetMailInfo('deleted_items')

In [33]:
print('Total amount of mails:', len(From_To_All_deleted_items))
print('Number of unique email addresses:', len(From_To_All_deleted_items.To.unique()))

Total amount of mails: 343031
Number of unique email addresses: 20272


**Extract all information from inbox**

In [21]:
From_To_All_inbox = GetMailInfo('inbox')

In [35]:
print('Total amount of mails:', len(From_To_All_inbox))
print('Number of unique email addresses:', len(From_To_All_inbox.To.unique()))

Total amount of mails: 392313
Number of unique email addresses: 22570


**Extract all information from _sent_mail**

In [18]:
From_To_All_sent_mail = GetMailInfo('_sent_mail')

In [19]:
print('Total amount of mails:', len(From_To_All_sent_mail))
print('Number of unique email addresses:', len(From_To_All_sent_mail.To.unique()))

Total amount of mails: 50370
Number of unique email addresses: 7039


**Extract all information from sent**

In [29]:
From_To_All_sent = GetMailInfo('sent')

In [39]:
print('Total amount of mails:', len(From_To_All_sent))
print('Number of unique email addresses:', len(From_To_All_sent.To.unique()))

Total amount of mails: 115179
Number of unique email addresses: 9719


#### 1.4 Concatenate data frames

Now that we have extracted all the necessary information from the folders of interest we will concatenate all these data frames.

In [46]:
# Concatenate all direct mails
Direct_Mails_all = pd.concat([From_To_All_deleted_items, From_To_All_inbox, From_To_All_sent_mail, From_To_All_sent])

In [52]:
# Export
Direct_Mails_all.to_csv('All direct mails', index=False)