## Summary
The Enron email database size is quite large (~1.3 GB with more than 500,000 rows). On a PC with 8 GB RAM, it is not possible to read and wrangle the data in one pass. This code reads 'emails.csv' in smaller chunks (20,000 rows) and does the wrangling and deletes the original chunk to free up memory. The chunks of wrangled dataframe are appended to a list. The complete dataframe is built by concatanating the list. In the end, the clean dataframe is written to csv file. 

In [1]:
import numpy as np 
import pandas as pd 
import email # https://docs.python.org/3.6/library/email.message.html

In [2]:
sample_df = pd.read_csv('emails.csv', nrows=5)
print(sample_df.head())
print('\n')
print('Sample email:')
print('\n')
print(sample_df['message'][0])

                       file                                            message
0     allen-p/_sent_mail/1.  Message-ID: <18782981.1075855378110.JavaMail.e...
1    allen-p/_sent_mail/10.  Message-ID: <15464986.1075855378456.JavaMail.e...
2   allen-p/_sent_mail/100.  Message-ID: <24216240.1075855687451.JavaMail.e...
3  allen-p/_sent_mail/1000.  Message-ID: <13505866.1075863688222.JavaMail.e...
4  allen-p/_sent_mail/1001.  Message-ID: <30922949.1075863688243.JavaMail.e...


Sample email:


Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 


## Message Object
- An email message consists of headers and a payload (which is also referred to as the content). Headers are RFC 5322 or RFC 6532 style field names and values, where the field name and value are separated by a colon. The colon is not part of either the field name or the field value. The payload may be a simple text message, or a binary object, or a structured sequence of sub-messages each with their own set of headers and their own payload.

In [3]:
# Helper functions using methods from email library with message objects

def get_text_from_email(msg):
    '''To get the content from email objects'''
    parts = []
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            parts.append( part.get_payload() )
    return ''.join(parts)

In [4]:
# Reading a large csv file in chunks, making a dataframe from each chunk, appending them to a list for a final concatanation 
mylist = []
i=0
for df_chunk in pd.read_csv('emails.csv', chunksize=20000, error_bad_lines = False, low_memory=False):

    # Parsing email messages
    
    # Return a message object structure from a string and make a list of all message objects
    messages = list(map(email.message_from_string, df_chunk['message']))
    df_chunk.drop('message', axis=1, inplace=True)

    # Get fields from parsed email objects, message[0], the first email was used to extract the keys
    keys = messages[0].keys() # keys: field names in message object separeted with a ":" from their values
    for key in keys:
        df_chunk[key] = [one[key] for one in messages] 
    
    # Parse content from emails
    df_chunk['content'] = list(map(get_text_from_email, messages)) # this function is defined above

    del messages
    
    # Extract the root of file column (filepath) as 'user'
    df_chunk['user'] = df_chunk['file'].map(lambda x:x.split('/')[0])
    df_chunk = df_chunk.set_index('Message-ID')\
                .drop(['file', 'Mime-Version', 'Content-Type', 'Content-Transfer-Encoding'], axis=1)
    
    # Make Date column a datetime object
    df_chunk['Date'] = pd.to_datetime(df_chunk['Date'], infer_datetime_format=True)
    
    # i += 1
    # print('Chunk number', i, 'is processed.')
    mylist.append(df_chunk)

In [5]:
# Complete dataframe
df = pd.concat(mylist, axis= 0)
print(df.head())
df.info()

                                               Bcc   Cc                Date  \
Message-ID                                                                    
<18782981.1075855378110.JavaMail.evans@thyme>  NaN  NaN 2001-05-14 23:39:00   
<15464986.1075855378456.JavaMail.evans@thyme>  NaN  NaN 2001-05-04 20:51:00   
<24216240.1075855687451.JavaMail.evans@thyme>  NaN  NaN 2000-10-18 10:00:00   
<13505866.1075863688222.JavaMail.evans@thyme>  NaN  NaN 2000-10-23 13:13:00   
<30922949.1075863688243.JavaMail.evans@thyme>  NaN  NaN 2000-08-31 12:07:00   

                                                                  From  \
Message-ID                                                               
<18782981.1075855378110.JavaMail.evans@thyme>  phillip.allen@enron.com   
<15464986.1075855378456.JavaMail.evans@thyme>  phillip.allen@enron.com   
<24216240.1075855687451.JavaMail.evans@thyme>  phillip.allen@enron.com   
<13505866.1075863688222.JavaMail.evans@thyme>  phillip.allen@enron.com   
<3

In [6]:
del mylist
df.to_csv('out.csv')