In [53]:
import numpy as np
import pandas as pd
import email
import sklearn
import nltk

# Data Preprocessing and Exploration

In [16]:
#loading the data set
df = pd.read_csv('emails.csv')
print(f'the dataset has dimensions: {np.shape(df)} with {len(df)} entries')

the dataset has dimensions: (517401, 2) with 517401 entries


To understand the dataset, I show the email below, it will be important to separate the different sections from the body of this message.

In [17]:
df.head()

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


Since we are only interested in the content of the messages we can drop the filename columns

In [19]:
df = df.drop('file',axis = 1)

In [20]:
sample_email = df.loc[1]['message']
print(sample_email)

Message-ID: <15464986.1075855378456.JavaMail.evans@thyme>
Date: Fri, 4 May 2001 13:51:00 -0700 (PDT)
From: phillip.allen@enron.com
To: john.lavorato@enron.com
Subject: Re:
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: John J Lavorato <John J Lavorato/ENRON@enronXgate@ENRON>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Traveling to have a business meeting takes the fun out of the trip.  Especially if you have to prepare a presentation.  I would suggest holding the business plan meetings here then take a trip without any formal business meetings.  I would even try and get some honest opinions on whether a trip is even desired or necessary.

As far as the business meetings, I think it would be more productive to try and stimulate discussions across the different groups about what is working and what is not.  Too often the

The native python package <em> emails </em> is built to process this style of data as shown below. </br></br>
It is important to note that to/from returns the email of the person while adding X- in front of to/from will return the name of the person. You can see in the data below that, in the file, the name of the recipient is not perfect, it contains information other than that persons name.

In [40]:
e = email.message_from_string(sample_email)
print('Date Sent:',e.get('date'))
print('Sender:',e.get('X-From'))
print('Sender (email):',e.get('From'))
print('Recipient:',e.get('X-To'))
print('Recipient (email):',e.get('To'))
print('Body:',e.get_payload())

Date Sent: Fri, 4 May 2001 13:51:00 -0700 (PDT)
Sender: Phillip K Allen
Sender (email): phillip.allen@enron.com
Recipient: John J Lavorato <John J Lavorato/ENRON@enronXgate@ENRON>
Recipient (email): john.lavorato@enron.com
Body: Traveling to have a business meeting takes the fun out of the trip.  Especially if you have to prepare a presentation.  I would suggest holding the business plan meetings here then take a trip without any formal business meetings.  I would even try and get some honest opinions on whether a trip is even desired or necessary.

As far as the business meetings, I think it would be more productive to try and stimulate discussions across the different groups about what is working and what is not.  Too often the presenter speaks and the others are quiet just waiting for their turn.   The meetings might be better if held in a round table discussion format.  

My suggestion for where to go is Austin.  Play golf and rent a ski boat and jet ski's.  Flying somewhere takes 

In [23]:
#Load dataframe with emails with pickle (don't run cell below)
import pickle
with open('df_emails.pickle','rb') as df_emails:
     df = pickle.load(df_emails)

In [24]:
#DO NOT RUN unless changing the code in this cell
def get_emails(df):
    temp = []
    for i in range(len(df)):
        temp.append(email.message_from_string(df['message'][i]))
    return temp
emails = get_emails(df)
df['emails'] = emails

#save emails with pickle to save time
import pickle
with open('df_emails.pickle','wb') as df_emails:
    pickle.dump(df, df_emails)

KeyboardInterrupt: 

Now we can use the email data type to get the information we want such as To, From, Date, cc, and bcc and add it to our dataframe

In [46]:
for i in ['To','From','Date','X-cc','X-bcc']:
    df[i] = [e.get(i) for e in df['emails']]
df['text'] = [e.get_payload() for e in df['emails']]

## Data Visualization
With the emails separated by cotent type we can now view information about the different categories, such as who appears in the dataset the most and least, what words are commonly used, etc.

In [52]:
#Top 10 Senders
df['To'].value_counts()[:10]

pete.davis@enron.com         9155
tana.jones@enron.com         5677
sara.shackleton@enron.com    4974
vkaminski@aol.com            4870
jeff.dasovich@enron.com      4350
kate.symes@enron.com         3517
all.worldwide@enron.com      3324
mark.taylor@enron.com        3295
kay.mann@enron.com           3085
gerald.nemec@enron.com       3074
Name: To, dtype: int64

Below I define a function that first converts the text to words, then removes words which are only punctuation or filler words. Then I find the root word (stem) and then count the number of appearances in the subset. It's important to stem the words as we may want to consider the past and tense and present tense of a word when counting the frequency of words

In [135]:
from nltk.corpus import stopwords
import string
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stop_words = stopwords.words('english')

def get_top_words(series):
    wordlist = list()
    for i in series:
        tok = nltk.word_tokenize(i)
        for w in tok:
            if w.lower() not in stop_words+list(string.punctuation)+['--']:
                wordlist.append(ps.stem(w.lower()))
    #tweak the parameter at the end of the list to view a higher or lower number of top words
    return pd.DataFrame(wordlist).value_counts()[:25]

Due to operating limitations I cannot run this over all 500,000+ emails (nor can I shuffle the order of all 500,000 emails therefore I list the top 25 words for the first 10,000 emails, since emails are ordered by sender there will be bias in this subset but nonetheless we can gain insights from it.

In [213]:
get_top_words(df['text'][:10000])

KeyboardInterrupt: 