# Enron Emails dev challenge project

Public dataset Enron Emails, https://www.cs.cmu.edu/~./enron/.
Dataset version May 7, 2015.

By Anette Karhu

## Task outline
### 1) Calculate how many emails were sent from each sender address to each recipient.
The result should be a CSV file that contains three columns (with header row included):

sender: the sending email address,
recipient: the recipient email address
count: number of emails sent from sender to recipient
If an email has multiple recipients, CC's or BCC's, count the email as it would have been sent to each recipient individually.

### 2) Calculate the average number of emails received per day per employee per day of week (monday, tuesday, etc.).
An employee is here defined as a person whos shortened name appears on the folder names on maildir, for example taylor-m.

The result should be a CSV file that contains three columns (with header row included):

employee: the shortname of the employee
day_of_week: day of week is a number 0-6, where 0 is monday, 1 tuesday etc
avg_count: average number of emails received on the corresponding day of week by the corresponding employee

In [2]:
from email.parser import BytesParser, Parser
from email.policy import default
import pandas as pd
from email.message import EmailMessage
import os

In [3]:
# Let's try to open a small amount of data as the data size is over 2GB.

#one folder for testing
rootdir = r'C:\Users\Anette\Documents\Enron_Emails_project\enron_emails\maildir\allen-p'
# all files
maindir = r'C:\Users\Anette\Documents\Enron_Emails_project\enron_emails\maildir'

In [4]:
# List all folders, i.e. the usernames of email holders.
all_users = os.listdir(maindir)

In [5]:
# TODO: how to go through all files in batches? 

#TODO: Test program with one user:
# Go trough a users all email in all files
# make a zipped list & df of all emails sender and receiver
# split tuples
# count amount of sent emails.

arnolds_mail_root = r'C:\Users\Anette\Documents\Enron_Emails_project\enron_emails\maildir\arnold-j'
for i in os.walk(arnolds_mail_root):
    print(i)

('C:\\Users\\Anette\\Documents\\Enron_Emails_project\\enron_emails\\maildir\\arnold-j', ['2000_conference', 'active_international', 'all_documents', 'avaya', 'bmc', 'bridge', 'bristol_babcock', 'colleen_koenig', 'compaq', 'computer_associates', 'continental_airlines', 'cooper_cameron', 'corestaff', 'deleted_items', 'dell', 'discussion_threads', 'ebs', 'ees', 'enron_europe', 'etol', 'fedex', 'ge', 'hp', 'human_resources', 'inbox', 'kinko_s', 'nepco', 'nepco_europe', 'notes_inbox', 'oec', 'pcc_values', 'personal', 'purchasing', 'requisite', 'sap', 'sarah_joy_hunter', 'sent', 'sent_items', 'sonoco', 'sony', 'sparefinders_com', 'tasks', 'universal_studios', 'vulcan_signs', 'weekly_report', '_sent_mail'], [])
('C:\\Users\\Anette\\Documents\\Enron_Emails_project\\enron_emails\\maildir\\arnold-j\\2000_conference', [], ['1_', '2_', '3_'])
('C:\\Users\\Anette\\Documents\\Enron_Emails_project\\enron_emails\\maildir\\arnold-j\\active_international', [], ['1_'])
('C:\\Users\\Anette\\Documents\\Enr

In [6]:
# root file location for arnold j's sent emails
arnolds_sent_mail_root = r'C:\Users\Anette\Documents\Enron_Emails_project\enron_emails\maildir\arnold-j\_sent_mail'

file_names = [filename for filename in sorted(os.listdir(arnolds_sent_mail_root),key=len)]
# print(file_names)

sent_mails_dirs= [(arnolds_sent_mail_root + '\\' + dir_name) for dir_name in file_names]    
# print(sent_mails_dirs)

In [7]:
# Looping one senders (arnold j's) sent emails from file sent_items.
# Making a dictionary of headers: from, to, cc, bcc
# TODO: make dictionary in a loop, nested dictionary! joka riville jokaisesta
# sähköpostista omat tiedot!
# TODO: add email only if it does not exist in dictionary, or delete duplicates
# TODO: change dictionary into pd.dataframe that can be transformed easily into csv format.


receiver_list = []
sender_list = []
# Opens Mimo formatted emails and parses data with email library.
# leaves out none/blank fields in to, cc, and bcc fields. 
for index, mail in enumerate(sent_mails_dirs):
    with open(mail, 'rb') as fp:
        headers = BytesParser(policy=default).parse(fp)
        sender_list.append(format(headers['from']))
        if format(headers['to']) != 'None':
            receiver_list.append(format(headers['to']))
        if format(headers['bcc']) != 'None':
#             print(format(headers['bcc']))
            receiver_list.append(format(headers['bcc']))
        if format(headers['cc']) != 'None':
#             print(format(headers['cc']))
            receiver_list.append(format(headers['cc']))

# Zip together as a list the two separate lists of sender and receiver(s)
tuples= list(zip(sender_list, receiver_list))

print(len(receiver_list), len(sender_list))

# mutta, halutaanko kaikki sähköpostit mitä datassa on, vai vaan käyttäjän lähettämät
# sähköpostiviestit??

828 814


In [9]:
# Create a dataframe 
dataframe = pd.DataFrame(tuples, columns=['sender', 'receiver'])
dataframe


Unnamed: 0,sender,receiver
0,john.arnold@enron.com,slafontaine@globalp.com
1,john.arnold@enron.com,jenwhite7@zdnetonebox.com
2,john.arnold@enron.com,greg.whalley@enron.com
3,john.arnold@enron.com,sarah.wesner@enron.com
4,john.arnold@enron.com,ina.rangel@enron.com
...,...,...
809,john.arnold@enron.com,slafontaine@globalp.com
810,john.arnold@enron.com,phillip.allen@enron.com
811,john.arnold@enron.com,jennifer.shipos@enron.com
812,john.arnold@enron.com,jennifer.fraser@enron.com


In [10]:
# split multiple receivers into rows, add column names again as they disappeared in 
# in the concat method, change the columns order back to original: sender,receiver format.
splitted_receivers_df = pd.concat([pd.Series(row['sender'], row['receiver'].split(', ')) for _, row in dataframe.iterrows()]).reset_index()
splitted_receivers_df.columns =['receiver', 'sender']
splitted_receivers_df = splitted_receivers_df.reindex(columns=['sender', 'receiver'])
splitted_receivers_df

Unnamed: 0,sender,receiver
0,john.arnold@enron.com,slafontaine@globalp.com
1,john.arnold@enron.com,jenwhite7@zdnetonebox.com
2,john.arnold@enron.com,greg.whalley@enron.com
3,john.arnold@enron.com,sarah.wesner@enron.com
4,john.arnold@enron.com,ina.rangel@enron.com
...,...,...
857,john.arnold@enron.com,slafontaine@globalp.com
858,john.arnold@enron.com,phillip.allen@enron.com
859,john.arnold@enron.com,jennifer.shipos@enron.com
860,john.arnold@enron.com,jennifer.fraser@enron.com


In [12]:
# Counts together how many times certain email address in mentioned in the list
# of sended emails.
counted_data = splitted_receivers_df.pivot_table(index=['sender', 'receiver'], aggfunc='size')
counted_data = pd.DataFrame(counted_data)
counted_data.rename(columns={0:'count'}, inplace=True)
# print(counted_data.columns)
counted_data

Unnamed: 0_level_0,Unnamed: 1_level_0,count
sender,receiver,Unnamed: 2_level_1
john.arnold@enron.com,adam.r.bayer@vanderbilt.edu,3
john.arnold@enron.com,aedc@aedc.org,1
john.arnold@enron.com,airam.arteaga@enron.com,1
john.arnold@enron.com,alan_batt@oxy.com,1
john.arnold@enron.com,allen.elliott@enron.com,2
john.arnold@enron.com,...,...
john.arnold@enron.com,w.duran@enron.com,1
john.arnold@enron.com,websupport@moneynet.com,4
john.arnold@enron.com,wine@bassins.com,1
john.arnold@enron.com,wsx@wsx.wsex.com,1


In [16]:
# Root file for csv files
root_file_for_csv = r'C:\Users\Anette\Documents\enron_emails'

# Write pd.dataframe into new scv file
# emails_sent_totals to csv
# adding tuple dataframe into csv for test.
csv_file_root = os.path.join(root_file_for_csv, 'emails_sent_totals.csv') 
# print(csv_file_root)
counted_data.to_csv(csv_file_root)