# Enron Emails dev challenge project

Public dataset Enron Emails, https://www.cs.cmu.edu/~./enron/.
Dataset version May 7, 2015.

By Anette Karhu

## Task 1 outline
### 1) Calculate how many emails were sent from each sender address to each recipient.
The result should be a CSV file that contains three columns (with header row included):

sender: the sending email address,
recipient: the recipient email address
count: number of emails sent from sender to recipient
If an email has multiple recipients, CC's or BCC's, count the email as it would have been sent to each recipient individually.

## How to execute the code

Use the main function to call the program. The program goes through all emails
in Enron Email -dataset and counts into a csv (emails_sent_totals.csv) 
the total amounts of each sender sent mail to each receiver.

To execute code, give the path to class: Enron_emails('add/path/here'), for the file location where the 
actual location of Enron Emails -dataset is located, here the location has been:
C:\Users\Anette\Documents\Enron_Emails_project\enron_emails\maildir. 

Give also the path to method: count_similarities('csv/path/here.csv'), with .csv-filename 
to create the emails_sent_totals.csv.

Use ''- marks to give the path to the program. Also, if using windows, use key 'r', i.e.
r'path/to/email/set' - as windows path is not recognized otherwise and errors accure. 
If using Linux or Mac 'path/to/email/set'- is enough, don't use r'path/to/email/set'.

When program has executed, None appears. A csv-file has been created into given path.


In [1]:
from email.parser import BytesParser, Parser, BytesHeaderParser
from email.policy import default
import pandas as pd
from email.message import EmailMessage
import os
from functools import partial
import csv
import re

In [2]:
class Enron_emails:
    '''
    A class to handle enron emails.
    
    The program goes through every email on Enron Emails dataset, 
    parses all emails headers to select only From, To, Cc, and Bcc fields, i.e. email addresses.
    These email addresses are then saved into csv for memory consuption reasons.
    
    In the CSV, there are first two columns: first field, i.e. the sender-field 
    refers to From-header in the emails. The second column, Receivers, 
    contain data from: To, Cc, and Bcc headers in the emails.
    
    Lastly, a new column 'Count' is added to the CSV-file, to show
    how many times a certain sender has sent to certain receiver emails.
    '''
    def __init__(self, root_directory):
        '''
        Initialize with root directory of where the enron emails are located.
        '''
        self.root_directory = root_directory
    
    def users_directories_files(self):
        '''
        List of paths to all files for all users.
        Used for reading email data from all users and all folders.
        '''
        all_dirs = [(os.path.join(root,file)) for root,dirs,files in os.walk(self.root_directory) for file in files]
        return all_dirs


    def load_parse_and_save(self, csv_file_path):
        '''
        Loads all email data and opens them in binary. Function parses From, To, Cc, and Bcc headers
        with BytesHeaderParser and saves parsed email addresses first into list of tuples and the 
        into a csv-file if they contain values. This new csv-file has two columns, the receiver 
        and the sender email addresses. 
        '''
        sender_receiver_list =[]
        for index, mail in enumerate(self.users_directories_files()):
            with open(mail, 'rb') as fp:             
                email = BytesHeaderParser().parse(fp)
                sender = format(email['from'])
                if format(email['to']) != 'None':
                    receiver = format(email['to'])
                    sender_receiver_list.append((sender, receiver))
                if format(email['cc']) != 'None':
                    cc_receiver = format(email['cc'])
                    sender_receiver_list.append((sender, cc_receiver))
                if format(email['bcc']) != 'None':
                    bcc_receiver = format(email['bcc'])
                    sender_receiver_list.append((sender, bcc_receiver))
        csv_writer = csv.writer(open(csv_file_path, 'w', newline='', encoding="utf-8"))
        csv_writer.writerows(sender_receiver_list)
        return
    
    def count_similarities(self, path_to_csv):
        '''
        This function first opens the created csv-file with pandas DataFrame and
        cleans special characters from email data. Next, it splits tuples from the receiver-column
        into their own rows, having now only one sender and one receiver in a row.
        Finally, the function counts the same senders and receivers amount (i.e. duplicates)
        and saves this information into csv-file into a new count-column.
        '''
        self.load_parse_and_save(path_to_csv)
        csv_as_df = pd.read_csv(path_to_csv)
        csv_as_df.columns=['sender','receiver']
        # remone (), specials characters from email addresses.
        csv_as_df['receiver'] = csv_as_df['receiver'].str.replace("\(\'", '')
        csv_as_df['receiver'] = csv_as_df['receiver'].str.replace("\',\)", '')
        # remove any word before <. appears in string, and remove <.> - characters.
        csv_as_df['receiver'] = [re.sub(r".*<.", '', stri) for stri in csv_as_df['receiver']]
        csv_as_df['receiver'] = csv_as_df['receiver'].str.replace(">", '')
        # Splits the receivers into own rows.
        splitted_receivers_df = pd.concat([pd.Series(row['sender'], row['receiver'].split(', ')) for _, row in csv_as_df.iterrows()]).reset_index()
        splitted_receivers_df.columns =['receiver', 'sender']
        splitted_receivers_df = splitted_receivers_df.reindex(columns=['sender', 'receiver'])
        # Counts the same receivers and senders amount together. (flips the columns and forgets colum names.)
        counted_data = splitted_receivers_df.pivot_table(index=['sender', 'receiver'], aggfunc='size')
        # switch columns back into order and rename columns into asked format:sender,receiver,count.
        counted_data = pd.DataFrame(counted_data)
        counted_data.rename(columns={0:'count'}, inplace=True)
        counted_data.to_csv(path_to_csv)
        return
        

In [3]:
def main():
    '''
    The main function to call the program. The program goes through all emails
    in Enron Email -dataset and counts into a csv (emails_sent_totals.csv) 
    the total amounts of each sender sent mail to each receiver.
    
    To execute code, give the path to class: Enron_emails('add/path/here'), for the file location where the 
    actual location of Enron Emails- -dataset is located, here the location has been:
    C:\Users\Anette\Documents\Enron_Emails_project\enron_emails\maildir. 
    
    Give also the path to method: count_similarities('csv/path/here.csv'), with .csv-filename 
    to create the emails_sent_totals.csv.
    
    Use ''- marks to give the path to the program. Also, if using windows, use key 'r', i.e.
    r'path/to/email/set' - as windows path is not recognized otherwise and errors accure. 
    If using Linux or Mac 'path/to/email/set'- is enough, don't use r'path/to/email/set'.
    
    When program has executed, None appears. Csv-file has been created into given path.
    '''
    enron_all_mails = Enron_emails(r'C:\Users\Anette\Documents\Enron_Emails_project\enron_emails\maildir')
    print(Enron_emails.count_similarities(enron_all_mails, r'C:\Users\Anette\Documents\enron_emails\emails_sent_totals.csv'))

if __name__ == "__main__":
    main()


None
