# Exploratory Data Analysis of Enron Dataset

### About Enron 
The Enron email dataset contains approximately 500,000 emails generated by employees of the Enron Corporation. It was obtained by the Federal Energy Regulatory Commission during its investigation of Enron's collapse.

The Enron scandal was a series of events involving dubious accounting practices that resulted in the 2001 bankruptcy of the energy, commodities, and services company Enron Corporation and the subsequent dissolution of the accounting firm Arthur Andersen.
### About Enron Email Dataset
The Enron Email Dataset is a collection of emails from the Enron Corporation, which became publicly available after the company's scandal and bankruptcy in 2001. This dataset is notable for several reasons:

Contents and Structure
Email Collection: The dataset includes approximately 500,000 emails from around 150 Enron employees. These emails were collected as part of the legal proceedings and investigations into the company's fraud.

Data Structure: The dataset is typically organized into folders and subfolders, reflecting the email directories of the employees. It includes various metadata fields such as sender, recipient, subject, date, and content of the emails.

Formats: The emails are usually provided in plain text or other easily readable formats, sometimes with attachments like documents or spreadsheets.

Key Features
Time Range: The emails span several years, primarily from 2000 to 2001, capturing a crucial period before and during the exposure of the Enron scandal.

Employee Roles: The dataset includes emails from a diverse range of Enron employees, from executives to lower-level staff, which provides insights into various levels of the company's operations and culture.

Uses and Applications
Research and Analysis: Researchers use the dataset to study corporate communication patterns, decision-making processes, and organizational behavior. It provides a real-world example of how corporate fraud can be intertwined with day-to-day operations and communications.

Machine Learning and NLP: The dataset is widely used in natural language processing (NLP) and machine learning for tasks such as text classification, sentiment analysis, and network analysis. It offers a rich source of text data for training and evaluating models.

Legal and Ethical Studies: The dataset serves as a case study in legal and ethical research, illustrating the consequences of corporate malfeasance and the importance of transparency and accountability in business practices.

Accessibility
The Enron Email Dataset is publicly accessible and often found in academic and research archives. It has been made available by various institutions, including the CALO Project at Stanford University. Because it contains sensitive information, there are considerations around privacy and ethical use, especially regarding personal data within the emails.

Overall, the Enron Email Dataset is a valuable resource for understanding corporate dynamics and fraud, offering a unique window into the inner workings of a major company during a critical period.


In [1]:
# !pip3 install pandas
# !pip3 install numpy
# !pip3 install matplotlib

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
df = pd.read_csv('emails.csv')

df.head()

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


In [2]:
#=== make a copy of the dataframe
emails_df = df.copy()

emails_df.head()

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


In [3]:
#=== create a function to split text
import re
def split_text(text, match):	
    text = re.sub(r"\n\t", "", text)
    return re.split(match, text)

#=== create a function to extract proper text from the email body
def extract_body(text, substr):	
    result = re.split(substr, text)[-1]
    result = re.sub(r"([\n-])", "", result)
    return result

#=== clean up the data fields
#- function to extract email addresses
def extract_emails(text, substr):
    result = re.findall("[^\s]+@[\w]+.[\w]+", str(text))
    if substr not in text:
        result = ""
    return result

#- function to extract subject
def extract_subject(text):

    list_of_words = re.split("\s", text)
    words_to_drop = ["Subject:","re:","Re:","RE:","fw:","Fw:", "FW:"]

    desired_words = []
    for word in list_of_words:
        if word not in words_to_drop:
            desired_words.append(word)

    r = re.compile("[\w]{3,}")
    final_list = list(filter(r.match, desired_words))

    return final_list 

#- function to extract the name of entity
def extract_entity(text):	
    string = ""
    for i in text:
        string = string + " " + i

    list_of_emails = list(re.findall(r"@[\w]+", string))	
    result = []
    for item in list_of_emails:		
        result.append(item[1:])

    return set(result)

In [4]:
#=== store output in new column
emails_df["message_tidy"] = emails_df.message.apply(lambda x : split_text(x, "\n"))

#=== take a look at the output
print(emails_df.head())
print(emails_df.message.head(1))
print(emails_df.message_tidy.head(1))

                       file  \
0     allen-p/_sent_mail/1.   
1    allen-p/_sent_mail/10.   
2   allen-p/_sent_mail/100.   
3  allen-p/_sent_mail/1000.   
4  allen-p/_sent_mail/1001.   

                                             message  \
0  Message-ID: <18782981.1075855378110.JavaMail.e...   
1  Message-ID: <15464986.1075855378456.JavaMail.e...   
2  Message-ID: <24216240.1075855687451.JavaMail.e...   
3  Message-ID: <13505866.1075863688222.JavaMail.e...   
4  Message-ID: <30922949.1075863688243.JavaMail.e...   

                                        message_tidy  
0  [Message-ID: <18782981.1075855378110.JavaMail....  
1  [Message-ID: <15464986.1075855378456.JavaMail....  
2  [Message-ID: <24216240.1075855687451.JavaMail....  
3  [Message-ID: <13505866.1075863688222.JavaMail....  
4  [Message-ID: <30922949.1075863688243.JavaMail....  
0    Message-ID: <18782981.1075855378110.JavaMail.e...
Name: message, dtype: object
0    [Message-ID: <18782981.1075855378110.JavaMail....
Name: m

In [5]:
#=== pull out useful data and post them into columns
emails_df["date"] = emails_df.message_tidy.apply(lambda x : x[1])
emails_df["sender_email"] = emails_df.message_tidy.apply(lambda x : x[2])
emails_df["recipient_email"] = emails_df.message_tidy.apply(lambda x : x[3])
emails_df["subject"] = emails_df.message_tidy.apply(lambda x : x[4])
emails_df["cc"] = emails_df.message_tidy.apply(lambda x : x[5])
emails_df["bcc"] = emails_df.message_tidy.apply(lambda x : x[9])
emails_df["body"] = emails_df.message.apply(lambda x : extract_body(x, r"X-FileName: [\w]*[\s]*[(Non\-Privileged).pst]*[\w-]*[.nsf]*").strip())

In [6]:
print(emails_df.head(1))

                    file                                            message  \
0  allen-p/_sent_mail/1.  Message-ID: <18782981.1075855378110.JavaMail.e...   

                                        message_tidy  \
0  [Message-ID: <18782981.1075855378110.JavaMail....   

                                          date                   sender_email  \
0  Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)  From: phillip.allen@enron.com   

            recipient_email    subject                 cc  \
0  To: tim.belden@enron.com  Subject:   Mime-Version: 1.0   

                                              bcc                  body  
0  X-To: Tim Belden <Tim Belden/Enron@EnronXGate>  Here is our forecast  


In [7]:
#- extract date info
emails_df["day_of_week"] = emails_df.loc[:,"date"].apply(lambda x : x[5:9])
emails_df.loc[:,"date"] = emails_df.loc[:,"date"].apply(lambda x : x[10:22])

#- extract sender and recipient email
emails_df.loc[:,"sender_email"] = emails_df.loc[:,"sender_email"].apply(lambda x : extract_emails(x, "From: "))
emails_df.loc[:,"recipient_email"] = emails_df.loc[:,"recipient_email"].apply(lambda x : extract_emails(x, "To: "))
emails_df.loc[:,"cc"] = emails_df.loc[:,"cc"].apply(lambda x : extract_emails(x, "Cc: "))
emails_df.loc[:,"bcc"] = emails_df.loc[:,"bcc"].apply(lambda x : extract_emails(x, "Bcc: "))
emails_df["all_recipient_emails"] = emails_df.apply(lambda x : list(x["recipient_email"]) + list(x["cc"]) + list(x["bcc"]), axis = 1)
emails_df["num_recipient"] = emails_df.recipient_email.apply(lambda x : len(x)) + emails_df.cc.apply(lambda x : len(x)) + \
                                emails_df.bcc.apply(lambda x : len(x))
    
#- extract sender and recipient entity info
emails_df["sender_entity"]    = emails_df.loc[:,"sender_email"].apply(lambda x : extract_entity(x))
emails_df["recipient_entity_to"] = emails_df.loc[:,"recipient_email"].apply(lambda x : extract_entity(x))
emails_df["recipient_entity_cc"] = emails_df.loc[:,"cc" ].apply(lambda x : extract_entity(x))
emails_df["recipient_entity_bcc"] = emails_df.loc[:,"bcc"].apply(lambda x : extract_entity(x))
emails_df["all_recipient_entities"] = emails_df.apply(lambda x : \
                                                 x["recipient_entity_to" ] | \
                                                 x["recipient_entity_cc" ] | \
                                                 x["recipient_entity_bcc"], axis = 1)

emails_df["sender_entity"] = emails_df.sender_entity.apply(lambda x : list(x))
emails_df["all_recipient_entities"] = emails_df.all_recipient_entities.apply(lambda x : list(x))

#- extract subject
emails_df.loc[:,"subject"] = emails_df.loc[:,"subject"].apply(lambda x : extract_subject(x))

#=== select and reorder the colums
df = emails_df.loc[:,["date","day_of_week","subject","body","sender_email","all_recipient_emails",
                                 "sender_entity","all_recipient_entities","num_recipient"]]  

print(df.head())

           date day_of_week  subject  \
0   14 May 2001         Mon       []   
1   4 May 2001          Fri       []   
2   18 Oct 2000         Wed   [test]   
3   23 Oct 2000         Mon       []   
4   31 Aug 2000         Thu  [Hello]   

                                                body  \
0                               Here is our forecast   
1  Traveling to have a business meeting takes the...   
2                     test successful.  way to go!!!   
3  Randy, Can you send me a schedule of the salar...   
4                  Let's shoot for Tuesday at 11:45.   

                sender_email       all_recipient_emails sender_entity  \
0  [phillip.allen@enron.com]     [tim.belden@enron.com]       [enron]   
1  [phillip.allen@enron.com]  [john.lavorato@enron.com]       [enron]   
2  [phillip.allen@enron.com]   [leah.arsdall@enron.com]       [enron]   
3  [phillip.allen@enron.com]    [randall.gay@enron.com]       [enron]   
4  [phillip.allen@enron.com]     [greg.piper@enron.com]  

In [8]:
df.head(10)

Unnamed: 0,date,day_of_week,subject,body,sender_email,all_recipient_emails,sender_entity,all_recipient_entities,num_recipient
0,14 May 2001,Mon,[],Here is our forecast,[phillip.allen@enron.com],[tim.belden@enron.com],[enron],[enron],1
1,4 May 2001,Fri,[],Traveling to have a business meeting takes the...,[phillip.allen@enron.com],[john.lavorato@enron.com],[enron],[enron],1
2,18 Oct 2000,Wed,[test],test successful. way to go!!!,[phillip.allen@enron.com],[leah.arsdall@enron.com],[enron],[enron],1
3,23 Oct 2000,Mon,[],"Randy, Can you send me a schedule of the salar...",[phillip.allen@enron.com],[randall.gay@enron.com],[enron],[enron],1
4,31 Aug 2000,Thu,[Hello],Let's shoot for Tuesday at 11:45.,[phillip.allen@enron.com],[greg.piper@enron.com],[enron],[enron],1
5,31 Aug 2000,Thu,[Hello],"Greg, How about either next Tuesday or Thursda...",[phillip.allen@enron.com],[greg.piper@enron.com],[enron],[enron],1
6,22 Aug 2000,Tue,[],Please cc the following distribution list with...,[phillip.allen@enron.com],"[david.l.johnson@enron.com, john.shafer@enron....",[enron],[enron],2
7,14 Jul 2000,Fri,"[PRC, review, phone, calls]",any morning between 10 and 11:30,[phillip.allen@enron.com],[joyce.teixeira@enron.com],[enron],[enron],1
8,17 Oct 2000,Tue,"[High, Speed, Internet, Access]",1. login: pallen pw: ke9davis I don't think t...,[phillip.allen@enron.com],[mark.scott@enron.com],[enron],[enron],1
9,16 Oct 2000,Mon,"[fixed, forward, other, Collar, floor, gas, pr...",Forwarded by Phillip K Allen/HOU/ECT on 10/16/...,[phillip.allen@enron.com],[zimam@enron.com],[enron],[enron],1


In [14]:
print(df.shape)
df.dropna(axis = 'columns')
print(df.shape)

(517401, 9)
(517401, 9)


In [14]:


#Shuffle Dataset
df = df.sample(frac = 1).reset_index(drop=True)

#Divide Dataset into three equal parts
#one_third_count = math.ceil(len(df) / 3)

#df_1 = df.iloc[:one_third_count]

#df_2 = df.iloc[one_third_count:, :one_third_count * 2]

#df_3 = df.iloc[one_third_count*2:]


In [19]:
import nltk
nltk.download('stopwords')
# Import stopwords with nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
df['body_stop'] = df['body'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

[nltk_data] Downloading package stopwords to C:\Users\DHAKA
[nltk_data]     WASA\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# import SyllableTokenizer() method from nltk 
#nltk.download('punkt')
#nltk.download('punkt_tab')
from nltk import word_tokenize 
df['body_stop'] = df.apply(lambda row: nltk.word_tokenize(row['body_stop']), axis=1)
df.head(10)