# Introduction
In this assignment, I will be manipulating and examining the enron database containing emails. There are two different places where information will be pulled from. The first will be through the SSCC through a VPN client. The other data source will be the full pickled email dataframe.

# Part 1
To start the assignment I needed to load a couple of packages for use. and then get a feel for the information that I pull from the SSCC from the VPN client. From there, I will extract messages from the enron index that include a Keny Lay email address in them in a message header.

In [None]:
#Packages
from elasticsearch import Elasticsearch, helpers
import pandas as pd
from datetime import datetime
import pprint
import pickle
import re

Then I got a feel for using Elasticsearch

In [None]:
#Connecting to the enron index in ES
es=Elasticsearch('http://enron:spsdata@129.105.88.91:9200')

#Query spec to match anything in a message, i.e. to retrieve all messages
query={"query" : {"match_all" : {}}}

#Count how many messages there are in Enron
count_results=es.search(size=0,index='enron',doc_type='email',body=query) 
count_results 

#Understanding the ingredients
msgs=es.search(index='enron',doc_type='email',body=query)
msgs

#Print keys in msgs
for key in msgs.keys(): print(key)

#Value associated with its key hits
type(msgs['hits']['hits'])

#Pretty Print of email documents for each unique ____id
pprint.pprint(msgs['hits']['hits'][0])

From there I had a nested query to find headers, specifically from the "X-From" header, that contained the following words: 'Ken Lay', 'Kenneth Lay', 'Ken'.

In [None]:
query={"query":{"nested":{"path":"headers","query":{"match":{"headers.X-From":"Ken Lay|Kenneth Lay|Ken"}}}}}
es.count(index='enron',doc_type='email',body=query) 
es.search(size=5,index='enron',doc_type='email',body=query)

It looks like there were 1,226 emails that included a Keny Lay email address in them in the message header.

# Part 2
In part two I used the pickled pandas dataframe containing all emails from Enron. I needed to find out how many different Ken Lay email addresses there were in the messages and provide a count.

In [None]:
#Create a data frame from the pickled data and standardize the date time format
df_enron_email = pd.read_pickle("eemail_df.pkl")

I then replaced any NA values with blanks

In [None]:
df_enron_email.fillna("", inplace = True)

The Date Time format was formatted uniformly.

In [None]:
#Standardize Date Time
df_enron_email['Date'] = pd.to_datetime(df_enron_email.Date, errors='ignore')
#Make sure there are no null fields
df_enron_email.Date.isnull().sum()
#Double check DB
df_enron_email.head(15)

I then had to change a column header and then search for available, viable options for email addresses belonging to Ken Lay.

In [None]:
#Unique Email Count - Change Column header
df_enron_email.rename(columns={'X-From': 'XFrom'}, inplace = True)

#Find available options with matching values
df_i = df_enron_email[df_enron_email.From.str.contains('ken|Lay|lay|klay|chairman')]
df_i.rename(columns={'Message-ID': 'MessageID'}, inplace=True)

From there, I was able to find an adequate value list to be able to search for in the data frame. The information was then grouped so that the information could be viewed. There were six unique addresses found to belong to Ken Lay that he used to communicate. One was shared with Skilling.

In [None]:
#Plug viable options into a value list and sort.
value_list = ['ken.skilling@enron.com','chairman.ken@enron.com','kenneth.lay@enron.com','ken.lay-@enron.com','ken.lay-.chairman.of.the.board@enron.com','ken.lay@enron.com','no.address@enron.com',"ken_lay"]
df_2 = df_i[df_i.From.isin(value_list)]
df_2.groupby('From').MessageID.nunique()
df_2

# Part 3
In part three, I was to determine how many of the messages were "To:" Ken Lay, and were "From:" Ken Lay. A count was provided for each of these.

In [None]:
#To and From Ken Lay:
query={"query":{"nested":{"path":"headers","query" : {"multi_match" : {"fields" : ["headers.From", "headers.To"],"query":"Ken Lay"}}}}}
es.count(index='enron',doc_type='email',body=query)
es.search(size=5,index='enron',doc_type='email',body=query)

The first search turned up 0 results. So a second search was done using the fields "X-To" and "X-From" There were 5808 emails total that were To and From "Ken Lay"

In [None]:
#------ Nothing turned up so we will use the X Fields
query={"query":{"nested":{"path":"headers","query" : {"multi_match" : {"fields" : ["headers.X-From", "headers.X-To"],"query":"Ken Lay"}}}}}
es.count(index='enron',doc_type='email',body=query)
es.search(size=5,index='enron',doc_type='email',body=query)

# Part 4
In part 4, I was to determine who lay sent the most emails to, how many he sent, who Lay sent the most emails to, and how many. 

It looks like Kenneth Thibodeaux sent lay the most emails at 58 total emails

In [None]:
#--Who sent Lay the most emails:
df_temp = df_enron_email
df_temp["Count"] = 1
df_temp = df_temp[(df_enron_email["X-To"].str.contains(".*klay@enron.com*."))]
number_to_klay = df_temp.groupby(["X-From"])["Count"].sum()
number_to_klay = number_to_klay.sort_values(ascending = False)
print("User who sent the most emails to Ken Lay:", number_to_klay.index[0])
print("Number of emails sent by", number_to_klay.index[0], "to Ken Lay:", number_to_klay[0])

Lay sent the most emails to "All Enron Worldwide@enron.com" at a total of 273 emails.

In [None]:
#--Who Lay sent the most emails to:
df_temp = df_enron_email
df_temp["Count"] = 1
df_temp = df_temp[(df_enron_email["X-From"].str.contains("Ken Lay"))]
number_to_klay = df_temp.groupby(["X-To"])["Count"].sum()
number_to_klay = number_to_klay.sort_values(ascending = False)
print("User who Ken Lay sent the most emails to:", number_to_klay.index[0])
print("Number of emails sent to", number_to_klay.index[0], "From Ken Lay:", number_to_klay[0])

# Part 5
In part 5, I was to determine whether or not the emails sent increased or decreased after Enron filed for bakruptcy or before they filed for bankruptcy. It looks like the majority of the emails occured after Enron filed for bankruptcy.

In [None]:
date_min = df_enron_email["Date"].min()
date_max = df_enron_email["Date"].max()

df_temp = df_enron_email
before_bank = df_temp['Date'] <= 'Sat, 1 Dec 2001 24:59:59 -0800 (PST)'
print("Number of emails before bankruptcy:", before_bank.sum())
after_bank = df_temp['Date'] >= 'Sun, 2 Dec 2001 01:00:00 -0800 (PST)'
print("Number of emails after bankruptcy:", after_bank.sum())
#Emails Before: 181,951
#Emails After: 307,944

# Part 6
In part six I was to determine how many of the messages in number four mentioned Arth Anderson, Enron's accounting firm. 855 emails mention Arthur Anderson, Enron's accounting firm.

In [None]:
df_temp = df_enron_email[["Subject", "body"]]
arthur_count = df_temp.applymap(lambda x: bool(re.search(".*Arthur Andersen*.", x))).any(axis=1)
print("Number of emails which mention Athur Andersen:", arthur_count.sum())
#885 emails mention Arthur Anderson, Enron's accounting firm.