## 2. Extract Employee information

One of the issues with the Enron email corpus is that the employees have multiple email addresses. We first use two resources which give us the different email addresses used by the employees. The names of the employees who have been prosecuted are stored in *criminals.csv*. Finally, all addresses from the criminals are extracted.

2 resources:
* http://www.ahschulz.de/enron-email-data/
* https://github.com/Sun121sun/ENRON-EMAILS-AND-EMPLOYEE

In [1]:
# Import packages
import pandas as pd
import numpy as np

In [2]:
# First resource
df = pd.read_csv('Enron/EmployeeList.csv')
df = df.drop('Unnamed: 0', axis=1)
df.head()

Unnamed: 0,eid,firstName,lastName,Email_id,Email2,Email3,EMail4,folder,status
0,13,Marie,Heard,marie.heard@enron.com,,,,heard-m,
1,6,Mark,Taylor,mark.e.taylor@enron.com,mark.taylor@enron.com,e.taylor@enron.com,,taylor-m,Employee
2,19,Lindy,Donoho,lindy.donoho@enron.com,ldonoho@enron.com,,,donoho-l,Employee
3,115,Lisa,Gang,lisa.gang@enron.com,,,,gang-l,
4,129,Jeffrey,Skilling,jeff.skilling@enron.com,jeffrey.skilling@enron.com,,,skilling-j,CEO


In [3]:
# Second resource
employees = pd.read_excel('Enron/enron_emp.xlsx')
employees.head()

Unnamed: 0,num,name,email1,email2,email3,Unnamed: 5
0,0,ALBERT MEYERS,albert.meyers@enron.com,,,
1,1,ANDREA RING,andrea.ring@enron.com,,,
2,2,ANDREW FASTOW,andrew.fastow@enron.com,,,
3,3,ANDREW LEWIS,andrew.lewis@enron.com,h..lewis@enron.com,,
4,4,ANDY ZIPPER,andy.zipper@enron.com,,,


In [4]:
# Get criminal names
criminal_names = pd.read_csv('Enron/criminals.csv').name.map(lambda x: x.lower())

**Join two data sets together**

If we want to join the 2 datasets together we need the names to be in a similar format

In [5]:
# Combine first name and last name
df['FullNames'] = df.apply(lambda x: ' '.join([x.firstName, x.lastName]), axis=1)

In [6]:
# Set all letters to lowercase
df['FullNames'] = df.FullNames.map(lambda x: x.lower())

In [7]:
# Check result
df.head()

Unnamed: 0,eid,firstName,lastName,Email_id,Email2,Email3,EMail4,folder,status,FullNames
0,13,Marie,Heard,marie.heard@enron.com,,,,heard-m,,marie heard
1,6,Mark,Taylor,mark.e.taylor@enron.com,mark.taylor@enron.com,e.taylor@enron.com,,taylor-m,Employee,mark taylor
2,19,Lindy,Donoho,lindy.donoho@enron.com,ldonoho@enron.com,,,donoho-l,Employee,lindy donoho
3,115,Lisa,Gang,lisa.gang@enron.com,,,,gang-l,,lisa gang
4,129,Jeffrey,Skilling,jeff.skilling@enron.com,jeffrey.skilling@enron.com,,,skilling-j,CEO,jeffrey skilling


In [8]:
# Set names to lower case
employees['name'] = employees.name.map(lambda x: x.lower())
employees.head()

Unnamed: 0,num,name,email1,email2,email3,Unnamed: 5
0,0,albert meyers,albert.meyers@enron.com,,,
1,1,andrea ring,andrea.ring@enron.com,,,
2,2,andrew fastow,andrew.fastow@enron.com,,,
3,3,andrew lewis,andrew.lewis@enron.com,h..lewis@enron.com,,
4,4,andy zipper,andy.zipper@enron.com,,,


In [9]:
# Look at all names which are in the first resource but not in the second
names_df = [name for name in df.FullNames if name not in employees.name.values]
df[df.FullNames.isin(names_df)].drop(['eid', 'firstName', 'lastName', 'folder', 'status'], axis=1).sort_values('FullNames')

Unnamed: 0,Email_id,Email2,Email3,EMail4,FullNames
51,bill.williams@enron.com,,,,bill williams
52,brad.mckay@enron.com,bmckay@enron.com,,,brad mckay
60,craig.dean@enron.com,,,,craig dean
66,c..giron@enron.com,darron.giron@enron.com,darron.c.giron@enron.com,,darron giron
68,debra.perlingiere@enron.com,,,,debra perlingiere
71,doug.gilbert-smith@enron.com,gilbert-smith@enron.com,,,doug gilbert-smith
73,dutch.quigley@enron.com,,,,dutch quigley
82,geoff.storey@enron.com,gstorey@enron.com,,,geoff storey
39,jason.williams@enron.com,jason.r.williams@enron.com,,,jason williams
4,jeff.skilling@enron.com,jeffrey.skilling@enron.com,,,jeffrey skilling


In [10]:
# Look at all names which are in the second but not in the first
names_employees = [name for name in employees.name if name not in df.FullNames.values]
employees[employees.name.isin(names_employees)].sort_values('name')

Unnamed: 0,num,name,email1,email2,email3,Unnamed: 5
2,2,andrew fastow,andrew.fastow@enron.com,,,
6,6,ben glisan,ben.glisan@enron.com,,,
9,9,boyle dan,dan.boyle@enron.com,,,
10,10,bradley mckay,brad.mckay@enron.com,,,
11,11,brown james,james.brown@enron.com,,,
12,12,calger christopher,christopher.calger@enron.com,calger@enron.com,f..calger@enron.com,
19,19,clint dean,clint.dean@enron.com,,,
20,20,colwell wesley,wes.colwell@enron.com,colwell@enron.com,,
26,26,daron giron,darron.giron@enron.com,c..giron@enron.com,,
29,29,despain timothy,tim.despain@enron.com,,,


When we compare the datasets above we can see that some names are the same but they were spelled differently, this is the case for the following.

brad mckay and bradley mckay; darron giron and daron giron; doug gilbert-smith and douglas gilbert smith; geoff storey and geoffery storey; jeffrey skilling and jeffery skilling; lisa gang and gang lisa; bill williams and williams iii bill; debra perlingiere and perlingiere debra; dutch quigley and quigley dutch; jason williams and williams jason (trading); larry may and lawrence may; matt motley and matthew motley; mike grigsby and michael grigsby; mike maggi and michael maggi; mike swerzbin and  micheal swerzbin; steven harris and harris steven

We will change these names to the same ones as in the df dataset.

In [11]:
#  Create list with names that have to be replaced
NamesToBeRplaced = ['bradley mckay', 'daron giron', 'douglas gilbert smith', 'geoffery storey', 'jeffery skilling', 'gang lisa',
                  'williams iii bill', 'perlingiere debra', 'quigley dutch', 'williams jason (trading)', 'lawrence may',
                 'matthew motley', 'michael grigsby', 'michael maggi', 'micheal swerzbin', 'harris steven']
NamesToReplaceWith = ['brad mckay', 'darron giron', 'doug gilbert-smith', 'geoff storey', 'jeffrey skilling', 'lisa gang',
                     'bill williams', 'debra perlingiere', 'dutch quigley', 'jason williams', 'larry may', 'matt motley',
                     'mike grigsby', 'mike maggi', 'mike swerzbin', 'steven harris']

In [12]:
# Check if the length of both lists is equal
len(NamesToBeRplaced) == len(NamesToReplaceWith)

True

In [13]:
# Loop over the names
for i in range(len(NamesToBeRplaced)):
    # Get index number
    index = employees.loc[employees.name == NamesToBeRplaced[i], 'num'].values
    # Change value
    employees.at[index[0], 'name'] = NamesToReplaceWith[i]

Next we create a subset of our two dataframes these will only contain the names of the employees and there corresponding addresses. We will join these two dataframes together based on their names.

In [14]:
# Create a subset of the employees data frame
SubEmployees = employees[['name', 'email1', 'email2', 'email3']]
SubEmployees = SubEmployees.set_index('name')

In [16]:
# Create a subset of the df data frame
SubDF = df[['FullNames', 'Email_id', 'Email2', 'Email3', 'EMail4']]
SubDF = SubDF.set_index('FullNames')

In [17]:
# Concatenate the two data frames
EmployeeAddresses = pd.concat([SubDF, SubEmployees], axis=1)

In [19]:
# Check how many employees there are left
EmployeeAddresses.head()

Unnamed: 0,Email_id,Email2,Email3,EMail4,email1,email2,email3
marie heard,marie.heard@enron.com,,,,marie.heard@enron.com,,
mark taylor,mark.e.taylor@enron.com,mark.taylor@enron.com,e.taylor@enron.com,,mark.taylor@enron.com,legal <.taylor@enron.com>,e.taylor@enron.com
lindy donoho,lindy.donoho@enron.com,ldonoho@enron.com,,,lindy.donoho@enron.com,,
lisa gang,lisa.gang@enron.com,,,,lisa.gang@enron.com,,
jeffrey skilling,jeff.skilling@enron.com,jeffrey.skilling@enron.com,,,jeff.skilling@enron.com,jskilli@enron.com,jeffreyskilling@yahoo.com


In the next part we are going to extract information regarding the criminals.

In [21]:
# Change spelling of one the criminals
criminal_names[10] = 'jeffrey skilling'

In [25]:
# Get all adresses used by criminals
criminal_addresses = EmployeeAddresses.loc[criminal_names]

In [26]:
criminal_addresses

Unnamed: 0,Email_id,Email2,Email3,EMail4,email1,email2,email3
andrew fastow,,,,,andrew.fastow@enron.com,,
ben glisan,,,,,ben.glisan@enron.com,,
boyle dan,,,,,dan.boyle@enron.com,,
brown james,,,,,james.brown@enron.com,,
calger christopher,,,,,christopher.calger@enron.com,calger@enron.com,f..calger@enron.com
colwell wesley,,,,,wes.colwell@enron.com,colwell@enron.com,
david delainey,david.w.delainey@enron.com,david.delainey@enron.com,w..delainey@enron.com,,david.delainey@enron.com,w..delainey@enron.com,dave.delainey@enron.com
despain timothy,,,,,tim.despain@enron.com,,
fastow lea,,,,,lfastow@pdq.net,lfastow@pop.pdq.net,
hirko joseph,,,,,joehirko@aol.com,,


In [30]:
# Export data frame
Criminal_addresses.to_csv('files/criminal_addresses.csv')