In [73]:
""" 
    Starter code for exploring the Enron dataset (emails + finances);
    loads up the dataset (pickled dict of dicts).
    The dataset has the form:
    enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }
    {features_dict} is a dictionary of features associated with that person.
    You should explore features_dict as part of the mini-project,
    but here's an example to get you started:
    enron_data["SKILLING JEFFREY K"]["bonus"] = 5600000
    
"""

import pickle
import numpy as np
import pandas as pd

from time import time

In [76]:
"""
convert dos linefeeds (crlf) to unix (lf)
usage: dos2unix.py 
"""
original = "final_project_dataset.pkl"
destination = "final_project_dataset_unix.pkl"

content = ''
outsize = 0
with open(original, 'rb') as infile:
    content = infile.read()
with open(destination, 'wb') as output:
    for line in content.splitlines():
        outsize += len(line) + 1
        output.write(line + str.encode('\n'))

print("Done. Saved %s bytes." % (len(content)-outsize))

Done. Saved 6705 bytes.


In [109]:
enron_data = pickle.load(open("final_project_dataset_unix.pkl", "rb"))

### #1How many data points (people) are in the dataset?

The aggregated Enron email + financial dataset is stored in a dictionary, where each key in the dictionary is a person’s name and the value is a dictionary containing all the features of that person.
The email + finance (E+F) data dictionary is stored as a pickle file, which is a handy way to store and load python objects directly. Use datasets_questions/explore_enron_data.py to load the dataset.

In [4]:
# check the number of people
len(enron_data)

146

### #2For each person, how many features are available?

In [5]:
# number of features for each people
len(enron_data['METTS MARK'])

21

### #3 How many POIs are there in the E+F dataset?

In [6]:
enron_data['METTS MARK']['poi']

False

In [7]:
# initialize the counter
count = 0
for person in enron_data:
    if enron_data[person]['poi'] == 1:
        count+=1

In [8]:
# check how many POIs are in the datase
count

18

In [28]:
# quick alternative calculating poi
pois = [x for x, y in enron_data.items() if y['poi']]
len(pois)

18

### #4 How many POI’s were there total?

* We compiled a list of all POI names, `poi_names.txt`, and associated email addresses, `poi_email_addresses.py`
* Use the names file, not the email addresses, since many folks have more than one address and a few didn’t work for Enron, so we don’t have their emails.

In [13]:
import os
os.listdir()

['.ipynb_checkpoints',
 'enron_mail_20150507',
 'enron_mail_20150507.tar.gz',
 'explore_enron_data.ipynb',
 'final_project_dataset.pkl',
 'final_project_dataset_unix.pkl',
 'poi_email_addresses.py',
 'poi_names.txt',
 '__pycache__']

In [21]:
with open('poi_names.txt', mode="r") as file:
    poi_names = file.readlines()
# strip the newline character and create list of names
poi_names = [x.strip() for x in poi_names]
poi_names[0][:3] == "(y)"

True

In [27]:
# total POIs
len(poi_names)

35

In [26]:
# POIs in dataset
count = 0
for name in poi_names:
    if name[:3] == '(y)':
        count +=1
print("POI names: ", count)

POI names:  4


In [18]:
# import emails
from poi_email_addresses import poiEmails

email_list = poiEmails()
email_list[:3]

['kenneth_lay@enron.net', 'kenneth_lay@enron.com', 'klay.enron@enron.com']

As we can see, we have many of the POIs in our E+F dataset, but not all of them. This could make it difficult to learn patterns. In general, more data is always better--only having 18 data points doesn't give you that many examples to learn from.

### #5 What is the total value of the stock belonging to James Prentice?

In [34]:
enron_data['PRENTICE JAMES']['total_stock_value']

1095040

### #6 How many email messages do we have from Wesley Colwell to persons of interest?

In [36]:
enron_data['COLWELL WESLEY']['from_this_person_to_poi']

11

### #7 What’s the value of stock options exercised by Jeffrey K Skilling?

In [38]:
enron_data['SKILLING JEFFREY K']['exercised_stock_options']

19250000

### #8 Which of these schemes was Enron involved in?

* selling assets to shell companies at the end of each month, and buying them back at the beginning of the next month to hide accounting losses
* causing electrical grid failures in California
* a plan in collaboration with Blockbuster movies to stream movies over the internet

### #9 Who was the CEO of Enron during most of the time that fraud was being perpetrated?

Skilling, Jeffrey

### #10 Who was chairman of the Enron board of directors?
Lay, Kenneth

### #11 Who was CFO (chief financial officer) of Enron during most of the time that fraud was going on?
Fastow, Andrew

### #12 Of these three individuals (Lay, Skilling and Fastow), who took home the most money (largest value of “total_payments” feature)?

In [62]:
top_pois = ['SKILLING JEFFREY K', 'LAY KENNETH L', 'FASTOW ANDREW S']

 [[name, enron_data[name]['total_payments']] for name in top_pois]

['SKILLING JEFFREY K', 8682716]

In [71]:
# get the payments
top_payments = {name: enron_data[name]['total_payments'] for name in top_pois}
top_payments

{'SKILLING JEFFREY K': 8682716,
 'LAY KENNETH L': 103559793,
 'FASTOW ANDREW S': 2424083}

In [70]:
# sort the dict and get the most payed individual
sorted(top_payments.items(), key=lambda kv: kv[1])[-1]

('LAY KENNETH L', 103559793)

In [79]:
enron_data['METTS MARK']

{'salary': 365788,
 'to_messages': 807,
 'deferral_payments': 'NaN',
 'total_payments': 1061827,
 'loan_advances': 'NaN',
 'bonus': 600000,
 'email_address': 'mark.metts@enron.com',
 'restricted_stock_deferred': 'NaN',
 'deferred_income': 'NaN',
 'total_stock_value': 585062,
 'expenses': 94299,
 'from_poi_to_this_person': 38,
 'exercised_stock_options': 'NaN',
 'from_messages': 29,
 'other': 1740,
 'from_this_person_to_poi': 1,
 'poi': False,
 'long_term_incentive': 'NaN',
 'shared_receipt_with_poi': 702,
 'restricted_stock': 585062,
 'director_fees': 'NaN'}

## Let's use Pandas to quickly understand the data set 

In [87]:
enron_data = pd.DataFrame.from_dict(enron_data, orient='index')
enron_data.head()

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,loan_advances,bonus,email_address,restricted_stock_deferred,deferred_income,total_stock_value,...,from_poi_to_this_person,exercised_stock_options,from_messages,other,from_this_person_to_poi,poi,long_term_incentive,shared_receipt_with_poi,restricted_stock,director_fees
ALLEN PHILLIP K,201955.0,2902.0,2869717.0,4484442,,4175000.0,phillip.allen@enron.com,-126027.0,-3081055.0,1729541,...,47.0,1729541.0,2195.0,152.0,65.0,False,304805.0,1407.0,126027.0,
BADUM JAMES P,,,178980.0,182466,,,,,,257817,...,,257817.0,,,,False,,,,
BANNANTINE JAMES M,477.0,566.0,,916197,,,james.bannantine@enron.com,-560222.0,-5104.0,5243487,...,39.0,4046157.0,29.0,864523.0,0.0,False,,465.0,1757552.0,
BAXTER JOHN C,267102.0,,1295738.0,5634343,,1200000.0,,,-1386055.0,10623258,...,,6680544.0,,2660303.0,,False,1586055.0,,3942714.0,
BAY FRANKLIN R,239671.0,,260455.0,827696,,400000.0,frank.bay@enron.com,-82782.0,-201641.0,63014,...,,,,69.0,,False,,,145796.0,


In [89]:
# reindex the table
enron_data.reset_index(level=enron_data.index.names, inplace=True)

In [90]:
# check the changes
enron_data.head()

Unnamed: 0,index,salary,to_messages,deferral_payments,total_payments,loan_advances,bonus,email_address,restricted_stock_deferred,deferred_income,...,from_poi_to_this_person,exercised_stock_options,from_messages,other,from_this_person_to_poi,poi,long_term_incentive,shared_receipt_with_poi,restricted_stock,director_fees
0,ALLEN PHILLIP K,201955.0,2902.0,2869717.0,4484442,,4175000.0,phillip.allen@enron.com,-126027.0,-3081055.0,...,47.0,1729541.0,2195.0,152.0,65.0,False,304805.0,1407.0,126027.0,
1,BADUM JAMES P,,,178980.0,182466,,,,,,...,,257817.0,,,,False,,,,
2,BANNANTINE JAMES M,477.0,566.0,,916197,,,james.bannantine@enron.com,-560222.0,-5104.0,...,39.0,4046157.0,29.0,864523.0,0.0,False,,465.0,1757552.0,
3,BAXTER JOHN C,267102.0,,1295738.0,5634343,,1200000.0,,,-1386055.0,...,,6680544.0,,2660303.0,,False,1586055.0,,3942714.0,
4,BAY FRANKLIN R,239671.0,,260455.0,827696,,400000.0,frank.bay@enron.com,-82782.0,-201641.0,...,,,,69.0,,False,,,145796.0,


In [94]:
# rename index column
enron_data.rename(columns={'index':'name'}, inplace=True)

In [95]:
# check the changes
enron_data.head()

Unnamed: 0,name,salary,to_messages,deferral_payments,total_payments,loan_advances,bonus,email_address,restricted_stock_deferred,deferred_income,...,from_poi_to_this_person,exercised_stock_options,from_messages,other,from_this_person_to_poi,poi,long_term_incentive,shared_receipt_with_poi,restricted_stock,director_fees
0,ALLEN PHILLIP K,201955.0,2902.0,2869717.0,4484442,,4175000.0,phillip.allen@enron.com,-126027.0,-3081055.0,...,47.0,1729541.0,2195.0,152.0,65.0,False,304805.0,1407.0,126027.0,
1,BADUM JAMES P,,,178980.0,182466,,,,,,...,,257817.0,,,,False,,,,
2,BANNANTINE JAMES M,477.0,566.0,,916197,,,james.bannantine@enron.com,-560222.0,-5104.0,...,39.0,4046157.0,29.0,864523.0,0.0,False,,465.0,1757552.0,
3,BAXTER JOHN C,267102.0,,1295738.0,5634343,,1200000.0,,,-1386055.0,...,,6680544.0,,2660303.0,,False,1586055.0,,3942714.0,
4,BAY FRANKLIN R,239671.0,,260455.0,827696,,400000.0,frank.bay@enron.com,-82782.0,-201641.0,...,,,,69.0,,False,,,145796.0,


In [96]:
enron_data.shape

(146, 22)

In [97]:
# save it to a csv file
enron_data.to_csv('final_project_datase.csv', index=False)

In [112]:
# test the new file
enron_df = pd.read_csv('final_project_datase.csv')
enron_df.head()

Unnamed: 0,name,salary,to_messages,deferral_payments,total_payments,loan_advances,bonus,email_address,restricted_stock_deferred,deferred_income,...,from_poi_to_this_person,exercised_stock_options,from_messages,other,from_this_person_to_poi,poi,long_term_incentive,shared_receipt_with_poi,restricted_stock,director_fees
0,ALLEN PHILLIP K,201955.0,2902.0,2869717.0,4484442.0,,4175000.0,phillip.allen@enron.com,-126027.0,-3081055.0,...,47.0,1729541.0,2195.0,152.0,65.0,False,304805.0,1407.0,126027.0,
1,BADUM JAMES P,,,178980.0,182466.0,,,,,,...,,257817.0,,,,False,,,,
2,BANNANTINE JAMES M,477.0,566.0,,916197.0,,,james.bannantine@enron.com,-560222.0,-5104.0,...,39.0,4046157.0,29.0,864523.0,0.0,False,,465.0,1757552.0,
3,BAXTER JOHN C,267102.0,,1295738.0,5634343.0,,1200000.0,,,-1386055.0,...,,6680544.0,,2660303.0,,False,1586055.0,,3942714.0,
4,BAY FRANKLIN R,239671.0,,260455.0,827696.0,,400000.0,frank.bay@enron.com,-82782.0,-201641.0,...,,,,69.0,,False,,,145796.0,


### #13 How many folks in this dataset have a quantified salary? What about a known email address?

In [113]:
enron_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 22 columns):
name                         146 non-null object
salary                       95 non-null float64
to_messages                  86 non-null float64
deferral_payments            39 non-null float64
total_payments               125 non-null float64
loan_advances                4 non-null float64
bonus                        82 non-null float64
email_address                111 non-null object
restricted_stock_deferred    18 non-null float64
deferred_income              49 non-null float64
total_stock_value            126 non-null float64
expenses                     95 non-null float64
from_poi_to_this_person      86 non-null float64
exercised_stock_options      102 non-null float64
from_messages                86 non-null float64
other                        93 non-null float64
from_this_person_to_poi      86 non-null float64
poi                          146 non-null bool
long_term_inc

### #14 Dict to array conversion

* A python dictionary can’t be read directly into an sklearn classification or regression algorithm; 
* instead, it needs a numpy array or a list of lists (each element of the list (itself a list) is a data point, and the elements of the smaller list are the features of that point).
* In the case when a feature does not have a value for a particular person, this function will also replace the feature value with 0 (zero).

### #15 How many people in the E+F dataset (as it currently exists) have “NaN” for their total payments? What percentage of people in the dataset as a whole is this?

As you saw a little while ago, not every POI has an entry in the dataset (e.g. Michael Krautz). That’s because the dataset was created using the financial data you can find in final_project/enron61702insiderpay.pdf, which is missing some POI’s (those absences propagated through to the final dataset). On the other hand, for many of these “missing” POI’s, we do have emails.

While it would be straightforward to add these POI’s and their email information to the E+F dataset, and just put “NaN” for their financial information, this could introduce a subtle problem. You will walk through that here.

In [146]:
sum(enron_df['total_payments'].isnull())

21

In [127]:
# percentage of NaN for total payments
sum(enron_df['total_payments'].isnull())/enron_df.shape[0] * 100

14.383561643835616

### #16 How many POIs in the E+F dataset have “NaN” for their total payments? What percentage of POI’s as a whole is this?

In [137]:
sum(enron_df[enron_df['poi'] == True]['total_payments'].isnull())

142444616.0

In [141]:
enron_df[enron_df['poi'] == True]['total_payments'].shape

(18,)

### #17 If a machine learning algorithm were to use total_payments as a feature, would you expect it to associate a “NaN” value with POIs or non-POIs?

* To non-POIs, No training points would have "NaN" for total_payments when the class label is "POI"

### 18 What is the new number of people of the dataset? What is the new number of folks with “NaN” for total payments?

If you added in, say, 10 more data points which were all POI’s, and put “NaN” for the total payments for those folks, the numbers you just calculated would change.

In [147]:
# number in dataset
enron_df.shape[0] + 10

156

In [148]:
# NaN for total payments
sum(enron_df['total_payments'].isnull()) + 10

31

### #19 What is the new number of POI’s in the dataset? What is the new number of POI’s with NaN for total_payments?

* Now there are 28 POI's, 10 of whom have "NaN" for total_payments
* That's 36% of the POI's who have "NaN" for total_payments, a big jump from before.

In [151]:
# new number of pois
enron_df[enron_df['poi'] == True].shape[0] + 10

28

### #20 Once the new data points are added, do you think a supervised classification algorithm might interpret “NaN” for total_payments as a clue that someone is a POI?

It totally could!

Adding in the new POI’s in this example, none of whom we have financial information for, has introduced a subtle problem, that our lack of financial information about them can be picked up by an algorithm as a clue that they’re POIs. Another way to think about this is that there’s now a difference in how we generated the data for our two classes--non-POIs all come from the financial spreadsheet, while many POIs get added in by hand afterwards. That difference can trick us into thinking we have better performance than we do--suppose you use your POI detector to decide whether a new, unseen person is a POI, and that person isn’t on the spreadsheet. Then all their financial data would contain “NaN” but the person is very likely not a POI (there are many more non-POIs than POIs in the world, and even at Enron)--you’d be likely to accidentally identify them as a POI, though!

This goes to say that, when generating or augmenting a dataset, you should be exceptionally careful if your data are coming from different sources for different classes. It can easily lead to the type of bias or mistake that we showed here. There are ways to deal with this, for example, you wouldn’t have to worry about this problem if you used only email data--in that case, discrepancies in the financial data wouldn’t matter because financial features aren’t being used. There are also more sophisticated ways of estimating how much of an effect these biases can have on your final answer; those are beyond the scope of this course.

For now, the takeaway message is to be very careful about introducing features that come from different sources depending on the class! It’s a classic way to accidentally introduce biases and mistakes.