In [2]:
#!/usr/bin/python

""" 
    Starter code for exploring the Enron dataset (emails + finances);
    loads up the dataset (pickled dict of dicts).

    The dataset has the form:
    enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }

    {features_dict} is a dictionary of features associated with that person.
    You should explore features_dict as part of the mini-project,
    but here's an example to get you started:

    enron_data["SKILLING JEFFREY K"]["bonus"] = 5600000
    
"""

import pickle

enron_data = pickle.load(open("C:/Users/Jon Targaryen/Desktop/mach learn/ud120-projects/final_project/final_project_dataset.pkl", "r"))

> **Info**:  The aggregated Enron email + financial dataset is stored in a dictionary, where each key in the dictionary is a 
person’s name and the value is a dictionary containing all the features of that person.The email + finance (E+F) data dictionary
is stored as a pickle file,which is a handy way to store and load python objects directly. 
Use datasets_questions/explore_enron_data.py to load the dataset.



In [3]:
# How many data points (people) are in the dataset?

print len(enron_data)

146


In [4]:
# For each person, how many features are available?
# features in data set

for key, value in enron_data.items():
    #print value
    print(key, len(filter(bool, value)))

    
    

('METTS MARK', 21)
('BAXTER JOHN C', 21)
('ELLIOTT STEVEN', 21)
('CORDES WILLIAM R', 21)
('HANNON KEVIN P', 21)
('MORDAUNT KRISTINA M', 21)
('MEYER ROCKFORD G', 21)
('MCMAHON JEFFREY', 21)
('HORTON STANLEY C', 21)
('PIPER GREGORY F', 21)
('HUMPHREY GENE E', 21)
('UMANOFF ADAM S', 21)
('BLACHMAN JEREMY M', 21)
('SUNDE MARTIN', 21)
('GIBBS DANA R', 21)
('LOWRY CHARLES P', 21)
('COLWELL WESLEY', 21)
('MULLER MARK S', 21)
('JACKSON CHARLENE R', 21)
('WESTFAHL RICHARD K', 21)
('WALTERS GARETH W', 21)
('WALLS JR ROBERT H', 21)
('KITCHEN LOUISE', 21)
('CHAN RONNIE', 21)
('BELFER ROBERT', 21)
('SHANKMAN JEFFREY A', 21)
('WODRASKA JOHN', 21)
('BERGSIEKER RICHARD P', 21)
('URQUHART JOHN A', 21)
('BIBI PHILIPPE A', 21)
('RIEKER PAULA H', 21)
('WHALEY DAVID A', 21)
('BECK SALLY W', 21)
('HAUG DAVID L', 21)
('ECHOLS JOHN B', 21)
('MENDELSOHN JOHN', 21)
('HICKERSON GARY J', 21)
('CLINE KENNETH W', 21)
('LEWIS RICHARD', 21)
('HAYES ROBERT E', 21)
('MCCARTY DANNY J', 21)
('KOPPER MICHAEL J', 21)
('LEF

> **Info**:   The “poi” feature records whether the person is a person of interest, according to our definition. 


In [6]:
# How many POIs are there in the E+F dataset?

counter = 0
for i in enron_data.values():
    if i['poi'] == True:
        counter+=1
print " # POI is %d " %counter   

 # POI is 18 


> **Info**: We compiled a list of all POI names (in ../final_project/poi_names.txt) and associated email addresses (in ../final_project/poi_email_addresses.py).


In [13]:
# How many POI’s were there total? 
# (Use the names file, not the email addresses, since many folks have more than one address and a few didn’t work for Enron, 
# so we don’t have their emails.)

poi_name_record = open("C:/Users/Jon Targaryen/Desktop/mach learn/ud120-projects/final_project/poi_names.txt").read().split("\n")

poi_name_total = [record for record in poi_name_record if "(y)" in record or "(n)" in record]

print("Total number of POIs: ", len(poi_name_total))

('Total number of POIs: ', 35)


Main thought is about having enough data to really learn the patterns.  In general, more data is always better--only having 18 data points doesn't give you that many examples to learn from.

In [18]:
# What is the total value of the stock belonging to James Prentice?

enron_data["PRENTICE JAMES"]["total_stock_value"]

1095040

In [21]:
# How many email messages do we have from Wesley Colwell to persons of interest?
enron_data["COLWELL WESLEY"]["from_this_person_to_poi"]


11

In [22]:
# What’s the value of stock options exercised by Jeffrey K Skilling?

enron_data["SKILLING JEFFREY K"]["exercised_stock_options"]

19250000

In [20]:
# Of these three individuals (Lay, Skilling and Fastow), who took home the most money
# (largest value of “total_payments” feature)?
# How much money did that person get?

mykeys = ["SKILLING JEFFREY K","LAY KENNETH L","FASTOW ANDREW S"]
tot_value = list()
for key, value in enron_data.iteritems():
    if key in mykeys:
        tot_value = tot_value + [key,value["total_payments"]]
print tot_value

['LAY KENNETH L', 103559793, 'FASTOW ANDREW S', 2424083, 'SKILLING JEFFREY K', 8682716]


In [24]:
# How many folks in this dataset have a quantified salary? What about a known email address?

count = 0
for key, value in enron_data.iteritems():
    if value["salary"]!='NaN':
        count += 1
print "people who have a quantified salary: " + str(count)

count1 = 0
for key, value in enron_data.iteritems():
    if value["email_address"]!='NaN':
        count1+= 1
print "people who have an email address: " + str(count1)


people who have a quantified salary: 95
people who have an email address: 111


95 have a quantified salary. 111 have a known email address.

> **INFO**: A python dictionary can’t be read directly into an sklearn classification or regression algorithm; instead, it needs a numpy array or a list of lists (each element of the list (itself a list) is a data point, and the elements of the smaller list are the features of that point).
We’ve written some helper functions (featureFormat() and targetFeatureSplit() in tools/feature_format.py) that can take a list of feature names and the data dictionary, and return a numpy array.
In the case when a feature does not have a value for a particular person, this function will also replace the feature value with 0 (zero).

As you saw a little while ago, not every POI has an entry in the dataset (e.g. Michael Krautz). That’s because the dataset was created using the financial data you can find in final_project/enron61702insiderpay.pdf, which is missing some POI’s (those absences propagated through to the final dataset). On the other hand, for many of these “missing” POI’s, we do have emails.

While it would be straightforward to add these POI’s and their email information to the E+F dataset, and just put “NaN” for their financial information, this could introduce a subtle problem. I will walk through that here.


In [31]:
# How many people in the E+F dataset (as it currently exists) have “NaN” for their total payments? 
# What percentage of people in the dataset as a whole is this?

count2 = 0
for key, value in enron_data.iteritems():
    if value["total_payments"]=='NaN':
        count2 += 1
print "people in set with no number of their payments: " + str(count2)

print "percentage of people in the set with payment value missing: " + str(float(count2)/len(enron_data)*100)

people in set with no number of their payments: 21
percentage of people in the set with payment value missing: 14.3835616438


 21 out of 146 (about 14%) of the people in the dataset don't have "total_payments" filled in.

In [41]:
# How many POIs in the E+F dataset have “NaN” for their total payments? 
# What percentage of POI’s as a whole is this?


# number of pois

counter = 0
for i in enron_data.values():
    if i['poi'] == True:
        counter+=1
print "POI is %d " %counter  


# pois with 'NaN' payments
count3 = 0
for key, value in enron_data.iteritems():
    if value["poi"]=='True':
        if value["total_payments"]=='NaN':
            count3 += 1
print "POI people in set with no number of their payments: " + str(count3)

print "Percentage of POI people in the set with payment value missing: " + str(float(count3)/len(enron_data)*100)

POI is 18 
POI people in set with no number of their payments: 0
Percentage of POI people in the set with payment value missing: 0.0


0 out of 18, or 0% of POI's don't have total_payments filled.

If a machine learning algorithm were to use "total_payments" as a feature, 
would you expect it to associate a “NaN” value with POIs or non-POIs?

No training points would have "NaN" for total_payments when the class label is "POI".
The "NaN" would be associated with non-POIs.

If you added in, say, 10 more data points which were all POI’s, and put “NaN” for the total payments for those folks, 
the numbers you just calculated would change.
What is the new number of people of the dataset? What is the new number of folks with “NaN” for total payments?


New number of people of the dataset = 146 + 10 = 156.

People in set with no number of their payments =  21 + 10 = 31.


Now there are 156 folks in dataset, 31 of whom have "NaN" total_payments. 

This makes for 31/156 = 19.87 % of them with a "NaN" overall.


What is the new number of POI’s in the dataset? What is the new number of POI’s with NaN for total_payments?

New number of POI's in the data set = 18 + 10 = 28.

New number of POI’s with NaN for total_payments = 0 + 10 = 10.

POI + 'NaN' payment values = 10/28 = 35.71 % .


That's 36% of the POI's who have "NaN" for total_payments, a big jump from before.

Once the new data points are added, do you think a supervised classification algorithm might interpret “NaN” for total_payments as a clue that someone is a POI?

Yes, it totally could.


Adding in the new POI’s in this example, none of whom we have financial information for, has introduced a subtle problem, that our lack of financial information about them can be picked up by an algorithm as a clue that they’re POIs. Another way to think about this is that there’s now a difference in how we generated the data for our two classes--non-POIs all come from the financial spreadsheet, while many POIs get added in by hand afterwards. That difference can trick us into thinking we have better performance than we do--suppose you use your POI detector to decide whether a new, unseen person is a POI, and that person isn’t on the spreadsheet. Then all their financial data would contain “NaN” but the person is very likely not a POI (there are many more non-POIs than POIs in the world, and even at Enron)--you’d be likely to accidentally identify them as a POI, though!

For now, the takeaway message is to be very careful about introducing features that come from different sources depending on the class! It’s a classic way to accidentally introduce biases and mistakes.