## Enron Data POI Classifier 
### Jo Anna Capp

In the 1990's Enron Corporation was one of the largest.....


In [2]:
#set working directory
import os
os.chdir('D:/Documents/Udacity/IntroMachineLearning/ud120projectsmaster/ud120projectsmaster/UdacityP5')

In [3]:
#import all packages and modules here
import sys
import pickle
sys.path.append("../tools/")
import pandas
import numpy
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from sklearn import preprocessing
from ggplot import *

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

In [4]:
features_list = ['poi']

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

Lets look at the structure of the dataset and check for missing values.

In [5]:
#total individuals
print "There are ", len(data_dict.keys()), "executives of interest in the Enron dataset"
#number of pois
num_poi = 0
for dic in data_dict.values():
    if dic['poi'] == 1: 
        num_poi += 1
print "There are ", num_poi, "identified persons of interest within the dataset"
print "Data Dictionary Keys:"
print(data_dict.keys())
#data dictionary format
print "A typical key:value list: ", data_dict["SKILLING JEFFREY K"]


There are  146 executives of interest in the Enron dataset
There are  18 identified persons of interest within the dataset
Data Dictionary Keys:
['METTS MARK', 'BAXTER JOHN C', 'ELLIOTT STEVEN', 'CORDES WILLIAM R', 'HANNON KEVIN P', 'MORDAUNT KRISTINA M', 'MEYER ROCKFORD G', 'MCMAHON JEFFREY', 'HORTON STANLEY C', 'PIPER GREGORY F', 'HUMPHREY GENE E', 'UMANOFF ADAM S', 'BLACHMAN JEREMY M', 'SUNDE MARTIN', 'GIBBS DANA R', 'LOWRY CHARLES P', 'COLWELL WESLEY', 'MULLER MARK S', 'JACKSON CHARLENE R', 'WESTFAHL RICHARD K', 'WALTERS GARETH W', 'WALLS JR ROBERT H', 'KITCHEN LOUISE', 'CHAN RONNIE', 'BELFER ROBERT', 'SHANKMAN JEFFREY A', 'WODRASKA JOHN', 'BERGSIEKER RICHARD P', 'URQUHART JOHN A', 'BIBI PHILIPPE A', 'RIEKER PAULA H', 'WHALEY DAVID A', 'BECK SALLY W', 'HAUG DAVID L', 'ECHOLS JOHN B', 'MENDELSOHN JOHN', 'HICKERSON GARY J', 'CLINE KENNETH W', 'LEWIS RICHARD', 'HAYES ROBERT E', 'MCCARTY DANNY J', 'KOPPER MICHAEL J', 'LEFF DANIEL P', 'LAVORATO JOHN J', 'BERBERIAN DAVID', 'DETMERING TIM

I can see from this brief exploration that there are 146 exectives in the dataset, 18 identified POIs, and 22 features, for a total of 3088 observations. There are also a number of missing values. I'll investigate those in the next section.

### EDA and Outlier Removal

In [13]:
#change dataset to pandas dataframe
df = pandas.DataFrame.from_records(list(data_dict.values()))
employees = pandas.Series(list(data_dict.keys()))

#count number of NA values
df.replace(to_replace='NaN', value=numpy.nan, inplace=True)
print "Number of NaN values for each feature:"
print df.isnull().sum()
print "Shape of the dataframe: ", df.shape

Number of NaN values for each feature:
bonus                         64
deferral_payments            107
deferred_income               97
director_fees                129
email_address                 35
exercised_stock_options       44
expenses                      51
from_messages                 60
from_poi_to_this_person       60
from_this_person_to_poi       60
loan_advances                142
long_term_incentive           80
other                         53
poi                            0
restricted_stock              36
restricted_stock_deferred    128
salary                        51
shared_receipt_with_poi       60
to_messages                   60
total_payments                21
total_stock_value             20
dtype: int64
Shape of the dataframe:  (146, 21)


There are quite a lot of NaN values for some of the features. Particularly loan advances, director fees, restricted stock deferred, and deferral payments. However, when I look at the data schema provided (enron61702insiderpay.pdf), I see that these "missing values" are actually 0, so I will convert them to 0.

In [12]:
#replace missing values with 0
df.replace(to_replace=numpy.nan, value=0, inplace=True)
df.describe()

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
count,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,...,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0
mean,1333474.0,438796.5,-382762.2,19422.49,0.0,4182736.0,70748.27,358.60274,38.226027,24.287671,...,664683.9,585431.8,0.123288,1749257.0,20516.37,365811.4,692.986301,1221.589041,4350622.0,5846018.0
std,8094029.0,2741325.0,2378250.0,119054.3,0.0,26070400.0,432716.3,1441.259868,73.901124,79.278206,...,4046072.0,3682345.0,0.329899,10899950.0,1439661.0,2203575.0,1072.969492,2226.770637,26934480.0,36246810.0
min,0.0,-102500.0,-27992890.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,-2604490.0,-7576788.0,0.0,0.0,0.0,0.0,-44093.0
25%,0.0,0.0,-37926.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,8115.0,0.0,0.0,0.0,0.0,93944.75,228869.5
50%,300000.0,0.0,0.0,0.0,0.0,608293.5,20182.0,16.5,2.5,0.0,...,0.0,959.5,0.0,360528.0,0.0,210596.0,102.5,289.0,941359.5,965955.0
75%,800000.0,9684.5,0.0,0.0,0.0,1714221.0,53740.75,51.25,40.75,13.75,...,375064.8,150606.5,0.0,814528.0,0.0,270850.5,893.5,1585.75,1968287.0,2319991.0
max,97343620.0,32083400.0,0.0,1398517.0,0.0,311764000.0,5235198.0,14368.0,528.0,609.0,...,48521930.0,42667590.0,1.0,130322300.0,15456290.0,26704230.0,5521.0,15149.0,309886600.0,434509500.0


In [7]:
#pairplot to visualize feature distributions
def splom_viz(df, labels=None):
    ax = sns.pairplot(df, hue="poi", diag_kind='kde', size=2, vars=['poi','salary', 'total_payments', 'bonus', 
                 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 
                 'restricted_stock', 'to_messages', 'from_poi_to_this_person', 'from_messages',
                 'from_this_person_to_poi', 'shared_receipt_with_poi'])
    plt.show()

splom_viz(df)

Looking at this data, there definitely appears to be outliers. If I look at the data dictionary keys again, I see that there are two that are not names: Total and travel agency in the park. I'll remove these then look again at the pairplot.

#### Outlier Removal

In [8]:
#outlier removal
df_subset = df.drop(df.index[[data_dict.keys().index("TOTAL"), data_dict.keys().index("THE TRAVEL AGENCY IN THE PARK")]])
df_subset.describe()

#pairplot to visualize distributions and correllations
splom_viz(df_subset)

After these outliers are removed, I can see on the pairplot that the most of the remainder of the outliers are classified as POI, so these "outliers" are in fact real data. The exception to this are the features 'from_poi_to_this_person', 'from_this_person_to_poi', and 'from_messages'. Looking at the statistics above, we see that the max 'from_messages' is 14368, which is one order of magnitude higher than the 75%. The same is true for the outliers in the other two categories.  Who are these people?

In [9]:
#identify keys for potential outliers
for key, value in data_dict.items():
    if value['from_poi_to_this_person'] != 'NaN' and value['from_poi_to_this_person'] > 500: 
        print "Max from_poi_to_this_person: ", key

for key, value in data_dict.items():
    if value['from_this_person_to_poi'] != 'NaN' and value['from_this_person_to_poi'] > 500: 
        print "Max from_this_person_to_poi: ", key
        
for key, value in data_dict.items():
    if value['from_messages'] != 'NaN' and value['from_messages'] > 14000: 
        print "Max from_messages: ", key

Max from_poi_to_this_person:  LAVORATO JOHN J
Max from_this_person_to_poi:  DELAINEY DAVID W
Max from_messages:  KAMINSKI WINCENTY J


Since these keys are all different people, I will keep the email data and assume it is real. Finally, looking at the pairplots and statistics for each feature, there are negative values for deferred_income, defferal_payments, restricted_stock, and restricted_stock deferred. Are these outliers, real data, or errors in the dataset?

#### Checking financial features

One final check we can do is make sure there aren't any mistakes in the financial data. In the data schema, we can see that two features: total_payments and total_stock_value are linear combinations of the other financial features. If the negative values observed above are real, then total_payments and total_stock_value should equal the sum of the values these features. If there is an error in the dataset, then these values will not be equal.

### Feature Selection

In [None]:
### Task 3: Create new feature(s)
### Store to my_dataset for easy export below.
my_dataset = data_dict

### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [None]:
### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html

# Provided to give you a starting point. Try a variety of classifiers.
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()

In [None]:
### Task 5: Tune your classifier to achieve better than .3 precision and recall 
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info: 
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

# Example starting point. Try investigating other evaluation techniques!
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

In [None]:
### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.

dump_classifier_and_data(clf, my_dataset, features_list)