## Enron Data POI Classifier 
### Jo Anna Capp

In the 1990's Enron Corporation was one of the largest.....


In [1]:
#set working directory
import os
os.chdir('D:/Documents/Udacity/IntroMachineLearning/ud120projectsmaster/ud120projectsmaster/UdacityP5')

In [2]:
#import all packages and modules here
import sys
import pickle
sys.path.append("../tools/")
import pandas
import numpy
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from sklearn import preprocessing
from ggplot import *

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

In [3]:
features_list = ['poi']

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

Lets look at the structure of the dataset and check for missing values.

In [4]:
#total individuals
print "There are ", len(data_dict.keys()), "executives of interest in the Enron dataset"
#number of pois
num_poi = 0
for dic in data_dict.values():
    if dic['poi'] == 1: 
        num_poi += 1
print "There are ", num_poi, "identified persons of interest within the dataset"
print "Data Dictionary Keys:"
print(data_dict.keys())
#data dictionary format
print "A typical key:value list: ", data_dict["SKILLING JEFFREY K"]


There are  146 executives of interest in the Enron dataset
There are  18 identified persons of interest within the dataset
Data Dictionary Keys:
['METTS MARK', 'BAXTER JOHN C', 'ELLIOTT STEVEN', 'CORDES WILLIAM R', 'HANNON KEVIN P', 'MORDAUNT KRISTINA M', 'MEYER ROCKFORD G', 'MCMAHON JEFFREY', 'HORTON STANLEY C', 'PIPER GREGORY F', 'HUMPHREY GENE E', 'UMANOFF ADAM S', 'BLACHMAN JEREMY M', 'SUNDE MARTIN', 'GIBBS DANA R', 'LOWRY CHARLES P', 'COLWELL WESLEY', 'MULLER MARK S', 'JACKSON CHARLENE R', 'WESTFAHL RICHARD K', 'WALTERS GARETH W', 'WALLS JR ROBERT H', 'KITCHEN LOUISE', 'CHAN RONNIE', 'BELFER ROBERT', 'SHANKMAN JEFFREY A', 'WODRASKA JOHN', 'BERGSIEKER RICHARD P', 'URQUHART JOHN A', 'BIBI PHILIPPE A', 'RIEKER PAULA H', 'WHALEY DAVID A', 'BECK SALLY W', 'HAUG DAVID L', 'ECHOLS JOHN B', 'MENDELSOHN JOHN', 'HICKERSON GARY J', 'CLINE KENNETH W', 'LEWIS RICHARD', 'HAYES ROBERT E', 'MCCARTY DANNY J', 'KOPPER MICHAEL J', 'LEFF DANIEL P', 'LAVORATO JOHN J', 'BERBERIAN DAVID', 'DETMERING TIM

I can see from this brief exploration that there are 146 exectives in the dataset, 18 identified POIs, and 21 features, for a total of 3066 observations. There are also a number of missing values. I'll investigate those in the next section.

### EDA and Outlier Removal

In [5]:
#change dataset to pandas dataframe
df = pandas.DataFrame.from_records(list(data_dict.values()))
#delete 'email addresses' from df
if 'email_address' in list(df.columns.values):
    df.drop('email_address', axis=1, inplace=True)

persons = pandas.Series(list(data_dict.keys()))

#count number of NA values
df.replace(to_replace='NaN', value=numpy.nan, inplace=True)
print "Number of NaN values for each feature:"
print df.isnull().sum()
print "Shape of the dataframe: ", df.shape

Number of NaN values for each feature:
bonus                         64
deferral_payments            107
deferred_income               97
director_fees                129
exercised_stock_options       44
expenses                      51
from_messages                 60
from_poi_to_this_person       60
from_this_person_to_poi       60
loan_advances                142
long_term_incentive           80
other                         53
poi                            0
restricted_stock              36
restricted_stock_deferred    128
salary                        51
shared_receipt_with_poi       60
to_messages                   60
total_payments                21
total_stock_value             20
dtype: int64
Shape of the dataframe:  (146, 20)


There are quite a lot of NaN values for some of the features. Particularly loan advances, director fees, restricted stock deferred, and deferral payments. I will not add these features to my classifier. For the rest of the features, I may try to impute the missing data, or I may discard the feature. I'll look at the structure of each feature and correllations between features next.

In [6]:
#delete features with large number of missing values
for column, series in df.iteritems():
    if series.isnull().sum() > 100:
        df.drop(column, axis=1, inplace=True)

df.describe()

Unnamed: 0,bonus,deferred_income,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,long_term_incentive,other,restricted_stock,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
count,82.0,49.0,102.0,95.0,86.0,86.0,86.0,66.0,93.0,110.0,95.0,86.0,86.0,125.0,126.0
mean,2374235.0,-1140475.0,5987054.0,108728.9,608.790698,64.895349,41.232558,1470361.0,919065.0,2321741.0,562194.3,1176.465116,2073.860465,5081526.0,6773957.0
std,10713330.0,4025406.0,31062010.0,533534.8,1841.033949,86.979244,100.073111,5942759.0,4589253.0,12518280.0,2716369.0,1178.317641,2582.700981,29061720.0,38957770.0
min,70000.0,-27992890.0,3285.0,148.0,12.0,0.0,0.0,69223.0,2.0,-2604490.0,477.0,2.0,57.0,148.0,-44093.0
25%,431250.0,-694862.0,527886.2,22614.0,22.75,10.0,1.0,281250.0,1215.0,254018.0,211816.0,249.75,541.25,394475.0,494510.2
50%,769375.0,-159792.0,1310814.0,46950.0,41.0,35.0,8.0,442035.0,52382.0,451740.0,259996.0,740.5,1211.0,1101393.0,1102872.0
75%,1200000.0,-38346.0,2547724.0,79952.5,145.5,72.25,24.75,938672.0,362096.0,1002370.0,312117.0,1888.25,2634.75,2093263.0,2949847.0
max,97343620.0,-833.0,311764000.0,5235198.0,14368.0,528.0,609.0,48521930.0,42667590.0,130322300.0,26704230.0,5521.0,15149.0,309886600.0,434509500.0


In [7]:
#pairplot to visualize feature distributions
def splom_viz(df, labels=None):
    ax = sns.pairplot(df, hue="poi", diag_kind='kde', size=2, vars=['poi','salary', 'total_payments', 'bonus', 
                 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 
                 'restricted_stock', 'to_messages', 'from_poi_to_this_person', 'from_messages',
                 'from_this_person_to_poi', 'shared_receipt_with_poi'])
    plt.show()

splom_viz(df)

Looking at this data, there definitely appears to be outliers. If I look at the data dictioary keys again, I see that there are two that are not names: Total and travel agency in the park. I'll remove these then look again at the pairplot.

#### Outlier Removal

In [8]:
#outlier removal
df_subset = df.drop(df.index[[data_dict.keys().index("TOTAL"), data_dict.keys().index("THE TRAVEL AGENCY IN THE PARK")]])
df_subset.describe()

#pairplot to visualize distributions and correllations
splom_viz(df_subset)

After these outliers are removed, I can see on the pairplot that the most of the remainder of the outliers are classified as POI, so these "outliers" are in fact real data. The exception to this are the features 'from_poi_to_this_person', 'from_this_person_to_poi', and 'from_messages'. Looking at the statistics above, we see that the max 'from_messages' is 14368, which is one order of magnitude higher than the 75%. The same is true for the outliers in the other two categories.  Who are these people?

In [9]:
#identify keys for potential outliers
for key, value in data_dict.items():
    if value['from_poi_to_this_person'] != 'NaN' and value['from_poi_to_this_person'] > 500: 
        print "Max from_poi_to_this_person: ", key

for key, value in data_dict.items():
    if value['from_this_person_to_poi'] != 'NaN' and value['from_this_person_to_poi'] > 500: 
        print "Max from_this_person_to_poi: ", key
        
for key, value in data_dict.items():
    if value['from_messages'] != 'NaN' and value['from_messages'] > 14000: 
        print "Max from_messages: ", key

Max from_poi_to_this_person:  LAVORATO JOHN J
Max from_this_person_to_poi:  DELAINEY DAVID W
Max from_messages:  KAMINSKI WINCENTY J


Since these keys are all different people, I will keep the email data and assume it is real.

### Feature Selection

In [None]:
### Task 3: Create new feature(s)
### Store to my_dataset for easy export below.
my_dataset = data_dict

### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [None]:
### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html

# Provided to give you a starting point. Try a variety of classifiers.
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()

In [None]:
### Task 5: Tune your classifier to achieve better than .3 precision and recall 
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info: 
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

# Example starting point. Try investigating other evaluation techniques!
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

In [None]:
### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.

dump_classifier_and_data(clf, my_dataset, features_list)