# Enron Case - Person of Interest Identifier - Intro to Machine Learning Final Project

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. In this project, we will play detective, and build a person of interest identifier based on financial and email data made public as a result of the Enron scandal.

In [19]:
#!/usr/bin/python

import sys
import pickle
import numpy as np
import pandas as pd
import copy
pd.set_option("display.max_columns",100)
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

In [20]:
### Import plotly offline library
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import plotly.plotly as py
from plotly.graph_objs  import *

# Data Exploration

In [21]:
### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

### How many person do we have in our dataset?

In [22]:
len(data_dict)

146

Our dataset contains data about 146 people.

### How many features do we have in our dataset?

In [23]:
features_in_dataset = set()
for key, value in data_dict.iteritems():
    for feature in value:
        features_in_dataset.add(feature)
len(features_in_dataset)

21

Our dataset contains 20 features and 1 label (POI: Person Of Interest) we want to predict.

In [24]:
df_data_dict = pd.DataFrame.from_dict(data_dict).transpose()

In [25]:
df_data_dict['poi'].value_counts()

False    128
True      18
Name: poi, dtype: int64

18 persons are labelled as person of interest.

In [26]:
# creation of a copy of the dictionary to track missing values
data_dict_missing_values = copy.deepcopy(data_dict)

In [27]:
for key, value in data_dict_missing_values.iteritems():
    for features, values in value.iteritems():
        if features != 'poi':
            if values == 'NaN':
                data_dict_missing_values[key][features] = "NoValue"
            else:
                data_dict_missing_values[key][features] = "Value"
df_data_dict_missing_values = pd.DataFrame.from_dict(data_dict_missing_values).transpose()

#### Distribution of missing data among all features, grouped by type of POI

In [28]:
(df_data_dict_missing_values.set_index('poi').stack().groupby(level=[0,1]).value_counts()
 .sort_index(ascending=[False, True, False]).unstack([1,2]).fillna(0).astype(int))

Unnamed: 0_level_0,bonus,bonus,deferral_payments,deferral_payments,deferred_income,deferred_income,director_fees,email_address,exercised_stock_options,exercised_stock_options,expenses,from_messages,from_messages,from_poi_to_this_person,from_poi_to_this_person,from_this_person_to_poi,from_this_person_to_poi,loan_advances,loan_advances,long_term_incentive,long_term_incentive,other,restricted_stock,restricted_stock,restricted_stock_deferred,salary,salary,shared_receipt_with_poi,shared_receipt_with_poi,to_messages,to_messages,total_payments,total_stock_value,director_fees,email_address,expenses,other,restricted_stock_deferred,total_payments,total_stock_value
Unnamed: 0_level_1,Value,NoValue,Value,NoValue,Value,NoValue,NoValue,Value,Value,NoValue,Value,Value,NoValue,Value,NoValue,Value,NoValue,Value,NoValue,Value,NoValue,Value,Value,NoValue,NoValue,Value,NoValue,Value,NoValue,Value,NoValue,Value,Value,Value,NoValue,NoValue,NoValue,Value,NoValue,NoValue
poi,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2
False,66,62,34,94,38,90,111,93,90,38,77,72,56,72,56,72,56,3,125,54,74,75,93,35,110,78,50,72,56,72,56,107,108,17,35,51,53,18,21,20
True,16,2,5,13,11,7,18,18,12,6,18,14,4,14,4,14,4,1,17,12,6,18,17,1,18,17,1,14,4,14,4,18,18,0,0,0,0,0,0,0


#### Investigation of emails / receipts features

In [29]:
count_email_data_missing = 0
features_to_check = ['shared_receipt_with_poi','to_messages','from_messages','from_poi_to_this_person','from_this_person_to_poi']
for key, value in data_dict.iteritems():
    count_per_key = 0
    for features, values in value.iteritems():
        if features in features_to_check:
            if values == 'NaN':
                count_per_key += 1
    if count_per_key == 5:
        count_email_data_missing += 1
print count_email_data_missing

60


#### Creation of a new dictionary and cleaning of missing data based on features exploration

In [30]:
list_of_features = ['bonus','long_term_incentive','deferral_payments','deferred_income','director_fees','expenses','exercised_stock_options','restricted_stock','restricted_stock_deferred']
### Replace 'NaN' with 0
data_dict_nan_cleaned = copy.deepcopy(data_dict)
for key, value in data_dict_nan_cleaned.iteritems():
    for feature, values in value.iteritems():
        if feature in list_of_features:
            if values == 'NaN':
                data_dict_nan_cleaned[key][feature] = 0

### Data vizualisation

In [31]:
### Creation of a function to plot 2 features
def plot_2_features(dictionary, feature_1, feature_2, title_plot):
    '''
    This function aims at plotting two features together to assess the strength of their relationship.
    '''
    data_poi = []
    data_feature_1 = []
    data_feature_2 = []
    for key, value in dictionary.iteritems():
        for feature, values in value.iteritems():
            if feature == 'poi':
                if values == True:
                    data_poi.append(1)
                else:
                    data_poi.append(0)
            if feature == feature_1:
                if values == 'NaN':
                    data_feature_1.append(0)
                else:
                    data_feature_1.append(values)
            if feature == feature_2:
                if values == 'NaN':
                    data_feature_2.append(0)
                else:
                    data_feature_2.append(values)
    iplot({
        'data': [
            Scatter(x=data_feature_1,
                    y=data_feature_2,
            mode = 'markers',
            marker = Marker(
                        color = data_poi,
                        colorscale='Bluered',
                        showscale = True
                    ))
        ],
        'layout': Layout(xaxis=XAxis(title=feature_1), yaxis=YAxis(title=feature_2), title= title_plot)}, show_link=False)

In [32]:
plot_2_features(data_dict, 'from_poi_to_this_person', 'from_this_person_to_poi', 'Relationship between # of emails received from poi and # of emails sent to poi')

In [33]:
plot_2_features(data_dict, 'expenses', 'shared_receipt_with_poi', 'Relatioship between expenses and # of receipt shared with poi')

In [34]:
plot_2_features(data_dict, 'total_stock_value', 'bonus', 'Relationship between Total stock value and bonus')

By plotting those features:
- we identified some outliers we should remove.
- we identified an opportunity of creating new features describing the interaction of an individual with a POI

### Removing outliers & new features creation

During our data exploration, we idenfied several outlier we need to investigate in order to see if they should be removed from the dataset or not.

In [35]:
for key, value in data_dict.iteritems():
    if value['expenses'] > 5000000 and value['expenses'] != 'NaN':
        print key
        print value

TOTAL
{'salary': 26704229, 'to_messages': 'NaN', 'deferral_payments': 32083396, 'total_payments': 309886585, 'exercised_stock_options': 311764000, 'bonus': 97343619, 'restricted_stock': 130322299, 'shared_receipt_with_poi': 'NaN', 'restricted_stock_deferred': -7576788, 'total_stock_value': 434509511, 'expenses': 5235198, 'loan_advances': 83925000, 'from_messages': 'NaN', 'other': 42667589, 'from_this_person_to_poi': 'NaN', 'poi': False, 'director_fees': 1398517, 'deferred_income': -27992891, 'long_term_incentive': 48521928, 'email_address': 'NaN', 'from_poi_to_this_person': 'NaN'}


The total has been included into the dataset like a person. We will remove this line from our analysis and rerun the graph above.

In [36]:
del data_dict_nan_cleaned['TOTAL']

As you'll notice below, after removing some 'Total' item, the relationship between the selected features and the labels is stronger.

In [37]:
plot_2_features(data_dict_nan_cleaned, 'total_stock_value', 'bonus', 'Relationship between Total stock value and bonus')

In [38]:
plot_2_features(data_dict_nan_cleaned, 'expenses', 'shared_receipt_with_poi', 'Relatioship between expenses and # of receipt shared with poi')

#### Using Total Payments and Total Stock Values to identify discrepancies 

In [39]:
list_financial_features = ['salary','bonus','long_term_incentive','deferral_payments','deferred_income','loan_advances','other','director_fees','expenses']
list_stock_features = ['exercised_stock_options','restricted_stock','restricted_stock_deferred']
list_individuals_financial_discrepancies = {}
list_individuals_stock_discrepancies = {}
for key, value in data_dict.iteritems():
    financial_sum = 0
    stock_sum = 0
    for features, values in value.iteritems():
        if features in list_financial_features:
            if values != 'NaN':
                financial_sum += values
        if features in list_stock_features:
            if values != 'NaN':
                stock_sum += values
    if financial_sum != value['total_payments']:
        if value['total_payments'] == 'NaN':
            if financial_sum != 0:
                list_individuals_financial_discrepancies[key] = value
        else:
            list_individuals_financial_discrepancies[key] = value
    if stock_sum != value['total_stock_value']:
        if value['total_stock_value'] == 'NaN':
            if stock_sum != 0:
                list_individuals_stock_discrepancies[key] = value
        else:
            list_individuals_stock_discrepancies[key]= value
            
print len(list_individuals_financial_discrepancies)
print len(list_individuals_stock_discrepancies)

2
2


In [40]:
list_individuals_financial_discrepancies

{'BELFER ROBERT': {'bonus': 'NaN',
  'deferral_payments': -102500,
  'deferred_income': 'NaN',
  'director_fees': 3285,
  'email_address': 'NaN',
  'exercised_stock_options': 3285,
  'expenses': 'NaN',
  'from_messages': 'NaN',
  'from_poi_to_this_person': 'NaN',
  'from_this_person_to_poi': 'NaN',
  'loan_advances': 'NaN',
  'long_term_incentive': 'NaN',
  'other': 'NaN',
  'poi': False,
  'restricted_stock': 'NaN',
  'restricted_stock_deferred': 44093,
  'salary': 'NaN',
  'shared_receipt_with_poi': 'NaN',
  'to_messages': 'NaN',
  'total_payments': 102500,
  'total_stock_value': -44093},
 'BHATNAGAR SANJAY': {'bonus': 'NaN',
  'deferral_payments': 'NaN',
  'deferred_income': 'NaN',
  'director_fees': 137864,
  'email_address': 'sanjay.bhatnagar@enron.com',
  'exercised_stock_options': 2604490,
  'expenses': 'NaN',
  'from_messages': 29,
  'from_poi_to_this_person': 0,
  'from_this_person_to_poi': 1,
  'loan_advances': 'NaN',
  'long_term_incentive': 'NaN',
  'other': 137864,
  'poi'

By doing this analysis, we identified one discrepancies which could affect our analysis:
- The amount of deferral payments attributed to BELFER ROBERT has been wrongly entered as it should be a positive number and should be equal to the total payments. 
- The amount of restricted stock deferred attributed to BELFER ROBERT has been wrongly entered as it should be a negative number and should be equal to the total stocks.

In [41]:
data_dict_nan_cleaned['BELFER ROBERT']['deferral_payments'] = 102500

In [42]:
data_dict_nan_cleaned['BELFER ROBERT']['restricted_stock_deferred'] = -44093

#### Check of names in the dataset


In [43]:
for key, value in data_dict.iteritems():
    print key

METTS MARK
BAXTER JOHN C
ELLIOTT STEVEN
CORDES WILLIAM R
HANNON KEVIN P
MORDAUNT KRISTINA M
MEYER ROCKFORD G
MCMAHON JEFFREY
HORTON STANLEY C
PIPER GREGORY F
HUMPHREY GENE E
UMANOFF ADAM S
BLACHMAN JEREMY M
SUNDE MARTIN
GIBBS DANA R
LOWRY CHARLES P
COLWELL WESLEY
MULLER MARK S
JACKSON CHARLENE R
WESTFAHL RICHARD K
WALTERS GARETH W
WALLS JR ROBERT H
KITCHEN LOUISE
CHAN RONNIE
BELFER ROBERT
SHANKMAN JEFFREY A
WODRASKA JOHN
BERGSIEKER RICHARD P
URQUHART JOHN A
BIBI PHILIPPE A
RIEKER PAULA H
WHALEY DAVID A
BECK SALLY W
HAUG DAVID L
ECHOLS JOHN B
MENDELSOHN JOHN
HICKERSON GARY J
CLINE KENNETH W
LEWIS RICHARD
HAYES ROBERT E
MCCARTY DANNY J
KOPPER MICHAEL J
LEFF DANIEL P
LAVORATO JOHN J
BERBERIAN DAVID
DETMERING TIMOTHY J
WAKEHAM JOHN
POWERS WILLIAM
GOLD JOSEPH
BANNANTINE JAMES M
DUNCAN JOHN H
SHAPIRO RICHARD S
SHERRIFF JOHN R
SHELBY REX
LEMAISTRE CHARLES
DEFFNER JOSEPH M
KISHKILL JOSEPH G
WHALLEY LAWRENCE G
MCCONNELL MICHAEL S
PIRO JIM
DELAINEY DAVID W
SULLIVAN-SHAKLOVITZ COLLEEN
WROBEL BRUC

In [44]:
del data_dict_nan_cleaned['THE TRAVEL AGENCY IN THE PARK']

#### New features creation


Using 4 existing features (from_this_person_to_poi & from_messages / from_poi_to_this_person & to_messages), we will create 2 new features:
- ratio_of_emails_this_person_sent_to_poi
- ratio_of_emails_received_from_poi

In [45]:
for key, value in data_dict_nan_cleaned.iteritems():
    if value['from_this_person_to_poi'] == 'NaN':
        value['ratio_of_emails_sent_to_poi'] = 0
    else:
        value['ratio_of_emails_sent_to_poi'] = value['from_this_person_to_poi'] / float(value['from_messages'])
    if value['from_poi_to_this_person'] == 'NaN':
        value['ratio_of_emails_received_from_poi'] = 0
    else:
        value['ratio_of_emails_received_from_poi'] = value['from_poi_to_this_person'] / float(value['to_messages'])    

In [46]:
plot_2_features(data_dict_nan_cleaned, 'ratio_of_emails_sent_to_poi', 'ratio_of_emails_received_from_poi', 'Relationship between ratio of emails received from poi and ratio of emails sent to poi')

We will create two different dictionaries to test two different set of features. As we noticed before, email data are not complete for every individuals. We'll then test:
- financial features with the relevant observations
- financial + email features with the relevant observations

In [47]:
data_financial = copy.deepcopy(data_dict_nan_cleaned)

In [48]:
#remove NaN observations from data_financial_email dictionary
data_email_financial = {}
for key, value in data_financial.iteritems():
    if value['from_messages'] != 'NaN':
        data_email_financial[key] = value   

In [49]:
print len(data_financial)
print len(data_email_financial)

144
86


# Test with financial features and full dataset

In [50]:
### Selection of features we'll use based on our feature exploration
features_list_financial = ['poi','deferral_payments','exercised_stock_options','bonus','restricted_stock','restricted_stock_deferred','director_fees','long_term_incentive','deferred_income','expenses','bonus']

In [51]:
### Extract features and labels from dataset for local testing
data_fin = featureFormat(data_financial, features_list_financial, sort_keys = True)
labels_fin, features_fin = targetFeatureSplit(data_fin)

In [52]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import MinMaxScaler

In [None]:
#Defining the Stratified Shuffle Split
sss = StratifiedShuffleSplit(
    n_splits = 1000,
    test_size=0.1,
    train_size=None,
    random_state=42)

In [54]:
def grid_search(features, labels):
    pipeline1 = Pipeline((
    ('scale',MinMaxScaler()),
    ('kbest', SelectKBest()),
    ('kneighbors', KNeighborsClassifier()),
    ))
    
    pipeline2 = Pipeline((
    ('tree', DecisionTreeClassifier()),
    ))

    pipeline3 = Pipeline((
    ('kbest', SelectKBest()),
    ('svc', SVC()),        
    ))

    parameters1 = {
    'kneighbors__n_neighbors': [3, 7, 10],
    'kneighbors__weights': ['uniform', 'distance'],
    'kbest__k': [3,5,10]
    }

    parameters2 = {
    'tree__criterion': ('gini','entropy'),
    'tree__splitter':('best','random'),
    'tree__min_samples_split':[2, 10, 20],
    'tree__max_depth':[10,15,20,25,30],
    'tree__max_leaf_nodes':[5,10,30]
    }
    
    parameters3 = {
    'svc__C': [0.01, 0.1, 1.0],
    'svc__kernel': ['rbf', 'poly'],
    'svc__gamma': [0.01, 0.1, 1.0],
    'kbest__k': [3,5,10]
    }

    pars = [parameters1, parameters2, parameters3]
    pips = [pipeline1, pipeline2, pipeline3]

    print "starting Gridsearch"
    for i in range(len(pars)):
        gs = GridSearchCV(pips[i], pars[i], scoring = 'f1', cv= sss, n_jobs=-1)
        gs = gs.fit(features, labels)
        print "finished pipeline"+ str(i+1) + " Gridsearch"
        print("The best parameters are %s with a score of %0.2f" % (gs.best_params_, gs.best_score_))

In [None]:
grid_search(features_fin, labels_fin)

starting Gridsearch



F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.



finished pipeline1 Gridsearch
The best parameters are {'kbest__k': 3, 'kneighbors__n_neighbors': 3, 'kneighbors__weights': 'uniform'} with a score of 0.31
{'split18_train_score': array([ 0.45454545,  1.        ,  0.        ,  1.        ,  0.        ,
        1.        ,  0.58333333,  1.        ,  0.        ,  1.        ,
        0.        ,  1.        ,  0.72727273,  1.        ,  0.        ,
        1.        ,  0.        ,  1.        ]), 'split774_test_score': array([ 0.6       ,  0.4       ,  0.33333333,  0.54545455,  0.        ,
        0.4       ,  0.        ,  0.        ,  0.28571429,  0.28571429,
        0.        ,  0.        ,  0.5       ,  0.5       ,  0.        ,
        0.        ,  0.        ,  0.        ]), 'split587_train_score': array([ 0.63157895,  0.96      ,  0.26666667,  0.96      ,  0.14285714,
        0.96      ,  0.57142857,  1.        ,  0.13333333,  1.        ,
        0.        ,  1.        ,  0.63157895,  1.        ,  0.35294118,
        1.        ,  0.1428571


F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.



finished pipeline2 Gridsearch
The best parameters are {'tree__criterion': 'gini', 'tree__max_depth': 15, 'tree__min_samples_split': 2, 'tree__splitter': 'random', 'tree__max_leaf_nodes': 30} with a score of 0.33
{'split18_train_score': array([ 0.81818182,  0.59259259,  0.8       ,  0.5       ,  0.42105263,
        0.14285714,  0.96      ,  0.75      ,  0.84615385,  0.45454545,
        0.42105263,  0.14285714,  1.        ,  1.        ,  0.84615385,
        0.74074074,  0.42105263,  0.        ,  0.81818182,  0.52631579,
        0.8       ,  0.7       ,  0.42105263,  0.375     ,  0.96      ,
        0.7       ,  0.84615385,  0.63636364,  0.42105263,  0.58333333,
        1.        ,  1.        ,  0.84615385,  0.5       ,  0.42105263,
        0.14285714,  0.81818182,  0.26666667,  0.8       ,  0.14285714,
        0.42105263,  0.14285714,  0.96      ,  0.55555556,  0.84615385,
        0.69565217,  0.42105263,  0.26666667,  1.        ,  1.        ,
        0.84615385,  0.6       ,  0.42105263


F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.



# Test with email & financial data on reduced dataset


In [38]:
### Selection of features we'll use based on our feature exploration
features_list_email_fin = ['poi','deferral_payments','exercised_stock_options','bonus','restricted_stock','restricted_stock_deferred','director_fees','long_term_incentive','deferred_income','expenses','bonus','ratio_of_emails_sent_to_poi','ratio_of_emails_received_from_poi']

In [39]:
### Extract features and labels from dataset for local testing
data_email_fin = featureFormat(data_email_financial, features_list_email_fin, sort_keys = True)
labels_email_fin, features_email_fin = targetFeatureSplit(data_email_fin)

In [40]:
grid_search(features_email_fin, labels_email_fin)

starting Gridsearch



F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.


Features [5] are constant.


Features [5] are constant.


Features [5] are constant.


Features [5] are constant.



finished pipeline1 Gridsearch
The best parameters are {'kbest__k': 3, 'kneighbors__n_neighbors': 3, 'kneighbors__weights': 'uniform'} with a score of 0.26



F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.



finished pipeline2 Gridsearch
The best parameters are {'tree__criterion': 'entropy', 'tree__max_depth': 20, 'tree__min_samples_split': 2, 'tree__splitter': 'random', 'tree__max_leaf_nodes': 30} with a score of 0.32



F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.


Features [5] are constant.


Features [5] are constant.


Features [5] are constant.


Features [5] are constant.



finished pipeline3 Gridsearch
The best parameters are {'svc__gamma': 1.0, 'kbest__k': 10, 'svc__kernel': 'poly', 'svc__C': 0.01} with a score of 0.07
