#P5 Enron Fraud Detectors using Enron Emails and Financial Data.
by Alexey Chesnok

*Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  [relevant rubric items: “data exploration”, “outlier investigation”]*

The goal of the project to analyze [Enron Email Dataset](https://www.cs.cmu.edu/~./enron/) using predictive mashine learning techniques to try and identify individuals who might have been involved into Enron Fraud. Data set already contains 18 records labeled as "POI" (Person of Interest) - individuals who were either indicted, settled without admitting guilt, or testified in exchange for immunity. My task is pinpoint potential additional persons of interest, using their financial information from the dataset and email interactions with existing POIs. 
Enron Email Dataset consists of 146 records with 21 features (14 financial, 6 email message features and 1 predefined poi label)


In [1]:
import sys
import pickle
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import scipy
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn import svm, grid_search
from sklearn import cross_validation

sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
import tester

plt.style.use('ggplot')



In [2]:
#Select features
financial_features = ['salary', 
                      'deferral_payments', 
                      'total_payments', 
                      'loan_advances', 
                      'bonus', 
                      'restricted_stock_deferred', 
                      'deferred_income', 
                      'total_stock_value', 
                      'expenses', 
                      'exercised_stock_options', 
                      'other', 
                      'long_term_incentive', 
                      'restricted_stock', 
                      'director_fees']

email_features = ['to_messages', 
                 'email_address', 
                 'from_poi_to_this_person', 
                 'from_messages', 
                 'from_this_person_to_poi', 
                 'shared_receipt_with_poi'] 

features_list = email_features + financial_features
features_list.insert(0, 'poi')
                

In [3]:
print features_list

['poi', 'to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi', 'salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees']


In [4]:
### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

In [None]:
data_dict

Significant number of features missing values, also predefined POIs have no entries for director_fees, restricted_stock_deferred, this renders both features irrelevant for further investigations, and "email_address".

In [5]:
#Transpond dataframe
data_dict = pd.DataFrame.from_dict(data_dict)
#reorder features
data_dict = data_dict.T
data_dict = data_dict[features_list]
list(data_dict.index.values)


['ALLEN PHILLIP K',
 'BADUM JAMES P',
 'BANNANTINE JAMES M',
 'BAXTER JOHN C',
 'BAY FRANKLIN R',
 'BAZELIDES PHILIP J',
 'BECK SALLY W',
 'BELDEN TIMOTHY N',
 'BELFER ROBERT',
 'BERBERIAN DAVID',
 'BERGSIEKER RICHARD P',
 'BHATNAGAR SANJAY',
 'BIBI PHILIPPE A',
 'BLACHMAN JEREMY M',
 'BLAKE JR. NORMAN P',
 'BOWEN JR RAYMOND M',
 'BROWN MICHAEL',
 'BUCHANAN HAROLD G',
 'BUTTS ROBERT H',
 'BUY RICHARD B',
 'CALGER CHRISTOPHER F',
 'CARTER REBECCA C',
 'CAUSEY RICHARD A',
 'CHAN RONNIE',
 'CHRISTODOULOU DIOMEDES',
 'CLINE KENNETH W',
 'COLWELL WESLEY',
 'CORDES WILLIAM R',
 'COX DAVID',
 'CUMBERLAND MICHAEL S',
 'DEFFNER JOSEPH M',
 'DELAINEY DAVID W',
 'DERRICK JR. JAMES V',
 'DETMERING TIMOTHY J',
 'DIETRICH JANET R',
 'DIMICHELE RICHARD G',
 'DODSON KEITH',
 'DONAHUE JR JEFFREY M',
 'DUNCAN JOHN H',
 'DURAN WILLIAM D',
 'ECHOLS JOHN B',
 'ELLIOTT STEVEN',
 'FALLON JAMES B',
 'FASTOW ANDREW S',
 'FITZGERALD JAY L',
 'FOWLER PEGGY',
 'FOY JOE',
 'FREVERT MARK A',
 'FUGH JOHN L',
 'GAHN 

In [6]:
data_dict = data_dict.replace('NaN', np.nan)

In [7]:
print data_dict.info()

<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 21 columns):
poi                          146 non-null bool
to_messages                  86 non-null float64
email_address                111 non-null object
from_poi_to_this_person      86 non-null float64
from_messages                86 non-null float64
from_this_person_to_poi      86 non-null float64
shared_receipt_with_poi      86 non-null float64
salary                       95 non-null float64
deferral_payments            39 non-null float64
total_payments               125 non-null float64
loan_advances                4 non-null float64
bonus                        82 non-null float64
restricted_stock_deferred    18 non-null float64
deferred_income              49 non-null float64
total_stock_value            126 non-null float64
expenses                     95 non-null float64
exercised_stock_options      102 non-null float64
other                        93 non-null float6

In [8]:
data_dict_poi = data_dict.loc[data_dict['poi'] == True]
print data_dict_poi.info()

<class 'pandas.core.frame.DataFrame'>
Index: 18 entries, BELDEN TIMOTHY N to YEAGER F SCOTT
Data columns (total 21 columns):
poi                          18 non-null bool
to_messages                  14 non-null float64
email_address                18 non-null object
from_poi_to_this_person      14 non-null float64
from_messages                14 non-null float64
from_this_person_to_poi      14 non-null float64
shared_receipt_with_poi      14 non-null float64
salary                       17 non-null float64
deferral_payments            5 non-null float64
total_payments               18 non-null float64
loan_advances                1 non-null float64
bonus                        16 non-null float64
restricted_stock_deferred    0 non-null float64
deferred_income              11 non-null float64
total_stock_value            18 non-null float64
expenses                     18 non-null float64
exercised_stock_options      12 non-null float64
other                        18 non-null float64


In [9]:
print data_dict.isnull().sum(axis=1).sort_values(ascending=False).head()


LOCKHART EUGENE E                20
GRAMM WENDY L                    18
WROBEL BRUCE                     18
WHALEY DAVID A                   18
THE TRAVEL AGENCY IN THE PARK    18
dtype: int64


I have also removed record "THE TRAVEL AGENCY IN THE PARK" - not people, and "LOCKHART EUGENE E" - missing values for all features, and therefore useless for investigation. 

In [10]:
def PlotHelper(feature_1, feature_2):
    sns.FacetGrid(data_dict, hue="poi").map(plt.scatter, feature_1, feature_2).add_legend()
    plt.show()

In [11]:
print(PlotHelper('total_payments', 'total_stock_value'))

None


In [12]:
for index, row in data_dict.iterrows():
    if row["salary"] != 'NaN' and row["salary"] != 'NaN' and row["salary"]>1000000 and row["salary"]>400000:
       print index

FREVERT MARK A
LAY KENNETH L
SKILLING JEFFREY K
TOTAL


In [13]:
data_dict['total_payments'].argmax()

  if __name__ == '__main__':


'TOTAL'

Several outliers in financial features were FREVERT MARK A, LAY KENNETH L, SKILLING JEFFREY K, and TOTAL. "Total" is a spread feature to be removed, the rest are POIs to be kept for further investigation. I have also removed features "email_address" and director_fees with restricted_stock_deferred - since they didn't have any entries for POIs

In [14]:
#Clean up
data_dict = data_dict.drop(['TOTAL', 'THE TRAVEL AGENCY IN THE PARK','LOCKHART EUGENE E'])
#remove useless (email address, director_fees restricted_stock_deferred)
del data_dict['email_address']
del data_dict['director_fees']
del data_dict['restricted_stock_deferred']

*2) What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  [relevant rubric items: “create new features”, “intelligently select features”, “properly scale features”]*


I have generated 2 additional email features to the list "proportion_to_poi" (proportion of emails sent by the person to poi to total email sent by the person) and "proportion_from_poi" (proportion of emails sent to the person by poi to total email sent to the person) to track level of interaction with known persons of interest. 

In [15]:
### Task 3: Create new feature(s)
### Store to my_dataset for easy export below.

In [16]:
data_dict['proportion_to_poi'] = data_dict['from_this_person_to_poi']/data_dict['from_messages']
data_dict['proportion_from_poi'] = data_dict['from_poi_to_this_person']/data_dict['to_messages']
data_dict = data_dict.replace('inf', 0)

In [17]:
my_dataset = data_dict.T
my_dataset.fillna(value=0, inplace = True)

In [None]:
my_dataset

In [18]:
features_list = features_list + ['proportion_to_poi', 'proportion_from_poi']
features_list = [e for e in features_list if e not in ('email_address', 'director_fees', 'restricted_stock_deferred')]

In [19]:
print features_list

['poi', 'to_messages', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi', 'salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'proportion_to_poi', 'proportion_from_poi']


I have tried using simple column Correlation with POI, Lasso Regression and then univariate feature selection tool SelectKBest, to identify best performing features. 

In [20]:
#correlations = zip(features_list[1:], data_dict.corrwith(data_dict['poi']))
#print(correlations)
#sorted(correlations, key = lambda x: x[1], reverse=True)
print data_dict.corrwith(data_dict['poi'])

poi                        1.000000
to_messages                0.058954
from_poi_to_this_person    0.167722
from_messages             -0.074308
from_this_person_to_poi    0.112940
shared_receipt_with_poi    0.228313
salary                     0.264976
deferral_payments         -0.098428
total_payments             0.230102
loan_advances              0.999851
bonus                      0.302384
deferred_income           -0.265698
total_stock_value          0.366462
expenses                   0.060292
exercised_stock_options    0.503551
other                      0.120270
long_term_incentive        0.254723
restricted_stock           0.224814
proportion_to_poi          0.339938
proportion_from_poi        0.104406
dtype: float64


In [23]:
#Select features with correlation above 0.3
best_correlations_list = ['poi',
                          'loan_advances',
                          'exercised_stock_options',
                          'total_stock_value',
                          'bonus',
                          'proportion_to_poi']
print "Features with correlations above 0.3"
print best_correlations_list

Features with correlations above 0.3
['poi', 'loan_advances', 'exercised_stock_options', 'total_stock_value', 'bonus', 'proportion_to_poi']


In [27]:
### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [28]:
from sklearn.linear_model import Lasso
#features = selector.fit_transform(features, labels)
regression = Lasso()
regression.fit(features, labels)
coeficients = zip(features_list[1:],regression.coef_ != 0)
best_coefs = sorted(coeficients, key = lambda x: x[1])
#print best_coefs
best_lasso_list = list(map(lambda x: x[0], best_coefs))
best_lasso_list = ["poi"] + [e for e in best_lasso_list if e not in ('proportion_to_poi', 'from_poi_to_this_person', 'proportion_from_poi')]
print("Best Lasso Regression features:")
print best_lasso_list

Best Lasso Regression features:
['poi', 'to_messages', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi', 'salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock']


After several testing rounds top 5 SelectKBest features ended up to be included into the data set:

In [29]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

k=5
selector = SelectKBest(f_classif, k)
selector.fit_transform(features, labels)
print("SelectKBest feature scores:")
scores = zip(features_list[1:],selector.scores_)
#scores = zip(features_list[1:],-np.log10(selector.pvalues_))
sorted_scores = sorted(scores, key = lambda x: x[1], reverse=True)
#print sorted_scores
best_kbest_list = ["poi"] + list(map(lambda x: x[0], sorted_scores))[0:k]
print("Best SelectKBest features:")
print best_kbest_list

SelectKBest feature scores:
Best SelectKBest features:
['poi', 'exercised_stock_options', 'total_stock_value', 'bonus', 'salary', 'proportion_to_poi']


After selecting 5 best performing features I have used scaling to normalize selected features for using them in Support Vector Machines Model

In [30]:
data = featureFormat(my_dataset, best_kbest_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [None]:
#data = featureFormat(my_dataset, best_correlations_list, sort_keys = True)
#labels, features = targetFeatureSplit(data)

In [31]:
scaler = MinMaxScaler()
features = scaler.fit_transform(features)

In [None]:
#data = featureFormat(my_dataset, best_lasso_list, sort_keys = True)
#labels, features = targetFeatureSplit(data)

In [None]:
#data = featureFormat(my_dataset, features_list, sort_keys = True)
#labels, features = targetFeatureSplit(data)

In [None]:
#print labels, features

*3. What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  [relevant rubric item: “pick an algorithm”]*

I have tryed Naive Bayers and Support Vector Machines algorithms. SVM Initially had better results especiall after parameter tuning with GridSearchCV.

In [46]:
### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html
from sklearn.naive_bayes import GaussianNB
clf_n = GaussianNB()
features_train, features_test, labels_train, labels_test = \
train_test_split(features, labels, test_size=0.4, random_state=42)
clf_n.fit(features_train, labels_train)
pred = clf_n.predict(features_test)

acc = accuracy_score(labels_test, pred)
rec = recall_score(labels_test, pred)
prec = precision_score(labels_test, pred)

print "Naive bayes model"
print "accuracy:", acc
print "precision:", prec
print "recall:", rec



Naive bayes model
accuracy: 0.849056603774
precision: 0.375
recall: 0.5


In [47]:
from sklearn import svm
clf = svm.SVC(kernel='rbf', C=10, gamma=1)
features_train, features_test, labels_train, labels_test = \
train_test_split(features, labels, test_size=0.4, random_state=42)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

accuracy = accuracy_score(labels_test, pred)
recall = recall_score(labels_test, pred)
precision = precision_score(labels_test, pred)

print "SVM"
print "accuracy: ", accuracy
print "precision: ", precision
print "recall: ", recall

SVM
accuracy:  0.905660377358
precision:  0.666666666667
recall:  0.333333333333


*4. What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).  [relevant rubric items: “discuss parameter tuning”, “tune the algorithm”]*

Parameter tuning allows to intify parameter values for the best algorithm performance. Without tuning optimal algorithm configuration might be missed. I have uses GridSearchCV function to identify best performing parameters for Support Vector Machines, tuning C, gamma and kernel parameters. Naive Bayes algorithm doesn't have any parameters to tune.


In [34]:
from sklearn.grid_search import GridSearchCV
def param_selection(X, y, nfolds):
    Cs = [0.001, 0.01, 0.1, 1, 10]
    gammas = [0.001, 0.01, 0.1, 1]
    kernels = ['linear', 'rbf']
    param_grid = {'C': Cs, 'gamma' : gammas, 'kernel' : kernels }
    grid_search = GridSearchCV(svm.SVC(), param_grid, cv=nfolds)
    grid_search.fit(X, y)
    grid_search.best_params_
    return grid_search.best_params_

In [35]:
print param_selection(features, labels, 18)

{'kernel': 'rbf', 'C': 10, 'gamma': 1}


*5. What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric items: “discuss validation”, “validation strategy”]*

Model validation allows independent testing of the algorithm. I have used KFold validation technique to split, shuffling data to minimize impact order of records on training portion of the dataset. Updated average evaluation metrics for Support Vector Machines model are:

In [36]:
from sklearn.cross_validation import KFold
kf = KFold(len(labels), n_folds=10, shuffle = True)
clf = svm.SVC(kernel='rbf', C=10, gamma=1)
accuracy = []
recall = []
precision = []

for train_indices, test_indices in kf:
    features_train = [features[ii] for ii in train_indices]
    features_test = [features[ii] for ii in test_indices]
    labels_train = [labels[ii] for ii in train_indices]
    labels_test = [labels[ii] for ii in test_indices]
    clf.fit(features_train, labels_train)
    pred = clf.predict(features_test)

    accuracy.append(accuracy_score(labels_test, pred))
    recall.append(recall_score(labels_test, pred))
    precision.append(precision_score(labels_test, pred))

print "SVM"
print "accuracy: ", np.average(accuracy)
print "precision: ", np.average(precision)
print "recall: ", np.average(recall)


SVM
accuracy:  0.855494505495
precision:  0.1
recall:  0.025


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [54]:
from sklearn.cross_validation import KFold
kf = KFold(len(labels), n_folds=10, shuffle = True)
clf = GaussianNB()
accuracy = []
recall = []
precision = []

for train_indices, test_indices in kf:
    features_train = [features[ii] for ii in train_indices]
    features_test = [features[ii] for ii in test_indices]
    labels_train = [labels[ii] for ii in train_indices]
    labels_test = [labels[ii] for ii in test_indices]
    clf.fit(features_train, labels_train)
    pred = clf.predict(features_test)

    accuracy.append(accuracy_score(labels_test, pred))
    recall.append(recall_score(labels_test, pred))
    precision.append(precision_score(labels_test, pred))

print "Naive Bayes"
print "accuracy: ", np.average(accuracy)
print "precision: ", np.average(precision)
print "recall: ", np.average(recall)

Naive Bayes
accuracy:  0.862637362637
precision:  0.383333333333
recall:  0.333333333333


In [None]:
dump_classifier_and_data(clf, my_dataset, features_list)

*6. Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]*

For model evaluation I have chosen following metrics:
Accuracy - proportion of cases when predicted POI status matched actual POI status, or overall ability of the algorithm to predict outcome wether it is POI or non-POI
Recall - ability of algorithm to identify all possible POIs.
Precision - ability of algorithm not to lable POIs as non-POI
After crossvalidating both models with Kfold Naive Bayes performed consistently better than SVM on all 3 metrics