# Identify Fraud from Enron Email

## Introduction

> In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. This project is to build a person of interest identifier based on financial and email data made public as a result of the Enron scandal.

> A Person of Interest(PoI) is an individual who meets one of the following criteria:
>* Individuals who were indicted
>* Individuals who reached a settlement or plea deal with the government
>* Individuals who testified in exchange for prosecution immunity.

> The full project and all data for this project can be seen in [this Github repository](https://github.com/TrikerDev/Identify-Fraud-from-Enron-Email). Data such as financial records, emails from employees, PoI names, etc, can be seen [here](https://github.com/TrikerDev/Identify-Fraud-from-Enron-Email/tree/master/Identify%20Fraud%20from%20Enron%20Email/Enron%20Data).

## Data Exploration

In [1]:
# Importing packages
import sys
import pickle
sys.path.append("Enron Data/")
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

In [2]:
# Loading data
with open("Enron Data/final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

In [3]:
# The number of executives in the dataset
len(data_dict)

146

> This shows that there are 146 executives in the dataset.

In [4]:
# The names of the executives
print data_dict.keys()

['METTS MARK', 'BAXTER JOHN C', 'ELLIOTT STEVEN', 'CORDES WILLIAM R', 'HANNON KEVIN P', 'MORDAUNT KRISTINA M', 'MEYER ROCKFORD G', 'MCMAHON JEFFREY', 'HORTON STANLEY C', 'PIPER GREGORY F', 'HUMPHREY GENE E', 'UMANOFF ADAM S', 'BLACHMAN JEREMY M', 'SUNDE MARTIN', 'GIBBS DANA R', 'LOWRY CHARLES P', 'COLWELL WESLEY', 'MULLER MARK S', 'JACKSON CHARLENE R', 'WESTFAHL RICHARD K', 'WALTERS GARETH W', 'WALLS JR ROBERT H', 'KITCHEN LOUISE', 'CHAN RONNIE', 'BELFER ROBERT', 'SHANKMAN JEFFREY A', 'WODRASKA JOHN', 'BERGSIEKER RICHARD P', 'URQUHART JOHN A', 'BIBI PHILIPPE A', 'RIEKER PAULA H', 'WHALEY DAVID A', 'BECK SALLY W', 'HAUG DAVID L', 'ECHOLS JOHN B', 'MENDELSOHN JOHN', 'HICKERSON GARY J', 'CLINE KENNETH W', 'LEWIS RICHARD', 'HAYES ROBERT E', 'MCCARTY DANNY J', 'KOPPER MICHAEL J', 'LEFF DANIEL P', 'LAVORATO JOHN J', 'BERBERIAN DAVID', 'DETMERING TIMOTHY J', 'WAKEHAM JOHN', 'POWERS WILLIAM', 'GOLD JOSEPH', 'BANNANTINE JAMES M', 'DUNCAN JOHN H', 'SHAPIRO RICHARD S', 'SHERRIFF JOHN R', 'SHELBY 

> This is the names of all 146 executives in the dataset. Scanning over it, we can see that there are many names, but also several names that dont really make sense in this context. These will be taken care of in the next section.

In [234]:
# count people of interest
count_poi = 0
poi_name = []
for entry in data_dict:
    if data_dict[entry]['poi'] == 1:
        count_poi += 1
        poi_name.append(entry)
print "There are " + str(count_poi) + " person of interest."
print poi_name

There are 18 person of interest.
['HANNON KEVIN P', 'COLWELL WESLEY', 'RIEKER PAULA H', 'KOPPER MICHAEL J', 'SHELBY REX', 'DELAINEY DAVID W', 'LAY KENNETH L', 'BOWEN JR RAYMOND M', 'BELDEN TIMOTHY N', 'FASTOW ANDREW S', 'CALGER CHRISTOPHER F', 'RICE KENNETH D', 'SKILLING JEFFREY K', 'YEAGER F SCOTT', 'HIRKO JOSEPH', 'KOENIG MARK E', 'CAUSEY RICHARD A', 'GLISAN JR BEN F']


> This shows that there are 18 total people of interest and the names of those individuals. This also means that since there are 146 total people, and only 18 of them are PoIs, then there are 128 individuals who are not a person of interest.

## Outlier Removal

> Right away, we can see some outliers in this dataset. The first is 'THE TRAVEL AGENCY IN THE PARK'. This is supposed to be a list of the names of executives, so this is clearly in the wrong place. Another outlier is 'TOTAL'. This 'TOTAL' data is the sum of all other executives. We dont want the total of every person on the list, so this is an outlier that will be removed

In [5]:
# Removing THE TRAVEL AGENCY IN THE PARK outlier
data_dict.pop('THE TRAVEL AGENCY IN THE PARK', 0)

{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 362096,
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 362096,
 'total_stock_value': 'NaN'}

In [6]:
# Removing TOTAL outlier
data_dict.pop('TOTAL', 0)

{'bonus': 97343619,
 'deferral_payments': 32083396,
 'deferred_income': -27992891,
 'director_fees': 1398517,
 'email_address': 'NaN',
 'exercised_stock_options': 311764000,
 'expenses': 5235198,
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 83925000,
 'long_term_incentive': 48521928,
 'other': 42667589,
 'poi': False,
 'restricted_stock': 130322299,
 'restricted_stock_deferred': -7576788,
 'salary': 26704229,
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 309886585,
 'total_stock_value': 434509511}

> Another way to find potential outliers is to find individuals who may have Total Payments as NaN. This could potentially mean that this person should not be included.

In [7]:
# List of people with no total payment data
for entry in data_dict:
    if data_dict[entry]['total_payments'] == 'NaN':
        print entry

CORDES WILLIAM R
LOWRY CHARLES P
CHAN RONNIE
WHALEY DAVID A
CLINE KENNETH W
LEWIS RICHARD
MCCARTY DANNY J
POWERS WILLIAM
PIRO JIM
WROBEL BRUCE
MCDONALD REBECCA
SCRIMSHAW MATTHEW
GATHMANN WILLIAM D
GILLIS JOHN
MORAN MICHAEL P
LOCKHART EUGENE E
SHERRICK JEFFREY B
FOWLER PEGGY
CHRISTODOULOU DIOMEDES
HUGHES JAMES A
HAYSLETT RODERICK J


> This could be narrowed down further by adding another main criteria to the list: the stock options.

In [8]:
# List of people with no total payment data and no stock option data
for entry in data_dict:
    if data_dict[entry]['total_payments'] == 'NaN' and data_dict[entry]['total_stock_value'] == 'NaN':
        print entry

CHAN RONNIE
POWERS WILLIAM
LOCKHART EUGENE E


> Investigating these individuals, we can see that they are missing lots of information.

In [9]:
# Investigating CHAN RONNIE
data_dict['CHAN RONNIE']

{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': -98784,
 'director_fees': 98784,
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 32460,
 'restricted_stock_deferred': -32460,
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 'NaN',
 'total_stock_value': 'NaN'}

In [10]:
# Investigating POWERS WILLIAM
data_dict['POWERS WILLIAM']

{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': -17500,
 'director_fees': 17500,
 'email_address': 'ken.powers@enron.com',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 26,
 'from_poi_to_this_person': 0,
 'from_this_person_to_poi': 0,
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 12,
 'to_messages': 653,
 'total_payments': 'NaN',
 'total_stock_value': 'NaN'}

In [11]:
# Investigating LOCKHART EUGENE E
data_dict['LOCKHART EUGENE E']

{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 'NaN',
 'total_stock_value': 'NaN'}

> LOCKHART EUGENE E is the most obvious outlier here, as this individual has NaN for every feature, and he is also not a PoI. This is an outlier that will be removed.

> CHAN RONNIE is also an outlier, as he as NaN for most features and is not a PoI. The features he does have are payments and stock. However, these are both cancelled out to zero as they are deferred for the exact same amount. This leaves us with another blank slate. This outlier will be removed.

> POWERS WILLIAM does have NaN and 0 for many of the features, however he does have several features that the other two did not have, such as messages and receipts. This gives up more information that can be used in the investigation. So even though much information is missing, we can still gather some data from this individual. This is not an outlier.

In [12]:
# Removing LOCKHART EUGENE E outlier
data_dict.pop('LOCKHART EUGENE E', 0)

{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 'NaN',
 'total_stock_value': 'NaN'}

In [13]:
# Removing CHAN RONNIE outlier
data_dict.pop('CHAN RONNIE', 0)

{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': -98784,
 'director_fees': 98784,
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 32460,
 'restricted_stock_deferred': -32460,
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 'NaN',
 'total_stock_value': 'NaN'}

In [14]:
# Saving this new dataset without the outliers
my_dataset = data_dict

## Processing Features

> In this section we are going to create a new feature to add to the feature list. The original features will be put into a list below, and then a new feature will be created.

In [15]:
# Printing all features
print len((my_dataset['SKILLING JEFFREY K'].keys()))
print (my_dataset['SKILLING JEFFREY K'].keys())

21
['salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'from_this_person_to_poi', 'poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'email_address', 'from_poi_to_this_person']


> We can see that there are 21 original features. We are going to add these into a list.

## Original Features

In [16]:
# List of original features
features_list = [
 'poi',
 'bonus',
 'deferral_payments',
 'deferred_income',
 'director_fees',
 'email_address',
 'exercised_stock_options',
 'expenses',
 'from_messages',
 'from_poi_to_this_person',
 'from_this_person_to_poi',
 'loan_advances',
 'long_term_incentive',
 'other',
 'restricted_stock',
 'restricted_stock_deferred',
 'salary',
 'shared_receipt_with_poi',
 'to_messages',
 'total_payments',
 'total_stock_value'
]

## New Feature

> The new feature we will create will be a cross between 'from_poi_to_this_person' and 'from_this_person_to_poi'. The theory is that a PoI would potentially have many more emails to and from other PoIs, whereas non PoIs would probably communicate less to PoIs. 

> We will create this new feature by adding the 'from_poi_to_this_person' and 'from_this_person_to_poi' together to get the total amount of PoI emails. We will then add 'to_messages' and 'from_messages' together to get the total amount of emails in general. Then dividing PoI emails by Total emails, we can get the percentage of communication between PoIs and Non-PoIs.

In [17]:
# Creates a new list by adding two lists together
def get_total_list(list1, list2):
    new_list = []
    for i in my_dataset:
        if my_dataset[i][list1] == 'NaN' or my_dataset[i][list2] == 'NaN':
            new_list.append(0.)
        elif my_dataset[i][list1]>=0:
            new_list.append(float(my_dataset[i][list1]) + float(my_dataset[i][list2]))
    return new_list

In [18]:
# Total PoI emails list
poi_emails_list = get_total_list('from_this_person_to_poi', 'from_poi_to_this_person')

In [19]:
# Total emails list
total_emails_list = get_total_list('to_messages', 'from_messages')

In [20]:
# Divides one list by another list
def fraction_list(list1, list2):
    new_list = []
    for i in range(0,len(list1)):
        if list2[i] == 0.0:
            new_list.append(0.0)
        else:
            new_list.append(float(list1[i])/float(list2[i]))
    return new_list

In [21]:
# Getting new list by dividing previously created lists
fraction_poi_emails = fraction_list(poi_emails_list, total_emails_list)

In [22]:
# Adding this new feature to the dataset
count = 0
for i in my_dataset:
    my_dataset[i]['fraction_poi_emails'] = fraction_poi_emails[count]
    count += 1

In [23]:
# printing all features
print len((my_dataset['SKILLING JEFFREY K'].keys()))
print (my_dataset['SKILLING JEFFREY K'].keys())

22
['to_messages', 'deferral_payments', 'expenses', 'poi', 'deferred_income', 'email_address', 'long_term_incentive', 'restricted_stock_deferred', 'from_messages', 'shared_receipt_with_poi', 'loan_advances', 'fraction_poi_emails', 'other', 'director_fees', 'bonus', 'total_stock_value', 'from_poi_to_this_person', 'from_this_person_to_poi', 'restricted_stock', 'salary', 'total_payments', 'exercised_stock_options']


> Printing all features again, we can now see that there are 22 features, as 'fraction_poi_emails' has been added to the list. Now we are going to add this new feature into the features_list.

> However, the 'email_address' feature is actually unnecessary so we will remove that one from the feature list. There are still 22 features in the dataset, but we will only be using 21 of them for the rest of the analysis. This brings us back to 21 features, essentially replacing 'email_address' with 'fraction_poi_emails'.

In [61]:
# Adding 'fraction_poi_emails' to features_list, and removing 'email_address'
features_list = [
 'poi',
 'bonus',
 'deferral_payments',
 'deferred_income',
 'director_fees',
 'exercised_stock_options',
 'expenses',
 'from_messages',
 'from_poi_to_this_person',
 'from_this_person_to_poi',
 'loan_advances',
 'long_term_incentive',
 'other',
 'restricted_stock',
 'restricted_stock_deferred',
 'salary',
 'shared_receipt_with_poi',
 'to_messages',
 'total_payments',
 'total_stock_value',
 'fraction_poi_emails'
]

## Selecting Best Classifier

## Decision Tree

> First, we are going to run decision tree with the original features.

In [258]:
import numpy as np
np.random.seed(42)
from time import time
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

# features_list
features_list = ['poi','salary', 'from_poi_to_this_person', 'from_this_person_to_poi', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'director_fees', 'deferred_income', 'long_term_incentive']

data = featureFormat(my_dataset, features_list)
labels, features = targetFeatureSplit(data)

# split data inton training and testing
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size = 0.4, random_state = 42)

# choose decision tree
from sklearn.tree import DecisionTreeClassifier
t0 = time()
clf = DecisionTreeClassifier()
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

from sklearn.metrics import accuracy_score
acc = accuracy_score(labels_test, pred)
print 'Accuracy: ' + str(acc)
print 'Precision: ', precision_score(labels_test, pred)
print 'Recall: ', recall_score(labels_test, pred)

Accuracy: 0.7543859649122807
Precision:  0.25
Recall:  0.375


> Now, we will run decision tree with the added feature.

In [252]:
# features_list
features_list = ['poi','salary', 'from_poi_to_this_person', 'fraction_poi_emails', 'from_this_person_to_poi', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'director_fees', 'deferred_income', 'long_term_incentive']

data = featureFormat(my_dataset, features_list)
labels, features = targetFeatureSplit(data)

# split data inton training and testing
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size = 0.4, random_state = 42)

# choose decision tree
from sklearn.tree import DecisionTreeClassifier
t0 = time()
clf = DecisionTreeClassifier()
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

from sklearn.metrics import accuracy_score
acc = accuracy_score(labels_test, pred)
print 'Accuracy: ' + str(acc)
print 'Precision: ', precision_score(labels_test, pred)
print 'Recall: ', recall_score(labels_test, pred)

Accuracy: 0.7719298245614035
Precision:  0.3076923076923077
Recall:  0.5


> Adding the new feature improved the Accuracy, Precision and Recall scores. We will keep this added feature for now.

## Selecting Best Features

> With our added feature, we improved all 3 of the performance scores, however there may be a better way to do this. We are going  to run Select K Best and choose the most important features from out dataset. We will then run decision tree again with the selected features and see if the performance has improved.

In [164]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline


# features_list
features_list = ['poi','salary', 'from_poi_to_this_person', 'fraction_poi_emails', 'from_this_person_to_poi', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'director_fees', 'deferred_income', 'long_term_incentive']

data = featureFormat(my_dataset, features_list)
labels, features = targetFeatureSplit(data)

# split data inton training and testing
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size = 0.4, random_state = 42)
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k=10)
selectedFeatures = selector.fit(features,labels)
feature_names = [features_list[i] for i in selectedFeatures.get_support(indices=True)]
print 'Best features: ', feature_names

Best features:  ['poi', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'restricted_stock_deferred', 'expenses', 'director_fees', 'deferred_income']


> We can see here that the model has chosen the 10 best features. The feature that was created and added was not selected as a best feature. We are now going to run decision tree again with these selected features.

In [199]:
# features_list
features_list = ['poi', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'restricted_stock_deferred', 'expenses', 'director_fees', 'deferred_income']
data = featureFormat(my_dataset, features_list)
labels, features = targetFeatureSplit(data)

# split data inton training and testing
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size = 0.4, random_state = 42)

# choose decision tree
from sklearn.tree import DecisionTreeClassifier
t0 = time()
clf = DecisionTreeClassifier()
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

acc = accuracy_score(labels_test, pred)
print 'Accuracy: ' + str(acc)
print 'Precision: ', precision_score(labels_test, pred)
print 'Recall: ', recall_score(labels_test, pred)

Accuracy: 0.8421052631578947
Precision:  0.42857142857142855
Recall:  0.375


> Running decision tree with only the best selected features has actually improved all 3 scores quite significantly. Even more than running with the new feature that was created. We will continue using just these 10 best selected features.

## Nearest K

In [191]:
from sklearn.neighbors.nearest_centroid import NearestCentroid

clf = NearestCentroid()
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)


acc = accuracy_score(labels_test, pred)
print 'Accuracy: ' + str(acc)
print 'Precision: ', precision_score(labels_test, pred)
print 'Recall: ', recall_score(labels_test, pred)

Accuracy: 0.8771929824561403
Precision:  0.6
Recall:  0.375


## Linear SVC

In [224]:
from sklearn import svm

clf = svm.LinearSVC()
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)


acc = accuracy_score(labels_test, pred)

print 'Accuracy: ' + str(acc)
print 'Precision: ', precision_score(labels_test, pred)
print 'Recall: ', recall_score(labels_test, pred)

Accuracy: 0.7543859649122807
Precision:  0.2857142857142857
Recall:  0.5


## Logistic Regression

In [225]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)


acc = accuracy_score(labels_test, pred)

print 'Accuracy: ' + str(acc)
print 'Precision: ', precision_score(labels_test, pred)
print 'Recall: ', recall_score(labels_test, pred)

Accuracy: 0.8771929824561403
Precision:  0.6
Recall:  0.375


> Out of all the algorithms, Logistic Regression with the selected features has the highest overall accuracy, precision, and recall scores. We will now try to tune to achieve even better results.

## Tuning Classifier

In [253]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
clf = GridSearchCV(LogisticRegression(penalty='l2'), param_grid)
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print "Best estimator found by grid search:"
print clf.best_estimator_

Best estimator found by grid search:
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)




In [254]:
# features_list
features_list = ['poi', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'restricted_stock_deferred', 'expenses', 'director_fees', 'deferred_income']
data = featureFormat(my_dataset, features_list)
labels, features = targetFeatureSplit(data)

# split data inton training and testing
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size = 0.4, random_state = 42)

clf = LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

acc = accuracy_score(labels_test, pred)

print 'Accuracy: ' + str(acc)
print 'Precision: ', precision_score(labels_test, pred)
print 'Recall: ', recall_score(labels_test, pred)

Accuracy: 0.8771929824561403
Precision:  0.6
Recall:  0.375


> Tuning the parameters according to GridSearchCV still results in the original performance of the Logistic Regression model.

##  Dumping classifier, dataset, and features_list

In [255]:
dump_classifier_and_data(clf, my_dataset, features_list)

## Questions

#### Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?

> The goal of this project was to create a PoI identifier based on public Enron financial and email data. This can be created by finding the attributes that PoI have in common, and attributes that non PoIs have in common. Loading these into a machine learning algorithm, and the algorithm can try to detect who might be a PoI and who might not. There were several outliers that had to be removed, such as data in the wrong place, and a total summary of all data that was not needed.

#### What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.

> The feature I created was the percentage of PoI emails to other PoIs, compared the the total emails from anyone to anyone. This caused the algorithm to have better performance than just using the original features. However, I then used selectkbest to choose the most optimal features to use and the feature I added was not chosen. Using the selected features caused the algorithm to have even better performance than when using the feature I created, so I used the features from selectKBest and did not add any others.

#### What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?

> The algorithm with the highest overall scores was the Logistic Regression model. After comparing several algorithms with the selected features, Logistic Regression had the best overall performance.

#### What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).

> An algorithms parameters must be tuned to a certain extent. Too much or too little can lead to innacurate data and overfitting. The model was originally tuned by using SelectKBest. This chose the most optimal features to use, while getting rid of features that would only negatively affect performance. I then used GridSearchCV to find the most optimal parameters for the Logistic Regression model. Another form of tuning is Feature Scaling. The model I used, Logistic Regression, does not require feature scaling, so it was not used. However, in a model that does require feature scaling, it is necessary to use. Not using feature scaling can cause some features to have too much weight and some to not have enough. This can cause the data to become very skewed in one direction because of the weight of a certain feature. Feature scaling is necessary to manage the weights of all features and balance them as well as possible to obtain the most accurate data.

#### What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?

> Validation is the processed of checking to see how your model performs on unseen data. A classic mistake would be tuning your model be able to predict your training data very well , but then having it perform poorly on unseen out-of-sample testing data. This is called overfitting. One of the major goals in validation is to avoid overfitting, which can be accomplished through a process called cross-validation. This analysis was validated though Sklearns Train/Test/Split algorithm. This splits the data into both training and testing data to be used to train and test the algorithms and compare their performance to the actual data.

#### Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.

> The Logistic Regression with the specific features and tuned parameters was able to get an Accuracy score of ~ 87.7%, a Precision score of 60%, and a Recall score of ~ 37.5%. These scores, specifially the accuracy score, gives about a 87% chance for the algorithm to essentially guess who might be a PoI and who might not. This is significantly better than just guessing who might be a PoI by random selection of any of the 146 original people in the dataset, which would be more like less than 1% chance of a correct guess, based on no further information. However, by training the algorithm, it can achieve a fairly high accuracy in choosing who is a PoI and who is not.