# Identify Fraud From Enron Email ML Project
---

## Project Overview
---

In this project, I will play detective, and put my machine learning skills to use by building an algorithm to identify Enron Employees who may have committed fraud based on the public Enron financial and email dataset.

### Project Context

The Enron scandal, publicized in October 2001, eventually led to the bankruptcy of the Enron Corporation, an American energy company based in Houston, Texas, and the de facto dissolution of Arthur Andersen, which was one of the five largest audit and accountancy partnerships in the world. In addition to being the largest bankruptcy reorganization in American history at that time, Enron was cited as the biggest audit failure.

The Enron fraud is a big, messy and totally fascinating story about corporate malfeasance of nearly every imaginable type. The Enron email and financial datasets are also big, messy treasure troves of information, which become much more useful once you know your way around them a bit. The email and finance data have been combined into a single dataset, which i will explore in this project.

### Project Goals

* Deal with an imperfect, real-world dataset
* Validate a machine learning result using test data
* Evaluate a machine learning result using quantitative metrics
* Create, select and transform features
* Compare the performance of machine learning algorithms
* Tune machine learning algorithms for maximum performance
* Communicate your machine learning algorithm results clearly

## Understanding the Dataset and Question
---

### Data Exploration

The <b>sizing</b> of the dataset used is as follows:

In [None]:
print "Population: ", len(data_dict)
print "Property Number: ", len(data_dict.values()[0])
print "Property Names:", data_dict.values()[0].keys()

<b>Population:</b>  146<br>
<b>Property Number:</b>  21<br>
<b>Property Names:</b> ['salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'from_this_person_to_poi', 'poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'email_address', 'from_poi_to_this_person']

The property <b>'poi'</b> is the classification label. It is not taken into account in the analysis of quality and selection of properties.<br>
The <b>'email_address'</b> property is a string that is not useful for classification. It will not be taken into account in the following steps

To <b>measure the data quality</b> of the properties we checked the percentage of NaNs in each of the properties

In [None]:
showNaNs(features, features_list[1:])

In [None]:
NaN Percent
34.7 salary
43.8 bonus
29.9 exercised_stock_options
13.2 total_stock_value
73.6 deferral_payments
13.9 total_payments
24.3 restricted_stock
40.3 shared_receipt_with_poi
88.2 restricted_stock_deferred
34.7 expenses
97.9 loan_advances
39.9 to_messages
40.3 from_messages
36.1 other
54.2 from_this_person_to_poi
88.9 director_fees
66.7 deferred_income
54.9 long_term_incentive
48.6 from_poi_to_this_person

Properties with a low quality (NaN> 60) will not take them into account:
<br>73.6 deferral_payments
<br>88.2 restricted_stock_deferred
<br>97.9 loan_advances
<br>88.9 director_fees
<br>66.7 deferred_income

At this point there are 14 properties to consider.
<br>
Using the following code to graphically show each property (2 at a time) together with the labeled POI / No POI analyzed if the properties are useful to discern the classification.

In addition to finding an outliner (next point) we are keep the following properties (6) that it seems (after graphic analysis) are related to the POI / non-POI classification: <b>['salary', 'bonus', 'total_stock_value', 'total_payments','expenses', 'long_term_incentive']</b>

We will also include properties related to messages sent / received with POIs but we will include them as percentages later.

### Outlier Investigation

Analyzing graphically the selected properties a clear outliner is observed. Analyzing the dataset it is observed that it corresponds to <b>'TOTAL'</b> so we eliminate it from the dataset.

In some cases other outliners seem to be observed, but they help us detect POIs so they do not have to be eliminated.

## Optimize Feature Selection/Engineering
---

I use a scaler to regularize all selected properties by regularizing them between the maximum and minimum.

In [None]:
features = MinMaxScaler().fit_transform(features)

Change the properties 'from_this_person_to_poi', 'from_poi_to_this_person' and total from/to/shared messages passing them to percentages on the total of messages that correspond to them.

In [None]:
features_list.append('perc_poi_messages')
include_perc_poi_messages(my_dataset, 'perc_poi_messages',
                          ['from_this_person_to_poi', 'from_poi_to_this_person', 'shared_receipt_with_poi'],
                          ['to_messages', 'from_messages', 'shared_receipt_with_poi'])

features_list.append('perc_this_person_to_poi')
include_perc_poi_messages(my_dataset, 'perc_this_person_to_poi', ['from_this_person_to_poi'], ['from_messages'])

features_list.append('perc_poi_to_this_person')
include_perc_poi_messages(my_dataset, 'perc_poi_to_this_person', ['from_poi_to_this_person'], ['to_messages'])

Additionally, I use PCA to accelerate the prediction speed of the algorithm. 8 is the minimum number of components before losing efficiency.

In [None]:
pca = PCA(n_components=8)
features = pca.fit_transform(features)

## Pick and Tune an Algorithm
---

The following algorithms have been used for the POI detector:
* Gaussian NB
* SVC
* Decision Tree Classifier

In [None]:
#clf = GaussianNB()
#clf = svm.SVC(gamma="auto", C=8000.0, kernel='rbf')
#clf = tree.DecisionTreeClassifier()

The best result was given by SVC and Naive Bayes.
Finally SVC has been selected after choosing the best one using GridSearchCV

In [None]:
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10, 100, 1000, 3000, 5000, 8000, 9000, 10000, 11000]}
clf = GridSearchCV(svm.SVC(gamma="auto"), parameters, cv=5, iid=False, scoring='f1')
clf.fit(features_train, labels_train)
print clf.best_params_


The iterated parameters have been 'kernel' and 'C' and I have set 'gamma'. As scoring I have sought to maximize f1.

## Validate and Evaluate
---


Taking into account that the dataset is very small, the results should not be very good.

Next I show the results with the selected metrics and they are better than I expected at the beginning.

In [None]:
print "Accuracy Train", round(clf.score(features_train, labels_train), 2)
print "Accuracy Test", round(clf.score(features_test, labels_test), 2)

prediction_test = clf.predict(features_test)
print "Precision Score", round(precision_score(labels_test, prediction_test, average='binary'), 2)
print "Recall Score", round(recall_score(labels_test, prediction_test, average='binary'), 2)

In [None]:
Accuracy Train 0.98
Accuracy Test 0.91
Precision Score 0.75
Recall Score 0.5

In this project context (detection of culprits) I am interested in maximizing precision but without forgetting the rest of the parameters (accuracy, recall). So I chose a scoring f1 reviewing the parameters and properties to choose the configuration that generated high precision.<br>
The result of a 75% precision seems very good (in this context of exercise) being only 25% of non-POI predicted as POI (false positives).<br>
On the other hand only 50% (recall) of true POI will not be detected (false negative).<br>
The general accuracy is very high 0.91

## References
[Wikipedia] https://en.wikipedia.org/wiki/Enron_scandal<br>
[Scikit Learn Web] https://scikit-learn.org/stable/

I hereby confirm that this submission is my work. I have cited above the origins of any parts of the submission that were taken from Websites, books, forums, blog posts, github repositories, etc.