# Identify Fraud From Enron Email ML Project
---

## Project Overview
---

In this project, I will play detective, and put my machine learning skills to use by building an algorithm to identify Enron Employees who may have committed fraud based on the public Enron financial and email dataset.

### Project Context

The Enron scandal, publicized in October 2001, eventually led to the bankruptcy of the Enron Corporation, an American energy company based in Houston, Texas, and the de facto dissolution of Arthur Andersen, which was one of the five largest audit and accountancy partnerships in the world. In addition to being the largest bankruptcy reorganization in American history at that time, Enron was cited as the biggest audit failure.

The Enron fraud is a big, messy and totally fascinating story about corporate malfeasance of nearly every imaginable type. The Enron email and financial datasets are also big, messy treasure troves of information, which become much more useful once you know your way around them a bit. The email and finance data have been combined into a single dataset, which i will explore in this project.

### Project Goals

* Deal with an imperfect, real-world dataset
* Validate a machine learning result using test data
* Evaluate a machine learning result using quantitative metrics
* Create, select and transform features
* Compare the performance of machine learning algorithms
* Tune machine learning algorithms for maximum performance
* Communicate your machine learning algorithm results clearly

## Understanding the Dataset and Question
---

### Data Exploration

The <b>sizing</b> of the dataset used is as follows:

In [None]:
print "Population: ", len(data_dict)
print "Property Number: ", len(data_dict.values()[0])
print "Property Names:", data_dict.values()[0].keys()

<b>Population:</b>  146<br>
<b>Property Number:</b>  21<br>
<b>Property Names:</b> ['salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'from_this_person_to_poi', 'poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'email_address', 'from_poi_to_this_person']

The property <b>'poi'</b> is the classification label. It is not taken into account in the analysis of quality and selection of properties.<br>
The <b>'email_address'</b> property is a string that is not useful for classification. It will not be taken into account in the following steps

The balance of POI/no POI (after removing the elements of population with all properties with NaNs value) is around <b>12.5% of POI (18 POI/125 no POI)</b>

To <b>measure the data quality</b> of the properties we checked the percentage of NaNs in each of the properties (after removing the elements of population with all properties with NaNs value)

In [None]:
poi_utils_show_nans(features, features_list[1:])

In [4]:
NaN Percent
34.7 salary
43.8 bonus
29.9 exercised_stock_options
13.2 total_stock_value
73.6 deferral_payments
13.9 total_payments
24.3 restricted_stock
40.3 shared_receipt_with_poi
88.2 restricted_stock_deferred
34.7 expenses
97.9 loan_advances
39.9 to_messages
40.3 from_messages
36.1 other
54.2 from_this_person_to_poi
88.9 director_fees
66.7 deferred_income
54.9 long_term_incentive
48.6 from_poi_to_this_person

SyntaxError: invalid syntax (<ipython-input-4-a6a847d966ae>, line 1)

Using the following function i check all the people names to try to find some error.
I found <b>'THE TRAVEL AGENCY IN THE PARK'</b> to remove from the data because it is not a person to analize.

### Outlier Investigation

Using the following code to graphically show each property (2 at a time) together with the labeled POI / No POI analyzed if there are any unexpected value.

Analyzing graphically the selected properties a clear outliner is observed. Analyzing the dataset it is observed that it corresponds to <b>'TOTAL'</b> so we eliminate it from the dataset.

I also remove the following because:
<br><b>'BHATNAGAR SANJAY'</b>: Wrong data on 'exercised_stock_options' and 'restricted_stock' (data exchange)
<br><b>'BELFER ROBERT'</b>: 'restricted_stock_deferred' must be negative
<br><b>'HAUG DAVID L'</b>: Total Payment 475. No other payment data and only stock value.

In some cases other outliners seem to be observed, but they help us detect POIs so they do not have to be eliminated.

## Optimize Feature Selection/Engineering
---

I use a scaler to regularize all selected properties by regularizing them between the maximum and minimum.

Change the properties 'from_this_person_to_poi', 'from_poi_to_this_person' and total from/to/shared messages passing them to percentages on the total of messages that correspond to them.
I also include:
'bonus_salary'. Proportion of bonus on salary.
'incentive_salary'. Proportion of 'long_term_incentive' on salary
'stock_payment'. Proportion of 'total_stock_value' on 'total_payments'
'total'. Aggregation of 'total_payments' and 'total_stock_value'
'total_salary'. Proportion of 'total' on salary

In [None]:
poi_utils.include_div_poi(my_dataset, features_list, 'perc_poi_messages',
                          ['from_this_person_to_poi', 'from_poi_to_this_person', 'shared_receipt_with_poi'],
                          ['to_messages', 'from_messages', 'shared_receipt_with_poi'])
poi_utils.include_div_poi(my_dataset, features_list, 'perc_this_person_to_poi', ['from_this_person_to_poi'], ['from_messages'])
poi_utils.include_div_poi(my_dataset, features_list, 'perc_poi_to_this_person', ['from_poi_to_this_person'], ['to_messages'])
poi_utils.include_div_poi(my_dataset, features_list, 'bonus_salary', ['bonus'], ['salary'])
poi_utils.include_div_poi(my_dataset, features_list, 'incentive_salary', ['long_term_incentive'], ['salary'])
poi_utils.include_div_poi(my_dataset, features_list, 'stock_payment', ['total_stock_value'], ['total_payments'])
poi_utils.include_add_poi(my_dataset, features_list, 'total', ['total_payments', 'total_stock_value'])
poi_utils.include_div_poi(my_dataset, features_list, 'total_salary', ['total'], ['salary'])

I use a scaler, selector and PCA in a pipeline on the training and predictions in order to fast the algorithm.
A scaler (MinMaxScaler) in order to regularizing them between the maximum and minimum.
A selector (SelectKBest) in order to select the most significant features.
A PCA in order to reduce to the main component of the selected features.

## Pick and Tune an Algorithm
---

The following algorithms have been used for the POI detector:
* Gaussian NB
* KNeighbors Classifier
* Decision Tree Classifier

### Gaussian NB

The following code shows the parameters used in the tuning of the scaler, selector, reducer and classifier.

In [None]:
# GaussianNB
pipe_nbc = Pipeline([
        ('scaler', MinMaxScaler()),
        ('selector', SelectKBest()),
        ('reducer', PCA()),
        ('classifier', GaussianNB())
    ])

param_grid_nbc = {
    'scaler':                [None, MinMaxScaler()],
    'selector__k':           [10, 11, 12, 14, 16, 'all'],
    'reducer__n_components': [5, 6, 7, 8],
}

### KNeighbors Classifer

The following code shows the parameters used in the tuning of the scaler, selector, reducer and classifier.

In [None]:
# KNeighborsClassifier
pipe_knc = Pipeline([
        ('scaler', MinMaxScaler()),
        ('selector', SelectKBest()),
        ('reducer', PCA()),
        ('classifier', KNeighborsClassifier())
    ])

param_grid_knc = {
    'scaler':                  [None, MinMaxScaler()],
    'selector__k':             [10, 11, 12, 14, 15, 18, 'all'],
    'reducer__n_components':   [5, 6, 7, 8, 10],
    'classifier__n_neighbors': [2, 3, 4, 5, 6],
    'classifier__weights':     ['uniform', 'distance'],
    'classifier__algorithm':   ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'classifier__leaf_size':   [20, 30, 40],
    'classifier__p':           [1, 2, 3],
}

### Decision Tree Classifier

The following code shows the parameters used in the tuning of the scaler, selector, reducer and classifier.

In [None]:
# DecisionTreeClassifier
pipe_dtc = Pipeline([
        ('scaler', None),
        ('selector', SelectKBest()),
        ('reducer', PCA()),
        ('classifier', DecisionTreeClassifier())
    ])

param_grid_dtc = {
    'scaler':                        [None, MinMaxScaler()],
    'selector__k':                   [10, 11, 12, 14, 16, 'all'],
    'reducer__n_components':         [5, 6, 7, 8],
    'classifier__criterion':         ['gini', 'entropy'],
    'classifier__splitter':          ['best', 'random'],
    'classifier__min_samples_split': [2, 3, 4, 5],
    'classifier__class_weight':      ['balanced', None],
    'classifier__min_samples_leaf':  [1, 2, 3, 4],
    'classifier__max_depth':         [None, 5, 10, 20],
}

### GridSearchCV

I use GridSearchCV to automatically select the best parameters that adapt to the algorithm to get the best score.
<br>I use StratifiedShuffleSplit for cross validation to get the best out of the few data we have by selecting training data set and test in the search for the best algorithm
<br>I use f1 scoring in this case to maximize precision and recall.

In [None]:
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
grid = GridSearchCV(pipe, param_grid, scoring='f1', cv=sss, verbose=1, n_jobs=2)
grid = grid.fit(features_train, labels_train)

### Results

| Param/Score |    Gaussian NB    |    KNeighbors      |    Decission Tree       |
|-------------|-------------------|--------------------|-------------------------|
|Scaler       | MinMaxScaler()    | None               | MinMaxScaler()          |
|Selector     | k = 12            | k = all            | k = 14                  |
|Reducer      | n_components = 6  | n_components = 9   | n_components = 5        |
|Classifer    | N/A               | n_neighbor = 2     | criterion = gini        |
|.            |.                  | weights = distance | splitter = random       |
|.            |.                  | algorithm = auto   | min_samples_split = 3   |
|.            |.                  | leaf_size = 2      | class_weight = balanced |
|.            |.                  | p =                | min_samples_leaf = 1    |
|.            |.                  |.                   | max_depth = 5           |
|.            |.                  |.                   |.                        |
|Accuracy     | 0.82              | 0.79               | 0.69                    |
|Precision    | 0.36              | 0.24               | 0.23                    |
|Recall       | 0.30              | 0.21               | 0.48                    |
|F1           | 0.33              | 0.22               | 0.31                    |

The best result obtained has been with a Pipeline with MinMaxScaler, SelectKBest(k=12), PCA(n_components=6) and <b>Gaussian NB</b>

## Validate and Evaluate
---


The validation of the algorithm is a fundamental part. It is necessary to ensure that the results in their precisions are adapted to our expectations.<br>
In this case, accuracy does not matter. It is necessary to ensure that precision and recall have significant values.<br>
It is also necessary to assess the algorithm's performance of the speed at which we need the answers.

Taking into account that the dataset is very small, the results should not be very good.

Next I show the results with the selected metrics and they are better than I expected at the beginning.

In [None]:
print "Accuracy Train", round(grid.score(features_train, labels_train), 2)
print "Accuracy Test", round(grid.score(features_test, labels_test), 2)

prediction_test = grid.predict(features_test)
print "Precision Score", round(precision_score(labels_test, prediction_test, average='binary'), 2)
print "Recall Score", round(recall_score(labels_test, prediction_test, average='binary'), 2)

In this project context (detection of fraud) I am interested in maximizing precision but without forgetting the rest of the parameters (accuracy, recall). So I chose a scoring f1 reviewing the parameters and properties to choose the configuration that generated high precision.<br>

## References
[Wikipedia] https://en.wikipedia.org/wiki/Enron_scandal<br>
[Scikit Learn Web] https://scikit-learn.org/stable/<br>
https://www.quora.com/How-do-I-properly-use-SelectKBest-GridSearchCV-and-cross-validation-in-the-sklearn-package-together<br>
http://busigence.com/blog/hyperparameter-optimization-and-why-is-it-important<br>

I hereby confirm that this submission is my work. I have cited above the origins of any parts of the submission that were taken from Websites, books, forums, blog posts, github repositories, etc.