# Identifying Fraud from Enron Emails

Marcus Kehn



## Introduction
> Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question.


Some of you reading this might not know of Enron, at one time it was one of the largest companies in the world. Quickly thereafter, wide spread corporate fraud at Enron caused it to collapse into bankruptcy. What this project deals with is the aftermath of the rise and fall of a giant torn apart from the inside. We have been given access to what is rather rare, a massive dataset with publicly accessible personally identifying information which is legal to obtain. This comes as a result of a Federal court case, against Enron.

The dataset taken from these court proceedings, has been appended by a list of persons of interests in the previously mentioned federal court case. For further clarification, persons of interest (POIs) are people who were charged and found guilty of crimes, plead guild for immunity, or reached a settlement.


The combined data set has: 
* 146 rows
* 14 financial 
* 6 email features
* one labeled feature (POI)



This project is to create a model, that when using the ideal features, can correctly decide if a person is a POI.
With such rare data like legally obtainable personal identifying data in bulk, it has been used many times to help identify and reduce fraud.

## Data Exploration & Outlier Investigation
> Were there any outliers in the data when you got it, and how did you handle those? [relevant rubric items: “data exploration”, “outlier investigation”]

First, we convert the dict to a dataframe as it's easier to work with.

First thing I notice is NaN values and numeric data in the same column. This tells me that the data is not stored as a float, or an int value.
This means we need a bit of data cleaning, and the first step to that is making sure we focus on only the numeric columns, as NaN values in the str email is acceptable, for now.

Here we iterate through the column names and add each column name, that isn't email_address as it's the only string column, into a list.
After which, we cast the numeric columns to float using the pandas.Dataframe.astype(float) which is inherited by our Variable df which is a pandas.Dataframe.

Proceeding, removing the NaN string values from email_address columns, as numpy.nan so we can change them later.

In [None]:
df.loc[df.email_address == 'NaN', 'email_address'] = np.nan

During the exploratory data phase, I was able to weed out three obvious outliers:

* TOTAL: The sum of all the other records, much too large.
* THE TRAVEL AGENCY IN THE PARK: The travel agency in the park cannot be an employee, thus we remove them.
* LOCKHART EUGENE E: This record contains only null values, we remove for more accurate representation of the data.

In [None]:
df = df[(df.index != 'TOTAL') & (df.index != 'THE TRAVEL AGENCY IN THE PARK') & (df.index != 'LOCKHART EUGENE E')]

Next, we move on to feature selection


## Feature Selection
>What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values. [relevant rubric items: “create new features”, “intelligently select features”, “properly scale features”]


We've developed two extra features: exercised_expense_ratio, which is simply the ratio of exercised_stocks_options to expenses, and bonus_expense_ration, which is used to
determine the ratio of the bonus divided by the expenses of the same record.

In [None]:
### We create a new feature
df['exercised_expense_ratio'] = df.exercised_stock_options / df.expenses
df['bonus_expense_ratio'] = df.bonus / df.expenses

Instead of iterating over every feature included by default, we chose 3 of the  14 financial features at a time and put them into our list to use. This process was done manually. From there used RFECV to verify they were all of equal importance and to autoremove any that weren't going forwards.

In [None]:
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=2652124)
rfecv = RFECV(estimator=best_random_tree, step=1, cv=rkf,
              min_features_to_select=min_features_to_select, n_jobs=-1)

In [None]:
###Print the ranking of features that rfe suggest we use, with 1 representing keep, and 2+ representing what is less impactful in order of impact
print(rfecv.ranking_)

###[1 1 1 1 1]

In [None]:
###Print the ideal number of features suggested by the algorithm
print("Ideal number of features : %d" % rfecv.n_features_)

###Ideal number of features : 5

<img height="800" src="graph.png" width="800"/>

Meaning, the final precision and recall values we will obtain, were based on the programmatical selection of features utilizing RFECV.

## Algorithm Selection

>What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms? [relevant rubric item: “pick an algorithm”]

We go ahead and test two different algorithms, DecisionTreeClassifier and RandomTreeClassifier, I went with these for two reasons:
    1. Experience
    2. No need to scale features

The ability to build, deploy and realize an ML algorithm is very important, and so I chose something I could rock with quickly.
We also combine this with the power of RFECV to impliment recursive feature selection with cross validation.
This gives us just a bit more confidence in our algorithm's results.

We eventually ended up going with an algorithm combination of RandomTreeClassifier, RFECV, and used RepeatedKFold as our cross-validator

In [None]:
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=2652124)
rfecv = RFECV(estimator=best_random_tree, step=1, cv=rkf,
              min_features_to_select=min_features_to_select, n_jobs=-1)
rfecv.fit(features_train, labels_train)
y = rfecv.predict(features_test)

## Parameter and Algorithm Tuning
> What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier). [relevant rubric items: “discuss parameter tuning”, “tune the algorithm”]

Hyper-parameters are parameters we can utilize to change the chosen algorithm slightly. By tuning them we try are able to find the optimal setting for maximum f1 score.

Algo tuning needs caution because overtuning the hyper-parameters may result overfitting and undertuning can lead to underfitting it.

I used GridSearchCV to calculate the best possible parameters for the DecisionTreeClassifier from the following grid of possible choices:

In [None]:
rclf = RandomForestClassifier()

param_grid = {
    "n_estimators": [9, 18, 27, 36],
    "max_depth": [None, 1, 5, 10, 15],
    "min_samples_leaf": [1, 2, 4, 6]}

# Use GridSearchCV to find the optimal hyperparameters for the classifier
rclf_grid = GridSearchCV(rclf, param_grid=param_grid, scoring='f1', cv=5)
rclf_grid.fit(features, labels)
# Get the best algorithm hyperparameters for the Decision Tree
print(rclf_grid.best_params_)
best_random_tree = clf_grid.best_estimator_

## Validation

We touched on this briefly before, but we utilise cross-validation using RFECV equipped with RepeatedKFold as the
cross validation algorithm.

We don't want to underfit or over fit, but we also want a consistent result from a "Random" classifier

For consitency sake and testing sakes therefore we utilize the random_state param of RepeatedKFold

In [None]:
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=13373003135)
rfecv = RFECV(estimator=best_random_tree, step=1, cv=rkf,
              min_features_to_select=min_features_to_select, n_jobs=-1)
rfecv.fit(features_train, labels_train)


## Evaluation Metrics
> Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]

Precision, Recall and F1 are useful for determining the success of our prediction.

    Precision measures: Of all the sample records we classified as true how many are actually true?
    Recall measures: Of all the actually true sample records, how many did we classify as true?

F1 is the weighted average of the precision and recall, with the score ideal at 1 and least ideal at 0.

- Our precision score of 0.44904 means that ~44.9% of the individuals labeled as POI were actually POI.

- Our recall score of 0.30400 means ~30.4% of POI in the dataset were identified correctly.

In [None]:
Accuracy: 0.84729	Precision: 0.44904	Recall: 0.30400	F1: 0.36255	F2: 0.32499
	Total predictions: 14000	True positives:  608	False positives:  746	False negatives: 1392	True negatives: 11254

Sources:

http://scikit-learn.org/stable/modules/feature_selection.html

http://scikit-learn.org/stable/modules/model_evaluation.html

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedKFold.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html