# Question Answers for Enron Fraud Project
Patrick Cook   06/23/2021

### Summary and Introduction

The goal of this project is to use Machine Learning algorithms from the sklearn library to detect persons of interest related to the Enron accounting fraud and eventual collapse of [Enron Corporation](https://en.wikipedia.org/wiki/Enron). The requirements are to use a model that gives precision and recall scores both above 0.30. Machine Learning is useful in quickly comparing a high number of variables over thousands of observations and coming up with the most significant predictors to identify or select a target. The sklearn library has optimized these machine learning methods and made it easier to choose the top features, a good model and tune the parameters for optimal performance and predictive scores. 

Enron Corporation was founded in 1985 through a merger of two small regional energy companies and 15 years later, in 2000, would report a revenue of over 100 billion dollars. A little over a year later the company would file for bankruptcy. After an investigation, it was found that Enron along with partner companies, participated in schemes and even accounting fraud to hide losses.

During data cleaning and exploration, the [Enron Fraud data set](https://github.com/trsvchn/ud120-projects-py3-jupyter) was found to contain 21 features about email messages, financial information and a 'poi' Boolean identifying persons of interest. There were initially 146 observations with 144 people, 1 business entity and a 'TOTAL' row. The 'TOTAL' row was removed early in the data exploration and analysis. The data set is unbalanced with only 18 persons of interest out of the 145 observations in the data set. 

There were multiple outliers in all categories except loan advances. The greatest outlier was the 'TOTAL' column identified above, followed by top executive staff of Enron. The other outliers were real people involved in Enron. The values were found to be legitimate values that would be found in other real world data and many of the most extreme outliers were persons of interest. These outliers represented possible strong markers for persons of interest and removal would hinder the models ability to learn. In addition, the data set was already unbalanced and removal of these poi would increase this unbalance. Therefore, these outliers were left in and scaling was used. There were numerous missing values in many of the featues (see graphic below) the observations with the highest missing counts were removed and the remaining were filtered during the featureFormat and testing process by the Udacity provided functions.

Count of missing (null) values per feature:

|**feature**|**missing count**|
| :-- | :--: |
|loan_advances|                142|
|director_fees|                129|
|restricted_stock_deferred|    128|
|deferral_payments|            107|
|deferred_income|               97|
|long_term_incentive|           80|
|bonus|                         64|
|from_messages|                 60|
|shared_receipt_with_poi|       60|
|from_this_person_to_poi|       60|
|from_poi_to_this_person|       60|
|to_messages|                   60|
|other|                         53|
|expenses|                      51|
|salary|                        51|
|exercised_stock_options|       44|
|restricted_stock|              36|
|total_payments|                21|
|total_stock_value|             20|
|poi|                            0|

There were 12 observations removed, including 'TOTAL', due to having too few non-null features, being an business and invalid data ('TOTAL'). This left the final data set with 134 observations. The observations were further reduced to 121 during model preprocessing by the feature_format and tester scripts when found they contained to few non-null features. 

### Feature Selection, Reduction and Engineering
After cleaning and data exploration, the 20 features were analyzed using correlation matrices and pair-plot distributions. These were compared for all observations and then, for only persons of interest observations. 

It was felt that the to_messages and from_messages alone would not contribute to identification of persons of interest and could even lead to false positive and negatives identifications if used. Due to a strong correlations between shared_receipt_with_poi and to_messages a ratio of the two was created. An additional ratio of from_this_person_to_poi and to_messages was created due to moderate correlations between the two. Other new features added were bonus to total pay ratio, exercised stock options to total pay ratio and total pay minus salary. These were based on intuition that any payouts for fraud would not be found in salary but in less regular and less monitored pay. 

#### First Feature Selection Process
After running ```SelectKBest(score_func=f_classif, k='all')``` on training sets generatied by running StratifiedShuffleSplit for 100 splits, the mean of all the feature's scores were taken. The features chosen were based on being above 50% of the maximum score. For this run, the maximum scores was 17.2 and the cut score was calculated to be 8.6. Therefore, all features with scores higher than 8.6 were used givng 7 features in the list. The reasoning of choosing features within 50% of the top score was to include only the high quality features. I felt anything below 50% of the highest score would not be as successful at correct predictions. Looking at the features scores chart below shows that deferred_income was borderline and might be excluded given slight changes to the selection methodology described above. Therefore, to further verify the SelectKBest results, the **7 features chosen will be further tested during classifier selection** and an optimal features list used. Using SelectKBest with f_classif and selecting features withing  gives the following features.

|**features**|**mean kbest scores**|
| :-- | --: |
|exercised_stock_options|    17.199257|
|total_stock_value      |    16.417045|
|bonus_totalpay_ratio   |    13.690021|
|bonus                  |    13.230604|
|salary                 |    11.425296|
|from_poi_total_from_ratio|  11.132003|
|deferred_income        |     8.692029|
|long_term_incentive    |     7.926842|
|restricted_stock       |     6.328934|
|total_payments         |     6.098792|
|total_minus_sal        |     5.865185|
|shared_rec_to_mes_ratio|     5.812289|
|shared_receipt_with_poi|     5.123269|
|expenses               |     3.813305|
|other                  |     3.647613|
|from_poi_to_this_person|     3.576881|
|from_this_person_to_poi|     1.796707|
|deferral_payments      |     0.386815|
|exec_stock_tot_ratio   |     0.337839|


<img src="images/best_features_mean.png" alt="SelectKBest Features" width="600"/>



Using SelectKBest the top 7 features, given below, were picked based on them having scores of 10 or higher.

1.  exercised_stock_options
2.  total_stock_value
3.  bonus_totalpay_ratio
4.  bonus
5.  salary
6.  from_poi_total_from_ratio
7.  deferred_income

These features were further analyzed by comparing them to the full list of features and reduced sets of the top 7 features in the chosen models. Of the seven initial features, the final model achieved best scores with the top features: exercised_stock_options, total_stock_value and bonus. Scaling and Non-Scaling was tested on all models and scaling was found to not change models that did not require it and benefit models that did require it. During tuning, scaling was used on all models tested due to it having little to no effect on models that did not require it and necessary for models such as KNeighborsClassifier that did require it. 


## Algorithms Tested and Used
### and Further Feature Testing

The 7 features above were tested on the following classifiers based on sklearns [Choosing the right estimator](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) to find determine the best base estimator:

- LogisticRegression
- GuassianNB
- LinearSVC
- KNeighborsClassifier
- RandomForestClassifier
- GradientBoostingClassifier

#### Further classifier testing 
The top 7 classifiers initially chosen where again tested on each classifier by using a for loop and choosing only the top feature, then the top 2 features, then the top 3 features until the whole list of 7 top features had been included. Then the feature set giving the highest score was used during hyper tuning.

#### Algorithm Selection
It was found that GaussianNB, KNeighborsClassifier, RandomForestClassifier and GradientBoostingClassifier gave similar F1 and recall scores based on a modified feature selection lists. Therefore, it was decided to hyper-tune all 4 classifiers based on the specific best feature list for the model and determine which best met the goal of maximizing F1 scores and precision versus recall trade-offs. The base classifier test results, **including an analysis of the Top KBest features that performed the best**, are given below.

<br />
<br />
<img src='images\best_features.png' alt='Classifier comparison inculding number of top features used' width=80%/>
<br />
<br />

All four algorithms were picked for tuning and the KNeighborsClassifier was chosen as the best classifier after tuning.




## Algorithm Tuning
Tuning the algorithm is adjusting the parameters to optimize the chosen score or scores used to compare or rate which algorithm is the best choice. Some algorithms have little to no parameters to adjust. While others have many parameters that can be adjusted with some duplicating or even undoing the effects of others. Poor tuning can lead to high bias or overfitting which will give poor results in real world data. 

Tuning should be carried out using the least number of parameters that have the greatest effect on the desired score in the beginning. Then, finer control parameters can be adjusted and optimized to see the effects on the score. Over fitting can cause the classifier to seem like it will work well only to find it does poorly on unseen or real-world data. 

To tune the 4 chosen algorithms, GridSearchCV was used with ShuffleStratifiedSplit for cross validation and a Pipeline that included the StandardScaler. These were done individually on each classifier. A broad range of values were initially picked using a list, np.arange or np.logspace based on parameters accepted values. Then, a manual binary type of search was used to check values above and below the given best value until the f1, recall and precision scores decreased or did not change significantly. Finally, the hyper-tuned model was run through the testing function provided to verify consistent scores.  

It was found that KNeighborsClassifier was the best model with the following parameters:

KNeighborsClassifier(leaf_size=2, n_neighbors=4, p=1, weights='distance')

This model was scaled using pipline and then sent to tester.py test_classifier() for final results.

This gave the following scores when run on 1,300,000 samples:

	Accuracy: 0.87244	Precision: 0.63699	Recall: 0.39719	F1: 0.48929	F2: 0.42953

Smaller runs of 13,000 and 130,000 were run and gave the same results to two decimal places.

The model is expected to have an accuracy of 87%. The precision is expected to be 63% meaning 63% of those identified as being persons of interest will be persons of interest. This also means that 37% of those identified of being person of interest are actually not persons of interest. The recall is expected to be 39% meaning 39% of the persons of interest will be identified leaving 61% of the persons of interest unidentified or missed.

This ratio was chosen as it had the highest precision value while still obtaining the highest recall rate. This balance was chosen to reduce investigation costs of innocent people and minimize possible lawsuits of those falsely accused.

## Validation
Validation is when a separate test data set and train data set are used to train a model on the training set and validate or score the model on the test set. Test data or any other data that is held back to confirm, or validate that the model's performance on the training data, is the source of the model's performance statements. The test data is used to give the expected prediction scores that will be achieved in the real world. It is the independent data source that is used to generate a model's performance scores that are used in reports and statements of the model's performance. 

A common mistake is to train your model on the entire data set giving it knowledge of all the 'answers'. **If the entire dataset is used for training, the model will likely have a high score when run again on any test data from the original data set.** This is because it has seen all the features and targets before. The model was created, refined and tuned on this data. **This leads to overconfidence and an overstatement of the model's ability to predict outcomes due to high scores achieved when running the test data which is really just a smaller set of training data run through the model a second time.** When the model **sees brand new data that has never been seen before, it will likely perform poorly and not match the stated confidence levels given by the training data.** 

To prevent this cross-validation was conducted using ShuffleStratifiedSplit. This was chosen due to the unbalanced nature of the target variable having only 18 of 134 positives. The test data was 30% and iterations of 100, 1000, 10000, 100000 were chosen to verify results convergence.

## Evaluation Metrics
The performance metrics used to measure a model's performance will be Precision, Recall and the F1 score. These are based on the counts of True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) values.

In terms of the Enron Fraud data set,
- A true positive occurs when a person is **predicted** by the model to be a person of interest and the person **is** a person of interest ("Got caught").
- A false positive occurs when a person is **predicted** by the model to be a person of interest and the person **is not** a person of interest ("Falsely accussed").
- A True Negative occurs when a person is **predicted** by the model to **not** be a person of interest and the person **is not** a person of interest ("Innocent and not suspected").
- A False Negative occurs when a person is **predicted** by the model to **not** be a person of interest but **is** a person of interest ("Not suspected but guilty" or "Got away with it").

Precision gives the percentage (ratio) of true positives ("Got caught") out of all positive predictions. A 1 would be perfect score (everyone who was guilty got caught) and a zero would be worst possible score (everyone accussed was innocent and no guilty people got caught). 

$$ Precision=\frac{TP}{TP+FP} $$

<br>
Recall gives the percentage (ratio) of true positives ("Got Caught") to the total number of actual positives in the dataset ("Got caught" and "Got away with it").

$$ Recall=\frac{TP}{TP+FN} $$
<br>

The F1 score, also known as the harmonic mean of precision and recall, has a penalizing affect for unbalanced (unequal) precision and recall scores. This penalty is accomplished by multiplying the recall and precision scores together in the numerator so that the lower score "weights" the higher score pulling the score lower.

As an example of the affect of the penalty:

If precision is 0.80 but recall is 0.5 then recall reduces the precision score by 50%

$$ {0.8}\times{0.5} = 0.4 $$

The whole formula for F1 based on the precision and recall scores is:

$$ F1=\frac{2({precision}\times{recall})}{precision+recall} $$

So the above example gives:

$$ F1=\frac{2({0.8}\times{0.5})}{0.8+0.5}=\frac{2(0.4)}{1.3} = \frac{0.8}{1.3}\approx 0.6154 $$

<br>
<br>
The F1 score is lower than the original precision score and slightly higher than the original recall score.

### Meaning of Chosen Classifier's Scores
The results of the chosen best classifier KNeighborsClassifier(leaf_size=2, n_neighbors=4, p=1, weights='distance') run on 1,300,000 samples is:

	Accuracy: 0.87244	Precision: 0.63699	Recall: 0.39719	F1: 0.48929	F2: 0.42953

Smaller runs of 13,000 and 130,000 were run and gave the same results to two decimal places.

Based on these results, this model is expected to have an accuracy of 87%. The precision is expected to be 63% meaning 63% of those identified as being persons of interest will be persons of interest. This also means that 37% of those identified of being person of interest are actually not persons of interest. The recall is expected to be 39% meaning 39% of the persons of interest will be identified leaving 61% of the persons of interest unidentified or missed.

This ratio was chosen as it had the highest precision value while still obtaining the highest recall rate. This balance was chosen to reduce investigation costs of innocent people and minimize possible lawsuits of those falsely accused.