# Project Summary

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives.

The objective of this project is to build a Person of Interest identifier based on financial and email data made public as a result of the Enron scandal. A Person of Interest (POI) is an individual who was indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity. 

In trying to acomplish the objective of tis project, Machine Learning will be used. Machine Learning algorithms are able to identify patterns in data and build models ,based on the analyzed dataset, that can be applied to new datsets in order to predict future events. More specifically, the project will be based on Supervised Learning which is a Maschine Learning task that is able to learn a function that can map an input to an output based on example input-output pairs. Therefore, the supervised learning algorithm is able to analyze the training data and produce an inferred function, that can be used to match new examples.

The project consists of six major steps:
1. Data Overview
2. Feature Selection and Engineering
3. Algorithm Selection
4. Parameters tunning
5. Analysis Validation
6. Evaluation


# 1. Data Overview

## Data Exploration

The first step in developing the project was to have an overview of the data. Therfore, an exploratory data analysis was perform in order to determine the size of the data, number of features, number of POIs etc. The results of the analysis show:
1. The data set comprises 146 entries which represent the number of Enron executives whose email and financial data are present in the data set
2. Each data set entry has 21 features
3. There are 18 POIs in the dataset out of the total 34 POIs (not all the identified POIs in the fraud investigation work for Enron), meaning that the remaining 128 are non-POIs
4. There are 1358 missing features in the dataset that are divided as follow:

| Features                          | No. of missing values  |
|-----------------------------------|------------------------|
|   salary                          | 51                     |
|   to_messages                     | 60                     |
|   deferral_payments               | 107                    |
|   total_payments                  | 21                     |
|   loan_advances                   | 142                    |
|   bonus                           | 64                     |
|   email_address                   | 35                     |
|   restricted_stock_deferred       | 128                    |
|   total_stock_value               | 20                     |
|   shared_receipt_with_poi         | 60                     |
|   long_term_incentive             | 80                     |
|   exercised_stock_options         | 44                     |
|   from_messages                   | 60                     |
|   other                           | 53                     |
|   from_poi_to_this_person         | 60                     |
|   from_this_person_to_poi         | 60                     |
|   poi                             | 0                      |
|   deferred_income                 | 97                     |
|   expenses                        | 51                     |
|   restricted_stock                | 36                     |
|   director_fees                   | 129                    |


## Outlier Investigation


For the outlier analysis, two features were extracted from the dictionary, namely "salary" and "bonus". Subsequently they were used as input for the scatterplot below.

![Outliers included](Figure_1_w_out.png)


Looking at the scatterplot, there is one outlier that pops up imediately. After looking at the data more carefully, it has been noticed that the outlier actually represents the total value of the salaries and bonuses. therefore it has been decided to remove this value. Additionally, "THE TRAVEL AGENCY IN THE PARK" was another entry that was removed from the dataset because it was not considered representative and it had all the values zero, except for "others" column.
The scatterplot bellow shows the situation after removing the outlier.

![Outliers excluded](Figure_1_wo_out.png)

When looking at the scatter plot after removing the outlier,it can be seen that there are still left a few possible outliers. When analyzing the provided PDF with the financial data it can be observed that the values belong to:
* Kenneth Lay
* Jeffrey Skilling
* Mark Frevert
* John Lavorato

Eventhough they have bonuses and salary higher than the rest, these "outliers" are not removed since they are considered valid data points.

# 2. Feature Selection and Engineering

## Feature Egineering

After removing the outliers from the Salary vs Bonus scatter plot, two other important features were identified. Namely "from_poi_to_this_person" and "from_this_person_to_poi". Since there no strong pattern was obesrved two new features were created:
* fraction_from_poi - represents the fraction between the number of emails from POIs and the total number of from messages
* farction_to_poi - represents the fraction between the number of emails to POIs and the total number of to messages

The scatterplot between the two new created features can be seen below.
![Messages to and from POIs](scaled.png)

In the scatterplot some black stars can be observed. Those represent the messages exchanged between POIs. By highlighting this, it can be observed that non POIs exchanged much more messages between them than the POIs.

## Feature Selection

The next step was to select the features that will be used further in the algorithm. Therefore a forests of trees was used to evaluate the importance of the features. The top ten features based on their importance was selected. Moreover, some features considered importantant and relevant for the analysis were manually added. The final list of selected features is:
1. total_payments (0.113846)
2. restricted_stock_deferred (0.090760)
3. fraction_from_poi (0.089054)
4. director_fees (0.075426)
5. poi (0.066242)
6. total_stock_value (0.063815)
7. deferral_payments (0.061640)
8. exercised_stock_options (0.061081)
9. deferred_income (0.058439)
10. bonus (0.049589)
11. salary (0.031417)
12. expenses (0.014175)
13. from_poi_to_this_person (0.000512)
14. fraction_to_poi

From the engineered features, fraction_from_poi was included in the selected list of fearures since it ranked third based on the importance of the features. Initially, the econd ingineered feature, fraction_to_poi, was not included in the list because it had a very low (the lowest) importance score. However, by adding the feature to the list, the accuracy using the AdaBoost classifier improved from 0.819 to 0.847



# 3. Algorithm Selection

In order to select the algorithm, beside accuracy, precision and recall were used for various classifiers. 

| Classifiers                   | Accuracy | Precision|Recall    |
|-------------------------------|----------|----------|----------|
|   AdaBoost                    |0.83053   |0.33333	  |0.2710    |
|   RandomForest                |0.86173   |0.43707   |0.1285	 |
|   DecisionTree                |0.81053   |0.27935   |0.2665    |
|   GassianNB                   |0.81273   |0.28766   |0.2740    |

AdaBoost and Random Forest are the seleted classifiers for parameter tuning. 


# 4. Parameter Tuning

Machine Learning algorithms comprises of parameters that have to be manually changed/defined by the person using the algorithm. Parameter tuning implies selecting the best parameters in order to optimize the performance of an algorithm. 

Two algorithms were selected for parameter tunning: AdaBoost and RandomForest since they hadthe highest accuracy and precison (> 0.3). Through parameter tuning is desired to increase the recall in order to have a value of at least 0.3.

### RandamForest parameter tuning

The first parameter tuning performed was the "n-estimator". The best perfomance was reached when n_estimator = 100. However, the increase in performance was not enough since recall < 0.3.

| Parameters                   | Accuracy | Precision|Recall    |
|------------------------------|----------|----------|----------|
|   n-estimator = 10           |0.86173   |0.43707   |0.1285    |
|   n-estimator = 50           |0.86467   |0.47440   |0.1390    |
|   n-estimator = 100          |0.86567   |0.48731   |0.1440    |

Since the reacall scores were far from the desired level (>0.3), the parameters for the AdaBoost clasifier will be tuned next. 

### AdaBoost parameter tuning

The first parameter tuning performed for AdaBoost was changing the algorithm from SAMME.R to SAMME.

| Parameters                   | Accuracy | Precision|Recall    |
|------------------------------|----------|----------|----------|
|   algorithm = SAMME.R        |0.83027   |0.33231   |0.2705    |
|   algorithm = SAMME          |0.84280   |0.36460   |0.2410    |

In can be observed that whil the precision increased, the recall decreased. Therefore the n-estimator parameter was selected for tuning. The default value for this parameter is 50, and the value was gradually incraesed in oredr to obtain the desired recall level.

| Parameters                   | Accuracy | Precision|Recall    |
|------------------------------|----------|----------|----------|
|   n-estimator = 50           |0.83027   |0.33231   |0.2705    |
|   n-estimator = 100          |0.8418	  |0.3769	 |0.2855    |
|   n-estimator = 150          |0.846	  |0.39427	 |0.289     |
|   n-estimator = 200          |0.84687	  |0.39905	 |0.2935    |
|   n-estimator = 300          |0.84573   |0.39547   |0.2970    |
|   n-estimator = 600          |0.84573   | 0.39671  |0.3015    |

The recall score reached 0.30 when the n-estimator was 600. Therefore, this is the selected model.

# 5. Analysis Validation

Validation implies partitioning the dataset into two different subsets. The analysis is performed on one subeset, called the trainng set, while the validation of the analysis is performed on the other set, called the training set. 
If this partion of the data is not perfomed, and the model trains and tests on the same dataset, it leads to overfitting. 

# 6. Evaluation

As perviously mentioned, the evaluation metrics used to assess the final model were: accuracy, recall and precision. The overview below shows the prformance of the model.

| Parameters                   | Accuracy | Precision|Recall    |
|------------------------------|----------|----------|----------|
|   AdaBoost                   |0.84573   | 0.39671  |0.3015    |

The accuracy, represents the number of correct predictions out of the total predicions made by the model. In this case, the model predicted correctly in aprox. 85% of the cases. The precision shows the percentage of item considered relevant. Therfore, considering all the items identified as POIs, aprox. 40% of the were actually POIs. A 30% recall, means that from all the POIs from the dataset the model correctly identified 30% of them. 
