# Enron POI Identifier 

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives.


## Question 1
>Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  [relevant rubric items: “data exploration”, “outlier investigation”]

Based on financial and email data made public as a result of the Enron scandal I have produced a person of interest (POI) identifier, that can classify whether a member of the corperation is a POI or Non-POI.  A POI can be defined as a persons of interest in the fraud case, which means individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.

Features for each user are listed below:

- **financial features:** ``` ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees']``` 

(all units are in US dollars)



- **email features: **```['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi']``` 

(units are generally number of emails messages; notable exception is ‘email_address’, which is a text string)



- **POI label: ** ```[‘poi’]``` 

(boolean, represented as integer)

### Data Summary


- Amount of users: **146**
- Amount of POI: **18**
- Non POI: **128**


- Total possible features: **3088** (21 features per individual)
- Null values: **1358**
- Non Null values: **1708**

A number of the features contain NAN's.  

```
salary                        51
to_messages                   60
deferral_payments            107
total_payments                21
exercised_stock_options       44
bonus                         64
restricted_stock              36
shared_receipt_with_poi       60
restricted_stock_deferred    128
total_stock_value             20
expenses                      51
loan_advances                142
from_messages                 60
other                         53
from_this_person_to_poi       60
poi                            0
director_fees                129
deferred_income               97
long_term_incentive           80
email_address                 35
from_poi_to_this_person       60
```



### Investigating Outliers 

Next I began investigating outliers in my features, I plotted the salary and bonus in a scatterplot, one outlier called "TOTAL" immediately poped out.  This seems like it shouldn't be there, so I removed that value from the final dataset.

<img src="img/outliers.png">

After removing outliers we can see the data was heavily skewed by that value.  After examining the other salary and bonus values I have decided to keep them in the dataset as they look valid and could contribute to the POI classifer.

<img src="img/removed.png">

## Question 2
>What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  [relevant rubric items: “create new features”, “intelligently select features”, “properly scale features”]

### Feature Creation


Next I created two features: 
- ```fraction_to_poi```, a ratio from this person to POI against all messages from this person.
- ```fraction_from_poi```, a ratio of the messages from POI to this person against all the messages sent to this person.

My reasoning behind this is that I think POI will have a higher fraction of to/from messages to each other.  It's likely that POI's are going to be communicating with other POI's.  I think the fraction_to_poi will hold more value, as a POI may send global emails throughout the company, thus reaching a lot of employees - it is my hunch that a message TO a particular POI could indicate involvement in fruad.
I examined this in a scatterplot and highlighted POI values, it seems like this feature may have some impact on classifying POI's as the results suggest POI's communicate frequently between each other.

<img src="img/fract_img.png">

### Feature Selection
I used ```scikit-learn SelectKBest``` to select best 10 influential features, the scores for the KBest features are seen below.

```
exercised_stock_options - 25.0975415287
total_stock_value - 24.4676540475
bonus - 21.0600017075
salary - 18.575703268
**fraction_to_poi - 16.8130220412**
deferred_income - 11.5955476597
long_term_incentive - 10.0724545294
restricted_stock - 9.34670079105
total_payments - 8.86672153711
shared_receipt_with_poi - 8.74648553213
loan_advances - 7.24273039654
expenses - 6.23420114051
from_poi_to_this_person - 5.34494152315
other - 4.2049708583
**fraction_from_poi - 3.44275149481**
from_this_person_to_poi - 2.42650812724
director_fees - 2.10765594328
to_messages - 1.69882434858
deferral_payments - 0.21705893034
from_messages - 0.164164498234
restricted_stock_deferred - 0.0649843117237
```
This leaves me with the 10 most influential features being:

```
['exercised_stock_options', 'total_stock_value', 'bonus', 'salary', 'fraction_to_poi', 'deferred_income', 'long_term_incentive', 'restricted_stock', 'total_payments', 'shared_receipt_with_poi']
```
Notice how our engineeer feature ```fraction_to_poi``` makes an appearance in the top 10 features.
It seems like most of a features are related to finace, only having two features related to emails in the final feature selection.

### Feature Scaling

Most of my features had different units, some of the features had very big values therefore I needed to transform them. I used MinMaxScaler from sklearn to scale all my features to a given range (between 0 and 1).

```
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
features = scaler.fit_transform(features)
```

## Question 3
>What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  [relevant rubric item: “pick an algorithm”]

### Algorithm Selection
http://scikit-learn.org/stable/tutorial/machine_learning_map/

<img src="img/flow.png">

To decide on a classification algorithm I used the cheat sheet above, suitable algorithms for this problem are:
    
- LinearSVC
- KNeighbors Classifier
- Ensemble methods
    
For this report I am going to explore The KNeighbors, Descion Trees and Linear SVC.

I created a basic test / train split on the data and obtained accuracy scores from each of the default classifiers.
Next I applied feature scaling and checked the results again.

| Classifier           | Accuracy (No Scaling) | Accuracy (Scaling) |
|----------------------|-----------------------|--------------------|
| Decision Tree        | 0.84                  | 0.86               |
| K Nearest Neighbours | 0.89                  | 0.86               |
| Linear SVC           | 0.8                   | 0.89               |



## Question 4

>What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).  [relevant rubric items: “discuss parameter tuning”, “tune the algorithm”]

### Algorithm Tuning

Upon testing with the basic test / train split I realised this would not be an accurate way to test each classifier because of the small amount of POI's within our dataset.  It's possible that when splitting the data the POI's will be unbalanced within the test / train split, thus affecting our recall and precision scores.  

I took a look at how the assignment was graded - it uses a StratifiedShuffleSplit. Here we randomly 'shuffle' through our data creating testing and training data in a stratified way (training/testing split contains the same proportion of POI's to non POI's in each split)

My next step was to create a pipeline to train and test each classifier.  A pipeline allows multiple processing steps to be chained together and fed into ```GridSearchCV```

The pipeline consisted of the following steps:
- Min/Max Scale features
- Select K Best features

```GridSearchCV``` has been used to tune each parameter within the classifier and compute the F1 score, the classifier parameters with the highest F1 score can then be selected.

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. 
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score

```GridSearchCV``` will perform the cross-validation internally. That is, using 'StratifiedShuffleSplit' as the cross-validation method, 'GridSearchCV' will, internally, create the 1000 test/train splits and return the parameters for which the average metric is highest.

I tested and tuned Linear SVC, KNN and DT and had the following results:

| Classifier           | Accuracy) | Precision | Recall | F1   |
|----------------------|-----------|-----------|--------|------|
| Linear SVC           | 0.87      | 0.58      | 0.18   | 0.27 |
| K Nearest Neighbours | 0.84      | 0.26      | 0.07   | 0.12 |
| **Decision Tree**        | **0.83**      | **0.39**      | **0.56**   | **0.46 **|


The following parameters were best when using a decision tree:

```('DT', DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
            max_depth=2, max_features=None, max_leaf_nodes=None,
            min_samples_leaf=4, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best'))])
```

Next, I further tuned my decision tree algorithm by adding KBest features into the pipeline.


## Question 5
>What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric items: “discuss validation”, “validation strategy”]

Validation is the process of checking the models predictions against data that hasn't be used to train the model. It addresses how effective the algorithm is at learning from training data in order to accurately label new test data.
Ideally we want our algorithm to perform well on training data - but not so well that it looks identical to our training data.  This is called overfitting and causes our model to memorise classification noise and not "learn" the correct patterns to generalise to new data.

## Question 6
> Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]

Recall -

Precision - 

F1 - 


### References
- Pipeline - http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html
- Tuning - https://stackoverflow.com/questions/30102973/how-to-get-best-estimator-on-gridsearchcv-random-forest-classifier-scikit
- Cheatsheet - http://scikit-learn.org/stable/tutorial/machine_learning_map/
- F1 - http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
- GridSearch - https://discussions.udacity.com/t/select-the-best-classifier-through-selectskbest-gridsearchcv/212664/5


