## Project goal
My personal target in this project is to identify at least one more person than was in the official list of the Enron fraud investigation. I am well aware that statistical methods will not be sufficient as prove in a fraud case, so the name will not be published - however, i am interested in how this very simple information like financial information and email network information as given here, can uncover connections.
Out of personal interest, i also want to see how far I can come with non-Neural-Network methods. As long as there is no PCA used, the non-NN-methods frequently are able to give insight.



### Some remarks
I am using Python 3.6

I needed to change all print commands to include brackets, and more important, i needed to change the pickle dump() and read() functions to use rb instead of r and wb instead of w.

Should the 2.7 version of Udacity not be able to execute, please try in Python 3.6 and use my modified tester function.

## Data investigation
The code for the investigation can be found in 01_Data_investigation.py

### outlier detection
Outlier detection is the central goal of this project - any unusual salary would be "interesting". However, in the first phase, when checking the data for the first time, one point was really special:
<img src="Outlier detection.png">
This picture was saved from the console during the investigation phase. Checking the data, the point to the top right was a "Total" that really shouldn't be in the dataset. It was removed.

### other data properties
This is a very small dataset. We have a total of 145 rows. 18 of those are POI; so the dataset is highly unbalanced, too. 
The data contains also a lot of sparsity. I did a count of NAN values:
```
print(data_df[data_df=="NaN"].count(axis=0))
bonus                         64
deferral_payments            107
deferred_income               97
director_fees                129
email_address                 34
exercised_stock_options       44
expenses                      51
from_messages                 59
from_poi_to_this_person       59
from_this_person_to_poi       59
loan_advances                142
long_term_incentive           80
other                         53
poi                            0
restricted_stock              36
restricted_stock_deferred    128
salary                        51
shared_receipt_with_poi       59
to_messages                   59
total_payments                21
total_stock_value             20
```
Only one field has no NaN values. It is the one made manually by Katie, to have labels for the data: POI.


## Feature selection
#### how did i pick them
In a first attempt, i selected all financial information available (after all, why committing fraud if you have no interest doing so?); and relational information from the email data: Emails sent to and from POI; and emails sent to this person with an POI in copy (indicating probably that the sender thinks there is a relation)

It is worth noting that most of the features are quite sparse. As example, for the financial information, not everybody receives a bonus paid in shares; and probably not everybody of those persons has their shares bonus published. The same goes for most of the features, they are sparse. So i need every feature i can get.

Next I made a correlation matrix, not with the label, but internally - to see if there is redundant data that could be combined or removed.

The detailed list is as follows

    features_list = ['poi', 
        'salary', 'bonus', 'director_fees',  'expenses',
        'loan_advances', 'long_term_incentive', 'other', 
        'total_payments',
        
        'deferral_payments', 'deferred_income',
        
        'exercised_stock_options', 'total_stock_value', 
        'restricted_stock', 'restricted_stock_deferred',
        
        'shared_receipt_with_poi', 'to_messages', 'from_messages', 
        'from_poi_to_this_person', 'from_this_person_to_poi']

<img src="Correlation_matrix.png">

It can be seen in the correlation matrix that nearly all features have a correlation to the poi feature, which is good. In first place, i would try and use them all. 
However, the first three stock option features (excercised stock options, total stock value, restricted stock) are so deeply correlated that it may be interesting to remove some.
The email features (last five lines) are also very much in line with each other.

### what about sparse data
Coming back to the sparsity problem: I was experimenting with leaving those features out of the estimator, but the results have not been better, so they stay in. I checked with the final algorithm selected in the last part of this project, if a removal of those data fields has an influence on the performance.
``` 
results for leaving some sparse data out
All features, SVC: F1=0.488
w/o loan_advances, SVC: F1=0.487
w/o restricted_stock_deferred, SVC: F1=0.479
w/o director_fees, SVC: F1=0.481
```
So, sparse the data may be, but information it carries nevertheless. Inside the stack i leave them!

Astonishingly, some of the not-so-sparse data contained less information. When i removed those three features from the stack, the algorithm performance actually boosted.
```
w/o 'exercised_stock_options', SVC: F1= 0.511
w/o 'shared_receipt_with_poi', SVC: F1=0.515
w/o 'to_messages', SVC: F1=0519
```
These are the features i finally removed from the list. Note, this final feature removal was done right at the end, after the algorithm was selected.

Before selecting the final features, i did a first try with some models

## Model selection
I was testing AdaBoost, SVC, Random Forest Classifier and Decision Tree Classifier.
All four of them are good classifiers. Only the SVC needs some preprocessing, which is also nice: The more i can do with the original data, the less i am in danger of accidentally introducing a failure.

From here i tried various strategies:
1. Apply the algorithms with a minimum of preprocessing
2. Generate additional random features and let PCA / KBest choose the best ones
3. Choose from the features with PCA and KBest before applying the Algorithms
4. Add only the best few of the additional random features, use PCA/KBest to choose
5. Evaluate all strategies and choose the best.

After several unsuccessful tries, I changed my test set up so that it was *very* similar to the Udacity tester code, in order to yield good results there. I load the same deprecated StratifiedShuffleSplit function in order to be able to bring the folds up to a thousand in the end. During the experimentation phase, however, i kept the folds down to 10 for the calculation speed. As Udacity requires to optimize on Precision and Recall, i use the F1 score as metric for the evaluation.

#### F1 score results in the first try
* AdaBoost F1: 0.25
* Decision Tree F1: 0.24
* Random Forest F1: 0.30
* SVC F1: 0.42

I consider those results to be relatively poor, so I employed GridSearchCV to search for good parameters. The predefined StratifiedShuffleSplit was used as cv parameter of the search.
The principle of search was to refine in four or five steps, until the algorithms were well adapted to the data. As an example for the SVC:

First step
``` {'C' : [1, 2, 5, 1e1, 2e1, 5e1, 1e2, 2e2, 5e2,
               1e3, 2e3, 5e3, 5e4, 5e5, 5e6 ]}```

5



### Turning knobs
After the SVC was so much better for this problem than any of the other tried algorithms, i was beginning to refine it by trying our more options: Different kernels, some other values for C and gamma near the chosen ones.

#### F1 score for different kernels of svc
* RBF 0.42
* Sigmoid 0.64
* linear - no solution
* poly (grade 3) 0.25

After the sigmoid variant proved to be so superior, the values for C and gamma were refined by a variant of successive approximation (done manually). The F1 score looked extremely promising with 0.64 - The udacity tester code agreed by showing a precision of 0.33 and a recall of 0.73.
I wasn't yet happy and wanted to increase the score by experimenting with the svc-internal factor class_weight; however thad didn't change anything: I started with the keyword 'balanced' as the docs say that this is the right one for unbalanced classes. It turned out, it indeed was the best choice for this parameter.

#### two evaluation metrics
Even if my code was optimized for the F1 score, it really means i was optimizing for low failure rates. Having a Precision of 0.33 means that this algorithm points at two innocents for every POI that it identifies. 
Having a Recall of 73% means, it catches 73% of the existing POI's. ([Wikipedia has a nice explanation](https://en.wikipedia.org/wiki/Precision_and_recall))
Those two metrics show that it is not so easy to identify fraud from email and financial data only. It would be interesting to see how well a neural net performs here - whereas, it is somewhere between difficult and impossible to find the reasons for a decision in the net; which is of prime importance in the justice.

Strikingly, the accuracy was going down to 76% from 86% in the beginning - however i think that in this case a low failure rate is more important than a high accuracy. 

## Featuring II
There is still one task left: See if we can improve the result by feature engineering. Generally, there seem to be three strategies out in the wild:

* Use inside knowledge to make the right feature at the right scale 
* generate loads of random features and choose the best with some kind of algorithm - Lasso, KBest, PCA
* avoid feature engineering and use neural networks instead. 

Neural networks are not yet an option for me, and i have no insider knowledge about Enron at all.
So i will try some experimental random feature engineering here; then I employ KBest and PCA to define a cut-off, and then i let myself be surprised by the (hopefully) one or two valuable new features that i find in the mass. Another expectation is to see; through higher dimensionality, better separated grouping. 

One drawback is that too many dimensions compared to data rows can easily break the algorithms; 

a second drawback that really gave me a bad feeling about this is that if you generate calculated features from sparse data, the result is a lot more sparse. That is especially true for division features, which is sad as a number divided by another one usually depicts a percentage, and those are often helpful.

Nevertheless, i wanted to try this experiment, so here is what I did:

I multiplied every column with every column. 
I subtracted every column from every column. 
I divided every column by every column. 
The resulting 1200 features  were fed into SelectKBest with the f_classif algorithm; and they were scaled and put into a PCA. 

I trained the four Algorithms of above, in a for-loop, on anything between 1 and 100 features coming out of the two feature selection methods. 
I expected some kind of e-curve approximation, and so was I quite surprised to see no steady decline or rise in my f1-value, but a quite wild zig-zag; and a nearly unreadable graphic in the end.

<img src="Algo performance F1.png">


The top line is interesting: The graphic ends well below 0.6; however, with the original features and without much in terms of preprocessing, we were already better than that. So in total, this was not the best idea.

The two yellow-ish lines are Adaboost - in this dataset, it really marks the lowest performance. Interesting, the darker line (PCA) outperforms with low feature count, the lighter line (K-Best) outperforms later. 

The two green-ish lines represent the random forest classifier: After i have made very good experiences with this classifier, i was a little bit taken aback that it shall perform so bad here! The darker green line is PCA again, which works less good than the lighter green line of K-Best. Random Forest is capable to select features itself, so it gets better teh less K-Best is involved.

The two blue lines are a decision tree. In combination with PCA (darker line) it first seems to be a fair contestor for the first place, but that changes soon. Again, the PCA wins the low-dimension-price while K-Best works better with high dimensional count: The more the decision tree is allowed to choose features himself, the better it works. 

Finally, the two red lines depict the Support Vector Classifier. I was using the same parameters that have already been tuned before - and interestingly, it seems to work very well together with PCA. That might be because PCA delivers gaussian values, and for the SVC the ideal input is exactly that.

Finally, all the other algorithms seem to like K-Best input more than PCA input. 

Now, for comparison, i tried the same with the original dataset.

<img src="Algo performance F1 original data.png">

First we note that the performance is much less random with the original features. Second, we can see that the peak performance is better than the peak performance with the random additional features, nearly 0.6 F1 score. Third, we can see that the two favorites both outperform with PCA and not with K-Best; which is the opposite result than above. Indeed, K-Best seems to work best on the right side of the scale: At the point where it actually does *not select* anymore.

Well - this exercise was nice, but in the beginning i did this to see if there is one calculated feature that can actually improve the original score.

I executed a little code:

    kb = SelectKBest(k=5)
    kb.fit(lots_X, lots_Y)
    for i, choice in enumerate(kb.get_support()):
    if choice:
        print(features_list_all[i])

And indeed, there were some important features; and they are exclusively multiplication features.
* deferred_income *multiplied with* exercised_stock_options
* deferred_income *multiplied with* total_stock_value
* exercised_stock_options *multiplied with* deferral_payments
* exercised_stock_options *multiplied with* deferred_income
* total_stock_value *multiplied with* deferred_income

Two of those five chosen features are actually double! This might be in parts the explanation for the erratic behavior of the algorithms - if they get "additional" features that really don't contain more information, it might harm the performance.

I continue by adding those three features to the original feature vector and see how the algorithms behave.
As AdaBoost and Random Forest didn't compare well to the other two, they are excluded from now on for speed and readability.

<img src="Algo performance F1 three Features.png">

This last graph is much better for the eyes :) After adding of the three additional features, both algorithms seem to peak around five features; which is earlier than before. As before, the decision tree gets better the less K-Best is involved - after all, the decision tree is quite capable of choosing features himself.

However, they don't come close to the performance of the first try.


## Conclusion:
This was like shopping. The first try was the best. 

As i didn't optimize for the accuracy, my top result for it is 0.76, that means 0.10 less than the benchmark for this exercise. However, i am quite happy with a recall rate of 0.73 and a precision of 0.33 - those results are more important than the pure accuracy, given that this result point it's finger at certain persons. Personally i think that none of this is going to be usable for a new case: I can't tell if my POI indicator has a precision of 0.33 because only one out of three POI have been identified in the Enron Fraud Case, or because the sparse data is not allowing a better prediction.
Similarly, a recall rate of 73% means that i let a quarter of the POI's slip... I am not yet happy at all.
