# Identifying fraud from Enron Emails

#### By Sargam Shah

### Question 1
** Summarize for us the goals of this project and how machine learning is useful in achieveing this. As a part of your answer, give some background of the dataset and how it can be used to answer the question. Were there any outliers in the data when you got it, and how did you handle it?**

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives.

The goal of this project is to play detective and put the machine learning skills to use by building an algorithm to identify Enron Employees who may have committed fraud based on the public Enron financial and email dataset.

The dataset contained 146 records with 1 labeled feature (POI), 14 financial features, 6 email feature. Within these record, 18 were labeled as a "Person Of Interest" (POI).

I use scikit-learn & various machine learning techniques to predict "Person fo Interest", detecting culpable person using both financial & email-data. Through exploratory data analysis and CSV check, I was able 3 records need to be removed:

TOTAL: Through visualising using scatter-plot matrix. We found TOTAL are the extreme outlier since it comprised every financial data in it. I removed the data TOTAL from the data dictionary.

THE TRAVEL AGENCY IN THE PARK: This must be a data-entry error that it didn't represent an individual.

LOCKHART EUGENE E: This record contained only NaN data.

Similarly, I removed the remaining two keys from the dictionary object. 

### Question 2
** What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why and why not? As a part of the assignment you should engineer your own feature that did not come ready-made in the dataset -- explain what feature did you made and the ratoinale behind it? **

**Selection Process**

I used a univariate feature selection process, select k-best, in a pipeline with grid search to select the features. Select k-best removes all but the k highest scoring features. The number of features, 'k', was chosen through an exhaustive grid search driven by the 'f1' scoring estimator, intending to maximize precision and recall.

**Features Selected**

I chose the following ten features in my POI Identifier. The second colum numbers are the feature score from select-k-best. The order of features is descending based on feature importance. 

|Selected Features|Score↑|
|-------------------|-------|
|exercised_stock_options|22.510|
|total_stock_value|22.349|
|bonus|20.792|
|salary|18.289|
|deferred_income|11.425|
|long_term_incentive|9.922|
|restricted_stock|9.284|
|total_payments|8.772|
|shared_receipt_with_poi|8.589|
|loan_advances|7.184|

**Feature Scaling**

After feature engineering & using SelectKBest, I then scaled all features using min-max scalers. As briefly investigated through exporting CSV, we can see all email and financial data are varied by several order of magnitudes. Therefore, it is vital that we feature-scaling for the features to be considered evenly

**Selection of self-engineered feature and how it impacted the algorithm performance**

Unsurprisingly, 9 out of 10 features related to financial data and only 1 features called shared_receipt_with_poi (messages from/to the POI divided by to/from messages from the person) were attempted to engineere by us. Main purpose of composing ratio of POI message is we expect POI contact each other more often than non-POI and the relationship could be non-linear. The initial assumption behind these features is: the relationship between POI is much more stronger than between POI and non-POIs, and if we quickly did back-of-the-envelope Excel scatter plot, there might be truth to that hypothesis. The fact that shared_receipt_with_poi is included after using SelectKBest proved that this is a crucial features, as they also slightly increased the precision and recall of most of the machine learning algorithms used in later part of the analysis (e.g precision & recall for Support Vector Classifer before adding new feature are 0.503 & 0.223 respectively, while after adding new feature, the results are 0.504 & 0.225)

### Question 3
** What algorithm did you end up using? What other one(s) did you use? How did model performace differ between algorithms?**

After trying more than 10 algorithm and found Random Forest Classifer, Support Vector Machine & Logistic Regression (not covering in class) have the potential to be improved further. Without any tuning, K-means clustering performed reasonably sufficient with precision & recall rate both larger than 0.3. I ended up using Logistic regression for my final analysis. Logistic regression is using widely in medical & law field, most prominent case is to predict tumor benign/malignancy or guilty/no-guilty law case and I would love to test, and recently with e-mail spamming classifer. Although initially, the result was not as expected, I believe with further tuning we can come up with a much better result.
Post-tuning result is summarized as tabel below:

| Algorithm | Precision | Recall |
|----------|-----------|--------|
| Logistic Regression| 0.315 | 0.161 |
| Support Vector Classifier |0.502|0.223|
|Random Forest Classifier | 0.336 |0.163|

### Question 4
** What does it mean to tune the parameters of an algorithm, and what can happen if you dont do this well? How did you tune the parameters of a particular algorithm?**

Parameters tuning refers to the adjustment of the algorithm when training, in order to improve the fit on the test set. Parameter can influence the outcome of the learning process, the more tuned the parameters, the more biased the algorithm will be to the training data & test harness. The strategy can be effective but it can also lead to more fragile models & overfit the test harness but don't perform well in practice
With every algorithms, I tried to tune as much as I could with only marginal success & unremmarkable improvement but come up with significant success with Logistic Regression & K-Mean Clustering. Manually searching through the documentation, I came up with these following paremeters:

Logistic regression: C (inverse regularization), class weight (weights associated with classes), max iteration (maximum number of iterations taken for the solvers to converge), random_state (the seed of the pseudo random number generator to use when shuffling the data), solver (using 'liblinear' since we have very small dataset).
```
C=1e-08, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, 
max_iter=100, multi_class='ovr', penalty='l2', random_state=42, solver='liblinear', tol=0.001, verbose=0))
```

K-means clustering:
```
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
       n_jobs=1, precompute_distances='auto', random_state=None, tol=0.001,
      verbose=0)
```

### Question 5
** What is validation and what is a classic mistake you make if you do it wrong? How did you validate your analysis?**

Validation comprises set of techniques to make sure our model generalizes with the remaining part of the dataset. A classic mistakes, which was briefly mistaken by me, is over-fitting where the model performed well on training set but have substantial lower result on test set. In order to overcome such classic mistake, we can conduct cross-validation (provided by the evaluate function in poi_id.py where I start 1000 trials and divided the dataset into 3:1 training-to-test ratio. Main reason why we would use StratifiedSuffleSplit rather than other splitting techniques avaible is due to the nature of our dataset, which is extremely small with only 14 Persons of Interest. A single split into a training & test set would not give a better estimate of error accuracy. Therefore, we need to randomly split the data into multiple trials while keeping the fraction of POIs in each trials relatively constant.

### Question 6
** Give atleast two evaluation metrics, and your average performace for each one of them. Explain the interpretation of your metrics that says something human understandable about your algorithm's performance. **

The two main evaluation metrics used for this project were:
1. Precision
2. Recall

Precision refer to the ratio of true positive (predicted as POI) to the records that are actually POI while recall described ratio of true positives to people flagged as POI.
The best performance belongs to logistic regression (precision: 0.382 & recall: 0.415) which is also the final model of choice.  Essentially speaking, with a precision score of 0.386, it tells us if this model predicts 100 POIs, there would be 38 people are actually POIs and the rest 62 are innocent. With recall score of 0.4252, this model finds 42% of all real POIs in prediction. This model is amazingly perfect for finding bad guys without missing anyone, but with 42% probability of wrong
With a precision score of 0.38, it tells us that if this model predicts 100 POIs, then the chance would be 38 people who are truely POIs and the rest 62 are innocent. On the other hand, with a recall score of 0.415, this model can find 42% of all real POIs in prediction. Due to the nature of the dataset, accuracy is not a good measurement as even if non-POI are all flagged, the accuracy score will yield that the model is a success.


### Files


- tools/: helper tools and functions
- final_project/poi_id.py: main submission file - POI identifier
- final_project/tester.py: Udacity-provided file, produce test result for submission




