<b>Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  [relevant rubric items: “data exploration”, “outlier investigation”]</b>

This project aimed to create a classifier leveraging Enron data that can identify Enron employees who were considered Persons Of Interest (POIs) in the court proceedings following the company's collapse. A POI is someone who was charged, settled, or testified in exchange for an immunity deal. The data available is comprised of two sections. The first is the financial data about the employees of Enron. This includes many features, such as salary, bonuses, and stock options. The second section of the dataset was the company email records. It specifically relates to how the employees communicated via email with POIs and non-POIs. The sense behind this is that employees who engaged in fraud would communicate more with one another to organize fraudulent activities. Unfortunately, both the financial and email data are not complete and contain missing values. I concluded that the missing fields in the financial data represented a zero value. For example, a person who did not receive a bonus would have a NaN value in the bonus field.  

Outliers were discovered in the email data, and the clearest outlier was a row containing the total of all columns. Removing this outlier highlighted the importance of having clean a clean dataset, as the results of the machine learning algorithms improved greatly. Also, I found outliers of persons with much higher total payments than the rest of the POI's, so I removed those.




<b>What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  [relevant rubric items: “create new features”, “intelligently select features”, “properly scale features”]</b>


I used the VarianceThreshold function to remove all features with a variance below 80%, then used SelectKBest function to obtain each feature's score. I then sorted the scores and used the top 7 feature scores as my final features to build the prediction models. The list of the final features and each of their scores:

|         Feature         |        Score       |
|:-----------------------:|:------------------:|
| exercised_stock_options | 24.815079733218194 |
| total_stock_value       | 24.18289867856688  |
| bonus                   | 20.792252047181535 |
| salary                  | 18.289684043404513 |
| deferred_income         | 11.458476579280369 |
| long_term_incentive     | 9.922186013189823  |
| restricted_stock        | 9.2128106219771    |


I made two features named 'msg_from_poi_prop' and 'msg_to_poi_prop'; The 'msg_from_poi_prop' feature shows the proportion of emails a person receives from a POI, and 'msg_to_poi_prop' shows the proportion of emails a person sends to a POI. Instinctively, I assumed that POIs are more likely to contact each other than non-POIs; therefore, these two features I created would be better predictors of a POI. So I tried to see how well my best model performs, the Naive Byse model, and the other algorithms, with and without the new features included. The performance slightly dropped after I had added the two new features to my features list on some of the algorithms, while the scores marginally increased in a couple.

There are 20 features in the dataset, but I did not use all of them. I chose k=7 to get the 7 highest feature scores. Since the range of raw data values varies widely, after choosing the 7 best features to build prediction models, I used min-max scalers to rescale each feature to a common range linearly.

<b>What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  [relevant rubric item: “pick an algorithm”]</b>

The four algorithms I used were Naive Bayes, K-mean, Logistic Regression, and SVM. Naive Bayes turned out to be the best. It as well as the K-means model performed well in the original feature set and the final set with added new features, while the SVM model performed worse.
 
|       Feature       | Accuracy | Precision | Recall |
|:-------------------:|:--------:|-----------|--------|
| Naive Bayes         | 0.855    | 0.671     | 0.650  |
| K-means             | 0.328    | 0.239     | 0.133  |
| Logistic Regression | 0.859    | 0.645     | 0.574  |
| SVM                 | 0.866    | 0.508     | 0.513  |


<b>What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).  [relevant rubric items: “discuss parameter tuning”, “tune the algorithm”]</b>

Tuning parameters is a process that one goes through to optimize the parameters that impact the model to allow the algorithm to perform at its best. If done well, an algorithm can be optimized to its best performance. Choose incorrect parameters may lead to lower prediction power.

In the 4 algorithms, I used the GridSearchCV function to get the best parameters for each:

<ul>
<li><b>Naive Bayes:</b> Simple model with no need to specify any parameter. <br>
<li><b>K-means:</b> The n_clusters was set at 5, tol (tolerance) was set at 1, and random_state was set at 42. <br>
<li><b>Logistic Regresion:</b> C (inverse regularization) was set at 0.1; tol (relative tolerance) was set to 1; random_state was set to 42. <br>
<li><b>SVM:</b> kernel was set at linear, C set at 1, gamma at 1, and random_state was at 42.
</ul>


<b>What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric items: “discuss validation”, “validation strategy”]</b>

Validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an
independent data set. A well-known, classic mistake that is made is overfitting. The overfit model will perform well on training data but fails considerably when making predictions about new or unseen data. To avoid this mistake, I tuned just a few parameters and made a function called validate_clf, in which I applied a cross-validation technique that splits the data into training data and test data 100 times, calculates the accuracy, precision, and recall of each iteration then takes the mean of each metric.


<b>Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]</b>

The 3 evaluation metrics i used were:
<li>Accuracy</li>
<li>Precision</li>
<li>Recall</li>
<center><b>Average Performance of Metrics with Naive Bayes Algorithm<b></center>

|   Metric  | Original Features Performance Average | New Features incl. final Performance Average |
|:---------:|:----------------------------:|:--------------------------------------------:|
| Accuracy  |             0.855            |                     0.843                    |
| Precision |             0.671            |                     0.652                    |
| Recall    |             0.649            |                     0.644                    |
    
    
The accuracy score represents the model’s ability to correctly predict both the positives and negatives out of all the predictions. Accuracy of 0.855 means that the proportion of true results is 0.855 among the total number of cases. The precision score represents the model’s ability to correctly predict the positives of all the positive predictions it made. A precision of 0.671 means that among the total 100 persons classified as POIs, 67 persons are actually POIs. Recall score represents the model’s ability to predict the positives out of actual positives correctly. A recall of 0.649 (0.65 rounded) means that among 100 true POIs existing in the dataset, 65 POIs are correctly classified as POIs
