# Final Project of the Enron Incident

This is a notebook of the final project for the Udacity lesson - Mechine Learning - which propose on detecting who is the person of interest(POI) in the Enron Incident by using the Enron email dataset. We'll go through four part of the Mechine Learning progress: Dataset/Question -> Features -> Algorithms -> Evaluation and figure out the POI by our Mechine Learning code. 


Between each part of coding, I'll answer some questions such as why I choose these feature as the new dataset, and something special during my coding. Finally, at the end of this notebook, I'll answer all the Udacity's question again to meet the requirement.

In [1]:
import sys
import pickle
import matplotlib.pyplot
import numpy as np
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
from tester import test_classifier



## Task 1: Show the Characteristics of the dataset

We firstly show some characteristics of this dataset to have a general understanding of it

In [2]:
def count_poi(dataset):
    num_poi = 0
    for name in data_dict:
        if data_dict[name]["poi"] == 1:
             num_poi += 1
    return num_poi

In [3]:
# Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)
    
num_poi = count_poi(data_dict)

print "Number of data point: ", len(data_dict)
print "Number of features: ", len(data_dict[data_dict.keys()[0]])-1
print "Number of poi: ", num_poi
print "Number of non-poi: ", len(data_dict) - num_poi

Number of data point:  146
Number of features:  20
Number of poi:  18
Number of non-poi:  128


## Task 2: Remove the outliers

Reading the given pdf called *enron61702insiderpay* we know that we have three outliers in ourdataset: 'TOTAL', 'THE TRAVEL AGENCY IN THE PARK’ and 'LOCKHART EUGENE E'. This is because 'TOTAL', ‘TRAVEL AGENCY IN THE PARK’ are not real people, and 'LOCKHART EUGENE E' has no non NaN values. so we need to remove these keys in our dataset.

In [4]:
data_dict.pop("TOTAL")
data_dict.pop("THE TRAVEL AGENCY IN THE PARK")
data_dict.pop("LOCKHART EUGENE E")

num_poi = count_poi(data_dict)

print "Number of data point: ", len(data_dict)
print "Number of features: ", len(data_dict[data_dict.keys()[0]])-1
print "Number of poi: ", num_poi
print "Number of non-poi: ", len(data_dict) - num_poi

Number of data point:  143
Number of features:  20
Number of poi:  18
Number of non-poi:  125


## Task 3: Choose our feature

In the following steps, we use SelectKBest to select the best 6 features in the whole features_list(except poi, because poi is a label not feature) that most fit to our dataset. Firstly, we choose all the features of our data as the features_list

In [5]:
# features_list is a list of strings, each of which is a feature name.
# The first feature must be "poi"
features_list = ['poi', 'salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 
                 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 
                 'other', 'long_term_incentive', 'restricted_stock', 'director_fees',
                 'to_messages', 'from_poi_to_this_person', 'from_messages',
                 'from_this_person_to_poi', 'shared_receipt_with_poi']

Then, we calculate the *NaN* in our dataset for each feature. Suppose that if one feature that half of the poeple are missing its value, this feature will be helpless. So we note it down as awful feature and remove them in the features_list

In [6]:
# detect the awful features
awful_feature = []
for feature in features_list:
    count = 0
    for name in data_dict:
        if data_dict[name][feature] == 'NaN':
            count += 1
    if count >= 60:
        awful_feature.append(feature)
    
print awful_feature

['deferral_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'long_term_incentive', 'director_fees']


In [7]:
# remove the awful features
for item in awful_feature:
    features_list.remove(item)

In this part, we need to load in our data from the data_dict, so we call the featureFormat function to help us do this job. To save data easily later, we add a name id list to the dataset in the last column, so that we can change the dataset into dict later

In [8]:
# sort_keys = False to make the name of the poeple to be the original order
data = featureFormat(data_dict, features_list, sort_keys = False, remove_all_zeroes = False)

In [9]:
# delete the all_zero_people
name_list = data_dict.keys()
name_id = [i for i in range(len(name_list))]

# delete the all_zero_data
labels, features = targetFeatureSplit(data)

In the code shown below, we import SelectKBest function from sklearn to help us choose the top 6 features from 12 features remained to do our prediction

In [10]:
from sklearn.feature_selection import SelectKBest, f_classif

# define a selector
selector = SelectKBest(f_classif, k=6)

# transform the features
transfrom_features = selector.fit(features, labels).transform(features)

To measure the selectKBset function, we print the selector score by print selector.scores_

In [11]:
print "scores of the selector:\n", selector.scores_

scores of the selector:
[ 18.28968404   8.77277773  24.18144555   6.09417331  24.81507973
   4.18747751   9.13378204   1.64634113   5.24344971   0.16970095
   2.38261211   8.58942073]


Finally, We return the index of the selector to see which feature we select

In [12]:
index = selector.get_support(True)
new_features_list = np.array(features_list[1:])[index]
print "The whole feature list: \n", features_list[1:]
print "\n\n"
print "Index choosed from the feature list: \n", index
print "\n\n"
print "Choosen features' name: \n", new_features_list

The whole feature list: 
['salary', 'total_payments', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'restricted_stock', 'to_messages', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi']



Index choosed from the feature list: 
[ 0  1  2  4  6 11]



Choosen features' name: 
['salary' 'total_payments' 'total_stock_value' 'exercised_stock_options'
 'restricted_stock' 'shared_receipt_with_poi']


## Task 4: Create new features

In my opinions, we can create two features call ratio_message_to_poi and ratio_message_from_poi which may hlep the model predict well. These two features can be computed as follow:

> ratio_message_to_poi = from_this_person_to_poi / to_messages

> ratio_message_from_poi = from_poi_to_this_person / from_messages

We will test the influence of these new features in the later session, which compare the dataset with or without them.

In [13]:
ratio_message_to_poi = []
ratio_message_from_poi = []
for i in range(len(data)):
    if data[i][8] == 0 or data[i][9] == 0:
        ratio_message_to_poi.append(0)
    else:
        ratio_message_to_poi.append(data[i][9]/data[i][8])
        
    if data[i][11] == 0 or data[i][10] == 0:
        ratio_message_from_poi.append(0)
    else:
        ratio_message_from_poi.append(data[i][11]/data[i][10])

In [14]:
# add the name id, ratio_message_to_poi, ratio_message_from_poi to the dataset
data = np.c_[data, ratio_message_to_poi, ratio_message_from_poi, name_id]

## Task 5: Scaling

Uing the scale function to do Z-score Normalization

In [15]:
from sklearn.preprocessing import scale
data = np.array(data)
data[:, 1:-1] = scale(data[:, 1:-1])

## Task 6: Store the dataset as new_data_dict

we store the dataset after cleaning it above, and this help us easily to call the function in the tester.py

In [16]:
# store the data as the original data_dict
new_data_dict = {}
for people in range(len(data)):
    name_id = data[people][-1]
    new_data_dict[name_list[int(name_id)]] = {}
    new_data_dict[name_list[int(name_id)]]["poi"] = data[people][0]
    for feature in index:
        new_data_dict[name_list[int(name_id)]][features_list[feature+1]] = data[people][feature+1]
        
    # add the new features value to the dict
    new_data_dict[name_list[int(name_id)]]["ratio_message_to_poi"] = data[people][-3]
    new_data_dict[name_list[int(name_id)]]["ratio_message_from_poi"] = data[people][-2]

In [17]:
new_features_list = new_features_list.tolist()
# add the poi feature to the first of the list
new_features_list.insert(0, "poi")
# add the ratio_message_to_poi, ratio_message_from_poi to the features_list
new_features_list.append("ratio_message_to_poi")
new_features_list.append("ratio_message_from_poi")

data = featureFormat(new_data_dict, new_features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

## Task 7: Try a varity of classifiers

In this session, we'll try a varity of classifiers and choose the best classifiers as our final model. What's more, we'll compare the predicion that when we use the new features created by us before with not using them to show how well these new features done.

In [18]:
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.cross_validation import train_test_split

### Prediction and Validation:

For the following bloks, we'll measure how well the new features done to the model. Firstly we'll show the features_list of these two list and  see their different. Then we'll use three mechine learning technic to do the prediction: SVM, DT, NB

We 'll call the test_classifier() function to help us measure how well the model predict, this function use StratifiedShuffleSplit function as its' cross_validation function which is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class. 

The importance of the validation is to prevent our model being overfitting(predict well on the training set but wrose on the validation set), which make sure the reliability of the algorithm we build 

Finally, the test_classifier() will print Accuracy, Precision, Recall, F1 as the evaluation index, which we'll discuss in the summary session.

In [19]:
print "feautres with the tow new create features:\n", new_features_list[1:]

feautres with the tow new create features:
['salary', 'total_payments', 'total_stock_value', 'exercised_stock_options', 'restricted_stock', 'shared_receipt_with_poi', 'ratio_message_to_poi', 'ratio_message_from_poi']


In [20]:
old_features_list = new_features_list[:-2]
print "feautres without the tow new create features:\n", old_features_list[1:]

feautres without the tow new create features:
['salary', 'total_payments', 'total_stock_value', 'exercised_stock_options', 'restricted_stock', 'shared_receipt_with_poi']


At the begining of this part, we'll use the  GridSearchCV() function to judge the parameter of the model to show how import parameter tuning is. We'll take the DT model into our consideration and use the new_features_list as our feauters used to make prediction. The algorithm finally give the best choice of paramters that can make the algorithm optimize (criterion:gini, min_samples_leaf:1)

In [22]:
parameters = {'criterion':('gini', 'entropy'), 'min_samples_leaf': [1, 3]}
svr = tree.DecisionTreeClassifier()
clf = GridSearchCV(svr, parameters, scoring='f1')

test_classifier(clf, new_data_dict, new_features_list)

GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'criterion': ('gini', 'entropy'), 'min_samples_leaf': [1, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1', verbose=0)
	Accuracy: 0.75191	Precision: 0.30393	Recall: 0.28250	F1: 0.29282	F2: 0.28654
	Total predictions: 11000	True positives:  565	False positives: 1294	False negatives: 1435	True negatives: 7706



Then we use different model to fit our dataset

#### SVM

This model seems nearly predict all the testset into all people are not poi

In [24]:
test_classifier(svm.SVC(kernel = "sigmoid"), new_data_dict, new_features_list)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.81236	Precision: 0.04286	Recall: 0.00150	F1: 0.00290	F2: 0.00186
	Total predictions: 11000	True positives:    3	False positives:   67	False negatives: 1997	True negatives: 8933



#### Decision Tree

This model also meet the requirement of the project with precision and recall are more than 0.3

In [25]:
test_classifier(tree.DecisionTreeClassifier(), new_data_dict, new_features_list)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
	Accuracy: 0.75327	Precision: 0.32703	Recall: 0.33750	F1: 0.33219	F2: 0.33535
	Total predictions: 11000	True positives:  675	False positives: 1389	False negatives: 1325	True negatives: 7611



#### GaussianNB

This model also meet the requirement of the project, and have a highest accuracy and precision meanwhile

In [26]:
test_classifier(GaussianNB(), new_data_dict, new_features_list)

GaussianNB(priors=None)
	Accuracy: 0.80036	Precision: 0.39135	Recall: 0.17650	F1: 0.24328	F2: 0.19827
	Total predictions: 11000	True positives:  353	False positives:  549	False negatives: 1647	True negatives: 8451



### Prediction without new create features:



#### SVM

In [27]:
test_classifier(svm.SVC(kernel = "sigmoid"), new_data_dict, old_features_list)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
	Accuracy: 0.79880	Precision: 0.28571	Recall: 0.00400	F1: 0.00789	F2: 0.00498
	Total predictions: 10000	True positives:    8	False positives:   20	False negatives: 1992	True negatives: 7980



#### Decision Tree

In [28]:
test_classifier(tree.DecisionTreeClassifier(), new_data_dict, old_features_list)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
	Accuracy: 0.75000	Precision: 0.36249	Recall: 0.32950	F1: 0.34521	F2: 0.33561
	Total predictions: 10000	True positives:  659	False positives: 1159	False negatives: 1341	True negatives: 6841



#### GaussianNB

In [29]:
test_classifier(GaussianNB(), new_data_dict, old_features_list)

GaussianNB(priors=None)
	Accuracy: 0.78570	Precision: 0.41248	Recall: 0.16850	F1: 0.23926	F2: 0.19111
	Total predictions: 10000	True positives:  337	False positives:  480	False negatives: 1663	True negatives: 7520



The result shown above point that when we use the new create features to make prediciton, we can get a higer accuracy but a lower presicion and other evaluation index

### Task 7: Dump your classifier, dataset, and features_list 
You do not need to change anything below, but make sure
that the version of poi_id.py that you submit can be run on its own and
generates the necessary .pkl files for validating your results.

In [30]:
dump_classifier_and_data(tree.DecisionTreeClassifier(), new_data_dict, old_features_list)

## Summary

In this project, firstly, I went through all the data in our dataset and dropout the 'bad features' which got lots of NaN value, then i use SelectKBest function to select the top 6 value atomatically to be our reliable feature list.

Secondly, by using matplotlib module, I show all 6 features value by plotting two features each time on the figure to show if there are any outliers in our dataset. After finding these outliers, I sort the dataset by the given feature, and remove these point.

Thridly, After doing all the movements mentioned above, I use a scale function to our dataset so that our algorithm can converge faster, which means we can save lots of time in the training option.

Finally, we save the dataset as new_data_dict which allow other poeple to run easily.

During the local test, I choose three mechine learning technic: NB, SVM, DT, and after runing the code with different parameter, I choose the DT model to be the finall code because it have a better accuarcy and a higher precision rate which meet the requirement meanwhile. 

In testing our model, I use split my dataset into training set and validation set which in order to estimate how well your model has been trained  and to estimate model properties 

And I use four evaluation index to show how well the model can be. 

The accuracy means how many right predicitons your model are made divide how may times your model have to predict. 

The precision is the ratio tp / (tp + fp) where tp is the number of true positives(which means the real label is poi and the predict label is also poi) and fp the number of false positives(which means the real label is non-poi and the predict label is poi)

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives (which means the real label is poi and the predict label is non-poi)

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0

> F1 = 2 \* (precision \* recall) / (precision + recall)