#P5 Enron Fraud Detectors using Enron Emails and Financial Data.
by Alexey Chesnok

*Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  [relevant rubric items: “data exploration”, “outlier investigation”]*

The goal of the project to analyze [Enron Email Dataset](https://www.cs.cmu.edu/~./enron/) using predictive mashine learning techniques to try and identify individuals who might have been involved into Enron Fraud. Data set already contains 18 records labeled as "POI" (Person of Interest) - individuals who were either indicted, settled without admitting guilt, or testified in exchange for immunity. My task is pinpoint potential additional persons of interest, using their financial information from the dataset and email interactions with existing POIs. 
Enron Email Dataset consists of 146 records with 21 features (14 financial, 6 email message features and 1 predefined poi label)


In [1]:
import sys
import pickle
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import scipy
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn import svm, grid_search
from sklearn import cross_validation

sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
import tester

plt.style.use('ggplot')



In [2]:
#Select features
financial_features = ['salary', 
                      'deferral_payments', 
                      'total_payments', 
                      'loan_advances', 
                      'bonus', 
                      'restricted_stock_deferred', 
                      'deferred_income', 
                      'total_stock_value', 
                      'expenses', 
                      'exercised_stock_options', 
                      'other', 
                      'long_term_incentive', 
                      'restricted_stock', 
                      'director_fees']

email_features = ['to_messages', 
                 'email_address', 
                 'from_poi_to_this_person', 
                 'from_messages', 
                 'from_this_person_to_poi', 
                 'shared_receipt_with_poi'] 

features_list = email_features + financial_features
features_list.insert(0, 'poi')
                

In [3]:
print features_list

['poi', 'to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi', 'salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees']


In [4]:
### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

In [5]:
data_dict

{'ALLEN PHILLIP K': {'bonus': 4175000,
  'deferral_payments': 2869717,
  'deferred_income': -3081055,
  'director_fees': 'NaN',
  'email_address': 'phillip.allen@enron.com',
  'exercised_stock_options': 1729541,
  'expenses': 13868,
  'from_messages': 2195,
  'from_poi_to_this_person': 47,
  'from_this_person_to_poi': 65,
  'loan_advances': 'NaN',
  'long_term_incentive': 304805,
  'other': 152,
  'poi': False,
  'restricted_stock': 126027,
  'restricted_stock_deferred': -126027,
  'salary': 201955,
  'shared_receipt_with_poi': 1407,
  'to_messages': 2902,
  'total_payments': 4484442,
  'total_stock_value': 1729541},
 'BADUM JAMES P': {'bonus': 'NaN',
  'deferral_payments': 178980,
  'deferred_income': 'NaN',
  'director_fees': 'NaN',
  'email_address': 'NaN',
  'exercised_stock_options': 257817,
  'expenses': 3486,
  'from_messages': 'NaN',
  'from_poi_to_this_person': 'NaN',
  'from_this_person_to_poi': 'NaN',
  'loan_advances': 'NaN',
  'long_term_incentive': 'NaN',
  'other': 'NaN'

Significant number of features missing values, also predefined POIs have no entries for director_fees, restricted_stock_deferred, this renders both features irrelevant for further investigations, and "email_address".

In [6]:
#Transpond dataframe
data_dict = pd.DataFrame.from_dict(data_dict)
#reorder features
data_dict = data_dict.T
data_dict = data_dict[features_list]
list(data_dict.index.values)


['ALLEN PHILLIP K',
 'BADUM JAMES P',
 'BANNANTINE JAMES M',
 'BAXTER JOHN C',
 'BAY FRANKLIN R',
 'BAZELIDES PHILIP J',
 'BECK SALLY W',
 'BELDEN TIMOTHY N',
 'BELFER ROBERT',
 'BERBERIAN DAVID',
 'BERGSIEKER RICHARD P',
 'BHATNAGAR SANJAY',
 'BIBI PHILIPPE A',
 'BLACHMAN JEREMY M',
 'BLAKE JR. NORMAN P',
 'BOWEN JR RAYMOND M',
 'BROWN MICHAEL',
 'BUCHANAN HAROLD G',
 'BUTTS ROBERT H',
 'BUY RICHARD B',
 'CALGER CHRISTOPHER F',
 'CARTER REBECCA C',
 'CAUSEY RICHARD A',
 'CHAN RONNIE',
 'CHRISTODOULOU DIOMEDES',
 'CLINE KENNETH W',
 'COLWELL WESLEY',
 'CORDES WILLIAM R',
 'COX DAVID',
 'CUMBERLAND MICHAEL S',
 'DEFFNER JOSEPH M',
 'DELAINEY DAVID W',
 'DERRICK JR. JAMES V',
 'DETMERING TIMOTHY J',
 'DIETRICH JANET R',
 'DIMICHELE RICHARD G',
 'DODSON KEITH',
 'DONAHUE JR JEFFREY M',
 'DUNCAN JOHN H',
 'DURAN WILLIAM D',
 'ECHOLS JOHN B',
 'ELLIOTT STEVEN',
 'FALLON JAMES B',
 'FASTOW ANDREW S',
 'FITZGERALD JAY L',
 'FOWLER PEGGY',
 'FOY JOE',
 'FREVERT MARK A',
 'FUGH JOHN L',
 'GAHN 

In [7]:
data_dict = data_dict.replace('NaN', np.nan)

In [8]:
print data_dict.info()

<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 21 columns):
poi                          146 non-null bool
to_messages                  86 non-null float64
email_address                111 non-null object
from_poi_to_this_person      86 non-null float64
from_messages                86 non-null float64
from_this_person_to_poi      86 non-null float64
shared_receipt_with_poi      86 non-null float64
salary                       95 non-null float64
deferral_payments            39 non-null float64
total_payments               125 non-null float64
loan_advances                4 non-null float64
bonus                        82 non-null float64
restricted_stock_deferred    18 non-null float64
deferred_income              49 non-null float64
total_stock_value            126 non-null float64
expenses                     95 non-null float64
exercised_stock_options      102 non-null float64
other                        93 non-null float6

In [9]:
data_dict_poi = data_dict.loc[data_dict['poi'] == True]
print data_dict_poi.info()

<class 'pandas.core.frame.DataFrame'>
Index: 18 entries, BELDEN TIMOTHY N to YEAGER F SCOTT
Data columns (total 21 columns):
poi                          18 non-null bool
to_messages                  14 non-null float64
email_address                18 non-null object
from_poi_to_this_person      14 non-null float64
from_messages                14 non-null float64
from_this_person_to_poi      14 non-null float64
shared_receipt_with_poi      14 non-null float64
salary                       17 non-null float64
deferral_payments            5 non-null float64
total_payments               18 non-null float64
loan_advances                1 non-null float64
bonus                        16 non-null float64
restricted_stock_deferred    0 non-null float64
deferred_income              11 non-null float64
total_stock_value            18 non-null float64
expenses                     18 non-null float64
exercised_stock_options      12 non-null float64
other                        18 non-null float64


In [10]:
print data_dict.isnull().sum(axis=1).sort_values(ascending=False).head()


LOCKHART EUGENE E                20
GRAMM WENDY L                    18
WROBEL BRUCE                     18
WHALEY DAVID A                   18
THE TRAVEL AGENCY IN THE PARK    18
dtype: int64


I have also removed record "THE TRAVEL AGENCY IN THE PARK" - not a person, and "LOCKHART EUGENE E" - missing values for all features, and therefore useless for investigation. 

In [11]:
def PlotHelper(feature_1, feature_2):
    sns.FacetGrid(data_dict, hue="poi").map(plt.scatter, feature_1, feature_2).add_legend()
    plt.show()

In [12]:
print(PlotHelper('total_payments', 'total_stock_value'))

None


In [13]:
for index, row in data_dict.iterrows():
    if row["salary"] != 'NaN' and row["salary"] != 'NaN' and row["salary"]>1000000 and row["salary"]>400000:
       print index

FREVERT MARK A
LAY KENNETH L
SKILLING JEFFREY K
TOTAL


In [14]:
data_dict['total_payments'].argmax()

  if __name__ == '__main__':


'TOTAL'

Several outliers in financial features were FREVERT MARK A, LAY KENNETH L, SKILLING JEFFREY K, and TOTAL. "Total" is a spreadsheet feature to be removed, the rest are POIs to be kept for further investigation. I have also removed features "email_address" and director_fees with restricted_stock_deferred - since they didn't have any entries for POIs

In [15]:
#Clean up
data_dict = data_dict.drop(['TOTAL', 'THE TRAVEL AGENCY IN THE PARK','LOCKHART EUGENE E'])
#remove useless (email address, director_fees restricted_stock_deferred)
del data_dict['email_address']
del data_dict['director_fees']
del data_dict['restricted_stock_deferred']

*2) What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  [relevant rubric items: “create new features”, “intelligently select features”, “properly scale features”]*


I have generated and added 2 additional email features to the list "proportion_to_poi" (proportion of emails sent by the person to poi to total email sent by the person) and "proportion_from_poi" (proportion of emails sent to the person by poi to total email sent to the person) to track level of interaction with known persons of interest.
Feature "proportion_to_poi" made it in the end of the investigation to the top features list both by corellation coefficient (0.34) and was consistently at top 5 features from SelectKBest runs.

In [None]:
### Task 3: Create new feature(s)
### Store to my_dataset for easy export below.

In [16]:
data_dict['proportion_to_poi'] = data_dict['from_this_person_to_poi']/data_dict['from_messages']
data_dict['proportion_from_poi'] = data_dict['from_poi_to_this_person']/data_dict['to_messages']
data_dict = data_dict.replace('inf', 0)

In [17]:
my_dataset = data_dict.T
my_dataset.fillna(value=0, inplace = True)

In [18]:
my_dataset

Unnamed: 0,ALLEN PHILLIP K,BADUM JAMES P,BANNANTINE JAMES M,BAXTER JOHN C,BAY FRANKLIN R,BAZELIDES PHILIP J,BECK SALLY W,BELDEN TIMOTHY N,BELFER ROBERT,BERBERIAN DAVID,...,WASAFF GEORGE,WESTFAHL RICHARD K,WHALEY DAVID A,WHALLEY LAWRENCE G,WHITE JR THOMAS E,WINOKUR JR. HERBERT S,WODRASKA JOHN,WROBEL BRUCE,YEAGER F SCOTT,YEAP SOON
poi,False,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,True,False
to_messages,2902,0,566,0,0,0,7315,7991,0,0,...,400,0,0,6019,0,0,0,0,0,0
from_poi_to_this_person,47,0,39,0,0,0,144,228,0,0,...,22,0,0,186,0,0,0,0,0,0
from_messages,2195,0,29,0,0,0,4343,484,0,0,...,30,0,0,556,0,0,0,0,0,0
from_this_person_to_poi,65,0,0,0,0,0,386,108,0,0,...,7,0,0,24,0,0,0,0,0,0
shared_receipt_with_poi,1407,0,465,0,0,0,2639,5521,0,0,...,337,0,0,3920,0,0,0,0,0,0
salary,201955,0,477,267102,239671,80818,231330,213999,0,216582,...,259996,63744,0,510364,317543,0,0,0,158403,0
deferral_payments,2.86972e+06,178980,0,1.29574e+06,260455,684694,0,2.14401e+06,-102500,0,...,831299,0,0,0,0,0,0,0,0,0
total_payments,4.48444e+06,182466,916197,5.63434e+06,827696,860136,969068,5.50163e+06,102500,228474,...,1.0344e+06,762135,0,4.67757e+06,1.93436e+06,84992,189583,0,360300,55097
loan_advances,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
features_list = features_list + ['proportion_to_poi', 'proportion_from_poi']
features_list = [e for e in features_list if e not in ('email_address', 'director_fees', 'restricted_stock_deferred')]

In [20]:
print features_list

['poi', 'to_messages', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi', 'salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'proportion_to_poi', 'proportion_from_poi']


For feature selection I have tested correlation coefficients with POI, Lasso Regression and univariate feature selection tool SelectKBest with various values of k, to identify best performing features. 

In [21]:
#correlations = zip(features_list[1:], data_dict.corrwith(data_dict['poi']))
#print(correlations)
#sorted(correlations, key = lambda x: x[1], reverse=True)
print data_dict.corrwith(data_dict['poi'])

poi                        1.000000
to_messages                0.058954
from_poi_to_this_person    0.167722
from_messages             -0.074308
from_this_person_to_poi    0.112940
shared_receipt_with_poi    0.228313
salary                     0.264976
deferral_payments         -0.098428
total_payments             0.230102
loan_advances              0.999851
bonus                      0.302384
deferred_income           -0.265698
total_stock_value          0.366462
expenses                   0.060292
exercised_stock_options    0.503551
other                      0.120270
long_term_incentive        0.254723
restricted_stock           0.224814
proportion_to_poi          0.339938
proportion_from_poi        0.104406
dtype: float64


For my first feature list test I gave selected I have top ones with best correlation coefficient - 5 made it over 0.3 

In [22]:
#Select features with correlation above 0.3
best_correlations_list = ['poi',
                          'loan_advances',
                          'exercised_stock_options',
                          'total_stock_value',
                          'bonus',
                          'proportion_to_poi']
print "Features with correlations above 0.3"
print best_correlations_list

Features with correlations above 0.3
['poi', 'loan_advances', 'exercised_stock_options', 'total_stock_value', 'bonus', 'proportion_to_poi']


In [34]:
### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [24]:
from sklearn.linear_model import Lasso
#features = selector.fit_transform(features, labels)
regression = Lasso()
regression.fit(features, labels)
coeficients = zip(features_list[1:],regression.coef_ != 0)
best_coefs = sorted(coeficients, key = lambda x: x[1])
#print best_coefs
best_lasso_list = list(map(lambda x: x[0], best_coefs))
best_lasso_list = ["poi"] + [e for e in best_lasso_list if e not in ('proportion_to_poi', 'from_poi_to_this_person', 'proportion_from_poi')]
print("Best Lasso Regression features:")
print best_lasso_list

Best Lasso Regression features:
['poi', 'to_messages', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi', 'salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock']




Results of some testing rounds are following:

Correlations above 0.3:
 Accuracy: 0.76608       Precision: 0.23565      Recall: 0.23200 F1: 0.23381     F2: 0.23272

Lasso Regression:
Accuracy: 0.79847       Precision: 0.25230      Recall: 0.26050 F1: 0.25633     F2: 0.25882

SelectKBest:

k=3
 Accuracy: 0.84292       Precision: 0.48444      Recall: 0.32700 F1: 0.39045     F2: 0.34973
 
k=5
 Accuracy: 0.86014       Precision: 0.51558      Recall: 0.34750 F1: 0.41517     F2: 0.37174
 
k=7
 Accuracy: 0.85321       Precision: 0.48289      Recall: 0.38800 F1: 0.43027     F2: 0.40387
 
k=8
 Accuracy: 0.85443       Precision: 0.48860      Recall: 0.40700 F1: 0.44408     F2: 0.42106
 
k=10
 Accuracy: 0.83887       Precision: 0.37447      Recall: 0.31100 F1: 0.33980     F2: 0.32191

        

Since Recall is arguably a more important metric in our case and SelectKBest with cut off at 8 features yielded highest Recall and F1 Scores, I choose k=8 parameter for final model runs

In [35]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

k=8
selector = SelectKBest(f_classif, k)
selector.fit_transform(features, labels)
print("SelectKBest feature scores:")
scores = zip(features_list[1:],selector.scores_)
#scores = zip(features_list[1:],-np.log10(selector.pvalues_))
sorted_scores = sorted(scores, key = lambda x: x[1], reverse=True)
#print sorted_scores
best_kbest_list = ["poi"] + list(map(lambda x: x[0], sorted_scores))[0:k]
print("Best SelectKBest features:")
print best_kbest_list

SelectKBest feature scores:
Best SelectKBest features:
['poi', 'exercised_stock_options', 'total_stock_value', 'bonus', 'salary', 'proportion_to_poi', 'deferred_income', 'long_term_incentive', 'restricted_stock']


After selecting 8 best performing features I have used scaling to normalize selected features for using them in Support Vector Machines Model

In [26]:
data = featureFormat(my_dataset, best_kbest_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [None]:
#data = featureFormat(my_dataset, best_correlations_list, sort_keys = True)
#labels, features = targetFeatureSplit(data)

In [27]:
scaler = MinMaxScaler()
features = scaler.fit_transform(features)

In [None]:
#data = featureFormat(my_dataset, best_lasso_list, sort_keys = True)
#labels, features = targetFeatureSplit(data)

In [None]:
#data = featureFormat(my_dataset, features_list, sort_keys = True)
#labels, features = targetFeatureSplit(data)

In [None]:
#print labels, features

*3. What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  [relevant rubric item: “pick an algorithm”]*

I have tryed Naive Bayers and Support Vector Machines algorithms. SVM initially had better results at train_test_split cross validation especially after parameter tuning with GridSearchCV.

In [28]:
### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html
from sklearn.naive_bayes import GaussianNB
clf_n = GaussianNB()
features_train, features_test, labels_train, labels_test = \
train_test_split(features, labels, test_size=0.4, random_state=42)
clf_n.fit(features_train, labels_train)
pred = clf_n.predict(features_test)

acc = accuracy_score(labels_test, pred)
rec = recall_score(labels_test, pred)
prec = precision_score(labels_test, pred)

print "Naive bayes model"
print "accuracy:", acc
print "precision:", prec
print "recall:", rec



Naive bayes model
accuracy: 0.803571428571
precision: 0.444444444444
recall: 0.4


In [29]:
from sklearn import svm
clf = svm.SVC(kernel='rbf', C=10, gamma=1)
features_train, features_test, labels_train, labels_test = \
train_test_split(features, labels, test_size=0.3, random_state=42)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

accuracy = accuracy_score(labels_test, pred)
recall = recall_score(labels_test, pred)
precision = precision_score(labels_test, pred)

print "SVM"
print "accuracy: ", accuracy
print "precision: ", precision
print "recall: ", recall

SVM
accuracy:  0.857142857143
precision:  0.5
recall:  0.166666666667


*4. What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).  [relevant rubric items: “discuss parameter tuning”, “tune the algorithm”]*

Parameter tuning in  Mashine learning is a process of selection of best parameter for the model, this can be done either by running model multiple times using different parameter values combinations or using model selection and evaluation toolslike GridSearchCV. Parameter Tuning is important because it allows to identify best parameter values for the optimal algorithm performance. Without tuning optimal paramether configuration for algorithm might be missed. 

I have already perfomed some parameter tuning for SelectKBest algorithm while looking for the best value of k and its impact on Model KPIs. In addition I have used GridSearchCV function to identify best performing parameters for Support Vector Machines algorithm, specifying set possible  values and tuning C (0.001, 0.01, 0.1, 1, 10), gamma (0.001, 0.01, 0.1, 1) and kernel (linear or rbf) parameters. Naive Bayes model doesn't have any parameters to tune.


In [30]:
from sklearn.grid_search import GridSearchCV
def param_selection(X, y, nfolds):
    Cs = [0.001, 0.01, 0.1, 1, 10]
    gammas = [0.001, 0.01, 0.1, 1]
    kernels = ['linear', 'rbf']
    param_grid = {'C': Cs, 'gamma' : gammas, 'kernel' : kernels }
    grid_search = GridSearchCV(svm.SVC(), param_grid, cv=nfolds)
    grid_search.fit(X, y)
    grid_search.best_params_
    return grid_search.best_params_

Best Parameters I dentified for SVM model are:

In [31]:
print param_selection(features, labels, 18)

{'kernel': 'linear', 'C': 0.001, 'gamma': 0.001}


*5. What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric items: “discuss validation”, “validation strategy”]*

Validation in Mashine Learning can be defined as process of dividing data set in training and testing partitions and using test portion to evaluate performance of trained model. It allows independent testing of the algorithm. 
A classic validation mistake is testing model on same subset of data as it was trained, which does not allow true model performance evaluatuon.
I have used Stratified ShuffleSplit cross-validation technique for model validation provided in tester.py.


In [36]:
from sklearn.cross_validation import StratifiedShuffleSplit
from feature_format import featureFormat, targetFeatureSplit
#clf = svm.SVC(kernel='linear', C=0.001, gamma=0.001)
#clf = clf_n
PERF_FORMAT_STRING = "\
\tAccuracy: {:>0.{display_precision}f}\tPrecision: {:>0.{display_precision}f}\t\
Recall: {:>0.{display_precision}f}\tF1: {:>0.{display_precision}f}\tF2: {:>0.{display_precision}f}"
RESULTS_FORMAT_STRING = "\tTotal predictions: {:4d}\tTrue positives: {:4d}\tFalse positives: {:4d}\
\tFalse negatives: {:4d}\tTrue negatives: {:4d}"

def test_classifier(clf, dataset, feature_list, folds = 1000):
    data = featureFormat(dataset, feature_list, sort_keys = True)
    labels, features = targetFeatureSplit(data)
    cv = StratifiedShuffleSplit(labels, folds, random_state = 10)
    true_negatives = 0
    false_negatives = 0
    true_positives = 0
    false_positives = 0
    for train_idx, test_idx in cv: 
        features_train = []
        features_test  = []
        labels_train   = []
        labels_test    = []
        for ii in train_idx:
            features_train.append( features[ii] )
            labels_train.append( labels[ii] )
        for jj in test_idx:
            features_test.append( features[jj] )
            labels_test.append( labels[jj] )
        
        ### fit the classifier using training set, and test on test set
        clf.fit(features_train, labels_train)
        predictions = clf.predict(features_test)
        for prediction, truth in zip(predictions, labels_test):
            if prediction == 0 and truth == 0:
                true_negatives += 1
            elif prediction == 0 and truth == 1:
                false_negatives += 1
            elif prediction == 1 and truth == 0:
                false_positives += 1
            elif prediction == 1 and truth == 1:
                true_positives += 1
            else:
                print "Warning: Found a predicted label not == 0 or 1."
                print "All predictions should take value 0 or 1."
                print "Evaluating performance for processed predictions:"
                break
    try:
        total_predictions = true_negatives + false_negatives + false_positives + true_positives
        accuracy = 1.0*(true_positives + true_negatives)/total_predictions
        precision = 1.0*true_positives/(true_positives+false_positives)
        recall = 1.0*true_positives/(true_positives+false_negatives)
        f1 = 2.0 * true_positives/(2*true_positives + false_positives+false_negatives)
        f2 = (1+2.0*2.0) * precision*recall/(4*precision + recall)
        print clf
        print PERF_FORMAT_STRING.format(accuracy, precision, recall, f1, f2, display_precision = 5)
        print RESULTS_FORMAT_STRING.format(total_predictions, true_positives, false_positives, false_negatives, true_negatives)
        print ""
    except:
        print "Got a divide by zero when trying out:", clf
        print "Precision or recall may be undefined due to a lack of true positive predicitons."

*6. Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]*

For model evaluation I have chosen following metrics:

Accuracy - proportion of cases when predicted POI status matched actual POI status, or overall ability of the algorithm to predict outcome wether it is POI or non-POI

Recall - ability of algorithm to correctly lable maximum POIs, out of all actual POIs in the dataset.

Precision - Out of all labled as POIs, hom many are actual POIs.

score - Weighted average of Precision and Recall

After crossvalidating both models with Stratified ShuffleSplit cross-validation method, Naive Bayes Model performed consistently better than SVM on all 4 metrics:
Accuracy: 0.85443 Precision: 0.48860 Recall: 0.40700 F1: 0.44408

In [37]:
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
clf = GaussianNB()
#clf = svm.SVC(kernel='linear', C=0.001, gamma=0.001)
#clf = svm.SVC(kernel='rbf', C=10, gamma=1)
test_classifier(clf, my_dataset, best_kbest_list)

GaussianNB(priors=None)
	Accuracy: 0.85443	Precision: 0.48860	Recall: 0.40700	F1: 0.44408	F2: 0.42106
	Total predictions: 14000	True positives:  814	False positives:  852	False negatives: 1186	True negatives: 11148



In [None]:
dump_classifier_and_data(clf, my_dataset, features_list)