<h1>Enron 'person of interest' classifier: Miscellaneous analyses</h1>

This file is used to perform a variety of analyses on the Enron data, in order to a) explore the data and b) experiment with the classifiers, features included, and other aspects of the classifier algorithm to optimize our results.

<h2>Data exploration</h2>

Here I attempt to understand the size and character of the data, and uncover some basic information about its features.

In [1]:
import pickle
import numpy as np

# Load the data
enron_data = pickle.load(open("final_project_dataset.pkl", "r"))

# Print the data type
print "Data type used for base data: %s" % type(enron_data).__name__

# Print the length of the data set
print "Number of rows in the data set: %s" % len(enron_data)
  
# Check how many POIs versus non-POIs are in the data
poi_count = 0
non_poi_count = 0

for key in enron_data:
    if enron_data[key]['poi'] == True:
        poi_count += 1
    elif enron_data[key]['poi'] == False:
        non_poi_count += 1

print "Number of POIs present: %s" % poi_count
print"Number of non-POIs present: %s" % non_poi_count
print "\n"

# Print list of features and percentage of records each covers
for internal_key in enron_data['METTS MARK']:
    nan_count = 0
    non_nan = 0
    
    for m in enron_data.iteritems():
        if m[1][internal_key] == 'NaN':
            nan_count += 1
        else:
            non_nan += 1
    print "Feature: %s; %s coverage" %(internal_key,\
                    '{0:.0%}'.format(1 - nan_count/146.0))

Data type used for base data: dict
Number of rows in the data set: 146
Number of POIs present: 18
Number of non-POIs present: 128


Feature: salary; 65% coverage
Feature: to_messages; 59% coverage
Feature: deferral_payments; 27% coverage
Feature: total_payments; 86% coverage
Feature: exercised_stock_options; 70% coverage
Feature: bonus; 56% coverage
Feature: restricted_stock; 75% coverage
Feature: shared_receipt_with_poi; 59% coverage
Feature: restricted_stock_deferred; 12% coverage
Feature: total_stock_value; 86% coverage
Feature: expenses; 65% coverage
Feature: loan_advances; 3% coverage
Feature: from_messages; 59% coverage
Feature: other; 64% coverage
Feature: from_this_person_to_poi; 59% coverage
Feature: poi; 100% coverage
Feature: director_fees; 12% coverage
Feature: deferred_income; 34% coverage
Feature: long_term_incentive; 45% coverage
Feature: email_address; 76% coverage
Feature: from_poi_to_this_person; 59% coverage


<h2>Outlier detection</h2>

Here I attempt to detect and (where appropriate) remove outliers in the data.

In [2]:
import pickle
import sys
import matplotlib.pyplot
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

data_dict = pickle.load( open("final_project_dataset.pkl", "r") )

# Code to remove Total; comment the line to re-run first figure
# data_dict.pop( 'TOTAL', 0 )

# Print and optionally save figure of total stock v total payments
features = ["total_payments", "total_stock_value"]
data = featureFormat(data_dict, features)

for point in data:
    payments = point[0]
    stock = point[1]
    matplotlib.pyplot.scatter( payments, stock )

matplotlib.pyplot.xlabel("total payments")
matplotlib.pyplot.ylabel("total stock value")
# matplotlib.pyplot.savefig('./images/fig_02.png', dpi=300)
matplotlib.pyplot.show()

In [3]:
# Print out the outlier with the highest total payments
payments_test = data.max(axis=0)[0]
for m in data_dict.iteritems():
    if m[1]['total_payments'] == payments_test:
        print payments_test
        print m[0]

309886585.0
TOTAL


In [4]:
# Explore whether there are further payments / stock outliers
# by printing information on highest figures

print data.max(axis=0)[0]
print type(data).__name__
print data.max(axis=0)[1]
print data.sort(axis=0)

# Return the top 10 highest paid execs still in the data
sorted_by_salary = data[data[:, 0].argsort()][::-1][1:11]

# Print the names of each of the top 10
for value in sorted_by_salary[:, [0]]:
    for m in data_dict.iteritems():
        if m[1]['total_payments'] == value[0]:
            print m[0]
            
# Return the top 10 highest stock execs still in the data
sorted_by_stock = data[data[:, 0].argsort()][::-1][1:11]
print sorted_by_stock

# Print the names of each of the top 10
for value in sorted_by_salary[:, [1]]:
    for m in data_dict.iteritems():
        if m[1]['total_stock_value'] == value[0]:
            print m[0]

309886585.0
ndarray
434509511.0
None
LAY KENNETH L
FREVERT MARK A
BHATNAGAR SANJAY
LAVORATO JOHN J
SKILLING JEFFREY K
MARTIN AMANDA K
BAXTER JOHN C
BELDEN TIMOTHY N
DELAINEY DAVID W
WHALLEY LAWRENCE G
[[  1.03559793e+08   4.91100780e+07]
 [  1.72525300e+07   3.07660640e+07]
 [  1.54562900e+07   2.60936720e+07]
 [  1.04257570e+07   2.38179300e+07]
 [  8.68271600e+06   2.25425390e+07]
 [  8.40701600e+06   1.51441230e+07]
 [  5.63434300e+06   1.46221850e+07]
 [  5.50163000e+06   1.18847580e+07]
 [  4.74797900e+06   1.06232580e+07]
 [  4.67757400e+06   8.83191300e+06]]
LAY KENNETH L
HIRKO JOSEPH
SKILLING JEFFREY K
PAI LOU L
RICE KENNETH D
WHITE JR THOMAS E
FREVERT MARK A
YEAGER F SCOTT
BAXTER JOHN C
DERRICK JR. JAMES V


In [5]:
# Detecting email outliers
data_dict = pickle.load( open("final_project_dataset.pkl", "r") )

# Code to remove Total; uncomment to re-run first figure
data_dict.pop( 'TOTAL', 0 )

# Print and optionally save figure 
features = ["to_messages", "from_messages"]
data = featureFormat(data_dict, features)

for point in data:
    to_messages = point[0]
    from_messages = point[1]
    matplotlib.pyplot.scatter( to_messages, from_messages )

matplotlib.pyplot.xlabel("to_messages")
matplotlib.pyplot.ylabel("from_messages")
# matplotlib.pyplot.savefig('./images/fig_03.png', dpi=300)
matplotlib.pyplot.show()

In [6]:
# Return the top 10 highest to_emails
sorted_by_email = data[data[:, 0].argsort()][::-1][0:10]

# Print the names of each of the top 10
for value in sorted_by_email[:, [0]]:
    for m in data_dict.iteritems():
        if m[1]['to_messages'] == value[0]:
            print m[0]

SHAPIRO RICHARD S
KEAN STEVEN J
KITCHEN LOUISE
BELDEN TIMOTHY N
BECK SALLY W
LAVORATO JOHN J
WHALLEY LAWRENCE G
KAMINSKI WINCENTY J
LAY KENNETH L
HAEDICKE MARK E


In [7]:
# Detecting email outliers to/from POIs

data_dict = pickle.load( open("final_project_dataset.pkl", "r") )

# Code to remove Total; uncomment to re-run first figure
data_dict.pop( 'TOTAL', 0 )

# Print and optionally save figure 
features = ["from_this_person_to_poi", "from_poi_to_this_person"]
data = featureFormat(data_dict, features)

for point in data:
    to_messages = point[0]
    from_messages = point[1]
    matplotlib.pyplot.scatter( to_messages, from_messages )

matplotlib.pyplot.xlabel("to__poi_messages")
matplotlib.pyplot.ylabel("from__poi_messages")
# matplotlib.pyplot.savefig('./images/fig_04.png', dpi=300)
matplotlib.pyplot.show()

In [8]:
# Return the top 10 highest to_poi_emails
sorted_by_poi_email = data[data[:, 0].argsort()][::-1][0:10]

# Print the names of each of the top 10
for value in sorted_by_poi_email[:, [0]]:
    for m in data_dict.iteritems():
        if m[1]['from_this_person_to_poi'] == value[0]:
            print m[0]

DELAINEY DAVID W
LAVORATO JOHN J
KEAN STEVEN J
BECK SALLY W
KITCHEN LOUISE
MCCONNELL MICHAEL S
KITCHEN LOUISE
MCCONNELL MICHAEL S
KAMINSKI WINCENTY J
BELDEN TIMOTHY N
SHANKMAN JEFFREY A
BUY RICHARD B


In [9]:
# Print individuals with no email activity

for m in data_dict.iteritems():
    if m[1]['to_messages'] == 'NaN':
        print m[0]

BAXTER JOHN C
ELLIOTT STEVEN
MORDAUNT KRISTINA M
LOWRY CHARLES P
WESTFAHL RICHARD K
WALTERS GARETH W
CHAN RONNIE
BELFER ROBERT
WODRASKA JOHN
URQUHART JOHN A
WHALEY DAVID A
ECHOLS JOHN B
MENDELSOHN JOHN
CLINE KENNETH W
KOPPER MICHAEL J
BERBERIAN DAVID
DETMERING TIMOTHY J
WAKEHAM JOHN
GOLD JOSEPH
DUNCAN JOHN H
LEMAISTRE CHARLES
KISHKILL JOSEPH G
SULLIVAN-SHAKLOVITZ COLLEEN
WROBEL BRUCE
LINDHOLM TOD A
MEYER JEROME J
BUTTS ROBERT H
CUMBERLAND MICHAEL S
GAHN ROBERT S
HERMANN ROBERT J
SCRIMSHAW MATTHEW
GATHMANN WILLIAM D
GILLIS JOHN
BAZELIDES PHILIP J
FASTOW ANDREW S
LOCKHART EUGENE E
OVERDYKE JR JERE C
PEREIRA PAULO V. FERRAZ
STABLER FRANK
BLAKE JR. NORMAN P
PRENTICE JAMES
GRAY RODNEY
THE TRAVEL AGENCY IN THE PARK
NOLES JAMES L
WHITE JR THOMAS E
CHRISTODOULOU DIOMEDES
JAEDICKE ROBERT
WINOKUR JR. HERBERT S
BADUM JAMES P
REYNOLDS LAWRENCE
DIMICHELE RICHARD G
YEAP SOON
YEAGER F SCOTT
HIRKO JOSEPH
PAI LOU L
BAY FRANKLIN R
FUGH JOHN L
SAVAGE FRANK
GRAMM WENDY L


In [10]:
# Print suspected empty records; manually confirm if any fully empty

for m in data_dict.iteritems():
    if m[1]['to_messages'] == 'NaN' and m[1]['total_payments'] == 'NaN':
        print m[0]
        print data_dict[m[0]]
        print "\n"


LOWRY CHARLES P
{'salary': 'NaN', 'to_messages': 'NaN', 'deferral_payments': 'NaN', 'total_payments': 'NaN', 'exercised_stock_options': 372205, 'bonus': 'NaN', 'restricted_stock': 153686, 'shared_receipt_with_poi': 'NaN', 'restricted_stock_deferred': -153686, 'total_stock_value': 372205, 'expenses': 'NaN', 'loan_advances': 'NaN', 'from_messages': 'NaN', 'other': 'NaN', 'from_this_person_to_poi': 'NaN', 'poi': False, 'director_fees': 'NaN', 'deferred_income': 'NaN', 'long_term_incentive': 'NaN', 'email_address': 'NaN', 'from_poi_to_this_person': 'NaN'}


CHAN RONNIE
{'salary': 'NaN', 'to_messages': 'NaN', 'deferral_payments': 'NaN', 'total_payments': 'NaN', 'exercised_stock_options': 'NaN', 'bonus': 'NaN', 'restricted_stock': 32460, 'shared_receipt_with_poi': 'NaN', 'restricted_stock_deferred': -32460, 'total_stock_value': 'NaN', 'expenses': 'NaN', 'loan_advances': 'NaN', 'from_messages': 'NaN', 'other': 'NaN', 'from_this_person_to_poi': 'NaN', 'poi': False, 'director_fees': 98784, 'def

In [11]:
# Remove remaining outliers

data_dict.pop('THE TRAVEL AGENCY IN THE PARK', 0)
data_dict.pop('LOCKHART EUGENE E', 0)

print len(data_dict)

143


<h2>Features</h2>

Here I evaluate the features in the data, ensure any that are clearly not of use are removed, and create new features as appropriate.

<h3>Feature evaluation</h3>

In [15]:
from sklearn.feature_selection import SelectKBest
from sklearn.cross_validation import train_test_split

# Import features for analysis
features_list = ['poi', 'salary', 'to_messages', 'deferral_payments', 
                 'total_payments','exercised_stock_options', 'bonus', 
                 'restricted_stock', 'shared_receipt_with_poi',
                 'restricted_stock_deferred', 'total_stock_value', 
                 'expenses','loan_advances','from_messages', 'other', 
                 'from_this_person_to_poi','director_fees', 
                 'deferred_income','long_term_incentive', 
                 'from_poi_to_this_person']

data = featureFormat(data_dict, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

# Perform K-best analysis
k_best = SelectKBest(k='all')
k_best.fit(features_train, labels_train)
features_unsorted = zip(features_list[1:], k_best.scores_)
features_sorted = sorted(features_unsorted, key=lambda tup: tup[1],\
                         reverse=True)

print features_sorted

[('bonus', 30.728774633399713), ('salary', 15.858730905995131), ('shared_receipt_with_poi', 10.722570813682712), ('total_stock_value', 10.633852048382538), ('exercised_stock_options', 9.6800414303809852), ('total_payments', 8.9591366476908583), ('deferred_income', 8.7922038527047608), ('restricted_stock', 8.058306312280525), ('long_term_incentive', 7.5551197773202938), ('loan_advances', 7.0379327981934612), ('from_poi_to_this_person', 4.9586666839661424), ('expenses', 4.1807214846470577), ('other', 3.2044591402721507), ('to_messages', 2.6161830046793662), ('director_fees', 1.6410979261701475), ('restricted_stock_deferred', 0.72712410971776964), ('from_messages', 0.43537409865824717), ('from_this_person_to_poi', 0.11120823866694469), ('deferral_payments', 0.0099823995896919059)]


<h3>New features</h3>

In [16]:
from __future__ import division
import pickle
    
# Basic fraction computation function
def computeFraction( poi_messages, all_messages ):
    """ given a number messages to/from POI (numerator) 
        and number of all messages to/from a person (denominator),
        return the fraction of messages to/from that person
        that are from/to a POI
   """    
    if poi_messages == 'NaN' or all_messages == 'NaN':
        return 0
    else:
        fraction = poi_messages/all_messages
        return fraction
    
# Compute from_poi_proportion and to_poi_proportion using computeFraction function
submit_dict = {}
for name in data_dict:
    data_point = data_dict[name]
    
    from_poi_to_this_person = data_point["from_poi_to_this_person"]
    to_messages = data_point["to_messages"]
    from_poi_proportion = computeFraction( from_poi_to_this_person, to_messages )
    data_point["from_poi_proportion"] = from_poi_proportion

    from_this_person_to_poi = data_point["from_this_person_to_poi"]
    from_messages = data_point["from_messages"]
    to_poi_proportion = computeFraction( from_this_person_to_poi, from_messages )
    data_point["to_poi_proportion"] = to_poi_proportion
    

# Re-run the K-Best evaluation with new features included

# Import features for analysis
features_list = ['poi', 'salary', 'to_messages', 'deferral_payments', 
                 'total_payments','exercised_stock_options', 'bonus',
                 'restricted_stock', 'shared_receipt_with_poi',
                 'restricted_stock_deferred', 'total_stock_value',
                 'expenses', 'loan_advances', 'from_messages', 'other',
                 'from_this_person_to_poi','director_fees',
                 'deferred_income', 'long_term_incentive', 
                 'from_poi_to_this_person', 'from_poi_proportion',
                 'to_poi_proportion']

data = featureFormat(data_dict, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

# Perform K-best analysis
k_best = SelectKBest(k='all')
k_best.fit(features_train, labels_train)
features_unsorted = zip(features_list[1:], k_best.scores_)
features_sorted = sorted(features_unsorted, key=lambda tup: tup[1], reverse=True)

print features_sorted

[('bonus', 30.728774633399713), ('salary', 15.858730905995131), ('to_poi_proportion', 15.838094949193755), ('shared_receipt_with_poi', 10.722570813682712), ('total_stock_value', 10.633852048382538), ('exercised_stock_options', 9.6800414303809852), ('total_payments', 8.9591366476908583), ('deferred_income', 8.7922038527047608), ('restricted_stock', 8.058306312280525), ('long_term_incentive', 7.5551197773202938), ('loan_advances', 7.0379327981934612), ('from_poi_to_this_person', 4.9586666839661424), ('expenses', 4.1807214846470577), ('other', 3.2044591402721507), ('to_messages', 2.6161830046793662), ('director_fees', 1.6410979261701475), ('restricted_stock_deferred', 0.72712410971776964), ('from_poi_proportion', 0.51923116954782722), ('from_messages', 0.43537409865824717), ('from_this_person_to_poi', 0.11120823866694469), ('deferral_payments', 0.0099823995896919059)]


<h3>Feature scaling</h3>

In [17]:
from feature_format import featureFormat, targetFeatureSplit
from sklearn.preprocessing import MinMaxScaler

# Import features for analysis
features_list = ['poi', 'salary', 'to_messages', 'deferral_payments', 
                 'total_payments','exercised_stock_options', 'bonus',
                 'restricted_stock', 'shared_receipt_with_poi',
                 'restricted_stock_deferred', 'total_stock_value',
                 'expenses', 'loan_advances', 'from_messages', 'other',
                 'from_this_person_to_poi','director_fees',
                 'deferred_income', 'long_term_incentive', 
                 'from_poi_to_this_person', 'from_poi_proportion',
                 'to_poi_proportion']

#  Initialize scaler, fit and transform the data with it
scaler = MinMaxScaler()
features = scaler.fit_transform(features)

<h3>Evaluation metrics</h3>

Here I create some basic evaluation metrics on which to assess classifiers and any 'fine tuning' we go on to do.

In [18]:
# Import relevant modules
from collections import Counter
from sklearn.metrics import precision_score, recall_score, f1_score,\
accuracy_score

# Create function to test a classifier on each evaluation metric
def classifier_evaluation(truth, prediction):
    confusion_matrix = Counter()
    positives = [1]
    
    truth_split = [i in positives for i in truth]
    prediction_split = [i in positives for i in prediction]
    for x, y in zip(truth_split, prediction_split):
        confusion_matrix[x,y] += 1
   
    print confusion_matrix
    print "Accuracy: ", accuracy_score(truth, prediction)
    print "Precision Score: ", precision_score(truth, prediction)
    print "Recall Score: ", recall_score(truth, prediction)
    print "F1 Score: ", f1_score(truth, prediction)
    

<h2>Classifiers</h2>

Here I select, test, and fine-tune classifiers 

<h3>Initial classifier test</h3>

In [19]:
""" NB: My goal in this section to try out a large number of 
    classifiers and to get a sense of which seem strongest
    based on minimal fine-tuning. I'll select the highest-
    performing two algorithms to refine and improve"""

# Naive Bayes test
from sklearn.naive_bayes import GaussianNB
NB_classifier = GaussianNB()
NB_classifier.fit(features_train, labels_train)
labels_pred = NB_classifier.predict(features_test)
print "---> Naive Bayes classifier:"
print classifier_evaluation(labels_test, labels_pred)
print "\n"

# Logistic regression test
from sklearn.linear_model import LogisticRegression
lreg_classifier = LogisticRegression(C=1000)
lreg_classifier.fit(features_train, labels_train)
labels_pred = lreg_classifier.predict(features_test)
print "---> Logistic regression:"
print classifier_evaluation(labels_test, labels_pred)
print "\n"

# SVM test
from sklearn import svm
svm_classifier = svm.SVC(C=1000)
svm_classifier.fit(features_train, labels_train)
labels_pred = svm_classifier.predict(features_test)
print "---> SVM classifier:"
print classifier_evaluation(labels_test, labels_pred)
print "\n"

---> Naive Bayes classifier:
Counter({(False, True): 27, (False, False): 11, (True, True): 4, (True, False): 1})
Accuracy:  0.348837209302
Precision Score:  0.129032258065
Recall Score:  0.8
F1 Score:  0.222222222222
None


---> Logistic regression:
Counter({(False, False): 31, (False, True): 7, (True, False): 4, (True, True): 1})
Accuracy:  0.744186046512
Precision Score:  0.125
Recall Score:  0.2
F1 Score:  0.153846153846
None


---> SVM classifier:
Counter({(False, False): 38, (True, False): 5})
Accuracy:  0.883720930233
Precision Score:  0.0
Recall Score:  0.0
F1 Score:  0.0
None




  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


<h3>Tuning: SVM</h3>

In [None]:
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer

# SVM test: initial
from sklearn import svm
svm_classifier = svm.SVC(C=1000)
svm_classifier.fit(features_train, labels_train)
labels_pred = svm_classifier.predict(features_test)
print "---> Initial SVM classifier:"
print classifier_evaluation(labels_test, labels_pred)
print "\n"

# SVM tuning: GridSearch
scorer = make_scorer(f1_score)
parameters = {'kernel':('linear', 'rbf', 'poly', 'sigmoid'), \
              'C':[1, 10, 100, 500, 1000, 1100, 1200, 1300, 1400, 1500, 1600 ], \
             'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]}
svr = svm.SVC()
svm_tuning = GridSearchCV(svr, parameters, scoring=scorer, verbose=1)
svm_tuning.fit(features_train, labels_train)

print svm_tuning.best_estimator_
# print svm_tuning.best_params_
print "\n"

# SVM test: gridsearch-tuned 
svm_classifier = svm.SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.05, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)
svm_classifier.fit(features_train, labels_train)
labels_pred = svm_classifier.predict(features_test)
print "---> Gridsearch-tuned SVM classifier:"
print classifier_evaluation(labels_test, labels_pred)
print "\n"

# Legacy manually tweaked classifier
# svm_classifier = svm.SVC(C=1600, kernel='rbf', gamma=0.05)
# svm_classifier.fit(features_train, labels_train)
# labels_pred = svm_classifier.predict(features_test)
# print "---> SVM classifier:"
# print classifier_evaluation(labels_test, labels_pred)
# print "\n"

<h3>Tuning: Logistic regression</h3>

In [22]:
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer

# Logistic regression: initial algorithm
from sklearn.linear_model import LogisticRegression
lreg_classifier = LogisticRegression(C=1000)
lreg_classifier.fit(features_train, labels_train)
labels_pred = lreg_classifier.predict(features_test)
print "---> Logistic regression:"
print classifier_evaluation(labels_test, labels_pred)
print "\n"

# Logistic regression tuning: GridSearch
scorer = make_scorer(f1_score)
parameters = {"C":[0.05, 0.5, 1, 10, 100,500,1000,10000,100000, 1000000],
                    "tol":[10**-1, 10**-5, 10**-10],
                    "class_weight":['auto']}
lreg_classifier = LogisticRegression()
lreg_tuning = GridSearchCV(lreg_classifier, parameters, scoring=scorer, verbose=1)
lreg_tuning.fit(features_train, labels_train)

print lreg_tuning.best_estimator_
# print lreg_tuning.best_params_
print "\n"

# Logistic regression: Gridsearch-tuned
lreg_classifier = LogisticRegression(C=0.05, class_weight='auto', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l2', random_state=None,
          solver='liblinear', tol=0.1, verbose=0)
lreg_classifier.fit(features_train, labels_train)
labels_pred = lreg_classifier.predict(features_test)
print "---> Logistic regression: Gridsearch tuned"
print classifier_evaluation(labels_test, labels_pred)
print "\n"

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  50 jobs       | elapsed:    0.2s


---> Logistic regression:
Counter({(False, False): 31, (False, True): 7, (True, False): 4, (True, True): 1})
Accuracy:  0.744186046512
Precision Score:  0.125
Recall Score:  0.2
F1 Score:  0.153846153846
None


Fitting 3 folds for each of 30 candidates, totalling 90 fits
LogisticRegression(C=1, class_weight='auto', dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=1e-05,
          verbose=0)


---> Logistic regression: Gridsearch tuned
Counter({(False, True): 20, (False, False): 18, (True, True): 3, (True, False): 2})
Accuracy:  0.488372093023
Precision Score:  0.130434782609
Recall Score:  0.6
F1 Score:  0.214285714286
None




[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:    0.4s finished


<h3>Naive Bayes recap</h3>

In [26]:
# Naive Bayes: initial algorithm
from sklearn.naive_bayes import GaussianNB
NB_classifier = GaussianNB()
NB_classifier.fit(features_train, labels_train)
labels_pred = NB_classifier.predict(features_test)
print "---> Naive Bayes classifier (Gaussian):"
print classifier_evaluation(labels_test, labels_pred)
print "\n"

# Multinomial Naive Bayes test
from sklearn.naive_bayes import MultinomialNB
NB_classifier = MultinomialNB()
NB_classifier.fit(features_train, labels_train)
labels_pred = NB_classifier.predict(features_test)
print "---> Naive Bayes classifier (Multinomial):"
print classifier_evaluation(labels_test, labels_pred)
print "\n"

---> Naive Bayes classifier (Gaussian):
Counter({(False, False): 35, (False, True): 3, (True, False): 3, (True, True): 2})
Accuracy:  0.860465116279
Precision Score:  0.4
Recall Score:  0.4
F1 Score:  0.4
None


---> Naive Bayes classifier (Multinomial):
Counter({(False, False): 38, (True, False): 5})
Accuracy:  0.883720930233
Precision Score:  0.0
Recall Score:  0.0
F1 Score:  0.0
None




<h3>Additional feature removal</h3>

Here I wanted to revisit a topic from earlier, and attempt to explore whether removing features impaired performance of the algorithms.

In [31]:
# Here my objective is to make a number of additional tweaks to
# try and improve the scores we're getting here. 

# Feature removal, to leave 12, those at >5 score by K-best

features_list = ['poi', 'salary','total_payments',
	'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi',
	'total_stock_value', 'loan_advances', 'expenses','from_poi_to_this_person',
	'deferred_income', 'long_term_incentive', 'to_poi_proportion']

#  Initialize scaler, fit and transform the data with it
data = featureFormat(data_dict, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

scaler = MinMaxScaler()
features = scaler.fit_transform(features)

features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

### Run tests

# SVM tuning: GridSearch
scorer = make_scorer(f1_score)
parameters = {'kernel':('linear', 'rbf', 'poly', 'sigmoid'), \
              'C':[1, 10, 100, 500, 1000, 1100, 1200, 1300, 1400, 1500, 1600 ], \
             'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]}
svr = svm.SVC()
svm_tuning = GridSearchCV(svr, parameters, scoring=scorer, verbose=1)
svm_tuning.fit(features_train, labels_train)

print svm_tuning.best_estimator_
# print svm_tuning.best_params_
print "\n"

# SVM test: gridsearch-tuned 
from sklearn import svm
svm_classifier = svm.SVC(C=1200, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.01, kernel='sigmoid', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)
svm_classifier.fit(features_train, labels_train)
labels_pred = svm_classifier.predict(features_test)
print "---> Gridsearch-tuned SVM classifier:"
print classifier_evaluation(labels_test, labels_pred)
print "\n"

# Logistic regression tuning: GridSearch
scorer = make_scorer(f1_score)
parameters = {"C":[0.05, 0.5, 1, 10, 10**2,10**5,10**10, 10**20],
                    "tol":[10**-1, 10**-5, 10**-10],
                    "class_weight":['auto']}
lreg_classifier = LogisticRegression()
lreg_tuning = GridSearchCV(lreg_classifier, parameters, scoring=scorer, verbose=1)
lreg_tuning.fit(features_train, labels_train)

print lreg_tuning.best_estimator_
# print lreg_tuning.best_params_
print "\n"

# Logistic regression: Gridsearch-tuned
lreg_classifier = LogisticRegression(C=0.05, class_weight='auto', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l2', random_state=None,
          solver='liblinear', tol=0.1, verbose=0)
lreg_classifier.fit(features_train, labels_train)
labels_pred = lreg_classifier.predict(features_test)
print "---> Logistic regression: Gridsearch tuned"
print classifier_evaluation(labels_test, labels_pred)
print "\n"

# Naive Bayes: initial Gaussian algorithm (no additional tuning used)
from sklearn.naive_bayes import GaussianNB
NB_classifier = GaussianNB()
NB_classifier.fit(features_train, labels_train)
labels_pred = NB_classifier.predict(features_test)
print "---> Naive Bayes classifier:"
print classifier_evaluation(labels_test, labels_pred)
print "\n"

# Naive Bayes multinomal tuning: GridSearch
scorer = make_scorer(f1_score)
parameters = {"alpha":[0.01, 0.1, 0.2, 0.5, 0.8, 1.0, 1.5, 2.0, 3.0, 4.0, 5.0, 10.0],
                    "fit_prior":['True', 'False']}
mnb_classifier = MultinomialNB()
mnb_tuning = GridSearchCV(mnb_classifier, parameters, scoring=scorer, verbose=1)
mnb_tuning.fit(features_train, labels_train)

print mnb_tuning.best_estimator_
# print mnb_tuning.best_params_
print "\n"

# Naive Bayes multinomial: Gridsearch-tuned
mnb_classifier = MultinomialNB(alpha=1.5, fit_prior=False)
mnb_classifier.fit(features_train, labels_train)
labels_pred = mnb_classifier.predict(features_test)
print "---> Naive Bayes multinomial: Gridsearch tuned"
print classifier_evaluation(labels_test, labels_pred)
print "\n"

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  50 jobs       | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done 200 jobs       | elapsed:    0.3s
[Parallel(n_jobs=1)]: Done 450 jobs       | elapsed:    1.2s
[Parallel(n_jobs=1)]: Done 800 jobs       | elapsed:    2.7s
[Parallel(n_jobs=1)]: Done 924 out of 924 | elapsed:    3.3s finished


Fitting 3 folds for each of 308 candidates, totalling 924 fits
SVC(C=1200, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.01, kernel='sigmoid', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  50 jobs       | elapsed:    0.1s





---> Gridsearch-tuned SVM classifier:
Counter({(False, False): 36, (True, False): 4, (False, True): 2, (True, True): 1})
Accuracy:  0.860465116279
Precision Score:  0.333333333333
Recall Score:  0.2
F1 Score:  0.25
None


Fitting 3 folds for each of 24 candidates, totalling 72 fits
LogisticRegression(C=0.05, class_weight='auto', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l2', random_state=None,
          solver='liblinear', tol=0.1, verbose=0)

[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  50 jobs       | elapsed:    0.1s





---> Logistic regression: Gridsearch tuned
Counter({(False, False): 29, (False, True): 9, (True, True): 4, (True, False): 1})
Accuracy:  0.767441860465
Precision Score:  0.307692307692
Recall Score:  0.8
F1 Score:  0.444444444444
None


---> Naive Bayes classifier:
Counter({(False, False): 35, (False, True): 3, (True, False): 3, (True, True): 2})
Accuracy:  0.860465116279
Precision Score:  0.4
Recall Score:  0.4
F1 Score:  0.4
None


Fitting 3 folds for each of 24 candidates, totalling 72 fits
MultinomialNB(alpha=0.1, class_prior=None, fit_prior='True')


---> Naive Bayes multinomial: Gridsearch tuned
Counter({(False, False): 36, (True, True): 3, (False, True): 2, (True, False): 2})
Accuracy:  0.906976744186
Precision Score:  0.6
Recall Score:  0.6
F1 Score:  0.6
None




[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.1s finished


<h3>Validation</h3>

In [32]:
from sklearn.cross_validation import StratifiedKFold

def validation_test(clf):
    stratkf = StratifiedKFold(labels, n_folds=3)
    recall_list = []
    precision_list = []

    for train_x, test_x in stratkf:
        features_train = []
        labels_train   = []
        features_test  = []
        labels_test    = []

        for i in train_x:
            features_train.append(features[i])
            labels_train.append(labels[i])
        for i in test_x:
            features_test.append(features[i])
            labels_test.append(labels[i])

        # Fit and predict labels with classifier of choice
        clf.fit(features_train, labels_train)
        labels_pred = clf.predict(features_test)

        # Track scores
        precision_list.append(precision_score(labels_test, labels_pred))
        recall_list.append(recall_score(labels_test, labels_pred))

    # Calculate and print average precision and recall across set
    print "precision: ", (sum(precision_list)/3.)
    print "recall: ", (sum(recall_list)/3.)
    
print "Naive Bayes (Gaussian) validation test --->"
validation_test(NB_classifier)
print "\n"

print "Naive Bayes (multinomial) validation test --->"
validation_test(mnb_classifier)
print "\n"

print "Logistic regression validation test --->"
validation_test(lreg_classifier)
print "\n"

Naive Bayes (Gaussian) validation test --->
precision:  0.31746031746
recall:  0.388888888889


Naive Bayes (multinomial) validation test --->
precision:  0.483333333333
recall:  0.333333333333


Logistic regression validation test --->
precision:  0.317841880342
recall:  0.666666666667


