# Identifying Fraud at Enron

Enron Corporation was an American energy, commodities, and services company based in Houston, Texas. Before  2001, Enron employed approximately 20,000 staff and was one of the world's major electricity, natural gas, communications and pulp and paper companies, with claimed revenues of nearly $111 billion during 2000. 

But the Enron scandal , comprising unethical practices and exploiting accounting limitations to misrepresent earnings and modify the balance sheet to indicate favorable performance, revealed in October 2001, led to the bankruptcy of the Enron Corporation and the majority of them were perpetuated by the indirect knowledge or direct actions of CFO , CEO, and a few other executives.

#### Enron corpus :
Enron Corpus is a large database of over 600,000 emails generated by 158 employees of the Enron Corporation and acquired by the Federal Energy Regulatory Commission during its investigation after the company's collapse.
In this project , we are analysing the latest available Enron dataset (courtesy : Udacity and https://www.cs.cmu.edu/~./enron/ ) to identify the POI (People of Interest) from around 150 former employees of Enron corporation.

### Project Goal : 
The aim of this project is to design a supervised machine learning model that would be able to classify if an employee is a POI or non-POI , using the Financial and Email features provided in the enron corpus as inputs. 

### Data Analysis :
The data from final_project_dataset pickle file has been loaded into a Pandas DataFrame (enron_df) for easy analysis.


In [3]:
# About the data
import sys
import pickle
import pandas as pd
import numpy as np
sys.path.append("tools/")

with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)
enron_df=pd.DataFrame.from_dict(data_dict, orient='index')
print enron_df.columns, '\n' ,'\n Total rowsXcolumns', enron_df.shape

Index([u'salary', u'to_messages', u'deferral_payments', u'total_payments',
       u'exercised_stock_options', u'bonus', u'restricted_stock',
       u'shared_receipt_with_poi', u'restricted_stock_deferred',
       u'total_stock_value', u'expenses', u'loan_advances', u'from_messages',
       u'other', u'from_this_person_to_poi', u'poi', u'director_fees',
       u'deferred_income', u'long_term_incentive', u'email_address',
       u'from_poi_to_this_person'],
      dtype='object') 

 Total rowsXcolumns (146, 21)


In [4]:
print 'Total number of POI=',len(enron_df[enron_df.poi==True])
print 'Total number of non-POI=',len(enron_df[enron_df.poi==False])

Total number of POI= 18
Total number of non-POI= 128


Clearly this is an unbalanced data set (15-85 ratio). As there are financial attributes , let us analyse top 10 employees that were highly paid.


In [5]:
enron_df=enron_df.sort_values(by=['total_payments'],ascending=False)
enron_df=enron_df.replace('NaN',np.NaN)
enron_df[enron_df.total_payments.notnull()].head(10)


Unnamed: 0,salary,to_messages,deferral_payments,total_payments,exercised_stock_options,bonus,restricted_stock,shared_receipt_with_poi,restricted_stock_deferred,total_stock_value,...,loan_advances,from_messages,other,from_this_person_to_poi,poi,director_fees,deferred_income,long_term_incentive,email_address,from_poi_to_this_person
TOTAL,26704229.0,,32083396.0,309886585.0,311764000.0,97343619.0,130322299.0,,-7576788.0,434509511.0,...,83925000.0,,42667589.0,,False,1398517.0,-27992891.0,48521928.0,,
LAY KENNETH L,1072321.0,4273.0,202911.0,103559793.0,34348384.0,7000000.0,14761694.0,2411.0,,49110078.0,...,81525000.0,36.0,10359729.0,16.0,True,,-300000.0,3600000.0,kenneth.lay@enron.com,123.0
FREVERT MARK A,1060932.0,3275.0,6426990.0,17252530.0,10433518.0,2000000.0,4188667.0,2979.0,,14622185.0,...,2000000.0,21.0,7427621.0,6.0,False,,-3367011.0,1617011.0,mark.frevert@enron.com,242.0
BHATNAGAR SANJAY,,523.0,,15456290.0,2604490.0,,-2604490.0,463.0,15456290.0,,...,,29.0,137864.0,1.0,False,137864.0,,,sanjay.bhatnagar@enron.com,0.0
LAVORATO JOHN J,339288.0,7259.0,,10425757.0,4158995.0,8000000.0,1008149.0,3962.0,,5167144.0,...,,2585.0,1552.0,411.0,False,,,2035380.0,john.lavorato@enron.com,528.0
SKILLING JEFFREY K,1111258.0,3627.0,,8682716.0,19250000.0,5600000.0,6843672.0,2042.0,,26093672.0,...,,108.0,22122.0,30.0,True,,,1920000.0,jeff.skilling@enron.com,88.0
MARTIN AMANDA K,349487.0,1522.0,85430.0,8407016.0,2070306.0,,,477.0,,2070306.0,...,,230.0,2818454.0,0.0,False,,,5145434.0,a..martin@enron.com,8.0
BAXTER JOHN C,267102.0,,1295738.0,5634343.0,6680544.0,1200000.0,3942714.0,,,10623258.0,...,,,2660303.0,,False,,-1386055.0,1586055.0,,
BELDEN TIMOTHY N,213999.0,7991.0,2144013.0,5501630.0,953136.0,5249999.0,157569.0,5521.0,,1110705.0,...,,484.0,210698.0,108.0,True,,-2334434.0,,tim.belden@enron.com,228.0
DELAINEY DAVID W,365163.0,3093.0,,4747979.0,2291113.0,3000000.0,1323148.0,2097.0,,3614261.0,...,,3069.0,1661.0,609.0,True,,,1294981.0,david.delainey@enron.com,66.0


There are 4 POIs and 5 non-POIs and an outlier 'TOTAL' in the Top 10 . 

In [6]:
enron_df[enron_df.drop('poi',axis=1).isnull().all(axis=1)]


Unnamed: 0,salary,to_messages,deferral_payments,total_payments,exercised_stock_options,bonus,restricted_stock,shared_receipt_with_poi,restricted_stock_deferred,total_stock_value,...,loan_advances,from_messages,other,from_this_person_to_poi,poi,director_fees,deferred_income,long_term_incentive,email_address,from_poi_to_this_person
LOCKHART EUGENE E,,,,,,,,,,,...,,,,,False,,,,,


There is one row that does not have any value populated (except for POI ). As this is on the majority category , I would remove as it may not help much. 

From the initial dataset , I will remove the outlier 'TOTAL' , the row of LOCKHART EUGENE E . Also as the 'email_address' does not help in the prediction , I would drop that column. Also using the helper class featureFormatPandas , I will replace NaN/np.inf with 0 and drop the row that has all NaN / 0 values.


In [7]:
enron_df=enron_df.drop('TOTAL',axis=0)
#drop rows that have all NaN values
enron_df=enron_df.dropna(thresh=2)
enron_df=enron_df.replace('NaN',np.NaN)

from feature_format import featureFormatPandas
enron_df=featureFormatPandas(enron_df,remove_all_zeroes=True,replace_NaN=True)

Below new features were tried 

1. total_money_value ,total_poi_interaction  -  Totals of monetary and email attributes , as they should relatively have better significance in identifying a POI than each individual attribute 
2. mails_to_poi_ratio,mails_from_poi_ratio - ratio of mails from / to poi to the total mails received / sent by an individual - The ratio of total mails to the ones sent / received from POIs may be higher for a POI than others. This attribute is added to explore on those lines. 

Although none of them had a very huge impact , the ratios and totals definitely helped to improve the scores.

I have also explored ratios of deferred income to total payments and deferred stock to total stock - Just to see if POIs (who are aware of the inevitable) would have minimal deferred stock or cash . But these never appeared in the top 10 or 15 and hence have removed them in the final analysis.


In [8]:
enron_df['total_money_value'] = enron_df['total_payments'] + enron_df['total_stock_value'] 
enron_df['total_poi_interaction'] = enron_df['shared_receipt_with_poi'] + \
                                    enron_df['from_this_person_to_poi'] + \
                                    enron_df['from_poi_to_this_person'] 
enron_df['mails_to_poi_ratio']=enron_df['from_this_person_to_poi'].div(enron_df['from_messages'])
enron_df['mails_from_poi_ratio']=enron_df['from_poi_to_this_person'].div(enron_df['to_messages'])
enron_df=featureFormatPandas(enron_df,remove_all_zeroes=True,replace_NaN=True)

The following phases were employed in the model

1. Feature Scaling  
2. Feature Selection 
3. Dimensionality reduction using PCA  
4. Various Classifiers were evaluated as this is a classification problem .

##### Feature Scaling :

As the features are on different scales (no. of mails in a few hundreds to financial features in millions), it is essential for us employ a feature scaling algorithm before passing the data for Classification. In this analysis , I have tried MinMaxScaler.

##### Feature Selection : 

It is essential to choose only relevant and important features and eliminate others 1. To reduce negative impacts of irrelevant features  2. reduce train / test timings .

In the analysis, SelectKBest Algorithm with K values varying from 5 to 22 in GridSearchCV was used . From the initial features , 15 features were chosen except email_address,deferral_payments and restricted_stock_deferred . Below are the precision and accuracy scores achieved .




In [9]:

feature_sel_df=pd.DataFrame(columns=['Accuracy','Precision','Recall','f1score'] ,index=['no_new_features','total_poi_interaction','Mails_to_poi_ratio'
                                                                            ,'mails_from_poi_ratio','all_new_and_old_features'
                                                                           ])

feature_sel_df.loc['no_new_features']=pd.Series({'Accuracy':0.72627,'Precision':0.30303,'Recall':0.81000,'f1score':0.44106})
feature_sel_df.loc['total_poi_interaction']=pd.Series({'Accuracy':0.73480,'Precision':0.31255,'Recall':0.82450,'f1score':0.45327 })
feature_sel_df.loc['Mails_to_poi_ratio']=pd.Series({'Accuracy':0.73360 ,'Precision':0.31381,'Recall':0.84100,'f1score':0.45707})
feature_sel_df.loc['mails_from_poi_ratio']=pd.Series({'Accuracy':0.72620,'Precision':0.30341,'Recall':0.81300,'f1score':0.44191})
feature_sel_df.loc['all_new_and_old_features']=pd.Series({'Accuracy':0.73907,'Precision':0.32232,'Recall':0.86800,'f1score':0.47008})

feature_sel_df

#Despite the minimal improvemens , as there are no negative impacts , I have included the new features in the final analysis. 

Unnamed: 0,Accuracy,Precision,Recall,f1score
no_new_features,0.72627,0.30303,0.81,0.44106
total_poi_interaction,0.7348,0.31255,0.8245,0.45327
Mails_to_poi_ratio,0.7336,0.31381,0.841,0.45707
mails_from_poi_ratio,0.7262,0.30341,0.813,0.44191
all_new_and_old_features,0.73907,0.32232,0.868,0.47008


Despite the minimal improvements , as there are no negative impacts , I have included the new features in the final analysis. 

##### Dimensionality Reduction : 

Have used PCA to achieve dimensionality reduction. 

#####Algorithms and Metrics used :

The below types of classifiers have been tried 

from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV,SGDClassifier
from sklearn.ensemble import VotingClassifier,RandomForestClassifier,ExtraTreesClassifier,AdaBoostClassifier
from sklearn.svm import SVC,LinearSVC

All the classifiers were evaluated based on their precision , recall and f-score . Accuracy cannot be trusted in this scenario as this is an unbalanced dataset and the classifiers tend to predict everything as non-POI .

Precision and Recall performance metrics were used in this analysis to evaluate the model.

Precision and recall are defined as : 

Precision is the ratio of Total True positives (tp) to Total positives predicted (tp + fp) - It is the indication of the how many times the model classified an input as a Positive and how many times was it correct. In our context , low precision indicates that we are classifying a lot of non-POIs as POIs.

Recall is the ratio of Total True positives to Total True positives + False Negatives (tp+fn) - It is the indication of how many times the model failed to recognise a positive input as Positive .In our context , low recall means we failed to identify a POI .

F1 score  - It indicates a balance between Precision and Recall. It is essential as well as an extremely low precision implies that we will marking a lot of non-POIs as POIs. An extremely high recall means that we are not successfully identifying POIs.

##### Algorithm tuning by tuning Parameters (hyperparameters) : 

*Hyperparameters of an algorithm control the flexibility or freedom of the model , For e.g., the depth or the number of leaves in Decision Trees , C (penalty for misclassification )or the kernel width in SVMs etc., By this, the effects of the Model adopting too much to the training data and the cases of overfitting can be controlled ***

All Algorithms provide such parameters which can be tuned to optimize the algorithm and this is almost the final yet very essential step before concluding the model development.

Below parameters were primarily considered for tuning .

1. C - Value of Regularization Constant - SVM and Linear Model based classifiers - High C Value makes the model complex and tends to overfit. Default value of C=1 did not give any satisfactory performance stats. 
2. class_weight - This being an unbalanced dataset , class_weight played a crucial role in tuning the algorithms. I have opted for 'balanced' as this uses inverse of the frequencies to calculate the class_weights.
3. tol - Training termination parameter - Had to modify the value of this parameter for a few algorithms for better scores.
4. selection (k) : Number of features to select in the feature selection algorithms . Have tried small number of features but could not get better scores without using almost all the features. A balance between precision and recall occurred for a k value of 20.
5. kernel - This was tuned in the context of SVM (linear , rbf) - Given the less number of samples, rbf performed efficiently
6. pca whiten - To reduce the affects due to correlated features , have used whiten==true in PCA .

Besides these , have explored different base estimators with AdaBoostClassifier and convicned that SGDClassifier as the base estimator was giving better numbers.

##### GridSearchCV : 

GridSearchCV was used to tune parameters as well as evaluate the estimators all throughout. I have not included all the values explored in the final submission.

#####  Cross Validation :

Using same data to train and test would lead to over fitting and the model will fail when exposed to unknown data. So it essential that we have separate training and test data.  But having separate testing and training data sets may cause issues if there is lot of variance between the sets. The best possible solution to this problem is to use cross validation. It works by splitting the dataset into k-parts , each split of the data is called a fold. The algorithm is trained on k-1 folds with one held back and tested on the held back fold. This is repeated and the average values of the performance measures are computed.


In this model , we have used the StratifiedShuffleSplit from sklearn.cross_validation , which would run for 1000 iterations and on a test data of 10 percent. This is passed to GridSearchCV to be used in evaluation.

In [9]:
scores_df=pd.DataFrame(columns=['Accuracy','Precision','Recall','f1score'] ,index=['VotingClassifier','SVC','LogisticRegression'
                                                                            ,'LinearSVC','AdaBoostClassifier'
                                                                           ,'DecisionTreeCLassfier'])

scores_df.loc['VotingClassifier']=pd.Series({'Accuracy':0.73967,'Precision':0.32186,'Recall':0.86050,'f1score':0.46849})
scores_df.loc['SVC']=pd.Series({'Accuracy':0.72947,'Precision':0.31250,'Recall':0.85750,'f1score':0.45807 })
scores_df.loc['LogisticRegression']=pd.Series({'Accuracy':0.69347 ,'Precision':0.29028,'Recall':0.89900,'f1score':0.43886})
scores_df.loc['LinearSVC']=pd.Series({'Accuracy':0.69347,'Precision':0.29028,'Recall':0.89900,'f1score':0.43886})
scores_df.loc['AdaBoostClassifier']=pd.Series({'Accuracy':0.77993,'Precision':0.32649,'Recall':0.61200,'f1score':0.42581})
scores_df.loc['DecisionTreeCLassfier']=pd.Series({'Accuracy':0.62887,'Precision':0.23745,'Recall':0.8065,'f1score':0.36688})

print " Below is the table of scores for some classifiers "
print " Note : Coincidentally LinearSVC and LogisticRegression gave same numbers"
scores_df

 Below is the table of scores for some classifiers 
 Note : Coincidentally LinearSVC and LogisticRegression gave same numbers


Unnamed: 0,Accuracy,Precision,Recall,f1score
VotingClassifier,0.73967,0.32186,0.8605,0.46849
SVC,0.72947,0.3125,0.8575,0.45807
LogisticRegression,0.69347,0.29028,0.899,0.43886
LinearSVC,0.69347,0.29028,0.899,0.43886
AdaBoostClassifier,0.77993,0.32649,0.612,0.42581
DecisionTreeCLassfier,0.62887,0.23745,0.8065,0.36688


#### Final Algorithm  chosen : 

The Voting Classifier from sklearn.ensemble , which combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels . I have used the below classifiers as estimators and passed to a Voting Classifier .

1. LogisticRegressionClassifier 
2. SVC 
3. AdaBoostClassifier with SGDClassifier as base estimator 

##### Final Features  included : 

'salary', 'to_messages', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'total_stock_value', 'expenses', 'loan_advances', 'other', 'from_this_person_to_poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'from_poi_to_this_person', 'total_poi_interaction', 'mails_to_poi_ratio', 'mails_from_poi_ratio'


In [10]:
run -t poi_id.py

[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    9.8s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   18.3s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   32.6s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   51.2s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 6000 out of 6000 | elapsed:  4.1min finished


Fitting 1000 folds for each of 6 candidates, totalling 6000 fits
best features selected :
['salary', 'to_messages', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'total_stock_value', 'expenses', 'loan_advances', 'other', 'from_this_person_to_poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'from_poi_to_this_person', 'total_poi_interaction', 'mails_to_poi_ratio', 'mails_from_poi_ratio']
Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('selection', SelectKBest(k=19, score_func=<function f_classif at 0x00000000096B0588>)), ('pca', PCA(copy=True, n_components=2, whiten=True)), ('classifier', VotingClassifier(estimators=[('lr', LogisticRegression(C=1e-05, class_weight...ning_rate=1.0, n_estimators=1000, random_state=None))],
         voting='hard', weights=[1, 1, 1]))])
	Accuracy: 0.73800	Precision: 0.32077	Recall: 0.86350	F1: 0.46777	F2: 0.64517
	Total predictions: 15000	True positives: 1727	F



####References : 
1. Intro to Machine Learning Udacity course 
2. Sklearn Documentation 
3. Few topics on MachineLearningMastery blog by Jason Brownlee : http://machinelearningmastery.com/ 
4. Python for Data Analysis book by Wes McKinney
5. Sebastian Raschka blog and book (on safari books)
6. Lot of stackexchange and stackoverflow discussions 
7. Wikipaedia for few definitions