Enron Dataset project:
Aim:
To build a machine learning algorithm to identify 'person's of interest' (poi) within the enron dataset. These poi's are those  who were indicted, reached a settlement, or plea deal with the government, or testified in exchange for prosecution immunity.

Resources: (copied from Udacity project details)
You should have python and sklearn running on your computer, as well as the starter code (both python scripts and the Enron dataset) that you downloaded as part of the first mini-project in the Intro to Machine Learning course. The starter code can be found in the final_project directory of the codebase that you downloaded for use with the mini-projects. Some relevant files: 

poi_id.py : starter code for the POI identifier, you will write your analysis here 

final_project_dataset.pkl : the dataset for the project, more details below 

tester.py : when you turn in your analysis for evaluation by a Udacity evaluator, you will submit the algorithm, dataset and list of features that you use (these are created automatically in poi_id.py). The evaluator will then use this code to test your result, to make sure we see performance that’s similar to what you report. You don’t need to do anything with this code, but we provide it for transparency and for your reference. 

emails_by_address : this directory contains many text files, each of which contains all the messages to or from a particular email address. It is for your reference, if you want to create more advanced features based on the details of the emails dataset.

Data:
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, there was a significant amount of typically confidential information entered into public record, including tens of thousands of emails and detailed financial data for top executives. This is the data I will use in the project.

This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.
The features in the data fall into three major types, namely financial features, email features and POI labels.

financial features: ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] (all units are in US dollars)
email features: ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi'] (units are generally number of emails messages; notable exception is ‘email_address’, which is a text string)
POI label: [‘poi’] (boolean, represented as integer)

Data source: https://www.cs.cmu.edu/~./enron/

Udacity project rubric:
https://review.udacity.com/#!/rubrics/27/view
-This is what I will be styling my project on. Of course there may be certain aspects lacking as I see most important in creating the best overall project I can.

#Please see my draft for all working and data exploration

In [70]:
import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
### Load the dictionary containing the dataset
with open('final_project_dataset.pkl', "r") as data_file:
    data_dict = pickle.load(data_file)
    
#move data into dataframe
import pandas as pd
import numpy as np
data_dframe = pd.DataFrame.from_dict(data_dict, orient='index')
data_dframe = data_dframe.replace('NaN', np.nan)
df = data_dframe

### Task 2: Remove outliers/fix data
df.ix['BELFER ROBERT','total_payments'] = 3285
df.ix['BELFER ROBERT','deferral_payments'] = 0
df.ix['BELFER ROBERT','restricted_stock'] = 44093
df.ix['BELFER ROBERT','restricted_stock_deferred'] = -44093
df.ix['BELFER ROBERT','total_stock_value'] = 0
df.ix['BELFER ROBERT','director_fees'] = 102500
df.ix['BELFER ROBERT','deferred_income'] = -102500
df.ix['BELFER ROBERT','exercised_stock_options'] = 0
df.ix['BELFER ROBERT','expenses'] = 3285
df.ix['BELFER ROBERT',]
df.ix['BHATNAGAR SANJAY','expenses'] = 137864
df.ix['BHATNAGAR SANJAY','total_payments'] = 137864
df.ix['BHATNAGAR SANJAY','exercised_stock_options'] = 1.54563e+07
df.ix['BHATNAGAR SANJAY','restricted_stock'] = 2.60449e+06
df.ix['BHATNAGAR SANJAY','restricted_stock_deferred'] = -2.60449e+06
df.ix['BHATNAGAR SANJAY','other'] = 0
df.ix['BHATNAGAR SANJAY','director_fees'] = 0
df.ix['BHATNAGAR SANJAY','total_stock_value'] = 1.54563e+07
df.ix['BHATNAGAR SANJAY']

df = df.drop(['TOTAL', 'THE TRAVEL AGENCY IN THE PARK'])
df = df.drop(['loan_advances', 'restricted_stock_deferred', 'director_fees', 'email_address'] , 1)

### Task 3: Create new feature
df['from_poi_prop'] = df['from_this_person_to_poi'] / df['to_messages']
df['to_poi_prop'] = df['from_poi_to_this_person'] / df['from_messages']
df['poi_interaction'] = (df['from_this_person_to_poi'] + df['from_poi_to_this_person']) / (df['to_messages'] 
                                                                                           + df['from_messages'])
#create new log scaled dataframe:
features_to_test = [#have removed poi from the current features as it does not to be log scaled.
                    'salary',
                    'poi_interaction',
                     'to_messages',
                     'deferral_payments',
                     'total_payments',
                     'bonus',
                     'from_poi_prop',
                     'total_stock_value',
                     'shared_receipt_with_poi',
                     'from_poi_to_this_person',
                     'exercised_stock_options',
                     'from_messages',
                     'other',
                     'from_this_person_to_poi',
                     'to_poi_prop',
                     'deferred_income',
                     'expenses',
                     'restricted_stock',
                     'long_term_incentive'
                   ]

log_df = df
for f in features_to_test:
    log_df[f] = np.log10(df[f] + 1)
    #added +1 so the 0 values remained natural zeros
    
features_to_test.insert(0, 'poi')

#reorder columns (so poi is first) to work with the targetFeatureSplit function.
log_df = log_df[['poi',
                 'salary',
                 'poi_interaction',
                 'to_messages',
                 'deferral_payments',
                 'total_payments',
                 'exercised_stock_options',
                 'bonus',
                 'restricted_stock',
                 'shared_receipt_with_poi',
                 'total_stock_value',
                 'expenses',
                 'from_messages',
                 'other',
                 'from_this_person_to_poi',
                 'deferred_income',
                 'long_term_incentive',
                 'from_poi_to_this_person',
                 'from_poi_prop',
                 'to_poi_prop']]


#have to get rid of NaN values before i do any minmax scaling (surprise!) Hopefully this isn't going to harm the results. 
from sklearn import preprocessing

nan_remover = preprocessing.Imputer(missing_values='NaN', strategy='mean', axis=0)
log_df = nan_remover.fit_transform(log_df)

#log_df is now in the right format to skip formatfeature func and go straight to split!

def targetFeatureSplit( data ):
    """ 
        given a numpy array like the one returned from
        featureFormat, separate out the first feature
        and put it into its own list (this should be the 
        quantity you want to predict)

        return targets and features as separate lists

        (sklearn can generally handle both lists and numpy arrays as 
        input formats when training/predicting)
    """

    target = []
    features = []
    for item in data:
        target.append( item[0] )
        features.append( item[1:] )

    return target, features

labels, features = targetFeatureSplit(log_df)

#scale features using MinMaxscaler!
mmscaler = preprocessing.MinMaxScaler()
features = mmscaler.fit_transform(features)


In [71]:
#MACHINE LEARNING TIME!!
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn import cross_validation
from sklearn.grid_search import GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn import ensemble
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn import linear_model
from time import time
sep = '##############################################################################################'
sep2 = '++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++'

##############################################################################################

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.33, random_state=40)
target_names = ['NonPoi', 'Poi']

##############################################################################################
#do logistic regression on data using paramters found using gridsearchCV



t = time()
pca = PCA(n_components=10)
logreg = linear_model.LogisticRegression(C=32, class_weight='balanced', random_state=40)
pipe = Pipeline(steps=[('PCA', pca),('LOG', logreg)])

pipe.fit(features_train, labels_train)
pred = pipe.predict(features_test)
print "training time:", round(time()-t, 3), "s"
print classification_report(labels_test, pred, target_names=target_names)
print 'accuracy=', accuracy_score(labels_test, pred)


training time: 0.004 s
             precision    recall  f1-score   support

     NonPoi       0.94      0.73      0.82        41
        Poi       0.31      0.71      0.43         7

avg / total       0.85      0.73      0.77        48

accuracy= 0.729166666667


In [72]:
features_list = [
                'poi',
                'salary',
                'poi_interaction',
                 'to_messages',
                 'deferral_payments',
                 'total_payments',
                 'bonus',
                 'from_poi_prop',
                 'total_stock_value',
                 'shared_receipt_with_poi',
                 'from_poi_to_this_person',
                 'exercised_stock_options',
                 'from_messages',
                 'other',
                 'from_this_person_to_poi',
                 'to_poi_prop',
                 'deferred_income',
                 'expenses',
                 'restricted_stock',
                 'long_term_incentive'
                   ]
dump_classifier_and_data(pipe, log_df, features_list)