# Mini-Project - Datasets and Questions

##### Student Tags

Author: Anderson Hitoshi Uyekita    
Mini-Project: Support Vector Machine  
Course: Data Science - Foundations II  
COD: ND111  
Date: 17/01/2019    

***

## Table of Contents
- [Introduction](#intro)
- [Given code 1](#code1)
- [Part 1](#part_1)
- [Part 2](#part_2)
- [Given code 2](#code2)



***

## Given Code 1 <a id='code1'></a>

In [1]:
#!/usr/bin/python

""" 
    Starter code for exploring the Enron dataset (emails + finances);
    loads up the dataset (pickled dict of dicts).

    The dataset has the form:
    enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }

    {features_dict} is a dictionary of features associated with that person.
    You should explore features_dict as part of the mini-project,
    but here's an example to get you started:

    enron_data["SKILLING JEFFREY K"]["bonus"] = 5600000
    
"""

import pickle

enron_data = pickle.load(open("../final_project/final_project_dataset.pkl", "r"))

## Importing Libraries

In [2]:
# Importing
import pandas as pd
import numpy as np

In [3]:
# Converting in DataFrame.
df_enron = pd.DataFrame(data = enron_data).transpose()

# Printing the first 5 rows.
df_enron.head()

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
ALLEN PHILLIP K,4175000.0,2869717.0,-3081055.0,,phillip.allen@enron.com,1729541.0,13868,2195.0,47.0,65.0,...,304805.0,152.0,False,126027.0,-126027.0,201955.0,1407.0,2902.0,4484442,1729541
BADUM JAMES P,,178980.0,,,,257817.0,3486,,,,...,,,False,,,,,,182466,257817
BANNANTINE JAMES M,,,-5104.0,,james.bannantine@enron.com,4046157.0,56301,29.0,39.0,0.0,...,,864523.0,False,1757552.0,-560222.0,477.0,465.0,566.0,916197,5243487
BAXTER JOHN C,1200000.0,1295738.0,-1386055.0,,,6680544.0,11200,,,,...,1586055.0,2660303.0,False,3942714.0,,267102.0,,,5634343,10623258
BAY FRANKLIN R,400000.0,260455.0,-201641.0,,frank.bay@enron.com,,129142,,,,...,,69.0,False,145796.0,-82782.0,239671.0,,,827696,63014


## Data Wrangling

The aggregated Enron email + financial dataset is stored in a dictionary, where each key in the dictionary is a person’s name and the value is a dictionary containing all the features of that person.
The email + finance (E+F) data dictionary is stored as a pickle file, which is a handy way to store and load python objects directly. Use datasets_questions/explore_enron_data.py to load the dataset.

>How many data points (people) are in the dataset?

In [4]:
print "Persons:", df_enron.shape[0]

Persons: 146


>For each person, how many features are available?

In [5]:
print "Number of Variables/Features:", df_enron.shape[1]

Number of Variables/Features: 21


The “poi” feature records whether the person is a person of interest, according to our definition.

>How many POIs are there in the E+F dataset?

In [6]:
# True is a person of interest.
sum(df_enron.poi)

18

We compiled a list of all POI names (in ../final_project/poi_names.txt) and associated email addresses (in ../final_project/poi_email_addresses.py).

#### Loading the poi_names.txt

In [7]:
# Loading the file
df_names = pd.read_csv('../final_project/poi_names.txt', sep = "\t")

In [8]:
# Defining a better column name.
df_names.columns = ['poi']

>How many POI’s were there total? (Use the names file, not the email addresses, since many folks have more than one address and a few didn’t work for Enron, so we don’t have their emails.)

In [9]:
# Number of unique POI.
df_names.shape[0]

35

As you can see, we have many of the POIs in our E+F dataset, but not all of them.

>Why is that a potential problem?

We will return to this later to explain how a POI could end up not being in the Enron E+F dataset, so you fully understand the issue before moving on.

Like any dict of dicts, individual people/features can be accessed like so:

enron_data["LASTNAME FIRSTNAME"]["feature_name"]
or, sometimes 
enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"]["feature_name"]

>What is the total value of the stock belonging to James Prentice?

In [10]:
df_enron.loc[["PRENTICE JAMES"]].total_stock_value[0]

1095040

Like any dict of dicts, individual people/features can be accessed like so:

enron_data["LASTNAME FIRSTNAME"]["feature_name"]

How many email messages do we have from Wesley Colwell to persons of interest?

In [11]:
df_enron.loc[["COLWELL WESLEY"]].from_this_person_to_poi[0]

11

Like any dict of dicts, individual people/features can be accessed like so:

enron_data["LASTNAME FIRSTNAME"]["feature_name"]

or

enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"]["feature_name"]

What’s the value of stock options exercised by Jeffrey K Skilling?

In [18]:
df_enron.loc[["SKILLING JEFFREY K"]].exercised_stock_options[0]

19250000

In the coming lessons, we’ll talk about how the best features are often motivated by our human understanding of the problem at hand. In this case, that means knowing a little about the story of the Enron fraud.

If you have an hour and a half to spare, “Enron: The Smartest Guys in the Room” is a documentary that gives an amazing overview of the story. Alternatively, there are plenty of archival newspaper stories that chronicle the rise and fall of Enron.

**Which of these schemes was Enron not involved in?**

- [ ] selling assets to shell companies at the end of each month, and buying them back at the beginning of the next month to hide accounting losses
- [ ] causing electrical grid failures in California
- [x] illegally obtained a government report that enabled them to corner the market on frozen concentrated orange juice futures
- [x] conspiring to give a Saudi prince expedited American citizenship
- [ ] a plan in collaboration with Blockbuster movies to stream movies over the internet

***

>Who was the CEO of Enron during most of the time that fraud was being perpetrated?

Jeffrey Skilling
***

>Who was chairman of the Enron board of directors?

Ken Lay
***

>Who was CFO (chief financial officer) of Enron during most of the time that fraud was going on?

Andrew Fastow
***

Of these three individuals (Lay, Skilling and Fastow), who took home the most money (largest value of “total_payments” feature)?

>How much money did that person get?

In [27]:
df_enron.loc[["SKILLING JEFFREY K","LAY KENNETH L","FASTOW ANDREW S"]].total_payments

SKILLING JEFFREY K      8682716
LAY KENNETH L         103559793
FASTOW ANDREW S         2424083
Name: total_payments, dtype: object

For nearly every person in the dataset, not every feature has a value.

>**How is it denoted when a feature doesn’t have a well-defined value?**

NaN
***

>How many folks in this dataset have a quantified salary? What about a known email address?

In [47]:
# Number of salaries different of NaN
len(df_enron[np.logical_not(df_enron.salary == "NaN")])

95

In [48]:
# Number of emails identified.
len(df_enron[np.logical_not(df_enron.email_address == "NaN")])

111

## Optional

A python dictionary can’t be read directly into an sklearn classification or regression algorithm; instead, it needs a numpy array or a list of lists (each element of the list (itself a list) is a data point, and the elements of the smaller list are the features of that point).

We’ve written some helper functions (featureFormat() and targetFeatureSplit() in tools/feature_format.py) that can take a list of feature names and the data dictionary, and return a numpy array.

In the case when a feature does not have a value for a particular person, this function will also replace the feature value with 0 (zero).

## Given Code 1 <a id='code1'></a>

In [49]:
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

In [57]:
#!/usr/bin/python

""" 
    A general tool for converting data from the
    dictionary format to an (n x k) python list that's 
    ready for training an sklearn algorithm

    n--no. of key-value pairs in dictonary
    k--no. of features being extracted

    dictionary keys are names of persons in dataset
    dictionary values are dictionaries, where each
        key-value pair in the dict is the name
        of a feature, and its value for that person

    In addition to converting a dictionary to a numpy 
    array, you may want to separate the labels from the
    features--this is what targetFeatureSplit is for

    so, if you want to have the poi label as the target,
    and the features you want to use are the person's
    salary and bonus, here's what you would do:

    feature_list = ["poi", "salary", "bonus"] 
    data_array = featureFormat( data_dictionary, feature_list )
    label, features = targetFeatureSplit(data_array)

    the line above (targetFeatureSplit) assumes that the
    label is the _first_ item in feature_list--very important
    that poi is listed first!
"""

def featureFormat( dictionary, features, remove_NaN=True, remove_all_zeroes=True, remove_any_zeroes=False, sort_keys = False):
    """ convert dictionary to numpy array of features
        remove_NaN = True will convert "NaN" string to 0.0
        remove_all_zeroes = True will omit any data points for which
            all the features you seek are 0.0
        remove_any_zeroes = True will omit any data points for which
            any of the features you seek are 0.0
        sort_keys = True sorts keys by alphabetical order. Setting the value as
            a string opens the corresponding pickle file with a preset key
            order (this is used for Python 3 compatibility, and sort_keys
            should be left as False for the course mini-projects).
        NOTE: first feature is assumed to be 'poi' and is not checked for
            removal for zero or missing values.
    """


    return_list = []

    # Key order - first branch is for Python 3 compatibility on mini-projects,
    # second branch is for compatibility on final project.
    if isinstance(sort_keys, str):
        import pickle
        keys = pickle.load(open(sort_keys, "rb"))
    elif sort_keys:
        keys = sorted(dictionary.keys())
    else:
        keys = dictionary.keys()

    for key in keys:
        tmp_list = []
        for feature in features:
            try:
                dictionary[key][feature]
            except KeyError:
                print "error: key ", feature, " not present"
                return
            value = dictionary[key][feature]
            if value=="NaN" and remove_NaN:
                value = 0
            tmp_list.append( float(value) )

        # Logic for deciding whether or not to add the data point.
        append = True
        # exclude 'poi' class as criteria.
        if features[0] == 'poi':
            test_list = tmp_list[1:]
        else:
            test_list = tmp_list
        ### if all features are zero and you want to remove
        ### data points that are all zero, do that here
        if remove_all_zeroes:
            append = False
            for item in test_list:
                if item != 0 and item != "NaN":
                    append = True
                    break
        ### if any features for a given data point are zero
        ### and you want to remove data points with any zeroes,
        ### handle that here
        if remove_any_zeroes:
            if 0 in test_list or "NaN" in test_list:
                append = False
        ### Append the data point if flagged for addition.
        if append:
            return_list.append( np.array(tmp_list) )

    return np.array(return_list)


def targetFeatureSplit( data ):
    """ 
        given a numpy array like the one returned from
        featureFormat, separate out the first feature
        and put it into its own list (this should be the 
        quantity you want to predict)

        return targets and features as separate lists

        (sklearn can generally handle both lists and numpy arrays as 
        input formats when training/predicting)
    """

    target = []
    features = []
    for item in data:
        target.append( item[0] )
        features.append( item[1:] )

    return target, features

>How many people in the E+F dataset (as it currently exists) have “NaN” for their total payments? What percentage of people in the dataset as a whole is this?

In [84]:
# Percentage of NaN in total_payments
100*float(sum(df_enron.total_payments == 'NaN'))/df_enron.shape[0]

14.383561643835616

>How many POIs in the E+F dataset have “NaN” for their total payments? What percentage of POI’s as a whole is this?

In [93]:
# From the POI, what is the percentage of NaN?
100*float(sum(df_enron[df_enron.poi].total_payments == 'NaN'))/df_enron[df_enron.poi].shape[0]

0.0

>If a machine learning algorithm were to use total_payments as a feature, would you expect it to associate a “NaN” value with POIs or non-POIs?

non-POI's
***

If you added in, say, 10 more data points which were all POI’s, and put “NaN” for the total payments for those folks, the numbers you just calculated would change.

>**What is the new number of people of the dataset? What is the new number of folks with “NaN” for total payments?**

In [100]:
len(df_enron.index) + 10

156

In [103]:
sum(df_enron.total_payments == "NaN") + 10

31

>What is the new number of POI’s in the dataset? What is the new number of POI’s with NaN for total_payments?

In [107]:
sum(df_enron.poi) + 10

28

>Once the new data points are added, do you think a supervised classification algorithm might interpret “NaN” for total_payments as a clue that someone is a POI?

Yes