# Wide and Deep Networks

Nick Chao

### Preparation (40 points total)
[10 points] Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). 

[10 points] Identify groups of features in your data that should be combined into cross-product features. Provide justification for why these features should be crossed (or why some features should not be crossed). 

[10 points] Choose and explain what metric(s) you will use to evaluate your algorithm’s performance. You should give a detailed argument for why this (these) metric(s) are appropriate on your data. That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.

[10 points] Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Shuffle splits? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. Convince me that your cross validation method is a realistic mirroring of how an algorithm would be used in practice. 

### Modeling (50 points total)
[20 points] Create several combined wide and deep networks to classify your data using Keras. Visualize the performance of the network on the training data and validation data in the same plot versus the training iterations. Try to use the "history" return parameter that is part of Keras "fit" function.

[20 points] Investigate generalization performance by altering the number of layers in the deep branch of the network. Try at least two different number of layers. Use the method of cross validation and evaluation metric that you argued for at the beginning of the lab. 

[10 points] Compare the performance of your best wide and deep network to a standard multi-layer perceptron (MLP) using the receiver operating characteristic and area under the curve.  Use proper statistical method to compare the performance of different models.  

### Exceptional Work (10 points total)
You have free reign to provide additional analyses.
One idea (required for 7000 level students): Capture the embedding weights from the deep network and perform t-SNE clustering on the output of these embedding layers. That is, pass the observations into the network, save the embedded weights (called embeddings), and then perform clustering of these output embeddings. Visualize and explain the results.

## Preparation

For my dataset, I chose to use the dataset I've been using for the last two labs which is the 2009 American Community Survey. The goal of this dataset is to determine whether or not a person has an income greater than 100,000 dollars based on qualities about them such as sex, age, race, place of work, etc. This dataset starts with over 3 million entrees and almost 300 attributes which will eventually be narrowed down to 1.3 million entrees and 13 attributes.


|Attribute|Description|Type|Example|
|:---:|:---:|:---:|:---:|
| CIT | Citizenship Status | Int | 1. Citizen, 0. Non-citizen |
| AGEP | Age | Int | 23
| COW | Class of Worker | Float | 3. Local Government, 4. State Government |
| ENG | Ability to speak English  | Int | 1. Speaks English, 0. Doesn't Speak English |
| MAR | Marital Status | Int | 1. Married, 2. Widowed |
| MIL | Military Service | Int | 1. Yes, 0. No |
| SCHL | Educational Attainment  | Float | 21 Bachelor's Degree, 22 Master's Degree |
| SEX | Sex      | Int | True. Male |
| DIS | Disability | Int | True. Disabled |
| PINCP | Total Person's Income | Float
| POWSP | Place of work | Float | 048 Texas, 049 Utah |
| RAC1P | Detailed Race Code | Int | 1 White, 6 Asian |
| FOD1P | Field of Degree | Float | 2407 Computer Engineering, 2408 Electrical Engineering |

The reamining features we keep as they are attributes that are likely related to one's personal income

In [11]:
# #importing dependancies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# from sklearn.linear_model import LogisticRegression
# from sklearn.utils.estimator_checks import check_estimator
# from sklearn.utils.validation import check_X_y, check_array, check_is_fitted

from sklearn.model_selection import train_test_split
# from sklearn.model_selection import cross_val_score
# from sklearn.model_selection import KFold, ShuffleSplit
# from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit

from sklearn.metrics import accuracy_score, f1_score
# from sklearn.metrics import make_scorer
# from sklearn.metrics import precision_score, recall_score


import warnings
warnings.filterwarnings('ignore')

In [2]:
%time dataA = pd.read_csv('../data/ss09pusa.csv')
%time dataB = pd.read_csv('../data/ss09pusb.csv')
merged = pd.concat([dataA,dataB])

Wall time: 27.3 s
Wall time: 26 s


In [3]:
cols_to_save = ['CIT','AGEP','COW','ENG','MAR','MIL','SCHL','SEX','DIS','PINCP','POWSP','RAC1P','FOD1P']
new_data = merged.filter(items=cols_to_save)
new_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3030728 entries, 0 to 1466654
Data columns (total 13 columns):
CIT      int64
AGEP     int64
COW      float64
ENG      float64
MAR      int64
MIL      float64
SCHL     float64
SEX      int64
DIS      int64
PINCP    float64
POWSP    float64
RAC1P    int64
FOD1P    float64
dtypes: float64(7), int64(6)
memory usage: 323.7 MB


In [4]:
# Change citizenship to Int.
# 1-4 is a citizen (true) and 5 is not a citizen (false)

new_data.CIT.replace(to_replace = range(5),
                    value=[1,1,1,1,0],
                    inplace=True)
new_data['CIT'] = new_data['CIT'].astype('bool')


# Change Ability to Speak English to boolean
# b is N/A but it would be a good assumption to assume they speak English
new_data['ENG']=new_data['ENG'].fillna(1)
# 1-2 speaks English well or very well, 3-4 speaks English not well or not at all.
new_data.ENG.replace(to_replace = range(4),
                    value=[1,1,0,0],
                    inplace=True)
new_data['ENG'] = new_data['ENG'].astype('bool')


# Change Military Status to Boolean
# b is N/A because less than 17 years old so lets just change this to 0
new_data['MIL']=new_data['MIL'].fillna(0)
# 1-3 Yes, 4-5 No
new_data.MIL.replace(to_replace = range(5),
                    value=[1,1,1,0,0],
                    inplace=True)
new_data['MIL'] = new_data['MIL'].astype('bool')


# Change Sex to bool
# 1 is male, 2 is female. Changing 2 to 0 for boolean conversion
new_data.SEX.replace(to_replace = range(2),
                    value=[1,0],
                    inplace=True)
new_data['SEX'] = new_data['SEX'].astype('bool')


# Change DIS to bool
# 1 is disabled, 2 is no disability. Changing 2 to 0 for boolean conversion
new_data.DIS.replace(to_replace = range(2),
                    value=[1,0],
                    inplace=True)
new_data['DIS'] = new_data['DIS'].astype('bool')


# Change Educational Atttainment to INT
# bb is N/A for less than 3 years old.
new_data['SCHL']=new_data['SCHL'].fillna(0)
# For this classification lets simplify some of these education levels.
# 0 between No schooling and Grade 8
# 1 between Grade 9 and Grade 12 no diploma
# 2 for High School degree or GED
# 3 Some college to Associate's degree
# 4 Bachelor's Degree
# 5 Master's Degree
# 6 Professional degree or Doctorate
new_data.SCHL.replace(to_replace = range(25),
                    value=[0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,2,2,3,3,3,4,5,6,6],
                    inplace=True)
new_data['SCHL'] = new_data['SCHL'].astype('int')


In [5]:
# delete younger than 18
new_data = new_data[new_data.AGEP >= 18]
#new_data

# Field of Study  -> 0
# Class of Worker -> Remove if Null
# Place of Work   -> Remove if Null

new_data['FOD1P'].fillna(0, inplace=True)
new_data = new_data[pd.notnull(new_data['COW'])]
new_data = new_data[pd.notnull(new_data['POWSP'])]

# Convert the Floats to Ints
# COW, POWSP, FOD1P, PINCP

new_data['COW'] = new_data['COW'].astype('int')
new_data['POWSP'] = new_data['POWSP'].astype('int')
new_data['FOD1P'] = new_data['FOD1P'].astype('int')
new_data['PINCP'] = new_data['PINCP'].astype('int')

future_data = new_data.copy(deep=False) # saving a copy for later

In [6]:
mask_0 = new_data.PINCP <= 99999
column_name = 'PINCP'
new_data.loc[mask_0, column_name] = 0

mask_1 = new_data.PINCP > 99999
column_name = 'PINCP'
new_data.loc[mask_1, column_name] = 1


In [7]:
# Lets see how the income classes have split...
print('Number of people in each class:')
for value in new_data.PINCP.unique(): 
    print(str(value)+': ' +str(len(new_data[new_data['PINCP'] == value])))

Number of people in each class:
0: 1209410
1: 128869


In [8]:
# Finally, let's rename some of these columns so they make more sense.
new_data.rename(columns={'CIT': 'Citizenship','AGEP': 'Age','COW': 'Class of Work','ENG': 'Speaks English','MAR': 'Martial Status','MIL': 'Military Status','SCHL': 'Education Level','SEX': 'Male','DIS': 'Disabled?','PINCP': 'Income','POWSP': 'Place of Work','RAC1P': 'Race','FOD1P': 'Field of Study'}, inplace=True)
new_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1338279 entries, 1 to 1466654
Data columns (total 13 columns):
Citizenship        1338279 non-null bool
Age                1338279 non-null int64
Class of Work      1338279 non-null int32
Speaks English     1338279 non-null bool
Martial Status     1338279 non-null int64
Military Status    1338279 non-null bool
Education Level    1338279 non-null int32
Male               1338279 non-null bool
Disabled?          1338279 non-null bool
Income             1338279 non-null int32
Place of Work      1338279 non-null int32
Race               1338279 non-null int64
Field of Study     1338279 non-null int32
dtypes: bool(5), int32(5), int64(3)
memory usage: 72.7 MB


## Crossing Features

In [9]:
df_features = ['Citizenship', 'Age', 'Class of Work', 'Speaks English', 'Martial Status', 'Military Status', 
               'Education Level', 'Male', 'Disabled?', 'Place of Work', 'Race', 'Field of Study']

df_class = ['Income']

X = new_data[df_features]
y = new_data[df_class]

### Evaluation

Goal: Determine if a person's income is above $100,000.

Since my chosen task is a binary classification task, we need to determine what type of evaluation metric works best for this case. We can start by establishing a business case. Let's say an advertising agency is attemping to solve this problem to determine what type of ads should be targeted towards a person based on their income. For example, cheaper products for people of lower income and expensive products for people of higher income. Targeting viewers with the wrong type of ads would not only be pointless, but costly as it is unlikely consumers would purchase the advertised product. 

In this case, we would want high recall and precision. A good recall would be the ability for the model to identify all person's will incomes above 100,000 and mark them as so. A high precision  is the ability for the model to be correct when it claims a person's income is above 100,000. Utilizing F-Score will allow us to evaluate the model using both recall and precision.

### Training and Testing

With regards to splitting the data for training and testing purposes, I am going to split the data into 80% for training and 20% for testing, ensuring that both sets are equally portional. Since my data set is very large I should not have a problem being able to do this and cross validation should not be neccessary as there is over a million entres.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

In [17]:
# Double checking to ensure that the split is evenly proportion
print('Number of people in each class:')
for value in new_data.Income.unique(): 
    print(str(value)+': ' +str(len(new_data[new_data['Income'] == value])))
    
print('Number of people in each class:')
for value in y_train.Income.unique(): 
    print(str(value)+': ' +str(len(y_train[y_train['Income'] == value])))
    
print('Number of people in each class:')
for value in y_test.Income.unique(): 
    print(str(value)+': ' +str(len(y_test[y_test['Income'] == value])))

Number of people in each class:
0: 1209410
1: 128869
Number of people in each class:
0: 967500
1: 103123
Number of people in each class:
0: 241910
1: 25746
