In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load, Split, and Balance (1.5 points total)
##### **[.5 points]**
* (1) Load the data into memory and save it to a pandas data frame. Do not normalize or one-hot encode any of the features until asked to do so later in the rubric.

In [6]:
#Read in the data
data = pd.read_csv('./acs2017_census_tract_data.csv', low_memory=False)
data.head()

Unnamed: 0,TractId,State,County,TotalPop,Men,Women,Hispanic,White,Black,Native,...,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
0,1001020100,Alabama,Autauga County,1845,899,946,2.4,86.3,5.2,0.0,...,0.5,0.0,2.1,24.5,881,74.2,21.2,4.5,0.0,4.6
1,1001020200,Alabama,Autauga County,2172,1167,1005,1.1,41.6,54.5,0.0,...,0.0,0.5,0.0,22.2,852,75.9,15.0,9.0,0.0,3.4
2,1001020300,Alabama,Autauga County,3385,1533,1852,8.0,61.4,26.5,0.6,...,1.0,0.8,1.5,23.1,1482,73.3,21.1,4.8,0.7,4.7
3,1001020400,Alabama,Autauga County,4267,2001,2266,9.6,80.3,7.1,0.5,...,1.5,2.9,2.1,25.9,1849,75.8,19.7,4.5,0.0,6.1
4,1001020500,Alabama,Autauga County,9965,5054,4911,0.9,77.5,16.4,0.0,...,0.8,0.3,0.7,21.0,4787,71.4,24.1,4.5,0.0,2.3


* (2) Remove any observations that having missing data.

In [8]:
# data.describe()
#Finding null values
print("Null Values before drop:\n")
print(data.isnull().sum())

data.dropna(axis=0,how="any",inplace=True)

print("\nNull Values after drop:\n")
print(data.isnull().sum())
# data.describe()

Null Values before drop:

TractId             0
State               0
County              0
TotalPop            0
Men                 0
Women               0
Hispanic            0
White               0
Black               0
Native              0
Asian               0
Pacific             0
VotingAgeCitizen    0
Income              0
IncomeErr           0
IncomePerCap        0
IncomePerCapErr     0
Poverty             0
ChildPoverty        0
Professional        0
Service             0
Office              0
Construction        0
Production          0
Drive               0
Carpool             0
Transit             0
Walk                0
OtherTransp         0
WorkAtHome          0
MeanCommute         0
Employed            0
PrivateWork         0
PublicWork          0
SelfEmployed        0
FamilyWork          0
Unemployment        0
dtype: int64

Null Values after drop:

TractId             0
State               0
County              0
TotalPop            0
Men                 0
Women      

* (3) Encode any string data as integers for now.

In [9]:
print("Categorical Features")
print(data.dtypes[data.dtypes != 'float64'][data.dtypes !='int64'])
data["State"] = data["State"].astype('category')
data["State"] = data["State"].cat.codes
data["County"] = data["County"].astype('category')
data["County"] = data["County"].cat.codes

print("Features After encoding")
print(data.dtypes[data.dtypes != 'float64'][data.dtypes !='int64'])


Categorical Features
State     object
County    object
dtype: object
Features After encoding
State      int8
County    int16
dtype: object


* (4) You have the option of keeping the "county" variable or removing it. Be sure to discuss why you decided to keep/remove this variable.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We decided to keep the county data in the dataset instead of trying to take the mean across each state because we determined that there is so much variation between counties that trying to classify them at a state level would lead to a much lower accuracy.<br/>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We chose to remove variables such as TractId because it is an ID number and not relevant as a predictor of our classes. We also removed things such as race because we wanted to try to stray away from the model picking up a racial bias and wanted it to focus more on variables such as income and the types of industry in the given county.<br/>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We also removed data such as how people commute to work as we determined that this data would not be very important in predicting child poverty as the types of commuting vary drastically and would only really help to determine if a county has most of the population in a city or not and it is not a good indicator as to whether or not a county will have high child poverty. It was also necessary to convert data such as number of Men and number of Women to be percentages so that they were somewhat normalized because without doing this, then a county with a higher population will appear to have more women and men than another county with lower pop. Finally, we removed data about the number of people over the age of 16 who are employed because it is already accounted for in the unemployment rate attribute and by leaving it in, we would create a stronger bias for that data without meaning to.<br/>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;__NOTE:__ It may be worth while to add this data in back in the end to see if it was increases our accuracy but for now, we will leave it out. 

In [10]:
#Data Cleaning
#Drop Non important columns
data.drop(columns=['TractId','Hispanic','White','Black','Native','Asian','Pacific','Employed','MeanCommute','OtherTransp','Walk','Transit','Carpool','Drive'],inplace=True)


#Numerical Data into percentages so that it is not skewed by population
data['Men'] = data['Men'] / data['TotalPop']
data['Women'] = data['Women'] / data['TotalPop']
data['VotingAgeCitizen'] = data['VotingAgeCitizen'] / data['TotalPop']
data.describe()

Unnamed: 0,State,County,TotalPop,Men,Women,VotingAgeCitizen,Income,IncomeErr,IncomePerCap,IncomePerCapErr,...,Service,Office,Construction,Production,WorkAtHome,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
count,72718.0,72718.0,72718.0,72718.0,72718.0,72718.0,72718.0,72718.0,72718.0,72718.0,...,72718.0,72718.0,72718.0,72718.0,72718.0,72718.0,72718.0,72718.0,72718.0,72718.0
mean,24.34037,998.356941,4443.485121,0.491322,0.508678,0.717249,61119.999326,9690.325642,30666.653222,4249.725969,...,18.847948,23.413165,9.263044,12.922312,4.612646,79.511827,14.149495,6.167661,0.171231,7.224917
std,15.102552,530.254496,2190.183318,0.040244,0.040244,0.104256,30511.06258,6119.407315,15844.127467,2991.009809,...,7.969609,5.591354,5.943849,7.592511,3.770733,7.95735,7.16479,3.798703,0.45163,5.099419
min,0.0,0.0,58.0,0.037275,0.006886,0.083826,2692.0,728.0,1631.0,351.0,...,0.0,0.0,0.0,0.0,0.0,17.5,0.0,0.0,0.0,0.0
25%,10.0,545.0,2958.0,0.469348,0.488619,0.672307,40380.0,5737.0,20624.0,2508.0,...,13.3,19.7,5.0,7.2,2.0,75.3,9.3,3.5,0.0,3.9
50%,24.0,1045.0,4137.0,0.490761,0.509239,0.739374,54413.0,8268.0,27249.0,3404.0,...,17.7,23.2,8.4,11.8,3.8,80.6,13.0,5.5,0.0,6.0
75%,38.0,1433.75,5532.75,0.511381,0.530652,0.783471,74688.0,11909.0,36413.0,4959.0,...,23.2,26.9,12.5,17.5,6.3,85.0,17.6,8.0,0.0,9.0
max,51.0,1953.0,65528.0,0.993114,0.962725,0.992776,249750.0,153365.0,220253.0,84414.0,...,70.9,72.3,68.1,60.5,82.8,100.0,80.7,47.4,22.3,62.8


#### Determining the cutoff for our Categories of child poverty
##### [.5 points] Balance the dataset so that about the same number of instances are within each class. Choose a method for balancing the dataset and explain your reasoning for selecting this method. One option is to choose quantization thresholds for the "ChildPoverty" variable that equally divide the data into four classes. Should balancing of the dataset be done for both the training and testing set? Explain.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We decided to go with pandas built in qcut function which is able to evenly divide any given series into n bins. Since we want 4 types of targets, we gave the function the number 4 for n and it ends up being the cutoff for the quartile ranges of the data. Meaning that the first bin, low child poverty, is the values from the 0th to 25th quartile and the moderate variable is the values from the 25th quartile to the 50th so on and so forth. By doing this, we are left with a roughly equivalent number of entries in each target variable. This is extremely important for the training data because without it, the model could only be good at looking at low poverty rates and would have little data to go off of in the high or extreme categories for example. This idea is not so important for the testing data because when testing, the model should be able to generalize such that it does not need an equivalent number of each class as it is not actively learning from the data and would therefore not be skewed one way or the other. I would argue that it is almost beneficial to have an uneven number of each category in the testing set because that can show if the model is able to generalize well or not.

In [12]:
#Coming up with divisors for child poverty

tmp = pd.qcut(data['ChildPoverty'],4,labels=['low','moderate','high','extreme'])
data['ChildPoverty'] = tmp
print(data.groupby(['ChildPoverty']).size())

ChildPoverty
low         18229
moderate    18171
high        18148
extreme     18170
dtype: int64


##### [.5 points] Assume you are equally interested in the classification performance for each class in the dataset. Split the dataset into 80% for training and 20% for testing. There is NO NEED to split the data multiple times for this lab.

In [37]:
def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return res


train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

print(f'Train_data size: {train_data.shape[0]} - {train_data.shape[0] / data.shape[0] * 100:.2f}% of original data')
print('Train_data classes:\n',train_data.groupby(['ChildPoverty']).size()) #Ensure that the number of classes stays relatively equivalent
print(f'\nTest_data size: {test_data.shape[0]} - {test_data.shape[0] / data.shape[0] * 100:.2f}% of original data')
print('Test_data classes:\n',test_data.groupby(['ChildPoverty']).size())

y = pd.DataFrame(train_data['ChildPoverty'])
X = train_data.drop(columns=['ChildPoverty'],inplace=False)
y = encode_and_bind(y,'ChildPoverty')


print()
print(f'Type of X: {type(X)}\nType of y: {type(y)}')

Train_data size: 58174 - 80.00% of original data
Train_data classes:
 ChildPoverty
low         14596
moderate    14447
high        14538
extreme     14593
dtype: int64

Test_data size: 14544 - 20.00% of original data
Test_data classes:
 ChildPoverty
low         3633
moderate    3724
high        3610
extreme     3577
dtype: int64

Type of X: <class 'pandas.core.frame.DataFrame'>
Type of y: <class 'pandas.core.frame.DataFrame'>


### Pre-processing and Initial Modeling (2.5 points total)
You will be using a two layer perceptron from class for the next few parts of the rubric. There are several versions of the two layer perceptron covered in class, with example code. When selecting an example two layer network from class be sure that you use: (1) vectorized gradient computation, (2) mini-batching, (3) cross entropy loss, and (4) proper Glorot initialization, at a minimum. There is no need to use momentum or learning rate reduction (assuming you choose a sufficiently small learning rate). It is recommended to use sigmoids throughout the network, but not required.
[.5 points] Use the example two-layer perceptron network from the class example and quantify performance using accuracy. Do not normalize or one-hot encode the data (not yet). Be sure that training converges by graphing the loss function versus the number of epochs.
[.5 points] Now (1) normalize the continuous numeric feature data. Use the example two-layer perceptron network from the class example and quantify performance using accuracy. Be sure that training converges by graphing the loss function versus the number of epochs.
[.5 points] Now(1) normalize the continuous numeric feature data AND (2) one hot encode the categorical data. Use the example two-layer perceptron network from the class example and quantify performance using accuracy. Be sure that training converges by graphing the loss function versus the number of epochs.
[1 points] Compare the performance of the three models you just trained. Are there any meaningful differences in performance? Explain, in your own words, why these models have (or do not have) different performances.
Use one-hot encoding and normalization on the dataset for the remainder of this lab assignment.

### Modeling (5 points total)
[1 points] Add support for a third layer in the multi-layer perceptron. Add support for saving (and plotting after training is completed) the average magnitude of the gradient for each layer, for each epoch (like we did in the flipped module for back propagation). For magnitude calculation, you are free to use either the average absolute values or the L1/L2 norm.
Quantify the performance of the model and graph the magnitudes for each layer versus the number of epochs.
[1 points] Repeat the previous step, adding support for a fourth layer.
[1 points] Repeat the previous step, adding support for a fifth layer.
[2 points] Implement an adaptive learning technique that was discussed in lecture and use it on the five layer network (such as AdaGrad, RMSProps, or AdaDelta). Discuss which adaptive method you chose. Compare the performance of your five layer model with and without the adaptive learning strategy. Do not use AdaM for the adaptive learning technique as it is part of the exceptional work.

### Exceptional Work (1 points total)
5000 level student: You have free reign to provide additional analyses.
One idea (required for 7000 level students):  Implement adaptive momentum (AdaM) in the five layer neural network and quantify the performance.