# Lab 4: The Multi-Layer Perceptron
## by Michael Doherty, Leilani Guzman, and Carson Pittman

Our goal is to predict what the child poverty rate for each county in the United States will be.

Link to the dataset: https://www.kaggle.com/datasets/muonneutrino/us-census-demographic-data/data

## 1. Load, Split, and Balance
### 1.1 Loading the Data

To begin, we need to load in the data and store it in a Pandas dataframe.

In [1]:
import pandas as pd

df = pd.read_csv("data/acs2017_census_tract_data.csv")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74001 entries, 0 to 74000
Data columns (total 37 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   TractId           74001 non-null  int64  
 1   State             74001 non-null  object 
 2   County            74001 non-null  object 
 3   TotalPop          74001 non-null  int64  
 4   Men               74001 non-null  int64  
 5   Women             74001 non-null  int64  
 6   Hispanic          73305 non-null  float64
 7   White             73305 non-null  float64
 8   Black             73305 non-null  float64
 9   Native            73305 non-null  float64
 10  Asian             73305 non-null  float64
 11  Pacific           73305 non-null  float64
 12  VotingAgeCitizen  74001 non-null  int64  
 13  Income            72885 non-null  float64
 14  IncomeErr         72885 non-null  float64
 15  IncomePerCap      73256 non-null  float64
 16  IncomePerCapErr   73256 non-null  float6

As shown above, there are several missing datapoints in the dataset; seeing as the dataset is so large, we will remove the observations that have missing data.

We also need to change the <code>State</code> and <code>County</code> attributes from strings to numeric data so we can use them in our neural network. For now, we will simply encode them as integers by mapping each string to an integer (such as mapping 'Alabama' to 1, 'Alaska' to 2, etc.).

In [2]:
# remove rows with missing data
df.dropna(inplace=True)

# convert 'State' strings to integers
unique_states = df['State'].unique()

state_to_int = { }

counter = 1

for state in unique_states:
    state_to_int[state] = counter
    counter += 1

# 'Alabama' = 1, 'Alaska' = 2, 'Arizona' = 3, etc.
df['State'] = df['State'].map(state_to_int)

# convert 'County' strings to integers
unique_counties = df['County'].unique()

county_to_int = { }

counter = 1

for county in unique_counties:
    county_to_int[county] = counter
    counter += 1
    
# 'Autauga County' = 1, 'Baldwin County' = 2, 'Barbour County' = 3, etc.
df['County'] = df['County'].map(county_to_int)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 72718 entries, 0 to 74000
Data columns (total 37 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   TractId           72718 non-null  int64  
 1   State             72718 non-null  int64  
 2   County            72718 non-null  int64  
 3   TotalPop          72718 non-null  int64  
 4   Men               72718 non-null  int64  
 5   Women             72718 non-null  int64  
 6   Hispanic          72718 non-null  float64
 7   White             72718 non-null  float64
 8   Black             72718 non-null  float64
 9   Native            72718 non-null  float64
 10  Asian             72718 non-null  float64
 11  Pacific           72718 non-null  float64
 12  VotingAgeCitizen  72718 non-null  int64  
 13  Income            72718 non-null  float64
 14  IncomeErr         72718 non-null  float64
 15  IncomePerCap      72718 non-null  float64
 16  IncomePerCapErr   72718 non-null  float6

Unnamed: 0,TractId,State,County,TotalPop,Men,Women,Hispanic,White,Black,Native,...,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
0,1001020100,1,1,1845,899,946,2.4,86.3,5.2,0.0,...,0.5,0.0,2.1,24.5,881,74.2,21.2,4.5,0.0,4.6
1,1001020200,1,1,2172,1167,1005,1.1,41.6,54.5,0.0,...,0.0,0.5,0.0,22.2,852,75.9,15.0,9.0,0.0,3.4
2,1001020300,1,1,3385,1533,1852,8.0,61.4,26.5,0.6,...,1.0,0.8,1.5,23.1,1482,73.3,21.1,4.8,0.7,4.7
3,1001020400,1,1,4267,2001,2266,9.6,80.3,7.1,0.5,...,1.5,2.9,2.1,25.9,1849,75.8,19.7,4.5,0.0,6.1
4,1001020500,1,1,9965,5054,4911,0.9,77.5,16.4,0.0,...,0.8,0.3,0.7,21.0,4787,71.4,24.1,4.5,0.0,2.3


We decided to keep the <code>County</code> variable instead of removing it. While the <code>State</code> variable gives us enough information to geographically represent each part of the country, we believe that the <code>County</code> variable allows us to break these geographic locations down even further (which we think is important, as it is often the case that different parts of a state have vast differences in the makeup of their population, especially in large states like Texas). Since we are predicting the child poverty rate for each county, we believe that being able to distinguish statistical features between counties is important. Thus, we will keep the <code>County</code> variable.

### 1.2 Splitting the Dataset

Now we need to split the dataset into training data and testing data. We'll use 80% of the data for training and 20% of the data for testing. It's important that we do this before balancing the dataset, as we only want to balance the training data; this is because the testing data should be a representative sample of the population, meaning it shouldn't necessarily be balanced (as a truly random sample of the population likely wouldn't be balanced either). We want our model to be able to correctly predict the <code>ChildPoverty</code> for any sample of data, regardless of if the data is balanced or not. Thus, we will split the dataset before only balancing the training data.

In [3]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["ChildPoverty"])
y = df["ChildPoverty"]

# 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### 1.3 Balancing the Dataset

Now that we have split our dataset into testing and training data, we need to balance the testing dataset; to do that, we'll split the data into four classes based on the quartiles for <code>ChildPoverty</code>.

In [4]:
quartile_cutoffs = y_train.quantile([0.25, 0.5, 0.75])

quartiles = []

for val in y_train:
    if val < quartile_cutoffs[0.25]:
        quartiles.append('Quartile 1')
    elif val < quartile_cutoffs[0.5]:
        quartiles.append('Quartile 2')
    elif val < quartile_cutoffs[0.75]:
        quartiles.append('Quartile 3')
    else:
        quartiles.append('Quartile 4')
        
        
print('Quartile 1 count:', quartiles.count('Quartile 1'))
print('Quartile 2 count:', quartiles.count('Quartile 2'))
print('Quartile 3 count:', quartiles.count('Quartile 3'))
print('Quartile 4 count:', quartiles.count('Quartile 4'))
        
X_train['ChildPovertyClass'] = quartiles

Quartile 1 count: 14524
Quartile 2 count: 14472
Quartile 3 count: 14625
Quartile 4 count: 14553


As shown above, by splitting the training data into quartiles (based off the values of <code>ChildPoverty</code> in <code>y_train</code>), we have about the same number of instances in each of the four classes. We believe this is the best method to balance the data, as it groups together instances that have similiar <code>ChildPoverty</code> values. **ADD MORE??**

## 2. Pre-Processing and Initial Modeling
### 2.1 Two-Layer Perceptron

### 2.2 Normalizing Continuous Numeric Data

### 2.3 One Hot Encoding Categorical Data

### 2.4 Comparison

## 3. Modeling
### 3.1 Adding a Third Layer

### 3.2 Adding a Fourth Layer

### 3.3 Adding a Fifth Layer

### 3.4 Implementing Adaptive Learning (RENAME TO WHICHEVER ONE WE CHOOSE)

## 4. Adaptive Momentum (AdaM)
### 4.1 Implementing Adaptive Momentum

### 4.2 Quantifying the Performance