## Lab 4, Group 4
### Names: Hailey DeMark, Deborah Park, Karis Park
### Student IDs: 48869449, 48878679, 48563429

Link to DataSet: https://www.kaggle.com/datasets/muonneutrino/us-census-demographic-data/data

## Load, Split, and Balance (1.5 points total)

* [.5 points] (1) Load the data into memory and save it to a pandas data frame. Do not normalize or one-hot encode any of the features until asked to do so later in the rubric. (2) Remove any observations that having missing data. (3) Encode any string data as integers for now. (4) You have the option of keeping the "county" variable or removing it. Be sure to discuss why you decided to keep/remove this variable. 

In [14]:
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [15]:
# Load Data 
file = 'data.csv'
df = pd.read_csv(file)

# remove missing rows
df.dropna(inplace=True)

# convert string to integer
categorical_columns = df.select_dtypes(include='object').columns
label_encoders = {}

for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

# remove county
df = df.drop(columns=['County'])

# check
print("Shape of cleaned data:", df.shape)
print("Remaining columns:", df.columns.tolist())

Shape of cleaned data: (72718, 36)
Remaining columns: ['TractId', 'State', 'TotalPop', 'Men', 'Women', 'Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific', 'VotingAgeCitizen', 'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty', 'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction', 'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork', 'SelfEmployed', 'FamilyWork', 'Unemployment']


### The next two requirements will need to be completed together as they might depend on one another:
* [.5 points] Balance the dataset so that about the same number of instances are within each class. Choose a method for balancing the dataset and explain your reasoning for selecting this method. One option is to choose quantization thresholds for the "ChildPoverty" variable that equally divide the data into four classes. Should balancing of the dataset be done for both the training and testing set? Explain.

* [.5 points] Assume you are equally interested in the classification performance for each class in the dataset. Split the dataset into 80% for training and 20% for testing. There is no need to split the data multiple times for this lab.
Note: You will need to one hot encode the target, but do not one hot encode the categorical data until instructed to do so in the lab. 

In [None]:
# 'childpoverty' into 4
quantiles = np.quantile(df['ChildPoverty'], [0.25, 0.5, 0.75])

def quant_poverty(val):
    if val <= quantiles[0]:
        return 0
    elif val <= quantiles[1]:
        return 1
    elif val <= quantiles[2]:
        return 2
    else:
        return 3

df['PovertyClass'] = df['ChildPoverty'].apply(quant_poverty)

# check - will remove
print(df['PovertyClass'].value_counts().sort_index())

# drop the original regression target
df.drop(columns=['ChildPoverty'], inplace=True)

# split features and target
X = df.drop(columns=['PovertyClass']).values
y = df['PovertyClass'].values

# train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# check - will remove
from collections import Counter
print("y_train:", Counter(y_train))
print("y_test :", Counter(y_test))


📊 Class distribution after quantization:
PovertyClass
0    18229
1    18171
2    18148
3    18170
Name: count, dtype: int64
✅ Class distribution in y_train: Counter({0: 14583, 1: 14537, 3: 14536, 2: 14518})
✅ Class distribution in y_test : Counter({0: 3646, 3: 3634, 1: 3634, 2: 3630})


## Pre-processing and Initial Modeling (2.5 points total)

You will be using a two layer perceptron from class for the next few parts of the rubric. There are several versions of the two layer perceptron covered in class, with example code. When selecting an example two layer network from class be sure that you use: (1) vectorized gradient computation, (2) mini-batching, (3) cross entropy loss, and (4) proper Glorot initialization, at a minimum. There is no need to use momentum or learning rate reduction (assuming you choose a sufficiently small learning rate). It is recommended to use sigmoids throughout the network, but not required.

* [.5 points] Use the example two-layer perceptron network from the class example and quantify performance using accuracy. Do not normalize or one-hot encode the data (not yet). Be sure that training converges by graphing the loss function versus the number of epochs. 

* [.5 points] Now (1) normalize the continuous numeric feature data. Use the example two-layer perceptron network from the class example and quantify performance using accuracy. Be sure that training converges by graphing the loss function versus the number of epochs.  

* [.5 points] Now(1) normalize the continuous numeric feature data AND (2) one hot encode the categorical data. Use the example two-layer perceptron network from the class example and quantify performance using accuracy. Be sure that training converges by graphing the loss function versus the number of epochs. 

* [1 points] Compare the performance of the three models you just trained. Are there any meaningful differences in performance? Explain, in your own words, why these models have (or do not have) different performances.  
    * Use one-hot encoding and normalization on the dataset for the remainder of this lab assignment.
We chose a customer segmentation dataset with 8,068 entries. It includes details about customers, such as their gender, age, marital status, education, job, work experience, spending habits, and family size. The goal is to predict which of four customer groups (A, B, C, or D) a customer belongs to based on their information. This dataset would be helpful for businesses to understand their customers and target audience better. Additionally, companies can use the results to send personalized offers, improve customer service, and make better marketing plans. This model would be deployed mostly for online-use because it's important for companies to keep up with the trends and preferences of their target audience and general customers to make the right choices in marketing and keep up sales. 

## Modeling (5 points total)

* [1 points] Add support for a third layer in the multi-layer perceptron. Add support for saving (and plotting after training is completed) the average magnitude of the gradient for each layer, for each epoch (like we did in the flipped module for back propagation). For magnitude calculation, you are free to use either the average absolute values or the L1/L2 norm.
    * Quantify the performance of the model and graph the magnitudes for each layer versus the number of epochs.

* [1 points] Repeat the previous step, adding support for a fourth layer.

* [1 points] Repeat the previous step, adding support for a fifth layer. 

* [2 points] Implement an adaptive learning technique that was discussed in lecture and use it on the five layer network (choose either RMSProp or AdaDelta). Discuss which adaptive method you chose. Compare the performance of your five layer model with and without the adaptive learning strategy. Do not use AdaM for the adaptive learning technique as it is part of the exceptional work.

## Exceptional Work (1 points total)

5000 level student: You have free reign to provide additional analyses.
One idea (required for 7000 level students):  Implement adaptive momentum (AdaM) in the five layer neural network and quantify the performance compared to other methods.  