# Lab Assignment Four: Multi-Layer Perceptron

Code by Miller Boyd

## Introduction
In this lab, you will compare the performance of multi-layer perceptrons with your own implementations. This project is a crucial component of the course, constituting 10% of the final grade. Teams are required to submit a comprehensive report in a Jupyter notebook format, including all code, visualizations, and narratives. For visualizations that cannot be directly embedded, include screenshots. Ensure the results are reproducible using the submitted notebook.


## Dataset Selection
You will employ the US Census data for this assignment, specifically chosen by the instructor. This dataset is available on Kaggle and can also be downloaded from [Dropbox](https://www.dropbox.com/s/bf7i7qjftk7cmzq/acs2017_census_tract_data.csv?dl=0). The classification task involves predicting the child poverty rate across different tracts, requiring you to convert this into a four-level classification task by quantizing the variable of interest.

Found in file: acs2017_census_tract_data.csv

### Load, Split, and Balance (1.5 points)
- **[0.5 points]** Load the data into a pandas DataFrame, remove missing data, encode strings as integers, and decide on the inclusion of the "county" variable with justification.
- **[0.5 points]** Balance the dataset ensuring an equal number of instances across classes. Explain your chosen method for balancing and whether it applies to both training and testing sets.
- **[0.5 points]** Split the dataset into an 80/20 train/test ratio, aiming for equal classification performance across classes. Only one-hot encode the target at this stage.

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Load the data
data_path = 'acs2017_census_tract_data.csv'
df = pd.read_csv(data_path)

# Remove missing data
df = df.dropna()

# Encode string data as integers
# Assuming 'county' is a string variable among others. Adjust as necessary.
le = LabelEncoder()
for col in df.select_dtypes(include=['object']).columns:
    df[col] = le.fit_transform(df[col])


In [13]:
# Decide on the inclusion of the "county" variable
# This is a placeholder for your analysis on whether to keep or remove the 'county' variable.
# Example decision: Remove 'county' if it was encoded above
# df = df.drop(columns=['county'])
df.columns
df.describe

<bound method NDFrame.describe of            TractId  State  County  TotalPop   Men  Women  Hispanic  White  \
0       1001020100      0      89      1845   899    946       2.4   86.3   
1       1001020200      0      89      2172  1167   1005       1.1   41.6   
2       1001020300      0      89      3385  1533   1852       8.0   61.4   
3       1001020400      0      89      4267  2001   2266       9.6   80.3   
4       1001020500      0      89      9965  5054   4911       0.9   77.5   
...            ...    ...     ...       ...   ...    ...       ...    ...   
73996  72153750501     39    1938      6011  3035   2976      99.7    0.3   
73997  72153750502     39    1938      2342   959   1383      99.1    0.9   
73998  72153750503     39    1938      2218  1001   1217      99.5    0.2   
73999  72153750601     39    1938      4380  1964   2416     100.0    0.0   
74000  72153750602     39    1938      3001  1343   1658      99.2    0.8   

       Black  Native  ...  Walk  OtherTra

In [14]:
# Balance the dataset
# First, quantize the 'ChildPoverty' variable into 4 classes
df['ChildPoverty_Quantized'] = pd.qcut(df['ChildPoverty'], 4, labels=False)

# Count the number of instances in each class to decide on balancing method
class_counts = df['ChildPoverty_Quantized'].value_counts()
print("Class distribution before balancing:", class_counts)

# Assuming a simple undersampling strategy for balancing for demonstration
# Ideally, you might want to explore more sophisticated methods depending on class distribution
min_class_size = class_counts.min()
df_balanced = df.groupby('ChildPoverty_Quantized').apply(lambda x: x.sample(min_class_size)).reset_index(drop=True)


Class distribution before balancing: ChildPoverty_Quantized
0    18229
1    18171
3    18170
2    18148
Name: count, dtype: int64


  df_balanced = df.groupby('ChildPoverty_Quantized').apply(lambda x: x.sample(min_class_size)).reset_index(drop=True)


In [15]:
# Split the dataset into 80% training and 20% testing
# Assuming 'ChildPoverty_Quantized' is the target variable
X = df_balanced.drop(['ChildPoverty_Quantized'], axis=1)
y = df_balanced['ChildPoverty_Quantized']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Training set size:", X_train.shape[0])
print("Testing set size:", X_test.shape[0])

Training set size: 58073
Testing set size: 14519



### Pre-processing and Initial Modeling (2.5 points)
- [.5 points] Use the example two-layer perceptron network from the class example and quantify performance using accuracy. Do not normalize or one-hot encode the data (not yet). Be sure that training converges by graphing the loss function versus the number of epochs. 

In [16]:
import numpy as np
import pandas as pd
from scipy.special import expit as sigmoid
import sys
from sklearn.metrics import accuracy_score

class TwoLayerPerceptronBase:
    def __init__(self, n_hidden=30, C=0.0, epochs=500, eta=0.001, random_state=None):
        np.random.seed(random_state)
        self.n_hidden = n_hidden
        self.l2_C = C
        self.epochs = epochs
        self.eta = eta

    @staticmethod
    def _encode_labels(y):
        onehot = pd.get_dummies(y).values.T
        return onehot

    def _initialize_weights(self):
        # Glorot initialization
        limit1 = np.sqrt(6 / (self.n_features_ + 1 + self.n_hidden))
        W1 = np.random.uniform(-limit1, limit1, size=(self.n_hidden, self.n_features_ + 1))
        
        limit2 = np.sqrt(6 / (self.n_hidden + 1 + self.n_output_))
        W2 = np.random.uniform(-limit2, limit2, size=(self.n_output_, self.n_hidden + 1))
        return W1, W2

    @staticmethod
    def _add_bias_unit(X, how='column'):
        if how == 'column':
            X_new = np.hstack((np.ones((X.shape[0], 1)), X))
        elif how == 'row':
            X_new = np.vstack((np.ones((1, X.shape[1])), X))
        return X_new

    @staticmethod
    def _L2_reg(lambda_, W1, W2):
        return (lambda_/2.0) * (np.sum(W1[:, 1:] ** 2) + np.sum(W2[:, 1:] ** 2))

    def _feedforward(self, X, W1, W2):
        A1 = self._add_bias_unit(X, how='column')
        Z1 = W1 @ A1.T
        A2 = sigmoid(Z1)
        A2 = self._add_bias_unit(A2, how='row')
        Z2 = W2 @ A2
        A3 = sigmoid(Z2)
        return A1, Z1, A2, Z2, A3

    def predict(self, X):
        _, _, _, _, A3 = self._feedforward(X, self.W1, self.W2)
        y_pred = np.argmax(A3, axis=0)
        return y_pred

class TLPMiniBatchCrossEntropy(TwoLayerPerceptronBase):
    def __init__(self, alpha=0.0, decrease_const=0.0, shuffle=True, minibatches=1, **kwds):
        super().__init__(**kwds)
        self.alpha = alpha
        self.decrease_const = decrease_const
        self.shuffle = shuffle
        self.minibatches = minibatches

    def _cost(self, A3, Y_enc, W1, W2):
        cost = -np.mean(np.nan_to_num((Y_enc * np.log(A3) + (1 - Y_enc) * np.log(1 - A3))))
        L2_term = self._L2_reg(self.l2_C, W1, W2)
        return cost + L2_term

    def _get_gradient(self, A1, A2, A3, Z1, Z2, Y_enc, W1, W2):
        V2 = (A3 - Y_enc)
        V1 = A2 * (1 - A2) * (W2.T @ V2)
        grad2 = V2 @ A2.T
        grad1 = V1[1:, :] @ A1.T
        grad1[:, 1:] += W1[:, 1:] * self.l2_C
        grad2[:, 1:] += W2[:, 1:] * self.l2_C
        return grad1, grad2

    def fit(self, X, y, print_progress=False, XY_test=None):
        X_data, y_data = X.copy(), y.copy()
        Y_enc = self._encode_labels(y)
        self.n_features_ = X_data.shape[1]
        self.n_output_ = Y_enc.shape[0]
        self.W1, self.W2 = self._initialize_weights()
        self.cost_ = []
        self.score_ = []
        self.score_.append(accuracy_score(y_data, self.predict(X_data)))
        if XY_test is not None:
            X_test, y_test = XY_test
            self.val_score_ = [accuracy_score(y_test, self.predict(X_test))]
        for i in range(self.epochs):
            eta = self.eta / (1 + self.decrease_const * i)
            if print_progress and (i + 1) % print_progress == 0:
                print(f'\rEpoch: {i+1}/{self.epochs}', end='')
            if self.shuffle:
                idx = np.random.permutation(y_data.shape[0])
                X_data, Y_enc, y_data = X_data[idx], Y_enc[:, idx], y_data[idx]
            mini = np.array_split(range(y_data.shape[0]), self.minibatches)
            for idx in mini:
                A1, Z1, A2, Z2, A3 = self._feedforward(X_data[idx], self.W1, self.W2)
                cost = self._cost(A3, Y_enc[:, idx], self.W1, self.W2)
                self.cost_.append(cost)
                grad1, grad2 = self._get_gradient(A1, A2, A3, Z1, Z2, Y_enc[:, idx], self.W1, self.W2)
                self.W1 -= eta * grad1
                self.W2 -= eta * grad2
            self.score_.append(accuracy_score(y_data, self.predict(X_data)))
            if XY_test is not None:
                self.val_score_.append(accuracy_score(y_test, self.predict(X_test)))
        return self


In [17]:
import matplotlib.pyplot as plt

# Assuming you have a dataset X and labels y
# Split your dataset into training and test sets
# For simplicity, let's assume X_train, X_test, y_train, y_test are already defined

# Instantiate the model
tlp = TLPMiniBatchCrossEntropy(epochs=100, eta=0.001, minibatches=50, n_hidden=30, random_state=1)

# Fit the model to the training data
tlp.fit(X_train, y_train, print_progress=True)

# Plotting the average cost over epochs
plt.plot(range(len(tlp.cost_)), tlp.cost_, marker='o')
plt.xlabel('Epochs')
plt.ylabel('Average Cost')
plt.title('Cost vs. Epochs')
plt.show()

# Calculate accuracy on test data
test_accuracy = accuracy_score(y_test, tlp.predict(X_test))
print(f"Test accuracy: {test_accuracy*100:.2f}%")


Epoch: 1/100

KeyError: "None of [Index([33757, 34565, 56880, 57742, 33027, 38594, 50984, 36810,  3194,  1422,\n       ...\n       47476, 25737, 53153, 38751, 51226, 16661, 37022,  8187, 40003,  9338],\n      dtype='int32', length=58073)] are in the [columns]"

- [.5 points] Now (1) normalize the continuous numeric feature data. Use the example two-layer perceptron network from the class example and quantify performance using accuracy. Be sure that training converges by graphing the loss function versus the number of epochs.  


- [.5 points] Now(1) normalize the continuous numeric feature data AND (2) one hot encode the categorical data. Use the example two-layer perceptron network from the class example and quantify performance using accuracy. Be sure that training converges by graphing the loss function versus the number of epochs. 


- [1 points] Compare the performance of the three models you just trained. Are there any meaningful differences in performance? Explain, in your own words, why these models have (or do not have) different performances.  Use one-hot encoding and normalization on the dataset for the remainder of this lab assignment.


### Modeling (5 points)
- **[1 point]** Extend the perceptron model to include a third layer, incorporating gradient magnitude tracking for each layer per epoch. Quantify performance and graph gradient magnitudes.
- **[1 point]** Add a fourth layer to the model, repeating the performance quantification and gradient magnitude tracking.
- **[1 point]** Introduce a fifth layer, continuing with performance quantification and gradient tracking.
- **[2 points]** Implement an adaptive learning technique (excluding AdaM) for the five-layer network. Discuss your choice of technique, compare model performances with and without the adaptive strategy.

### Exceptional Work (1 point)
- **5000 level students:** You are encouraged to explore additional analyses.
- **7000 level students (required):** Implement AdaM in the five-layer neural network and compare its performance with other models.