
---


# **CS 4824/ECE 4424, Homework 2, 100 points, 15% Credit**
##**Due before 11:59 PM Thursday October 1, 2024**
---



**Instructions**:

1.   Honor code is enforced. This is an individual assignment. You should do your own work. Any evidence of copying will result in an immediate zero grade (0 point) and additional penalties/actions.
2.   Edits are only allowed at where 'TODO' tags exist. Edits made elsewhere will result in an immediate zero grade (0 point).
3.   Importing extra packages is forbidden. Any extra package import (including but not limited to numpy, scikit-learn, etc.) will result in an immediate zero grade (0 point).
4.   Please run each cell, including those that are collapsed, shown as `Show code`.
5.   Upon completion of this assignment, please download a '.ipynb' file through Taskbar > File > Download > .ipynb, then upload the file to Canvas.



### **1. Overview and Objective**
In this assignment, you will be training logistic regression via stochastic gradient-based optimization for predicting income from census data having various features such as age, employment information, education, marital status, occupation, race, sex, capital gain or loss, hours per week, country, etc.

Using the trained model, you will predict the probability that a person earns more than $50k per year. As such, this assignment involves end-to-end training and inference using logistic regression. You will get hands-on understanding of maximizing the conditional log likelihood while incorporating regularization (i.e., MAP estimation) for learning the parameters of logistic regression (i.e, model's weights). You will also get a chance to perform feature engineering and additional fine-tuning of your model.

### **2. Logistic Function [5 points]**###

To perform logistic regression, you have to be able to calculate the logistic function defined as follows:

$$a = \frac{1}{1+e^{-b}}$$

Fill in the `logistic` function:

In [1]:
from math import exp
import random
random.seed(1)

# TODO: Calculate logistic
def logistic(x):
    s = 1 / (1 + exp(-x)) # logistic function over x
    return s

In [2]:
def test_logistic():
    assert abs(logistic(1) - 0.7310585786300049) < 1e-7
    assert abs(logistic(2) - 0.8807970779778823) < 1e-7
    assert abs(logistic(-1) - 0.2689414213699951) < 1e-7
test_logistic()
print('Pass: Logistic Function [5 points]')

Pass: Logistic Function [5 points]


### **3. Dot Product [10 points]**###

The model you are training is just a bunch of numerical weights. To run your model on a data points you will need to compute the dot product of your weights and the features for that data point and run the result through your logistic function. The dot product of two vectors $\mathbf a = [a_1, a_2, ..., a_n]$ and $\mathbf b = [b_1, b_2, ..., b_n]$ is defined as:

$\mathbf a\mathbf \cdot \mathbf b = \sum_{i=1}^n a_i b_i = a_1b_1 + a_2 b_2 + ... + a_n b_n$

Fill in the `dot` function to compute the dot product of two vectors:

In [3]:
# TODO: Calculate dot product of two lists
def dot(x, y):

    Nx = len(x)
    Ny = len(y)

    if Nx != Ny: # vectors must be of the same size
        return None

    s = 0 # dot product over x and y
    for i in range(Nx):
      s = s + x[i] * y[i]

    return s

In [4]:
def test_dot():
    d = dot([1.1,2,3.5], [-1,0.1,.08])
    assert abs(d - (-.62)) < 1e-7
test_dot()
print('Pass: Dot Product [10 points]')

Pass: Dot Product [10 points]


### **4. Prediction [5 points]**###

Now that you can calculate the dot product, prediction task for new data points is straightforward given a model (i.e., the model's weights are available).

To predict for data new points, compute the dot product of your model's weights and the corresponding feature and run the result through your logistic function.

Fill in the `predict` function for prediction task for a new data point given a model. Take a look at `test_predict()` to see what the format for data points is.

In [5]:
# TODO: Calculate prediction based on model
def predict(model, point):
    feautres = point['features']
    d = dot(model, feautres)
    p = logistic(d) # prediction value returned from logistic function
    return p

In [6]:
def test_predict():
    model = [1,2,1,0,1]
    point = {'features':[.4,1,3,.01,.1], 'label': 1}
    p = predict(model, point)
    assert abs(p - 0.995929862284) < 1e-7
test_predict()
print('Pass: Prediction [5 points]')

Pass: Prediction [5 points]


### **5. Data Loading and Analysis [5 points]**###

Cells below load the dataset from a dataset file. `data` is an array consists of several data points.
We provide a list of print statements to help you understand the data format.
Each point is an `ordered Dict` type and has `15` features. Features include *age*, *type_employer* and so on (see the following cell to get an idea on how the data look like).

Performing basic data analysis may help you better understand the dataset and the features involved. As a rudimentary analysis, use the `age` attribute and report how many people fall into the age ranges of `(0, 20], (20, 40], (40, 60], (60, 80], (80, 100]`.
Fill in the `calculate_age_bins` functions and return a list of five values representing the amounts in each age range.

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
import csv

def load_csv(filename):
    lines = []
    with open(filename) as csvfile:
        reader = csv.DictReader(csvfile)
        for line in reader:
            lines.append(line)
    return lines

def load_adult_data(fn):
    return load_csv(fn)

# Note: Possibly use different data for training and validation to get a more accurate result,
# but remember that in the last part your model will be trained on the full training data
# load_adult_data() and be tested on a test dataset you don't have access to.
def load_adult_train_data(fn):
    return load_adult_data(fn)

def load_adult_valid_data(fn):
    return load_adult_data(fn)

In [9]:
#Append the directory to your python path using sys
import sys
import os
prefix = '/content/drive/My Drive/'
# modify "customized_path_to_your_homework" here to where you uploaded your homework
customized_path_to_your_homework = 'Colab Notebooks/'
sys_path = prefix + customized_path_to_your_homework
sys.path.append(sys_path)
# print(sys.path)

fp_data = os.path.join(sys_path, 'adult_balanced.data')
data = load_adult_train_data(fp_data)
print('Path to adult.data: {}'.format(fp_data))

Path to adult.data: /content/drive/My Drive/Colab Notebooks/adult_balanced.data


In [10]:
print(len(data))
print(data[0])
print(len(data[0]))
print(data[0].keys())
print(data[14].values())
print(data[0]['age'])

10000
{'age': '26', 'type_employer': 'Private', 'fnlwgt': '162302', 'education': 'Some-college', 'education_num': '10', 'marital': 'Never-married', 'occupation': 'Machine-op-inspct', 'relationship': 'Not-in-family', 'race': 'Asian-Pac-Islander', 'sex': 'Male', 'capital_gain': '0', 'capital_loss': '0', 'hr_per_week': '40', 'country': 'Philippines', 'income': '<=50K'}
15
dict_keys(['age', 'type_employer', 'fnlwgt', 'education', 'education_num', 'marital', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hr_per_week', 'country', 'income'])
dict_values(['34', 'Private', '45114', 'Bachelors', '13', 'Never-married', 'Sales', 'Own-child', 'Black', 'Female', '0', '0', '40', 'United-States', '<=50K'])
26


In [11]:
def calculate_age_bins(data):
    # TODO: return bins as a list of five values where each value represent the number of people in each range of `(0, 20], (20, 40], (40, 60], (60, 80], (80, 100]

    bins = [0]*5

    for i in range(len(data)):
      age = int(data[i]['age'])

      if age <= 20:
        bins[0] += 1
      elif age > 20 and age <= 40:
        bins[1] += 1
      elif age > 40 and age <= 60:
        bins[2] += 1
      elif age > 60 and age <= 80:
        bins[3] += 1
      elif age > 80 and age <= 100:
        bins[4] += 1

    return bins

In [12]:
def test_calculate_age_bins():
    bins = calculate_age_bins(data)
    assert sum(bins) == len(data)
    assert bins == [427, 4785, 4147, 624, 17]
test_calculate_age_bins()
print('Pass: Data Loading and Analysis [5 points]')

Pass: Data Loading and Analysis [5 points]


### **6. Evaluate Accuracy [10 points]**###

Before training your model, let's think about how can you evaluate the accuracy of your prediction. Generally speaking, accuracy quantifies how well your model is doing. As a standard convention, 0.5 is used as a threshold to classify predicted values, i.e., if the predicted real-valued output is greater or equal to 0.5, the output is considered `True`, and is considered `False` otherwise.

 Modify the `accuracy` function to calculate your accuracy on a dataset given a list of data points and the associated predictions.

In [13]:
def accuracy(data, predictions):
    # TODO: Calculate accuracy of predictions on data

    correct = 0 # number of correctedly predicted data points

    for i in range(len(data)):
      if predictions[i] >= 0.5 and data[i]['label'] == True:
        correct += 1
      elif predictions[i] < 0.5 and data[i]['label'] == False:
        correct += 1

    return float(correct)/len(data)

In [14]:
def extract_features(raw):
    data = []
    for r in raw:
        point = {}
        point["label"] = (r['income'] == '>50K')

        features = []
        features.append(1.)
        features.append(float(r['age'])/100)
        features.append(float(r['education_num'])/20)
        features.append(r['marital'] == 'Married-civ-spouse')
        point['features'] = features
        data.append(point)
    return data

In [15]:
def test_accuracy(fn):
    load_data = load_adult_train_data(fn)
    data = extract_features(load_data)
    #print(data)
    a = accuracy(data, [0]*len(data))
    assert abs(a - 0.5) < 1e-7
test_accuracy(fp_data)
print('Pass: Accuracy [10 points]')

Pass: Accuracy [10 points]


### **7. Train Your Model via Stochastic Gradient-based Optimization [30 points]**###
For training your logistic regression model, you need to implement stochastic gradient ascent. That is, you need to iteratively update model's weights by computing the gradient while incorporating regularization.

Stochastic version of gradient-based optimization is different from batch gradient-based optimization where you look at all of the data points before updating your model's weights. Stochastic gradient converges faster but can also be less stable because you have a noisy estimate of the gradient instead of the true gradient. In practice, it is often much more efficient to use stochastic gradient than full batch gradient.

Use the update rule from class to adjust the model's weights, but remember to only look at one point for updating the model since we are performing stochastic gradient ascent. The update rule from class is given below:

$$w_i^{(t+1)} \leftarrow w_i^{(t)} - \eta \lambda w_i^{(t)} + \eta \sum_l X_i^l (Y^l -\hat P(Y^l=1| X^l, W))$$

The training should run for some number of `epochs`. An epoch refers to a full pass over the dataset. In practice it is easier (and more statistically valid) to sample randomly with replacement. Thus, an epoch just means examining `N` data points where `N` is the number of points in your training data.

Fill in the `train` function to train your model. You should use logistic regression with regularization where `rate` is the learning rate and `lam` is the regularization parameter.

To get a more accurate evaluation, you can modify `load_adult_train_data()` and `load_adult_valid_data()` to use different training and validation sets by splitting your data.

In [16]:
# TODO: Initialize model
def initialize_model(k):
    return [random.gauss(0, 1) for x in range(k)]

# TODO: Train model using training data
def train(data, epochs, rate, lam):
    N_features = len(data[0]['features'])

    model = initialize_model(N_features)

    #print(model)

    grad_d = [0] * N_features

    N_iter = 0
    N_epochs = 0

    while(N_epochs < epochs):
      N_iter += 1

      if N_iter == len(data):
        N_iter = 0
        N_epochs += 1

      i = random.randint(0, len(data) - 1)
      d = data[i]

      #print(d)

      y = d['label']
      x = d['features']

      #print(N_iter)
      #print(model)
      #print(d)
      #print()

      for i in range(N_features):
        p_hat = predict(model, d)
        grad_d[i] = x[i] * (y - p_hat)

        model[i] = model[i] - rate * lam * model[i] + rate * grad_d[i]

      #print(grad_d)

    #print(model)

    return model

### **8. Feature Engineering [20 points]**###

Feature engineering (or feature extraction) is the process of extracting "better" features (characteristics, properties, attributes) from raw data. Good feature engineering is often the key to making good machine learning models. The motivation is to use these extra features to improve the quality of results. Add more feature extraction rules to help improve your model's performance. By definition, this is very open ended and so be creative and experiment to find features that work well with your model.

Take a look at the feature extracting code in `extract_features`, and at the raw data in `adult.data`. Right now, your model is only considering age, education, and one possible marital status. In that sense, it is somewhat restrictive and thus "good" feature engineering can help improve the performance of your model.

In [22]:
def extract_features(raw):
    data = []
    for r in raw:
        point = {}
        point["label"] = (r['income'] == '>50K')

        features = []
        features.append(1.)
        features.append(float(r['age'])/100)
        features.append((float(r['age'])/100)**2)
        features.append(float(r['education_num'])/20)
        features.append((float(r['education_num'])/20)**2)
        features.append(r['marital'] == 'Married-civ-spouse')

        #TODO: Add more feature extraction rules here!
        #features.append(r['relationship'] == 'Husband')
        features.append(r['relationship'] == 'Not-in-family')
        features.append(r['marital'] == 'Never-married')
        #features.append(r['race'] == 'White')
        #features.append(r['sex'] == 'Male')
        #features.append(r['type_employer'] == 'Private')
        features.append(float(r['capital_gain'])/1000000)
        features.append(float(r['capital_loss'])/100000)
        #features.append(r['country'] == 'United-States')
        features.append(float(r['hr_per_week'])/168) # 168 hrs per week

        features.append(float(r['age'])/100 * float(r['education_num'])/20)

        features.append(float(r['capital_gain'])/1000000 * float(r['capital_loss'])/100000)

        point['features'] = features
        data.append(point)
    return data

### **9. Fine-tune Your Submission [15 points]**###

Fine-tune your `submission` function to train your final model. You should change your feature extraction and training code to produce the best model you can. Try different learning rates and regularization parameters and see how do they compare. Often it is good to start with a high learning rate and decrease it over time. This is known as learning rate annealing. The way learning rate evolves over time during optimization can be defined by a schedule. Feel free to try various learning rate annealing schedules and observe their effects to figure out what works best given your creative feature engineering. If so, you may need to modify the `train` function to implement this. Your `submission` function should finish execution in no more than 2 minutes, you will get zero points otherwise.

Your final model will be trained on the full training data and tested on an independent validation dataset that you don't have access to. Your grade for this section will be based on your performance relative to a baseline model we obtained during our in-house testing.

In [23]:
# TODO: Tune your parameters for final submission
def submission(data):
    random.seed(1)
    return train(data, 200, 4e-2, 1e-5)

In [24]:
def test_submission(fn):
    train_data = extract_features(load_adult_train_data(fn))
    valid_data = extract_features(load_adult_valid_data(fn))
    model = submission(train_data)
    predictions = [predict(model, p) for p in train_data]
    print("Training Accuracy:", accuracy(train_data, predictions))
    predictions = [predict(model, p) for p in valid_data]
    print("Validation Accuracy:", accuracy(valid_data, predictions))
    print()
test_submission(fp_data)

Training Accuracy: 0.8057
Validation Accuracy: 0.8057



### **10. Acknowledgments**###

The data file is adapted from the adult dataset originally from the UCI repository and later released <a href="https://www.cs.toronto.edu/~delve/data/adult/adultDetail.html">here</a>. We performed class balancing and downsampling for robust performance evaluation.