# DS 542 - 2025 Fall - Homework 5

In this homework, you will practice using PyTorch's frameworks for repeatable data management and reusable model designs.
You will also track gradient statistics during the fitting process.

When you are done writing code, make sure to run all the cells and then submit your notebook in [Gradescope](https://www.gradescope.com/courses/1071076).


## Problem 1 - Setup Dataset and DataLoader Objects

PyTorch provides various utilities to help managing large data sets.
In this problem, you will implement `Dataset` and `DataLoader` objects for the Pima Indians Diabetes data set.
This data set is small and easily fits in memory, but these objects will also help with randomization and batching for stochastic gradient descent.


Here is a link to PyTorch's [Datasets & DataLoaders tutorial](https://docs.pytorch.org/tutorials/beginner/basics/data_tutorial.html).

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import torch

In [None]:
df = pd.read_csv("https://github.com/npradaschnor/Pima-Indians-Diabetes-Dataset/raw/refs/heads/master/diabetes.csv")
df.head()

In [None]:
len(df)

Finish the implementation of the `DiabetesDataset` class below by implementing the missing methods for `torch.utils.data.Dataset`.
The dataset should return pairs of tensors where the first tensor is the input row and the second tensor has the corresponding `Outcome` target.

In [None]:
class DiabetesDataset(torch.utils.data.Dataset):
    def __init__(self, dataframe):
        self.data = dataframe.drop('Outcome', axis=1).values
        self.targets = dataframe['Outcome'].values

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return torch.tensor(self.data[idx], dtype=torch.float32), torch.tensor(self.targets[idx], dtype=torch.float32).unsqueeze(0)

Create a DataLoader object using an instance of your `DiabetesDataSet` class and configure it to randomize the data and return batches of 100 rows at a time.

In [None]:
dataset = DiabetesDataset(df)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=100, shuffle=True)

Test your data loader.

In [None]:
# DO NOT CHANGE

for batch_input, batch_output in dataloader:
    print("INPUT", batch_input)
    print("OUTPUT", batch_output)
    break

## Problem 2 - Use Adam to Optimize Logistic Regression

Write a training loop using PyTorch's [`torch.optim.Adam`](https://docs.pytorch.org/docs/stable/generated/torch.optim.Adam.html) to optimize logistic regression.
Use the following `LogisticRegression` class for the implementation of logistic regression and [`torch.nn.functional.binary_cross_entropy`](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.binary_cross_entropy.html) for the loss function.

Run the training loop for 10 epochs printing the average training batch loss for each epoch.

In [None]:
class LogisticRegression(torch.nn.Module):
    def __init__(self):
        super().__init__()

        # use torch.nn.Parameter to register these as model parameters
        self.weights = torch.nn.Parameter(torch.zeros(len(df.columns)-1, 1))
        self.bias = torch.nn.Parameter(torch.zeros(1))

    def forward(self, x):
        return torch.sigmoid(x @ self.weights + self.bias)

In [None]:
# YOUR CHANGES HERE

...

## Problem 3 - Track Training Statistics and Gradients

Copy your training loop from problem 2 and modify it as follows.

1. Increase the number of epochs to 100.
2. Track the training loss of each batch.
3. Track the training accuracy of each batch.
4. Track the loss gradient of each batch for both the weights and bias of the logistic regression.
5. After the training loop is done, plot the data from 2-4. Use Matplotlib's subplot function to stack the charts vertically so they are aligned.

In [None]:
# Instantiate the model and optimizer
model = LogisticRegression()
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.functional.binary_cross_entropy

# Training loop
num_epochs = 100

# Lists to store training statistics and gradients
train_losses = []
train_accuracies = []
weight_gradients = []
bias_gradients = []

for epoch in range(num_epochs):
    total_loss = 0
    correct_predictions = 0
    total_samples = 0
    for batch_input, batch_output in dataloader:
        # Forward pass
        outputs = model(batch_input)
        loss = criterion(outputs, batch_output)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Track loss
        train_losses.append(loss.item())
        total_loss += loss.item()

        # Track accuracy
        predicted = (outputs > 0.5).float()
        correct_predictions += (predicted == batch_output).sum().item()
        total_samples += batch_output.size(0)
        train_accuracies.append(correct_predictions / total_samples)

        # Track gradients
        weight_gradients.append(model.weights.grad.norm().item())
        bias_gradients.append(model.bias.grad.norm().item())


    # Print average loss for the epoch
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {total_loss / len(dataloader):.4f}')

In [None]:
plt.figure(figsize=(10, 8))

plt.subplot(3, 1, 1)
# plot training loss
plt.plot(train_losses)
plt.title("Training Loss")
plt.xlabel("Batch")
plt.ylabel("Loss")

plt.subplot(3, 1, 2)
# plot training accuracy
plt.plot(train_accuracies)
plt.title("Training Accuracy")
plt.xlabel("Batch")
plt.ylabel("Accuracy")

plt.subplot(3, 1, 3)
# plot weights gradient
plt.plot(bias_gradients, label="bias")
for i in range(len(model.weights.grad)):
    plt.plot([w[i] for w in weight_gradients], label=f"{df.columns[i]}") # Assuming order matches columns
plt.title("Gradients")
plt.xlabel("Batch")
plt.ylabel("Gradient")
plt.legend()


plt.subplots_adjust(hspace=1.0)
plt.show();