# Deep Learning: Assignment #1
## Submission date: 03/12/2025, 23:59.
### Topics:
- Logistic Regression
- Feedforward Neural Networks
- Backpropagation
- Optimization
- Batch Normalization

**Submitted by:**

- **Student 1 (Name, ID)**
- **Student 2 (Name, ID)**  


**Assignment Instruction:**

· Submissions are in **pairs only**. Write both names + IDs at the top of the notebook.

· Keep your code **clean, concise, and readable**.

· You may work in your IDE, but you **must** paste the final code back into the **matching notebook cells** and run it there.  


· <font color='red'>Write your textual answers in red.</font>  
(e.g., `<span style="color:red">your answer here</span>`)

· All figures, printed results, and outputs should remain visible in the notebook.  
Run **all cells** before submitting and **do not clear outputs**.

· Use relative paths — **no absolute file paths** pointing to local machines.

· **Important:** Your submission must be entirely your own.  
Any form of plagiarism (including uncredited use of ChatGPT or AI tools) will result in **grade 0** and disciplinary action.


In [None]:
# --- Global Setup ---

# Import Libraries
import torch
import numpy as np
import matplotlib.pyplot as plt
import itertools
import random
import torch.nn.functional as F
from torch import nn
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms
import torch.optim as optim
import torch.nn.init as init

# Device
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", DEVICE)

## Question 1 — From Classic ML to Deep Networks

In this question we embark on a short expedition from classic Machine Learning to Deep Learning.

We will compare the performance of a **multiclass logistic regression** to a **multi-layer perceptron (MLP)** on the **MNIST** dataset.


Start by running this cell which includes relative setup and useful helper functions implemented for you to use.

In [None]:
# --- Define Seed ---
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# --- Helper Functions To Use ---
def accuracy(logits: torch.Tensor, y: torch.Tensor):
    """Top-1 accuracy for logits [N,C] and labels [N]."""
    return (logits.argmax(dim=1) == y).float().mean().item()

def count_params(obj):
    """
    Count trainable parameters.
    - If obj is (W, b) tuple → counts elements.
    - If obj is a nn.Module → sums requires_grad params.
    """
    if isinstance(obj, tuple) and len(obj) == 2:
        W, b = obj
        return W.numel() + b.numel()
    if isinstance(obj, nn.Module):
        return sum(p.numel() for p in obj.parameters() if p.requires_grad)
    raise TypeError("count_params expects (W,b) or nn.Module.")


@torch.no_grad()
def evaluate_acc(W: torch.Tensor, b: torch.Tensor, loader):
    """Dataset-level accuracy for a linear softmax model parameterized by (W,b)."""
    total_acc, total_n = 0.0, 0
    for xb, yb in loader:
        xb = xb.to(DEVICE).view(xb.size(0), -1)
        yb = yb.to(DEVICE)
        logits = xb @ W + b
        batch_acc = accuracy(logits, yb)
        n = xb.size(0)
        total_acc += batch_acc * n         # weight by batch size
        total_n   += n
    return total_acc / total_n


# Use this function in the training loop for your nn.Module
@torch.no_grad()
def evaluate_module(model: nn.Module, loader):
    model.eval()
    total_acc, total_n = 0.0, 0
    for xb, yb in loader:
        xb, yb = xb.to(DEVICE), yb.to(DEVICE)
        logits = model(xb)
        batch_acc = accuracy(logits, yb)
        n = xb.size(0)
        total_acc += batch_acc * n
        total_n   += n
    return total_acc / total_n

### Load & Preprocess the Data:



We first load the MNIST dataset as provided by `torchvision`. All images are automatically converted to tensors in the range $[0,1]$.


In [None]:
# Load the raw MNIST dataset
transform = transforms.ToTensor()

train_full = datasets.MNIST(root="./data", train=True,  download=True, transform=transform)
test_set   = datasets.MNIST(root="./data", train=False, download=True, transform=transform)

print(f"\n Train set: {len(train_full)} samples  |  Test set: {len(test_set)} samples")

Implement the following steps:

1. **Split the data:**
   - The MNIST test set is already provided.
   - From the full training set, create an 80/20 split into **train** and **validation** subsets.

2. **Construct `DataLoader` objects:**
   - `train_loader` → use shuffling.
   - `val_loader` → no shuffling.
   - `test_loader` → no shuffling.2.

3. **Print a quick sanity check:**
   - batch shapes `(batch_size = 128, 1, 28, 28)`
   - label range (0–9)

*Example format:*
```python
Batch: torch.Size([128, 1, 28, 28]) torch.Size([128]) | pixel range = (0.0, 1.0)


In [None]:
# TODO: Implement preprocessing

###  Multiclass Logistic Regression



In binary logistic regression, we model  
$$
P(y=1 \mid x) = \sigma(w^\top x + b)
$$
where  $\sigma$ is the sigmoid activation function.

For **multiclass** problems, like ours where MNIST is a dataset of images of 10 digits, hence 10 classes, we generalize this idea.

Each class $c \in \{0, \dots, 9\}$ has its own weight vector  $w_c$ and bias  $b_c$.

We stack them into a **weight matrix**  $W \in \mathbb{R}^{d \times C}$
and a **bias vector**  $b \in \mathbb{R}^{C}$.

<br>

Generally speaking, we compute a forward pass by computing:

$$
z_i = W^\top x_i + b
$$

where $W \in \mathbb{R}^{d \times C}$ is the weight matrix and $b \in \mathbb{R}^C$ is the bias vector.

<br>

We then obtain class probabilities via the softmax function:

$$
p_i(c) = \frac{e^{z_i(c)}}{\sum_{k=1}^{C} e^{z_i(k)}}
$$

The model is trained by minimizing the **cross-entropy loss** between the predicted probabilities and the true class labels:

$$
\mathcal{L}(W,b) = -\frac{1}{N} \sum_{i=1}^N \log p_i(y_i)
$$

and the parameters are updated using **gradient descent**:

$$
W \leftarrow W - \eta \, \nabla_W \mathcal{L}, \qquad
b \leftarrow b - \eta \, \nabla_b \mathcal{L}
$$


<br><br>
In this question, you will build a multiclass logistic regression model from scratch, train it using the cross-entropy loss, and optimize it with gradient descent.

<br>

You will then evaluate the model’s performance on the MNIST dataset, and train it on 20 epochs.


In [None]:
def logistic_regression(train_loader, val_loader, epochs=20, lr=0.1, tol=1e-6):
    """
    Train a multiclass logistic regression model using gradient descent.
    - X: [N, d] input features (flattened images)
    - y: [N] class labels in {0,...,9}
    - lr: learning rate
    - max_steps: max number of iterations
    - tol: stop early when gradients converge

    Returns: (W, b)
    """

    # TODO: Implement logistic_regression



Now run the next cell to evaluate your multiclass logistic regression.

We use the `evaluate_acc` helper function to evaluate.


In [None]:
# TODO: Run
W, b = logistic_regression(train_loader, val_loader, epochs=20, lr=0.1)

model = (W, b)

# Evaluate
val_acc  = evaluate_acc(W, b, val_loader)
test_acc = evaluate_acc(W, b, test_loader)

print(f"\nNumber of Parameters: {count_params(model):,}")
print(f"Val. acc.: {val_acc:.4f}")
print(f"Test acc.: {test_acc:.4f}")

Answer the following questions:

1. Is a *logistic regression* a sufficient model for this task?  
   What **constraints** does this model impose on the decision boundaries and on what it can “represent”?  
   What **modification** might help address these constraints?


2. Give two examples of incorrectly classified digits. Include the image of the misclassified digit,
the predicted class and the actual class in your write up

</br>

<font color='red'>Write your answers here for questions above, with an explaination as requested.</font>

### Feedforward Neural Network




As we learned in class, a **Feedforward Neural Network (FFNN)** maps an input — in our case, a flattened $28 \times 28$ image, $x \in \mathbb{R}^{784}$ — through a sequence of **linear layers** interleaved with **nonlinear activation functions**.

Formally, the computation proceeds as follows:

$$
y^{(1)} = \sigma (W^{(1)}x+b^{(1)})
$$
$$
y^{(\ell)} = \sigma (W^{(1)}y^{(\ell -1)}+b^{(1)})
$$


Where:
- $y^{(i)}$ is the output of layer $i$.
- $W^{(i)}$ and $b^{(i)}$ are the weight matrix and bias vector for layer $i$, respectively.
- $\sigma ^{(i)}$ is the activation function used in layer $i$.

In this section you will implement a FFNN from scratch, and see how it fairs with the MNIST dataset.

In the cell below, implement your `MLP` network, a two-layer Multi-Layer Perceptron with no regularization to classify MNIST with the following architecture:

- Input layer : 784 nodes (MNIST images size)
- First hidden layer : 400 nodes
- Second hidden layer : 400 nodes
- Output layer : 10 nodes (number of classes for MNIST)


<br>

- Use `RELU()` as your activation function for each layer.

In [None]:
class MLP(nn.Module):
    def __init__(self):
        # TODO: implement

    def forward(self, x):
        return self.model(x)

After defining `MLP` class, train it for **20 epochs** using **cross-entropy loss** and **Adam** as your optimizer.  

At the end of every epoch, evaluate the model both on the training set and on the validation set. You should print the loss, the training accuracy, and the validation accuracy so that you can monitor how the model improves over time and detect potential overfitting.

Once training is complete, produce a plot that shows how the training accuracy and the validation accuracy evolve across the 20 epochs. The plot should clearly reflect the learning dynamics of your model and allow you to compare its performance to the logistic regression model from the previous section.

> You already have the helpers `evaluate_module`, `accuracy`, and `count_params` in the setup cell— you may use them directly in your implementation.


In [None]:
# TODO: Implement

<font color="red">Answer the following questions:</font>

<ol>
  <li>
    Using your results, explain why this model outperforms the logistic-regression baseline.
    What key property of the architecture could be tied to the success?
    Cite specific evidence from your plots or errors.
    <br>
  </li>
<br>
  <li>
    Did you need all 20 epochs? If not, choose a stopping point from the validation curve, and justify it.
    Propose a concrete rule you would use next time to limit computation waste (spell out the metric and the trigger).
    <br>
  </li>


</ol>


### Exploring Initialization

Now, we’ll explore how **weight initialization** affects training dynamics and accuracy. In this section, you will expirement with different initilization methods, **changing only the initialization scheme** while keeping **everything else fixed**, including the model architecture, optimizer, data splits, number of epochs, batch size, etc.

<br>

In the previous code, you used PyTorch’s default: **Kaiming/He initialization**, which scales weights so that
$$
\mathrm{Var}(W_{ij}) \approx \frac{2}{\text{fan}_{in}}
$$
to help stabilize activations and gradients with ReLU layers.

In this section, you'll compare this baseline to four alternatives:
1. Zero initialization
2. Uniform $[0,1]$
3. Standard Normal: $\mathcal{N}(0,1)$
4. Xavier initialization

For each scheme, reinitialize <strong>all</strong> linear layers (weights and biases), train for 20 epochs using adam optimizer, and record validation accuracy per epoch and the final test accuracy. Then plot all validation-accuracy curves on one figure, report the final test accuracies for each method.

> Tip: Recreate a **fresh model** for each initialization; don’t re-use trained weights between runs.

In [None]:
# TODO: Implement

<font color="red">Answer the following question:</font>

Why does zero initialization cause training to fail, even though the optimizer still updates the weights?


### Exploring Optimization Methods


Finally, we’ll explore how **different Optimization** methods affect training dynamics and accuracy. In this section, you will expirement with different optimizers, **changing only the optimization method** while keeping **everything else fixed**, including the model architecture, selecting the best initialization scheme you got earlier, data splits, number of epochs, batch size, etc.

<br>

Up to this point, you used the **Adam** optimizer. Here, you will experiment with alternative optimizers and compare their performance under the same settings.

In this section, you'll compare this baseline to three other alternatives:
1. Stochastic Gradient Descent (SGD)
2. RMSProp
3. Adagrad


As you did with initilization exploring, for each optimizer, create a **fresh model** and apply the **same fixed initialization** to all `nn.Linear` layers (weights & biases). Train the model for **20 epochs**. Plot all **validation-accuracy vs. epoch** curves on one figure and compare test accuracies.

> Tip: Recreate a **fresh model** for each initialization; don’t re-use trained weights between runs.

In [None]:
# TODO: Implement

#### Reflection

Consider the interaction between initialization and optimization:

- Combine the **best-performing initialization** with the **worst-performing optimizer**.
- Combine the **worst-performing initialization** with the **best-performing optimizer**.

Run both experiments and compare the learning dynamics.



In [None]:
# Implement

## Question 2:  Multidimentional XOR and backpropagation

Given $ d $ bits $b=(b_1, b_2, ... , b_d) \in \{0, 1\}^d$, we define $$XOR(b) = Σ_{i=1}^db_i\ mod\ 2$$

In this question, you will train a feedforward neural network to model the multidimensional XOR problem.




The following function returns the labeled XOR "dataset" for any given $d$ (`dim`)

In [None]:
def XORData(dim):
  X = np.array(list(itertools.product([0, 1], repeat=dim)))
  Y = X.sum(axis=1)%2
  return X, Y

Implement the following:

1. A linear layer class:`Linear(torch.nn.Module)`\
• You should inherit from `torch.nn.Module`\
• You can not use `torch.nn.LinearLayer` or any other model implemented by PyTorch, only tensors. Wrap every tensor object with `torch.nn.Parameter` (for the optimizer).\
• Use `torch.randn` to initialize the parameter tensors (weights matrix and biases vector)\
• Implement the `forward` method according to the linear model $y = Wx + b$\
• The number of parameters should be defined by `in_features` and `out_features` \
• Make sure to enable gradient computation!

In [None]:
class Linear(nn.Module):
  def __init__(self, in_features, out_features):
    # Implement here

  def forward(self, x):
    # Implement here

2. A feedforward neural network:`FFNet(torch.nn.module)`\
• Use the `Linear` class you implemented to construct a network with `in_features` input size, `out_features` output size, and one hidden layer with `hidden_size` neurons.\
• Use the sigmoid activation function on top of the hidden layer neurons.

In [None]:
class FFNet(nn.Module):
  def __init__(self, in_features, out_features, hidden_size):
    # Implement here

  def forward(self, x):
    # Implement here

### Exploring the hidden layer
In class we discussed the **Universal Approximation Theorem**. The following question will help us to understand how hard it is to find the optimal weight mentioned in the theorem.

You are given a basic training function:

In [None]:
loss_func = nn.MSELoss()

def train(net, X, Y, epochs=300):
  steps = X.shape[0]
  for i in range(epochs):
      for j in range(steps):
          data_point = np.random.randint((X.shape[0]))
          x_var = torch.Tensor(X[data_point])
          y_var = torch.Tensor([Y[data_point]])
          optimizer.zero_grad()
          y_hat = net(x_var)
          loss = loss_func(y_hat, y_var)
          loss.backward()
          optimizer.step()
      if(i % 100 == 0):
          print("Epoch:{}, Loss:{}".format(i, loss.detach()))

Using the code above, for each of the dimensions: $d = 2, 3, 4, 5$:
1. Create the corresponding dataset using `XORData`
2. Create models with different hidden layer sizes($1, 2, 3, ..., d,d+const, const*d,...$ etc.) and an optimizer for it using: `optimizer = optim.SGD(model.parameters(), lr=0.02, momentum=0.9)
`
3. Try to train the different networks on the dataset.
4. Plot the losses for different models on the same grid. Reflect on your findings.

Make sure to create a new `FFNet` and optimizer objects each time you call `train`.


In [None]:
# Implement here

<font color="red">**Answer the following questions:** </font>

1. For which dimensions were you able to reach a model that converges (loss value is geting close to 0)?

2. What is the number of hidden neurons in that network and how did you find it?

3. What results did you expect and how does it differ from the results you achieved?




### Backpropagation

The following part will focus on the computaion of gradients with respect to the model parameters, aimed at understanding the functioning of `loss.backward()`.

Implement the function:`calc_gradients(net, x, loss)` with the following instructions:

- Input: instance of `FFNet` with a hidden layer of size 2, input of size 2 and output of size 1, input vector `x` and `loss` value.

- Output: One dimensional tensor with the gradients of the loss w.r.t each of the parameters, ordered as the corresponding parameters order `net.parameters()` returns (if we would flatten it's output).

1. Compute (by hand) the derivatives w.r.t the second linear layer parameters.
2. Use the chain rule to compute (by hand) the derivatives w.r.t the first linear layer parameters.
3. Implement the final form computations per parameter.
4. Order the gradients as described above.


In [None]:
def calc_gradients(net, x, y_hat, loss):
  # Implement here

Run the following cell to check the correctness of your computation:

In [None]:
def equal_gradients(net, x, y_var, y_hat, loss):
  grads = []
  for param in net.parameters():
    grads.append(param.grad.view(-1))
  grads = torch.cat(grads)
  return True if torch.sum(grads - calc_gradients(net, x, y_var, y_hat, loss)).round() == 0 else False # Added round because results were very close but not identical

def train_and_compare(net, X, Y, epochs=100):
  steps = X.shape[0]
  for i in range(epochs):
      for j in range(steps):
          data_point = np.random.randint(X.shape[0])
          x_var = torch.Tensor(X[data_point])
          y_var = torch.Tensor([Y[data_point]])
          optimizer.zero_grad()
          y_hat = net(x_var)
          loss = loss_func(y_hat, y_var)
          loss.backward()
          if not equal_gradients(net, x_var, y_var, y_hat, loss.item()):
            print("Wrong gradients computation!")
            return
          optimizer.step()
  print("Correct gradients computation!")

model = FFNet(2, 1, 2)
optimizer = optim.SGD(model.parameters(), lr=0.02, momentum=0.9)
X, Y = XORData(2)


train_and_compare(model, X, Y)