# <u>Submission instructions</u>
### Submission must be in pairs, unless otherwise authorized.
#### Submit by 28/2/2024

<ul style="font-size: 17px">
<li> This notebook contains all the questions. You should follow the instructions below.</li>
<li> Solutions for both theoretical and practical parts should be written in this notebook</li>
</ul>

<h3> Moodle submission</h3>


<p style="font-size: 17px">
You should submit three files:
</p>
<ul style="font-size: 17px">
<li>IPYNB notebook:
  <ul>
  <li>All the wet and dry parts, including code, graphs, discussion, etc.</li>
  </ul>
</li>
<li>PDF file:
  <ul>
  <li>Export the notebook to PDF. Make sure that all the cells are visible.</li>
  </ul>
</li>
<li>Pickle file:
  <ul>
    <li>As requested in Q2.a</li>
  </ul>
</li>
</ul>
<p style="font-size: 17px">
All files should be in the following format: "HW1_ID1_ID2.file"
<br>
Good Luck!
</p>

<h1> Question 1</h1>

## I. Softmax Derivative (10pt)

<p style="font-size: 17px">
Derive the gradients of the softmax function and demonstrate how the expression can be reformulated solely by using the softmax function, i.e., in some expression where only $softmax(x)$, but not $x$, is present).
Recall that the softmax function is defined as follows:
$$softmax(x)_i = \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}}$$


$$softmax(x)_i = \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}}$$


### I. Softmax Derivative - Answer:
$$\frac{\partial softmax(x)_i}{\partial x_k} = \frac{\partial \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}}}{\partial x_k}$$
$$when\ i = j$$
$$\frac{\partial softmax(x)_i}{\partial x_k} = softmax(x)_i \cdot (1 - softmax(x)_i)$$
$$when\ i \neq j$$
$$\frac{\partial softmax(x)_i}{\partial x_k} = -softmax(x)_i \cdot softmax(x)_k$$


## II. Cross-Entropy Gradient (10pt)
<p style="font-size: 17px">
Derive the gradient of cross-entropy loss with regard to the inputs of a softmax function. i.e., find the gradients with respect to the softmax input vector $\theta$, when the prediction is denoted by $\hat{y} = softmax(\theta)$. 


<p style="font-size: 17px">where $y$ is the one-hot label vector, and $\hat{y}$ is the predicted probability vector for all classes. 

$$\hat{y} = softmax(\theta)$$

Remember the cross entropy function is:
$$CE(y, \hat{y}) = -\sum_i y_i log(\hat{y_i})$$

### II. Cross-Entropy Gradient - Answer

<!--- write your answer -->
$$\frac{\partial CE(y, \hat{y})}{\partial\theta} = \frac{\partial CE(y, \hat{y})}{\partial\hat{y}}\frac{\partial\hat{y}}{\partial\theta} = \frac{\partial -\sum_i y_i log(\hat{y_i})}{\partial\theta}$$

\begin{align*}
\frac{\partial CE}{\partial \theta_k} &= \frac{\partial}{\partial \theta_k} \sum_{j=1}^n (-y_j \log(\sigma(\theta_j))) \\
&= - \sum_{j=1}^n y_j \frac{\partial}{\partial \theta_k} \log(\sigma(\theta_j)) & &\text{...addition rule, } -y_j \text{ is constant}\\
&= - \sum_{j=1}^n y_j \frac{1}{\sigma(\theta_j)}  \frac{\partial}{\partial \theta_k}\sigma(\theta_j) & &\text{...chain rule}\\
&= -y_k \frac{\sigma(\theta_k)(1-\sigma(\theta_k))}{\sigma(\theta_k)} + \sum_{j\neq k} y_j \frac{\sigma(\theta_k)\sigma(\theta_j)}{\sigma(\theta_j)} & &\text{...consider both } j=k \text{ and } j\neq k \\
&= -y_k (1-\sigma(\theta_k)) + \sum_{j\neq k} y_j \sigma(\theta_k) \\
&= -y_k + y_k\sigma(\theta_k) + \sum_{j\neq k} y_j \sigma(\theta_k) \\
&= -y_k + \sigma(\theta_k) \sum_j y_j. \\
\end{align*}



\begin{align*}
\Rightarrow \frac{\partial CE}{\partial \theta_k} &= \sigma(\theta_k) - y_k
\end{align*}

# Question 2

## I. Derivative Of Activation Functions (10pt)

<p style="font-size: 17px">
The following cell contains an implementation of some activation functions. Implement the corresponding derivatives.</p>

In [48]:
import torch

def sigmoid(x):
    return 1 / (1 + torch.exp(-x))

def tanh(x):
    return torch.div(torch.exp(x) - torch.exp(-x), torch.exp(x) + torch.exp(-x))


def softmax(x):
    exp_x = torch.exp(x.T - torch.max(x, dim=-1).values).T  # Subtracting max(x) for numerical stability
    return exp_x / exp_x.sum(dim=-1, keepdim=True)

In [49]:
def d_sigmoid(x):
    return sigmoid(x)*(1-sigmoid(x))


def d_tanh(x):
    return 1 - (tanh(x)**2)


def d_softmax(x):
    s = softmax(x)
    batch_size, n_classes = s.shape
    # Initialize the Jacobian matrix with zeros
    jacobian = torch.zeros((batch_size, n_classes))
    
    for i in range(batch_size):
        for j in range(batch_size):
                if j == i:
                    jacobian[i, j, k] = s[i, j] * (1 - s[i, j])
                else:
                    jacobian[i, j, k] = -s[i, j] * s[i, k]
    return jacobian

In [50]:
# Example usage
x = torch.randn(3, 2) 
print(x) # Example input vector
jacobian_matrix = d_softmax(x)
print("Jacobian matrix of the softmax:", jacobian_matrix)

tensor([[ 0.8434, -2.1867],
        [ 1.5041,  0.5034],
        [ 0.3367, -1.1663]])
Jacobian matrix of the softmax: tensor([[[ 0.0440, -0.0440],
         [-0.0440,  0.0440]],

        [[ 0.1965, -0.1965],
         [-0.1965,  0.1965]],

        [[ 0.1489, -0.1489],
         [-0.1489,  0.1489]]])


## II. Train a Fully Connected network on MNIST (30pt)

<p style="font-size: 17px">In the following exercise, you will create a classifier for the MNIST dataset.
You should write your own training and evaluation code and meet the following
constraints:
<ul>
<li> You are only allowed to use torch tensor manipulations.</li>
<li> You are NOT allowed to use:
  <ul>
  <li> Auto-differentiation - backward()</li>
  <li> Built-in loss functions</li>
  <li> Built-in activations</li>
  <li> Built-in optimization</li>
  <li> Built-in layers (torch.nn)</li>
  </ul>
  </li>
</ul>
</h4>


<p style="font-size: 17px">
 a) The required classifier class is defined.
<ul style="font-size: 17px">
<li> You should implement the backward pass of the model.
<li> Train the model and plot the model's accuracy and loss (both on train and test sets) as a function of the epochs.
<li> You should save the model's weights and biases. Change the student_ids to yours.
</ul>
<p style="font-size: 17px">In this section, you <b>must</b> use the "set_seed" function with the given seed and <b>sigmoid</b> as an activation function.
</p>

In [51]:
import torch
import torchvision
from torch.utils.data import DataLoader

import os
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()
import torch.nn.functional as F

# Constants
SEED = 42
EPOCHS = 16
BATCH_SIZE = 32
NUM_OF_CLASSES = 10

# Setting seed
def set_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    os.environ["PYTHONHASHSEED"] = str(seed)


# Transformation for the data
transform = torchvision.transforms.Compose(
    [torchvision.transforms.ToTensor(),
     torch.flatten])

# Cross-Entropy loss implementation
def one_hot(y, num_of_classes=10):
    hot = torch.zeros((y.size()[0], num_of_classes))
    hot[torch.arange(y.size()[0]), y] = 1
    return hot

def cross_entropy(y, y_hat):
    return -torch.sum(one_hot(y) * torch.log(y_hat)) / y.size()[0]

def cross_entropy_builtin(y, y_hat_logits):
    return F.cross_entropy(y_hat_logits, y)

In [52]:
# Create dataloaders
train_dataset = torchvision.datasets.MNIST(root='./data', train=True,
                                            download=True, transform=transform)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE)


test_dataset = torchvision.datasets.MNIST(root='./data', train=False,
                                           download=True, transform=transform)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE)

In [53]:

class FullyConnectedNetwork:
    def __init__(self, input_size, output_size, hidden_size1, activation_func = sigmoid, lr=0.01):
        # parameters
        self.input_size = input_size
        self.output_size = output_size
        self.hidden_size1 = hidden_size1

        # activation function
        self.activation_func = activation_func

        # weights
        self.W1 = torch.randn(self.input_size, self.hidden_size1)
        self.b1 = torch.zeros(self.hidden_size1)

        self.W2 = torch.randn(self.hidden_size1, self.output_size)
        self.b2 = torch.zeros(self.output_size)

        self.lr = lr

    def forward(self, x):
        self.z1 = torch.matmul(x, self.W1) + self.b1
        self.h1 = self.activation_func(self.z1)
        self.z2 = torch.matmul(self.h1, self.W2) + self.b2
        self.y_hat = softmax(self.z2)
        return self.y_hat

    def backward(self, x, y, y_hat):
        lr = self.lr
        batch_size = y.size(0)
        
        dl_dy_hat = (1/batch_size)*((y_hat - y)/ (y_hat * (torch.ones(y_hat.shape[-1]) - y_hat))) 
        print(dl_dy_hat.shape)
        print(d_softmax(self.z2).shape)
        dl_dz2 =  dl_dy_hat * d_softmax(self.z2) 
        
        dl_dW2 = torch.matmul(torch.t(self.h), dl_dz2)
        dl_db2 = torch.matmul(torch.t(dl_dz2), torch.ones(batch_size))
        
        dl_dh = torch.matmul(dl_dz2, torch.t(self.W2)) 
        dl_dz1 = dl_dh * d_sigmoid(self.z1) 
        
        dl_dW1 = torch.matmul(torch.t(x), dl_dz1) 
        dl_db1 = torch.matmul(torch.t(dl_dz1), torch.ones(batch_size))
       

        #gradient step
        self.W1 -= lr*dl_dW1 
        self.b1 -= lr*dl_db1
        self.W2 -= lr*dl_dW2
        self.b2 -= lr*dl_db2

    def train(self, X, y):
        # forward + backward pass for trainig a model
        o = self.forward(X)
        self.backward(X, y, o)
            

In [54]:
set_seed(SEED)
model = FullyConnectedNetwork(784, 10, 128, sigmoid, lr=0.01)

In [55]:
# Initialize history lists for tracking progress
history = {
    'train_loss': [],
    'train_accuracy': [],
    'test_loss': [],
    'test_accuracy': []
}

# Function to calculate metrics for a given dataloader
def calculate_metrics(dataloader, mode='train'):
    total_loss, total_accuracy, total_samples = 0, 0, 0
    for X_batch, y_batch in dataloader:
        y_hat = model.forward(x=X_batch)
        loss = cross_entropy(y=y_batch, y_hat=y_hat)
        _, predicted = torch.max(y_hat, 1)

        # Calculate the accuracy
        accuracy = (predicted == y_batch).float().mean()

        
        # Accumulate batch results
        total_loss += loss * len(y_batch)
        total_accuracy += accuracy
        total_samples += len(y_batch)
        
        # Backpropagation for training mode
        if mode == 'train':
            model.backward(x=X_batch, y=y_batch, y_hat=y_hat.max(dim=1).values)
    
    # Calculate and store epoch metrics
    history[f'{mode}_loss'].append(total_loss / total_samples)
    history[f'{mode}_accuracy'].append(total_accuracy / total_samples)

# Function to plot the training and testing loss and accuracy
def plot_metrics(history):
    plt.figure(figsize=(12, 5))
    
    # Plot training and testing loss
    plt.subplot(1, 2, 1)
    plt.plot(history['train_loss'], label='Train Loss')
    plt.plot(history['test_loss'], label='Test Loss')
    plt.title('Loss Over Epochs')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    
    # Plot training and testing accuracy
    plt.subplot(1, 2, 2)
    plt.plot(history['train_accuracy'], label='Train Accuracy')
    plt.plot(history['test_accuracy'], label='Test Accuracy')
    plt.title('Accuracy Over Epochs')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    
    plt.tight_layout()
    plt.show()


# Training and testing the model
for epoch in range(EPOCHS):
    print(f'Epoch: {epoch+1} ({EPOCHS - (epoch+1)} to go)')
    
    calculate_metrics(train_dataloader, 'train')
    calculate_metrics(test_dataloader, 'test')
    print('\n')


plot_metrics(history)

Epoch: 1 (15 to go)
torch.Size([32])
torch.Size([32, 10, 10])


RuntimeError: The size of tensor a (32) must match the size of tensor b (10) at non-singleton dimension 2

In [None]:
students_ids = "12345789_987654321"
torch.save({"W1": model.W1, "W2": model.W2, "b1": model.b1, "b2": model.b2}, f"HW1_{students_ids}.pkl")

<p style="font-size: 17px"> b) Train the model with various learning rates (at least 3).
<ul style="font-size: 17px">
<li> Plot the model's accuracy and loss (both on train and test sets) as a function of the epochs.
<li>Discuss the differences in training with different learning rates. Support your answer with plots.

# Question 3

## I. Implement and Train a CNN (30pt)

<p style="font-size: 17px"> You are a data scientist at a supermarket. Your manager asked you to write a new image classifiaction algorithem for the self checkout cashiers. The images are of products from your grocery store (dataset files are attched in the Moodle).
<br>
Your code and meet the following constraints:
<ul style="font-size: 17px">
<li> Your classifier must be CNN based</li>
<li> You are not allowed to use any pre-trained model</li>
</ul>
<br>
<p style="font-size: 17px">
In order to satisfy your boss you have to reach 65% accuracy on the test set. You will get a bonus for your salary (and 10 points to your grade) if your model's number of paramters is less than 100K. You can reutilize code from the tutorials.

<ul style="font-size: 17px">
<li>Train the model and plot the model's accuracy and loss (both on train and validation sets) as a function of the epochs. </li>
<li>Report the test set accurecy.</li>
<li>Discus the progress you made and describe your final model.</li>

## II. Analyzing a Pre-trained CNN (Filters) (10pt)

In this part, you are going to analyze a (large) pre-trained model. Pre-trained models are quite popular these days, as big companies can train really large models on large datasets (something that personal users can't do as they lack the sufficient hardware). These pre-trained models can be used to fine-tune on other/small datasets or used as components in other tasks (like using a pre-trained classifier for object detection).

All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225].

You can use the following transform to normalize:

<code>normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])</code>
<a href="https://pytorch.org/vision/stable/models.html">Read more here</a>


1. Load a pre-trained VGG16 with PyTorch using torchvision.models.vgg16(pretrained=True, progress=True, **kwargs) (<a href="https://pytorch.org/vision/stable/models.html#classification">read more here</a>). Don't forget to use the model in evaluation mode (<code>model.eval()</code>).

2. Load the images in the 'birds' folder and display them.

3. Pre-process the images to fit VGG16's architecture. What steps did you take?

4. Feed the images (forward pass) to the model. What are the outputs?

5. Choose an image of a dog in the 'dogs' folder, display it and feed it to network. What are the outputs?

6. For the first 3 filters in the first layer of VGG16, plot the filters, and then plot their response (their output) for the image from question 5. Explain your observations.