# Libraries & Custom functions

In [1]:
import torch
import torch.nn.functional as F

# Ignore UserWarning
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [2]:
def BCE_Loss(y_true, y_hat):
    if y_true == y_hat:
        return torch.tensor(0.)
    
    return -1 * (y_true * torch.log(y_hat) + (1 - y_true) * torch.log(1 - y_hat))


def ForwardPass(X, W, b = 0):

    z1 = torch.matmul(X, W[0].T) + b
    a1 = F.relu(z1)

    z2 = torch.matmul(a1, W[1].T) + b
    a2 = F.leaky_relu(z2, negative_slope = 0.01)

    z3 = torch.matmul(a2, W[2].T) + b
    a3 = F.sigmoid(z3)

    return a3, a2, a1


def BackwardPass(X, a1, a2, a3, y_true, W):

    # Partial derivative of the loss function with respect to the prediction
    dL_da3 = -1 * (y_true / a3 - (1 - y_true) / (1 - a3)) if y_true != a3 else torch.tensor([0.])
    # Partial derivative of the loss function with respect to z3, using the sigmoid derivative
    dL_dz3 = dL_da3 * (a3 * (1 - a3))
    # Partial derivative of the loss function with respect to the weights of the connections between the second hidden layer and the output layer
    dL_dW3 = torch.matmul(dL_dz3.T, a2)

    # Partial derivative of the loss function with respect to the activation values from the second hidden layer
    dL_da2 = torch.matmul(dL_dz3, W[2])
    # Partial derivative of the loss function with respect to z2, using the Leaky ReLU derivative
    dL_dz2 = torch.where(a2 >= 0, dL_da2, 0.01 * dL_da2)
    # Partial derivative of the loss function with respect to the weights of the connections between the first and second hidden layers
    dL_dW2 = torch.matmul(dL_dz2.T, a1)

    # Partial derivative of the loss function with respect to the activation values from the first hidden layer
    dL_da1 = torch.matmul(dL_dz2, W[1])
    # Partial derivative of the loss function with respect to z2, using the ReLU derivative
    dL_dz1 = torch.where(a1 >= 0, dL_da1, 0.0 * dL_da1)
    # Partial derivative of the loss function with respect to the weights of the connections between the input layer and the first hidden layers
    dL_dW1 = torch.matmul(dL_dz1.T, X)

    return dL_dW1, dL_dW2, dL_dW3


def StochasticGradintDescent(dL_dW1, dL_dW2, dL_dW3, W, lr):
    # Update weights using SGD
    W[0] -= lr * dL_dW1
    W[1] -= lr * dL_dW2
    W[2] -= lr * dL_dW3

    return W

# Exercise 1

## Exercise 1, a)

To optimize deep neural networks (DNN) backpropagation and gradient descent are often used.
Distinguish between these two algorithms, if there is any distinction, and explain how they can be
used for DNN optimization.


**Backpropagation vs. Gradient Descent for DNN Optimization**

**Distinction**

Backpropagation is an algorithm for calculating the gradient of a loss function with respect to the parameters of a neural network. Gradient descent is an algorithm for optimizing a function by iteratively moving in the direction of the negative gradient.

How they can be used for DNN optimization

Backpropagation is used to calculate the gradient of the loss function with respect to the weights of the neural network. This gradient is then used by gradient descent to update the weights in the direction of the negative gradient, which reduces the loss function.

**Example**

Consider a simple neural network with two inputs, one hidden layer, and one output. The loss function for this network could be the mean squared error between the predicted output and the ground truth output.

To use backpropagation to calculate the gradient of the loss function with respect to the weights of the network, we would start at the output layer and work our way backwards. At each layer, we would calculate the error at that layer and then propagate it back to the previous layer. This process would continue until we reached the input layer.

Once we have calculated the gradient of the loss function with respect to the weights of the network, we can use gradient descent to update the weights in the direction of the negative gradient. This will help the network to learn and reduce the loss function.

Advantages and disadvantages

**Backpropagation**

Advantages:
Easy to implement
Efficient for calculating the gradient of the loss function with respect to the weights of a neural network
Disadvantages:
Can be computationally expensive for large neural networks

**Gradient descent**

Advantages:
Simple to implement
Efficient for optimizing a function by iteratively moving in the direction of the negative gradient
Disadvantages:
Can be prone to getting stuck in local minima
Conclusion

Backpropagation and gradient descent are two essential algorithms for optimizing deep neural networks. Backpropagation is used to calculate the gradient of the loss function with respect to the weights of the network, while gradient descent is used to update the weights in the direction of the negative gradient. By combining these two algorithms, we can train deep neural networks to perform complex tasks.

#### References

 - https://www.analyticsvidhya.com/blog/2023/01/gradient-descent-vs-backpropagation-whats-the-difference/#:~:text=To%20put%20it%20plainly%2C%20gradient,gradient%20descent%20relies%20on%20backpropagation.

## Exercise 1, b)

In [3]:
input_values = torch.tensor([[5., 4., 1., 3., 2.]])

y_true = torch.tensor([1.])

w1 = torch.tensor([[2., 2., 2., 2., 2.],
                   [2., 2., 2., 2., 2.],
                   [2., 2., 2., 2., 2.]])

w2 = torch.tensor([[2., 2., 2.],
                   [2., 2., 2.]])

w3 = torch.tensor([[2., 2.]])

W = list([w1, w2, w3])

In [4]:
a3, a2, a1 = ForwardPass(input_values, W)
loss = BCE_Loss(y_true, a3)

print("Predicted Output (a3):", a3.item())
print("Expected Output (y_true):", y_true.item())
print("Binary Cross-Entropy Loss:", loss.item())

Predicted Output (a3): 1.0
Expected Output (y_true): 1.0
Binary Cross-Entropy Loss: 0.0


In [5]:
dL_dW1, dL_dW2, dL_dW3 = BackwardPass(input_values, a1, a2, a3, y_true, W)

print("dL_dW1:", dL_dW1)
print("dL_dW2:", dL_dW2)
print("dL_dW3:", dL_dW3)

dL_dW1: tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])
dL_dW2: tensor([[0., 0., 0.],
        [0., 0., 0.]])
dL_dW3: tensor([[0., 0.]])


In [6]:
W = StochasticGradintDescent(dL_dW1, dL_dW2, dL_dW3, W, lr = 0.2)

print("Updated W1:", W[0])
print("Updated W2:", W[1])
print("Updated W3:", W[2])

Updated W1: tensor([[2., 2., 2., 2., 2.],
        [2., 2., 2., 2., 2.],
        [2., 2., 2., 2., 2.]])
Updated W2: tensor([[2., 2., 2.],
        [2., 2., 2.]])
Updated W3: tensor([[2., 2.]])


<h2>After updating your weights what do you observe? Explain why.</h2>
<p>After computing the loss and doing a backward pass, and given that all weights were initialized with the value 2, we can see that none of the weights changed. Tracing back the steps made, we can see that all partial derivatives with respect to all the weights have a value of 0, which corroborates the fact that none of the weight's values changed. These partial derivatives measure the weight's variation in order for the loss of the neural network to decrease. So, the question now becomes, why all partial derivatives with respect to all weights show that no variation in their values is required in order to decrease the loss? Well, that's because the loss is already at its minimum, which is zero. Looking at how the binary loss is computed, we see that both the expected value and predicted value are exactly the same value, one, meaning that our network predicted perfectly the target. So, this means that there's no more room for improvement, and that's why all partial derivatives with respect to the weights are zero, and as consequence, the weight matrices didn't change.</p>

#### References

https://towardsdatascience.com/nothing-but-numpy-understanding-creating-binary-classification-neural-networks-with-e746423c8d5c

## Exercise 1, c)

In [13]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()  # Corrected super() call
        self.fc1 = nn.Linear(784, 256)  # Input size for MNIST is 28x28=784
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)   # 10 output classes for Fashion-MNIST

    def forward(self, x):
        x = x.view(-1, 784)  # Flatten the input
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        # ver leaky relu para segunda hidden layer
        
        x = self.fc3(x)
        return x

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

trainset = torchvision.datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
testset = torchvision.datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)

# For MNIST
# trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
# testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=False)

# Initialize the network, loss function, and optimizer
net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01)

# Training loop
for epoch in range(10):  # You can adjust the number of epochs
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {running_loss / len(trainloader)}")

# Testing the network
correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Accuracy on test set: {100 * correct / total}%")

# Calculate Cross-Entropy on the training and test sets
train_loss = 0.0
test_loss = 0.0
with torch.no_grad():
    for data in trainloader:
        inputs, labels = data
        outputs = net(inputs)
        train_loss += criterion(outputs, labels).item()
    
    for data in testloader:
        inputs, labels = data
        outputs = net(inputs)
        test_loss += criterion(outputs, labels).item()

print(f"Cross-Entropy on training set: {train_loss / len(trainloader)}")
print(f"Cross-Entropy on test set: {test_loss / len(testloader)}")


Epoch 1, Loss: 0.961732804616377
Epoch 2, Loss: 0.5474599316430244
Epoch 3, Loss: 0.4815413070512987
Epoch 4, Loss: 0.44872575202412696
Epoch 5, Loss: 0.4265495993689433
Epoch 6, Loss: 0.4093995988051266
Epoch 7, Loss: 0.39465504908549
Epoch 8, Loss: 0.38299932659689045
Epoch 9, Loss: 0.3717110209754789
Epoch 10, Loss: 0.36276883485792544
Accuracy on test set: 84.47%
Cross-Entropy on training set: 0.37405235665058023
Cross-Entropy on test set: 0.4217475307215551


## Exercise 1, d)

To improve the previous feedforward neural network for the Fashion-MNIST classification task, we can make the following changes:

**Increase Network Complexity**: We can add more hidden layers and neurons to increase the network's capacity to learn complex patterns in the data.

**Use Different Activation Function**: Instead of using ReLU (Rectified Linear Unit) activation, we can try a different activation function like Leaky ReLU, which might help with training.

**Regularization Techniques**: To prevent overfitting, we can add dropout layers and L2 regularization to the network.

**Batch Normalization**: Applying batch normalization can help stabilize and speed up training.

In [14]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

# Define a more complex neural network
class ComplexNet(nn.Module):
    def __init__(self):
        super(ComplexNet, self).__init__()
        self.fc1 = nn.Linear(784, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 128)
        self.fc4 = nn.Linear(128, 10)
        self.relu = nn.LeakyReLU(0.2)
        self.dropout = nn.Dropout(0.5)
        self.batch_norm1 = nn.BatchNorm1d(512)
        self.batch_norm2 = nn.BatchNorm1d(256)
        self.batch_norm3 = nn.BatchNorm1d(128)

    def forward(self, x):
        x = x.view(-1, 784)
        x = self.relu(self.batch_norm1(self.fc1(x)))
        x = self.relu(self.batch_norm2(self.fc2(x)))
        x = self.relu(self.batch_norm3(self.fc3(x)))
        x = self.fc4(x)
        return x

# Data loading and preprocessing
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
trainset = torchvision.datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
testset = torchvision.datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=False)

# Initialize the network, loss function, and optimizer
net = ComplexNet()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.001, weight_decay=1e-5)  # L2 regularization

# Training loop
for epoch in range(10):
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {running_loss / len(trainloader)}")

# Testing the network
correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Accuracy on test set: {100 * correct / total}%")

# Calculate Cross-Entropy on the training and test sets
train_loss = 0.0
test_loss = 0.0
with torch.no_grad():
    for data in trainloader:
        inputs, labels = data
        outputs = net(inputs)
        train_loss += criterion(outputs, labels).item()
    
    for data in testloader:
        inputs, labels = data
        outputs = net(inputs)
        test_loss += criterion(outputs, labels).item()

print(f"Cross-Entropy on training set: {train_loss / len(trainloader)}")
print(f"Cross-Entropy on test set: {test_loss / len(testloader)}")


Epoch 1, Loss: 0.45866134827896987
Epoch 2, Loss: 0.3540881048721164
Epoch 3, Loss: 0.32221466568169566
Epoch 4, Loss: 0.2985416015884134
Epoch 5, Loss: 0.28089767953416683
Epoch 6, Loss: 0.2671330156070845
Epoch 7, Loss: 0.25372612392946853
Epoch 8, Loss: 0.24080968309027045
Epoch 9, Loss: 0.23350627290239848
Epoch 10, Loss: 0.22304274401526208
Accuracy on test set: 88.66%
Cross-Entropy on training set: 0.19269801209222026
Cross-Entropy on test set: 0.320014977293789


### Explaining results

Original Simple Feedforward Neural Network:

After 10 epochs, the loss on the training set is 0.3628, and the accuracy on the test set is 84.47%.
The Cross-Entropy on the training set is 0.3741, and on the test set, it's 0.4217.
Modified Complex Feedforward Neural Network:

After 10 epochs, the loss on the training set is 0.2230, and the accuracy on the test set is 88.66%.
The Cross-Entropy on the training set is 0.1927, and on the test set, it's 0.3200.
Comparison:

Training Loss: The modified complex network achieves a significantly lower training loss (0.2230) compared to the original simple network (0.3628). This indicates that the complex network learns the training data better.

Test Accuracy: The modified complex network achieves a higher test accuracy (88.66%) compared to the original simple network (84.47%). This suggests that the complex network generalizes better to unseen data.

Cross-Entropy: The Cross-Entropy on the training set is lower for the complex network (0.1927) compared to the original network (0.3741). Additionally, the Cross-Entropy on the test set is lower for the complex network (0.3200) compared to the original network (0.4217). Lower Cross-Entropy values indicate better model performance.

In summary, the modified complex feedforward neural network with increased complexity, Leaky ReLU activations, L2 regularization, batch normalization, and dropout shows improved performance over the original simple network. It achieves both lower training loss and better generalization to the test set, resulting in higher accuracy and lower Cross-Entropy.

## Exercise 1, e)