# **Deep Learning Course**

## **Loss Functions and Multilayer Perceptrons (MLP)**

---

### **Student Information:**

- **Name:** *Radin Khayyam*
- **Student Number:** *99101579*

---

### **Assignment Overview**

In this notebook, we will explore various loss functions used in neural networks, with a specific focus on their role in training **Multilayer Perceptrons (MLPs)**. By the end of this notebook, you will have a deeper understanding of:
- Types of loss functions
- How loss functions affect the training process
- The relationship between loss functions and model optimization in MLPs

---

### **Table of Contents**

1. Introduction to Loss Functions
2. Types of Loss Functions
3. Multilayer Perceptrons (MLP)
4. Implementing Loss Functions in MLP
5. Conclusion

---



# 1.Introduction to Loss Functions

In deep learning, **loss functions** play a crucial role in training models by quantifying the difference between the predicted outputs and the actual targets. Selecting the appropriate loss function is essential for the success of your model. In this assay, we will explore various loss functions available in PyTorch, understand their theoretical backgrounds, and provide you with a scaffolded class to experiment with these loss functions.

Before begining, let's train a simle MLP model using the **L1Loss** function. We'll return to this model later to experiment with different loss functions. We'll start by importing the necessary libraries and defining the model architecture.

First things first, let's talk about **L1Loss**.

### 1. L1Loss (`torch.nn.L1Loss`)
- **Description:** Also known as Mean Absolute Error (MAE), L1Loss computes the average absolute difference between the predicted values and the target values.
- **Use Case:** Suitable for regression tasks where robustness to outliers is desired.

Here is the mathematical formulation of L1Loss:
\begin{equation}
\text{L1Loss} = \frac{1}{n} \sum_{i=1}^{n} |y_{\text{pred}_i} - y_{\text{true}_i}|
\end{equation}

Let's implement a simple MLP model using the L1Loss function.

In [15]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from torch.optim import Adam
from tqdm import tqdm
# Don't be courious about Adam, it's just a fancy name for a fancy optimization algorithm

Here, we'll define a class called `SimpleMLP` that inherits from `nn.Module`. This class can have multiple layers, and we'll use the `nn.Sequential` module to define the layers of the model. The model will have the following architecture:

In [16]:
class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_hidden_layers=1, last_layer_activation_fn=nn.ReLU):
        super(SimpleMLP, self).__init__()
        # TODO: Define the layers of the MLP
        layers = []
        layers.append(nn.Linear(input_dim, hidden_dim))
        layers.append(nn.ReLU())

        for _ in range(num_hidden_layers - 1):
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(nn.ReLU())

        layers.append(nn.Linear(hidden_dim, output_dim))
        if last_layer_activation_fn is not None:
            layers.append(last_layer_activation_fn())

        self.model = nn.Sequential(*layers)

    def forward(self, x):
        # TODO: Define the forward pass of the MLP
         return self.model(x)

Now, let's define a class called `SimpleMLP_Loss` that has the following architecture:

In [182]:
class SimpleMLPTrainer:
    def __init__(self, model, criterion, optimizer):
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer

    def train(self, train_loader, num_epochs):
        #TODO: Implement the training loop
        #Note: You should also print the training loss at each epoch, use tqdm for progress bar
        #Note: You should return the training loss at each epoch

        training_loss = []

        for epoch in range(num_epochs):
            self.model.train()  # Set the model to training mode
            epoch_loss = 0.0

            progress_bar = tqdm(train_loader, desc=f"Epoch {epoch + 1}/{num_epochs}")
            for inputs, targets in progress_bar:
                inputs, targets = inputs.to(next(self.model.parameters()).device), targets.to(next(self.model.parameters()).device)
                # Forward pass
                predictions = self.model(inputs)
                loss = self.criterion(predictions, targets)

                # Backward pass
                self.optimizer.zero_grad()
                loss.backward()
                self.optimizer.step()

                # Accumulate loss
                epoch_loss += loss.item()

            avg_loss = epoch_loss / len(train_loader)
            training_loss.append(avg_loss)
            print(f"Epoch {epoch + 1}/{num_epochs} - Training Loss: {avg_loss:.4f}")

        return training_loss

    def evaluate(self, val_loader):
        #TODO: Implement the evaluation loop
        #Note: You should return the validation loss and accuracy
        self.model.eval()
        val_loss = 0.0
        correct = 0
        total = 0

        with torch.no_grad():
            for inputs, targets in val_loader:
                inputs, targets = inputs.to(next(self.model.parameters()).device), targets.to(next(self.model.parameters()).device)
                # Forward pass
                predictions = self.model(inputs)

                loss = self.criterion(predictions, targets)

                val_loss += loss.item()

                # Calculate accuracy
                predicted = (predictions > 0.5).float()
                correct += (predicted == targets).sum().item()
                total += targets.size(0)

        avg_val_loss = val_loss / len(val_loader)
        accuracy = correct / total * 100.0
        print(f"Validation Loss: {avg_val_loss:.4f} - Validation Accuracy: {accuracy:.2f}%")

        return avg_val_loss, accuracy

Next, lets test our model using the L1Loss function. You'll use <span style="color:red">*Titanic Dataset*</span> to train the model.


In [148]:
# Load dataset
train_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(train_url)

# Preprocessing (simple example)
data = data[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']].dropna()
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

# TODO: Convert the data to PyTorch tensors and create a DataLoader
features = data[['Pclass', 'Sex', 'Age', 'Fare']].values
labels = data['Survived'].values
features_tensor = torch.tensor(features, dtype=torch.float32)
labels_tensor = torch.tensor(labels, dtype=torch.float32).unsqueeze(1)

# TODO: Split the data into training and validation sets
train_features, val_features, train_labels, val_labels = train_test_split(
    features_tensor, labels_tensor, test_size=0.2, random_state=42
)

# TODO: Define the model, criterion, and optimizer
train_dataset = TensorDataset(train_features, train_labels)
val_dataset = TensorDataset(val_features, val_labels)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)


<div style="text-align: center;"> <span style="color:red; font-size: 26px; font-weight: bold;">Let's train!</span> </div>

In [183]:
from torch.nn import L1Loss

# TODO: Train the model

input_dim = features.shape[1]
hidden_dim = 16
output_dim = 1
model = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim, num_hidden_layers=2, last_layer_activation_fn=nn.Sigmoid)

optimizer = Adam(model.parameters(), lr=0.001)

num_epochs = 20
trainer = SimpleMLPTrainer(model, criterion=L1Loss(), optimizer=optimizer)
training_loss = trainer.train(train_loader, num_epochs)
print("Training completed.")

# TODO: Evaluate the model
val_loss, _ = trainer.evaluate(val_loader)
print(f"Validation Loss: {val_loss:.4f}")

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 516.59it/s]


Epoch 1/20 - Training Loss: 0.4638


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 534.02it/s]


Epoch 2/20 - Training Loss: 0.4242


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 399.80it/s]


Epoch 3/20 - Training Loss: 0.3917


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 530.18it/s]


Epoch 4/20 - Training Loss: 0.3718


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 509.68it/s]


Epoch 5/20 - Training Loss: 0.3569


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 403.61it/s]


Epoch 6/20 - Training Loss: 0.3438


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 500.79it/s]


Epoch 7/20 - Training Loss: 0.3358


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 490.21it/s]


Epoch 8/20 - Training Loss: 0.3278


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 530.70it/s]


Epoch 9/20 - Training Loss: 0.3230


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 474.58it/s]


Epoch 10/20 - Training Loss: 0.3185


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 447.89it/s]


Epoch 11/20 - Training Loss: 0.3160


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 459.86it/s]


Epoch 12/20 - Training Loss: 0.3123


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 409.74it/s]


Epoch 13/20 - Training Loss: 0.3102


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 474.61it/s]


Epoch 14/20 - Training Loss: 0.3087


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 523.61it/s]


Epoch 15/20 - Training Loss: 0.3073


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 509.67it/s]


Epoch 16/20 - Training Loss: 0.3069


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 484.38it/s]


Epoch 17/20 - Training Loss: 0.3048


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 480.60it/s]


Epoch 18/20 - Training Loss: 0.3035


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 517.13it/s]


Epoch 19/20 - Training Loss: 0.3041


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 445.07it/s]

Epoch 20/20 - Training Loss: 0.3024
Training completed.
Validation Loss: 0.3986 - Validation Accuracy: 62.94%
Validation Loss: 0.3986





---
# 2. Types of Loss Functions

PyTorch offers a variety of built-in loss functions tailored for different types of problems, such as regression, classification, and more. Below, we discuss several commonly used loss functions, their theoretical foundations, and typical use cases.

### 2. MSELoss (`torch.nn.MSELoss`)
- **Description:** Mean Squared Error (MSE) calculates the average of the squares of the differences between predicted and target values.
- **Use Case:** Commonly used in regression problems where larger errors are significantly penalized.

Here is boring math stuff for MSE:
\begin{equation}
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2}
\end{equation}

<span style="color:red; font-size: 18px; font-weight: bold;">Warning:</span> Don't forget to reinitialize the model before experimenting with different loss functions.

In [185]:
from torch.nn import MSELoss

input_dim = features.shape[1]
hidden_dim = 16
output_dim = 1
model = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim, num_hidden_layers=2, last_layer_activation_fn=nn.Sigmoid)

criterion = MSELoss()
optimizer = Adam(model.parameters(), lr=0.001)

trainer = SimpleMLPTrainer(model, criterion=criterion, optimizer=optimizer)

# Train the model
num_epochs = 50
training_loss = trainer.train(train_loader, num_epochs)
print("Training completed.")

# Evaluate the model
val_loss, _ = trainer.evaluate(val_loader)
print(f"Validation Loss: {val_loss:.4f}")

Epoch 1/50: 100%|██████████| 18/18 [00:00<00:00, 506.90it/s]


Epoch 1/50 - Training Loss: 0.2276


Epoch 2/50: 100%|██████████| 18/18 [00:00<00:00, 579.72it/s]


Epoch 2/50 - Training Loss: 0.2086


Epoch 3/50: 100%|██████████| 18/18 [00:00<00:00, 556.97it/s]


Epoch 3/50 - Training Loss: 0.2075


Epoch 4/50: 100%|██████████| 18/18 [00:00<00:00, 510.33it/s]


Epoch 4/50 - Training Loss: 0.2051


Epoch 5/50: 100%|██████████| 18/18 [00:00<00:00, 546.11it/s]


Epoch 5/50 - Training Loss: 0.2054


Epoch 6/50: 100%|██████████| 18/18 [00:00<00:00, 498.28it/s]


Epoch 6/50 - Training Loss: 0.2050


Epoch 7/50: 100%|██████████| 18/18 [00:00<00:00, 384.57it/s]


Epoch 7/50 - Training Loss: 0.2038


Epoch 8/50: 100%|██████████| 18/18 [00:00<00:00, 330.63it/s]


Epoch 8/50 - Training Loss: 0.2032


Epoch 9/50: 100%|██████████| 18/18 [00:00<00:00, 412.64it/s]


Epoch 9/50 - Training Loss: 0.2030


Epoch 10/50: 100%|██████████| 18/18 [00:00<00:00, 490.55it/s]


Epoch 10/50 - Training Loss: 0.2034


Epoch 11/50: 100%|██████████| 18/18 [00:00<00:00, 484.35it/s]


Epoch 11/50 - Training Loss: 0.2035


Epoch 12/50: 100%|██████████| 18/18 [00:00<00:00, 571.25it/s]


Epoch 12/50 - Training Loss: 0.2034


Epoch 13/50: 100%|██████████| 18/18 [00:00<00:00, 547.12it/s]


Epoch 13/50 - Training Loss: 0.2016


Epoch 14/50: 100%|██████████| 18/18 [00:00<00:00, 549.94it/s]


Epoch 14/50 - Training Loss: 0.2012


Epoch 15/50: 100%|██████████| 18/18 [00:00<00:00, 543.28it/s]


Epoch 15/50 - Training Loss: 0.2012


Epoch 16/50: 100%|██████████| 18/18 [00:00<00:00, 579.54it/s]


Epoch 16/50 - Training Loss: 0.2003


Epoch 17/50: 100%|██████████| 18/18 [00:00<00:00, 572.57it/s]


Epoch 17/50 - Training Loss: 0.1995


Epoch 18/50: 100%|██████████| 18/18 [00:00<00:00, 554.82it/s]


Epoch 18/50 - Training Loss: 0.1997


Epoch 19/50: 100%|██████████| 18/18 [00:00<00:00, 543.70it/s]


Epoch 19/50 - Training Loss: 0.1993


Epoch 20/50: 100%|██████████| 18/18 [00:00<00:00, 558.00it/s]


Epoch 20/50 - Training Loss: 0.1988


Epoch 21/50: 100%|██████████| 18/18 [00:00<00:00, 398.49it/s]


Epoch 21/50 - Training Loss: 0.1986


Epoch 22/50: 100%|██████████| 18/18 [00:00<00:00, 546.08it/s]


Epoch 22/50 - Training Loss: 0.1983


Epoch 23/50: 100%|██████████| 18/18 [00:00<00:00, 531.31it/s]


Epoch 23/50 - Training Loss: 0.2001


Epoch 24/50: 100%|██████████| 18/18 [00:00<00:00, 480.17it/s]


Epoch 24/50 - Training Loss: 0.1987


Epoch 25/50: 100%|██████████| 18/18 [00:00<00:00, 508.94it/s]


Epoch 25/50 - Training Loss: 0.1977


Epoch 26/50: 100%|██████████| 18/18 [00:00<00:00, 406.57it/s]


Epoch 26/50 - Training Loss: 0.1973


Epoch 27/50: 100%|██████████| 18/18 [00:00<00:00, 542.70it/s]


Epoch 27/50 - Training Loss: 0.1956


Epoch 28/50: 100%|██████████| 18/18 [00:00<00:00, 505.00it/s]


Epoch 28/50 - Training Loss: 0.1958


Epoch 29/50: 100%|██████████| 18/18 [00:00<00:00, 444.09it/s]


Epoch 29/50 - Training Loss: 0.1944


Epoch 30/50: 100%|██████████| 18/18 [00:00<00:00, 384.33it/s]


Epoch 30/50 - Training Loss: 0.1946


Epoch 31/50: 100%|██████████| 18/18 [00:00<00:00, 388.73it/s]


Epoch 31/50 - Training Loss: 0.1938


Epoch 32/50: 100%|██████████| 18/18 [00:00<00:00, 299.31it/s]


Epoch 32/50 - Training Loss: 0.1921


Epoch 33/50: 100%|██████████| 18/18 [00:00<00:00, 427.50it/s]


Epoch 33/50 - Training Loss: 0.1921


Epoch 34/50: 100%|██████████| 18/18 [00:00<00:00, 402.32it/s]


Epoch 34/50 - Training Loss: 0.1916


Epoch 35/50: 100%|██████████| 18/18 [00:00<00:00, 393.23it/s]


Epoch 35/50 - Training Loss: 0.1905


Epoch 36/50: 100%|██████████| 18/18 [00:00<00:00, 419.83it/s]


Epoch 36/50 - Training Loss: 0.1894


Epoch 37/50: 100%|██████████| 18/18 [00:00<00:00, 395.87it/s]


Epoch 37/50 - Training Loss: 0.1872


Epoch 38/50: 100%|██████████| 18/18 [00:00<00:00, 371.96it/s]


Epoch 38/50 - Training Loss: 0.1875


Epoch 39/50: 100%|██████████| 18/18 [00:00<00:00, 397.32it/s]


Epoch 39/50 - Training Loss: 0.1879


Epoch 40/50: 100%|██████████| 18/18 [00:00<00:00, 401.01it/s]


Epoch 40/50 - Training Loss: 0.1857


Epoch 41/50: 100%|██████████| 18/18 [00:00<00:00, 391.20it/s]


Epoch 41/50 - Training Loss: 0.1842


Epoch 42/50: 100%|██████████| 18/18 [00:00<00:00, 526.70it/s]


Epoch 42/50 - Training Loss: 0.1840


Epoch 43/50: 100%|██████████| 18/18 [00:00<00:00, 471.41it/s]


Epoch 43/50 - Training Loss: 0.1843


Epoch 44/50: 100%|██████████| 18/18 [00:00<00:00, 560.02it/s]


Epoch 44/50 - Training Loss: 0.1811


Epoch 45/50: 100%|██████████| 18/18 [00:00<00:00, 551.48it/s]


Epoch 45/50 - Training Loss: 0.1791


Epoch 46/50: 100%|██████████| 18/18 [00:00<00:00, 485.46it/s]


Epoch 46/50 - Training Loss: 0.1784


Epoch 47/50: 100%|██████████| 18/18 [00:00<00:00, 533.22it/s]


Epoch 47/50 - Training Loss: 0.1771


Epoch 48/50: 100%|██████████| 18/18 [00:00<00:00, 480.47it/s]


Epoch 48/50 - Training Loss: 0.1762


Epoch 49/50: 100%|██████████| 18/18 [00:00<00:00, 402.08it/s]


Epoch 49/50 - Training Loss: 0.1745


Epoch 50/50: 100%|██████████| 18/18 [00:00<00:00, 430.82it/s]


Epoch 50/50 - Training Loss: 0.1736
Training completed.
Validation Loss: 0.2055 - Validation Accuracy: 67.13%
Validation Loss: 0.2055


### 3. NLLLoss (`torch.nn.NLLLoss`)
- **Description:** Negative Log-Likelihood Loss measures the likelihood of the target class under the predicted probability distribution.
- **Use Case:** Typically used in multi-class classification tasks, especially when combined with `log_softmax` activation.

Here is the mathematical formulation of NLLLoss:
\begin{equation}
\text{NLLLoss} = -\frac{1}{n} \sum_{i=1}^{n} \log(y_{i})
\end{equation}

I hope you note the logarithm in the formula. It's important!

Why?

The logarithm in the formula for **NLLLoss** is important because it heavily penalizes low probabilities assigned to the correct class, encouraging the model to make confident and accurate predictions. By transforming probabilities into manageable negative values, it prevents numerical instability and ensures the loss reflects the model's certainty about its predictions.


In this part, run your training with Relu at last layer. <span style="color:red; font-weight: bold;">Discuss </span> and explain the difference between the results of the two models. Find a proper solution to the problem.


In [181]:
# Run with ReLU activation function
from torch.nn import NLLLoss

input_dim = features.shape[1]
hidden_dim = 16
output_dim = 1
model = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim, num_hidden_layers=2, last_layer_activation_fn=nn.ReLU)

criterion = NLLLoss()
optimizer = Adam(model.parameters(), lr=0.001)

trainer = SimpleMLPTrainer(model, criterion=criterion, optimizer=optimizer)

# Train the model
num_epochs = 50
training_loss = trainer.train(train_loader, num_epochs)
print("Training completed.")

# Evaluate the model
val_loss, _ = trainer.evaluate(val_loader)
print(f"Validation Loss: {val_loss:.4f}")

Epoch 1/50:   0%|          | 0/18 [00:00<?, ?it/s]


IndexError: Target 1 is out of bounds.

In [None]:
# Run with Sigmoid activation function
from torch.nn import NLLLoss

# Define the model with log-softmax in the last layer (correct configuration)
input_dim = features.shape[1]
hidden_dim = 16
output_dim = 1  # Multi-class output (for NLLLoss)
model_log_softmax = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim, num_hidden_layers=2, last_layer_activation_fn=nn.LogSoftmax)

# Define the criterion and optimizer
criterion_log_softmax = NLLLoss()
optimizer_log_softmax = Adam(model_log_softmax.parameters(), lr=0.001)

# Train the model
trainer_log_softmax = SimpleMLPTrainer(model_log_softmax, criterion=criterion_log_softmax, optimizer=optimizer_log_softmax)

print("Training with Log-Softmax as the last layer (Expected to Succeed):")
training_loss_log_softmax = trainer_log_softmax.train(train_loader, num_epochs=50)
val_loss_log_softmax, val_accuracy_log_softmax = trainer_log_softmax.evaluate(val_loader)
print(f"Validation Loss: {val_loss_log_softmax:.4f} - Validation Accuracy: {val_accuracy_log_softmax:.2f}%")


Your reason for your choice:

<div>
**Your answer here**
</div>


### 4. CrossEntropyLoss (`torch.nn.CrossEntropyLoss`)
- **Description:** Combines `LogSoftmax` and `NLLLoss` in one single class. It computes the cross-entropy loss between the target and the output logits.
- **Use Case:** Widely used for multi-class classification problems.

The mathematical formulation of CrossEntropyLoss is as follows:
\begin{equation}
  \text{CrossEntropy}(y, \hat{y}) = - \sum_{i=1}^{C} y_i \log\left(\frac{e^{\hat{y}_i}}{\sum_{j=1}^{C} e^{\hat{y}_j}}\right)
\end{equation}
  where:
  - \( C \) is the number of classes,
  - \( y_i \) is a one-hot encoded target vector (or a scalar class label),
  - \( \hat{y}_i \) represents the logits (unnormalized model outputs) for each class.
  
  In practice, `torch.nn.CrossEntropyLoss` expects raw logits as input and internally applies the softmax function to convert the logits into probabilities, followed by the negative log-likelihood computation.

- **Background:** Cross-entropy measures the difference between the true distribution \( y \) and the predicted distribution \( \hat{y} \). The function minimizes the negative log-probability assigned to the correct class, effectively penalizing predictions that deviate from the true class, making it a standard choice for classification tasks in deep learning.

Now, let's implement a class called `SimpleMLP_Loss` that has the following architecture:


In [189]:
from torch.nn import CrossEntropyLoss

input_dim = features.shape[1]
hidden_dim = 16
output_dim = 1
model = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim, num_hidden_layers=2, last_layer_activation_fn=nn.ReLU)

criterion = CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=0.001)

trainer = SimpleMLPTrainer(model, criterion=criterion, optimizer=optimizer)

# Train the model
num_epochs = 50
training_loss = trainer.train(train_loader, num_epochs)
print("Training completed.")

# Evaluate the model
val_loss, _ = trainer.evaluate(val_loader)
print(f"Validation Loss: {val_loss:.4f}")


Epoch 1/50: 100%|██████████| 18/18 [00:00<00:00, 73.34it/s]


Epoch 1/50 - Training Loss: 0.0000


Epoch 2/50: 100%|██████████| 18/18 [00:00<00:00, 100.41it/s]


Epoch 2/50 - Training Loss: 0.0000


Epoch 3/50: 100%|██████████| 18/18 [00:00<00:00, 94.46it/s] 


Epoch 3/50 - Training Loss: 0.0000


Epoch 4/50: 100%|██████████| 18/18 [00:00<00:00, 71.66it/s]


Epoch 4/50 - Training Loss: 0.0000


Epoch 5/50: 100%|██████████| 18/18 [00:00<00:00, 78.34it/s] 


Epoch 5/50 - Training Loss: 0.0000


Epoch 6/50: 100%|██████████| 18/18 [00:00<00:00, 106.87it/s]


Epoch 6/50 - Training Loss: 0.0000


Epoch 7/50: 100%|██████████| 18/18 [00:00<00:00, 197.56it/s]


Epoch 7/50 - Training Loss: 0.0000


Epoch 8/50: 100%|██████████| 18/18 [00:00<00:00, 324.35it/s]


Epoch 8/50 - Training Loss: 0.0000


Epoch 9/50: 100%|██████████| 18/18 [00:00<00:00, 402.60it/s]


Epoch 9/50 - Training Loss: 0.0000


Epoch 10/50: 100%|██████████| 18/18 [00:00<00:00, 518.92it/s]


Epoch 10/50 - Training Loss: 0.0000


Epoch 11/50: 100%|██████████| 18/18 [00:00<00:00, 535.40it/s]


Epoch 11/50 - Training Loss: 0.0000


Epoch 12/50: 100%|██████████| 18/18 [00:00<00:00, 517.59it/s]


Epoch 12/50 - Training Loss: 0.0000


Epoch 13/50: 100%|██████████| 18/18 [00:00<00:00, 524.94it/s]


Epoch 13/50 - Training Loss: 0.0000


Epoch 14/50: 100%|██████████| 18/18 [00:00<00:00, 530.15it/s]


Epoch 14/50 - Training Loss: 0.0000


Epoch 15/50: 100%|██████████| 18/18 [00:00<00:00, 510.58it/s]


Epoch 15/50 - Training Loss: 0.0000


Epoch 16/50: 100%|██████████| 18/18 [00:00<00:00, 379.20it/s]


Epoch 16/50 - Training Loss: 0.0000


Epoch 17/50: 100%|██████████| 18/18 [00:00<00:00, 498.33it/s]


Epoch 17/50 - Training Loss: 0.0000


Epoch 18/50: 100%|██████████| 18/18 [00:00<00:00, 523.55it/s]


Epoch 18/50 - Training Loss: 0.0000


Epoch 19/50: 100%|██████████| 18/18 [00:00<00:00, 509.69it/s]


Epoch 19/50 - Training Loss: 0.0000


Epoch 20/50: 100%|██████████| 18/18 [00:00<00:00, 406.15it/s]


Epoch 20/50 - Training Loss: 0.0000


Epoch 21/50: 100%|██████████| 18/18 [00:00<00:00, 356.59it/s]


Epoch 21/50 - Training Loss: 0.0000


Epoch 22/50: 100%|██████████| 18/18 [00:00<00:00, 381.85it/s]


Epoch 22/50 - Training Loss: 0.0000


Epoch 23/50: 100%|██████████| 18/18 [00:00<00:00, 367.06it/s]


Epoch 23/50 - Training Loss: 0.0000


Epoch 24/50: 100%|██████████| 18/18 [00:00<00:00, 376.65it/s]


Epoch 24/50 - Training Loss: 0.0000


Epoch 25/50: 100%|██████████| 18/18 [00:00<00:00, 408.07it/s]


Epoch 25/50 - Training Loss: 0.0000


Epoch 26/50: 100%|██████████| 18/18 [00:00<00:00, 383.33it/s]


Epoch 26/50 - Training Loss: 0.0000


Epoch 27/50: 100%|██████████| 18/18 [00:00<00:00, 402.19it/s]


Epoch 27/50 - Training Loss: 0.0000


Epoch 28/50: 100%|██████████| 18/18 [00:00<00:00, 378.95it/s]


Epoch 28/50 - Training Loss: 0.0000


Epoch 29/50: 100%|██████████| 18/18 [00:00<00:00, 437.58it/s]


Epoch 29/50 - Training Loss: 0.0000


Epoch 30/50: 100%|██████████| 18/18 [00:00<00:00, 532.27it/s]


Epoch 30/50 - Training Loss: 0.0000


Epoch 31/50: 100%|██████████| 18/18 [00:00<00:00, 499.41it/s]


Epoch 31/50 - Training Loss: 0.0000


Epoch 32/50: 100%|██████████| 18/18 [00:00<00:00, 501.08it/s]


Epoch 32/50 - Training Loss: 0.0000


Epoch 33/50: 100%|██████████| 18/18 [00:00<00:00, 516.17it/s]


Epoch 33/50 - Training Loss: 0.0000


Epoch 34/50: 100%|██████████| 18/18 [00:00<00:00, 485.58it/s]


Epoch 34/50 - Training Loss: 0.0000


Epoch 35/50: 100%|██████████| 18/18 [00:00<00:00, 478.36it/s]


Epoch 35/50 - Training Loss: 0.0000


Epoch 36/50: 100%|██████████| 18/18 [00:00<00:00, 342.28it/s]


Epoch 36/50 - Training Loss: 0.0000


Epoch 37/50: 100%|██████████| 18/18 [00:00<00:00, 491.88it/s]


Epoch 37/50 - Training Loss: 0.0000


Epoch 38/50: 100%|██████████| 18/18 [00:00<00:00, 528.41it/s]


Epoch 38/50 - Training Loss: 0.0000


Epoch 39/50: 100%|██████████| 18/18 [00:00<00:00, 512.59it/s]


Epoch 39/50 - Training Loss: 0.0000


Epoch 40/50: 100%|██████████| 18/18 [00:00<00:00, 379.24it/s]


Epoch 40/50 - Training Loss: 0.0000


Epoch 41/50: 100%|██████████| 18/18 [00:00<00:00, 366.01it/s]


Epoch 41/50 - Training Loss: 0.0000


Epoch 42/50: 100%|██████████| 18/18 [00:00<00:00, 376.96it/s]


Epoch 42/50 - Training Loss: 0.0000


Epoch 43/50: 100%|██████████| 18/18 [00:00<00:00, 392.36it/s]


Epoch 43/50 - Training Loss: 0.0000


Epoch 44/50: 100%|██████████| 18/18 [00:00<00:00, 401.60it/s]


Epoch 44/50 - Training Loss: 0.0000


Epoch 45/50: 100%|██████████| 18/18 [00:00<00:00, 393.58it/s]


Epoch 45/50 - Training Loss: 0.0000


Epoch 46/50: 100%|██████████| 18/18 [00:00<00:00, 393.37it/s]


Epoch 46/50 - Training Loss: 0.0000


Epoch 47/50: 100%|██████████| 18/18 [00:00<00:00, 550.83it/s]


Epoch 47/50 - Training Loss: 0.0000


Epoch 48/50: 100%|██████████| 18/18 [00:00<00:00, 353.65it/s]


Epoch 48/50 - Training Loss: 0.0000


Epoch 49/50: 100%|██████████| 18/18 [00:00<00:00, 349.23it/s]


Epoch 49/50 - Training Loss: 0.0000


Epoch 50/50: 100%|██████████| 18/18 [00:00<00:00, 355.17it/s]

Epoch 50/50 - Training Loss: 0.0000
Training completed.
Validation Loss: 0.0000 - Validation Accuracy: 60.84%
Validation Loss: 0.0000






### 5. KLDivLoss (`torch.nn.KLDivLoss`)
- **Description:** Kullback-Leibler Divergence Loss measures how one probability distribution diverges from a second, reference distribution. Unlike other loss functions that focus on classification, KL divergence specifically compares the relative entropy between two distributions. It quantifies the information loss when using the predicted distribution to approximate the true distribution.

- **Mathematical Function:**
\begin{equation}
  \text{KL}(P \parallel Q) = \sum_{i=1}^{C} P(i) \left( \log P(i) - \log Q(i) \right)
\end{equation}
  where:
  - \( P \) is the target (true) probability distribution,
  - \( Q \) is the predicted distribution (often the output of `log_softmax`),
  - \( C \) is the number of classes.

  KL divergence is always non-negative, and it equals zero if the two distributions are identical. The loss function expects the model's output to be in the form of log-probabilities (using `log_softmax`) and compares this against a target probability distribution, which is typically a normalized distribution (using softmax).

- **Use Case:** KLDivLoss is frequently used in:
  - **Variational Autoencoders (VAEs):** In VAEs, KL divergence is used to measure how much the learned latent space distribution deviates from a prior distribution (often Gaussian).
  - **Knowledge Distillation:** In teacher-student models, KL divergence is used to transfer the "soft" knowledge from a teacher model to a student model by comparing their output probability distributions.
  - **Reinforcement Learning:** It can be used to update policies while minimizing the divergence from a previous policy.

- **Background:** Kullback-Leibler divergence, a core concept in information theory, measures the inefficiency of assuming the predicted distribution \( Q \) when the true distribution is \( P \). It is asymmetric, meaning that \( KL(P \parallel Q) \neq KL(Q \parallel P) \), so the direction of the comparison matters.

Again, in this part, run your training with Relu at last layer. <span style="color:red; font-weight: bold;">Discuss </span> and explain the difference between the results of the two models. Find a proper solution to the problem.


In [193]:
# Run with relu activation function
from torch.nn import KLDivLoss

input_dim = features.shape[1]
hidden_dim = 16
output_dim = 1
model = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim, num_hidden_layers=2, last_layer_activation_fn=nn.ReLU)

criterion = KLDivLoss()
optimizer = Adam(model.parameters(), lr=0.001)

trainer = SimpleMLPTrainer(model, criterion=criterion, optimizer=optimizer)

# Train the model
num_epochs = 50
training_loss = trainer.train(train_loader, num_epochs)
print("Training completed.")

# Evaluate the model
val_loss, _ = trainer.evaluate(val_loader)
print(f"Validation Loss: {val_loss:.4f}")



Epoch 1/50: 100%|██████████| 18/18 [00:00<00:00, 525.65it/s]


Epoch 1/50 - Training Loss: -1.4916


Epoch 2/50: 100%|██████████| 18/18 [00:00<00:00, 552.98it/s]


Epoch 2/50 - Training Loss: -3.2011


Epoch 3/50: 100%|██████████| 18/18 [00:00<00:00, 573.05it/s]


Epoch 3/50 - Training Loss: -5.3487


Epoch 4/50: 100%|██████████| 18/18 [00:00<00:00, 605.83it/s]


Epoch 4/50 - Training Loss: -8.4321


Epoch 5/50: 100%|██████████| 18/18 [00:00<00:00, 442.35it/s]


Epoch 5/50 - Training Loss: -12.6173


Epoch 6/50: 100%|██████████| 18/18 [00:00<00:00, 552.40it/s]


Epoch 6/50 - Training Loss: -19.3068


Epoch 7/50: 100%|██████████| 18/18 [00:00<00:00, 534.71it/s]


Epoch 7/50 - Training Loss: -28.6094


Epoch 8/50: 100%|██████████| 18/18 [00:00<00:00, 489.28it/s]


Epoch 8/50 - Training Loss: -41.5725


Epoch 9/50: 100%|██████████| 18/18 [00:00<00:00, 479.60it/s]


Epoch 9/50 - Training Loss: -59.2196


Epoch 10/50: 100%|██████████| 18/18 [00:00<00:00, 537.70it/s]


Epoch 10/50 - Training Loss: -82.5798


Epoch 11/50: 100%|██████████| 18/18 [00:00<00:00, 492.11it/s]


Epoch 11/50 - Training Loss: -114.9437


Epoch 12/50: 100%|██████████| 18/18 [00:00<00:00, 475.74it/s]


Epoch 12/50 - Training Loss: -153.9804


Epoch 13/50: 100%|██████████| 18/18 [00:00<00:00, 407.26it/s]


Epoch 13/50 - Training Loss: -204.2056


Epoch 14/50: 100%|██████████| 18/18 [00:00<00:00, 409.90it/s]


Epoch 14/50 - Training Loss: -265.6581


Epoch 15/50: 100%|██████████| 18/18 [00:00<00:00, 322.42it/s]


Epoch 15/50 - Training Loss: -339.8792


Epoch 16/50: 100%|██████████| 18/18 [00:00<00:00, 356.93it/s]


Epoch 16/50 - Training Loss: -428.3095


Epoch 17/50: 100%|██████████| 18/18 [00:00<00:00, 315.60it/s]


Epoch 17/50 - Training Loss: -534.0756


Epoch 18/50: 100%|██████████| 18/18 [00:00<00:00, 337.69it/s]


Epoch 18/50 - Training Loss: -657.3834


Epoch 19/50: 100%|██████████| 18/18 [00:00<00:00, 294.21it/s]


Epoch 19/50 - Training Loss: -791.0666


Epoch 20/50: 100%|██████████| 18/18 [00:00<00:00, 364.54it/s]


Epoch 20/50 - Training Loss: -935.7775


Epoch 21/50: 100%|██████████| 18/18 [00:00<00:00, 394.65it/s]


Epoch 21/50 - Training Loss: -1122.9293


Epoch 22/50: 100%|██████████| 18/18 [00:00<00:00, 373.72it/s]


Epoch 22/50 - Training Loss: -1301.0675


Epoch 23/50: 100%|██████████| 18/18 [00:00<00:00, 361.57it/s]


Epoch 23/50 - Training Loss: -1513.3036


Epoch 24/50: 100%|██████████| 18/18 [00:00<00:00, 378.20it/s]


Epoch 24/50 - Training Loss: -1739.6145


Epoch 25/50: 100%|██████████| 18/18 [00:00<00:00, 387.64it/s]


Epoch 25/50 - Training Loss: -2017.5180


Epoch 26/50: 100%|██████████| 18/18 [00:00<00:00, 389.67it/s]


Epoch 26/50 - Training Loss: -2283.7414


Epoch 27/50: 100%|██████████| 18/18 [00:00<00:00, 488.57it/s]


Epoch 27/50 - Training Loss: -2608.4124


Epoch 28/50: 100%|██████████| 18/18 [00:00<00:00, 554.82it/s]


Epoch 28/50 - Training Loss: -2887.2553


Epoch 29/50: 100%|██████████| 18/18 [00:00<00:00, 518.25it/s]


Epoch 29/50 - Training Loss: -3250.0684


Epoch 30/50: 100%|██████████| 18/18 [00:00<00:00, 516.52it/s]


Epoch 30/50 - Training Loss: -3622.3611


Epoch 31/50: 100%|██████████| 18/18 [00:00<00:00, 483.46it/s]


Epoch 31/50 - Training Loss: -4020.2953


Epoch 32/50: 100%|██████████| 18/18 [00:00<00:00, 491.56it/s]


Epoch 32/50 - Training Loss: -4444.4009


Epoch 33/50: 100%|██████████| 18/18 [00:00<00:00, 464.57it/s]


Epoch 33/50 - Training Loss: -4892.8080


Epoch 34/50: 100%|██████████| 18/18 [00:00<00:00, 371.89it/s]


Epoch 34/50 - Training Loss: -5403.5382


Epoch 35/50: 100%|██████████| 18/18 [00:00<00:00, 515.00it/s]


Epoch 35/50 - Training Loss: -5878.1046


Epoch 36/50: 100%|██████████| 18/18 [00:00<00:00, 391.02it/s]


Epoch 36/50 - Training Loss: -6418.0089


Epoch 37/50: 100%|██████████| 18/18 [00:00<00:00, 379.39it/s]


Epoch 37/50 - Training Loss: -7022.0129


Epoch 38/50: 100%|██████████| 18/18 [00:00<00:00, 390.21it/s]


Epoch 38/50 - Training Loss: -7587.3199


Epoch 39/50: 100%|██████████| 18/18 [00:00<00:00, 387.58it/s]


Epoch 39/50 - Training Loss: -8163.6414


Epoch 40/50: 100%|██████████| 18/18 [00:00<00:00, 381.47it/s]


Epoch 40/50 - Training Loss: -8900.5406


Epoch 41/50: 100%|██████████| 18/18 [00:00<00:00, 395.35it/s]


Epoch 41/50 - Training Loss: -9526.2815


Epoch 42/50: 100%|██████████| 18/18 [00:00<00:00, 529.51it/s]


Epoch 42/50 - Training Loss: -10252.9155


Epoch 43/50: 100%|██████████| 18/18 [00:00<00:00, 544.96it/s]


Epoch 43/50 - Training Loss: -11106.2716


Epoch 44/50: 100%|██████████| 18/18 [00:00<00:00, 373.13it/s]


Epoch 44/50 - Training Loss: -11813.1191


Epoch 45/50: 100%|██████████| 18/18 [00:00<00:00, 406.67it/s]


Epoch 45/50 - Training Loss: -12706.6388


Epoch 46/50: 100%|██████████| 18/18 [00:00<00:00, 394.90it/s]


Epoch 46/50 - Training Loss: -13474.4817


Epoch 47/50: 100%|██████████| 18/18 [00:00<00:00, 367.61it/s]


Epoch 47/50 - Training Loss: -14493.8312


Epoch 48/50: 100%|██████████| 18/18 [00:00<00:00, 415.77it/s]


Epoch 48/50 - Training Loss: -15376.7548


Epoch 49/50: 100%|██████████| 18/18 [00:00<00:00, 529.84it/s]


Epoch 49/50 - Training Loss: -16266.8846


Epoch 50/50: 100%|██████████| 18/18 [00:00<00:00, 524.97it/s]

Epoch 50/50 - Training Loss: -17299.0196
Training completed.
Validation Loss: -17352.2740 - Validation Accuracy: 39.16%
Validation Loss: -17352.2740





In [197]:
# Run with LogSoftmax activation function
from torch.nn import KLDivLoss

input_dim = features.shape[1]
hidden_dim = 16
output_dim = 1
model = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim, num_hidden_layers=2, last_layer_activation_fn=nn.LogSoftmax)

criterion = KLDivLoss()
optimizer = Adam(model.parameters(), lr=0.001)

trainer = SimpleMLPTrainer(model, criterion=criterion, optimizer=optimizer)

# Train the model
num_epochs = 50
training_loss = trainer.train(train_loader, num_epochs)
print("Training completed.")

# Evaluate the model
val_loss, _ = trainer.evaluate(val_loader)
print(f"Validation Loss: {val_loss:.4f}")



Epoch 1/50: 100%|██████████| 18/18 [00:00<00:00, 510.86it/s]


Epoch 1/50 - Training Loss: 0.0000


Epoch 2/50: 100%|██████████| 18/18 [00:00<00:00, 537.64it/s]


Epoch 2/50 - Training Loss: 0.0000


Epoch 3/50: 100%|██████████| 18/18 [00:00<00:00, 573.03it/s]


Epoch 3/50 - Training Loss: 0.0000


Epoch 4/50: 100%|██████████| 18/18 [00:00<00:00, 567.28it/s]


Epoch 4/50 - Training Loss: 0.0000


Epoch 5/50: 100%|██████████| 18/18 [00:00<00:00, 554.71it/s]


Epoch 5/50 - Training Loss: 0.0000


Epoch 6/50: 100%|██████████| 18/18 [00:00<00:00, 521.60it/s]


Epoch 6/50 - Training Loss: 0.0000


Epoch 7/50: 100%|██████████| 18/18 [00:00<00:00, 525.31it/s]


Epoch 7/50 - Training Loss: 0.0000


Epoch 8/50: 100%|██████████| 18/18 [00:00<00:00, 529.07it/s]


Epoch 8/50 - Training Loss: 0.0000


Epoch 9/50: 100%|██████████| 18/18 [00:00<00:00, 452.31it/s]


Epoch 9/50 - Training Loss: 0.0000


Epoch 10/50: 100%|██████████| 18/18 [00:00<00:00, 543.79it/s]


Epoch 10/50 - Training Loss: 0.0000


Epoch 11/50: 100%|██████████| 18/18 [00:00<00:00, 519.65it/s]


Epoch 11/50 - Training Loss: 0.0000


Epoch 12/50: 100%|██████████| 18/18 [00:00<00:00, 508.65it/s]


Epoch 12/50 - Training Loss: 0.0000


Epoch 13/50: 100%|██████████| 18/18 [00:00<00:00, 530.75it/s]


Epoch 13/50 - Training Loss: 0.0000


Epoch 14/50: 100%|██████████| 18/18 [00:00<00:00, 486.53it/s]


Epoch 14/50 - Training Loss: 0.0000


Epoch 15/50: 100%|██████████| 18/18 [00:00<00:00, 506.79it/s]


Epoch 15/50 - Training Loss: 0.0000


Epoch 16/50: 100%|██████████| 18/18 [00:00<00:00, 522.33it/s]


Epoch 16/50 - Training Loss: 0.0000


Epoch 17/50: 100%|██████████| 18/18 [00:00<00:00, 350.49it/s]


Epoch 17/50 - Training Loss: 0.0000


Epoch 18/50: 100%|██████████| 18/18 [00:00<00:00, 281.25it/s]


Epoch 18/50 - Training Loss: 0.0000


Epoch 19/50: 100%|██████████| 18/18 [00:00<00:00, 362.20it/s]


Epoch 19/50 - Training Loss: 0.0000


Epoch 20/50: 100%|██████████| 18/18 [00:00<00:00, 395.74it/s]


Epoch 20/50 - Training Loss: 0.0000


Epoch 21/50: 100%|██████████| 18/18 [00:00<00:00, 413.81it/s]


Epoch 21/50 - Training Loss: 0.0000


Epoch 22/50: 100%|██████████| 18/18 [00:00<00:00, 418.81it/s]


Epoch 22/50 - Training Loss: 0.0000


Epoch 23/50: 100%|██████████| 18/18 [00:00<00:00, 420.06it/s]


Epoch 23/50 - Training Loss: 0.0000


Epoch 24/50: 100%|██████████| 18/18 [00:00<00:00, 416.04it/s]


Epoch 24/50 - Training Loss: 0.0000


Epoch 25/50: 100%|██████████| 18/18 [00:00<00:00, 385.94it/s]


Epoch 25/50 - Training Loss: 0.0000


Epoch 26/50: 100%|██████████| 18/18 [00:00<00:00, 530.82it/s]


Epoch 26/50 - Training Loss: 0.0000


Epoch 27/50: 100%|██████████| 18/18 [00:00<00:00, 491.45it/s]


Epoch 27/50 - Training Loss: 0.0000


Epoch 28/50: 100%|██████████| 18/18 [00:00<00:00, 506.76it/s]


Epoch 28/50 - Training Loss: 0.0000


Epoch 29/50: 100%|██████████| 18/18 [00:00<00:00, 468.78it/s]


Epoch 29/50 - Training Loss: 0.0000


Epoch 30/50: 100%|██████████| 18/18 [00:00<00:00, 496.94it/s]


Epoch 30/50 - Training Loss: 0.0000


Epoch 31/50: 100%|██████████| 18/18 [00:00<00:00, 517.60it/s]


Epoch 31/50 - Training Loss: 0.0000


Epoch 32/50: 100%|██████████| 18/18 [00:00<00:00, 530.26it/s]


Epoch 32/50 - Training Loss: 0.0000


Epoch 33/50: 100%|██████████| 18/18 [00:00<00:00, 515.48it/s]


Epoch 33/50 - Training Loss: 0.0000


Epoch 34/50: 100%|██████████| 18/18 [00:00<00:00, 482.51it/s]


Epoch 34/50 - Training Loss: 0.0000


Epoch 35/50: 100%|██████████| 18/18 [00:00<00:00, 520.75it/s]


Epoch 35/50 - Training Loss: 0.0000


Epoch 36/50: 100%|██████████| 18/18 [00:00<00:00, 497.54it/s]


Epoch 36/50 - Training Loss: 0.0000


Epoch 37/50: 100%|██████████| 18/18 [00:00<00:00, 407.91it/s]


Epoch 37/50 - Training Loss: 0.0000


Epoch 38/50: 100%|██████████| 18/18 [00:00<00:00, 350.10it/s]


Epoch 38/50 - Training Loss: 0.0000


Epoch 39/50: 100%|██████████| 18/18 [00:00<00:00, 340.42it/s]


Epoch 39/50 - Training Loss: 0.0000


Epoch 40/50: 100%|██████████| 18/18 [00:00<00:00, 409.09it/s]


Epoch 40/50 - Training Loss: 0.0000


Epoch 41/50: 100%|██████████| 18/18 [00:00<00:00, 407.14it/s]


Epoch 41/50 - Training Loss: 0.0000


Epoch 42/50: 100%|██████████| 18/18 [00:00<00:00, 411.58it/s]


Epoch 42/50 - Training Loss: 0.0000


Epoch 43/50: 100%|██████████| 18/18 [00:00<00:00, 401.24it/s]


Epoch 43/50 - Training Loss: 0.0000


Epoch 44/50: 100%|██████████| 18/18 [00:00<00:00, 436.18it/s]


Epoch 44/50 - Training Loss: 0.0000


Epoch 45/50: 100%|██████████| 18/18 [00:00<00:00, 554.57it/s]


Epoch 45/50 - Training Loss: 0.0000


Epoch 46/50: 100%|██████████| 18/18 [00:00<00:00, 536.20it/s]


Epoch 46/50 - Training Loss: 0.0000


Epoch 47/50: 100%|██████████| 18/18 [00:00<00:00, 537.12it/s]


Epoch 47/50 - Training Loss: 0.0000


Epoch 48/50: 100%|██████████| 18/18 [00:00<00:00, 493.55it/s]


Epoch 48/50 - Training Loss: 0.0000


Epoch 49/50: 100%|██████████| 18/18 [00:00<00:00, 453.96it/s]


Epoch 49/50 - Training Loss: 0.0000


Epoch 50/50: 100%|██████████| 18/18 [00:00<00:00, 478.63it/s]

Epoch 50/50 - Training Loss: 0.0000
Training completed.
Validation Loss: 0.0000 - Validation Accuracy: 60.84%
Validation Loss: 0.0000





Your reason for your choice:

<div>

Using ReLU in the last layer is invalid for KLDivLoss because it produces unbounded outputs that do not represent a valid probability distribution.
LogSoftmax Activation:

LogSoftmax ensures the model outputs are valid log-probabilities, aligning with KLDivLoss requirements. It is the proper choice for this task, allowing the model to minimize the divergence between the predicted and target distributions effectively.

By using LogSoftmax, the model can train successfully, and the KL divergence loss decreases as the predicted distribution approaches the target distribution.
</div>

### 6. CosineEmbeddingLoss (`torch.nn.CosineEmbeddingLoss`)
- **Description:** Measures the cosine similarity between two input tensors, `x1` and `x2`, and computes the loss based on a label `y` that indicates whether the tensors should be similar (`y = 1`) or dissimilar (`y = -1`). Cosine similarity focuses on the angle between vectors, disregarding their magnitude.

- **Mathematical Function:**
\begin{equation}
  \text{CosineEmbeddingLoss}(x1, x2, y) =
  \begin{cases}
  1 - \cos(x_1, x_2), & \text{if } y = 1 \\
  \max(0, \cos(x_1, x_2) - \text{margin}), & \text{if } y = -1
  \end{cases}
\end{equation}
  where $ \cos(x_1, x_2) $ is the cosine similarity between the two vectors, and `margin` is a threshold that determines how dissimilar the vectors should be.

- **Use Case:** Commonly used in tasks like face verification, image similarity, and other scenarios where the relative orientation of vectors (angle) is more important than their length, such as in embeddings and metric learning.

- **Background:** Cosine similarity compares the directional alignment of vectors, making it ideal for high-dimensional data where the magnitude may not be as informative. This loss is particularly useful when training models to learn meaningful embeddings that capture semantic similarity.

You'll become more fimiliar with this loss function in future.

---

# Regularization in Machine Learning

## Introduction

Regularization is a fundamental technique in machine learning that helps prevent overfitting by adding a penalty to the loss function. This penalty discourages the model from becoming too complex, ensuring better generalization to unseen data. In this notebook, you will explore the concepts of regularization, understand different types of regularization techniques, and apply them using Python's popular libraries.

## What is Regularization?

Regularization involves adding a regularization term to the loss function used to train machine learning models. This term imposes a constraint on the model's coefficients, effectively reducing their magnitude. By doing so, regularization helps in:

- **Preventing Overfitting:** Ensures the model does not become too tailored to the training data.
- **Improving Generalization:** Enhances the model's performance on new, unseen data.
- **Feature Selection:** Especially in L1 regularization, it can drive some coefficients to zero, effectively selecting important features.

## Types of Regularization

There are several types of regularization techniques, each imposing different constraints on the model's parameters:

### 1. L1 Regularization (Lasso)

L1 regularization adds the absolute value of the magnitude of coefficients as a penalty term to the loss function. It can lead to sparse models where some feature coefficients are exactly zero.

### 2. L2 Regularization (Ridge)

L2 regularization adds the squared magnitude of coefficients as a penalty term to the loss function. It tends to shrink the coefficients evenly but does not set them to zero.

### 3. Elastic Net

Elastic Net combines both L1 and L2 regularization penalties. It balances the benefits of both Lasso and Ridge methods, allowing for feature selection and coefficient shrinkage.

## Homework Time!
Import Iris dataset from sklearn.datasets and apply ridge regression with different alpha values. Then, create a gif that shows the changes of the classification boundary with respect to alpha values.

Import the libs that you need and start coding!

In [225]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from PIL import Image
from io import BytesIO
import imageio
import warnings
from sklearn.neural_network import MLPClassifier




# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

Load the Iris dataset and select Setosa and Versicolor classes

In [223]:
# 1. Load and Prepare the Iris Dataset

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

# Select only Setosa and Versicolor classes
mask = (y == 0) | (y == 1)  # 0 for Setosa, 1 for Versicolor
X = X[mask]
y = y[mask]

# Select only two features: Sepal Length (column 0) and Petal Length (column 2)
X = X[:, [0, 2]]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


Define Function to Plot Decision Boundary

In [224]:
import numpy as np
import matplotlib.pyplot as plt
from io import BytesIO
from PIL import Image

def plot_decision_boundary(model, X, y, alpha):
    # Define the grid (use meshgrid)
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(
        np.arange(x_min, x_max, 0.01),
        np.arange(y_min, y_max, 0.01)
    )

    # Predict over the grid
    grid = np.c_[xx.ravel(), yy.ravel()]
    Z = model.predict(grid)
    Z = Z.reshape(xx.shape)

    # Create a figure
    fig, ax = plt.subplots(figsize=(6, 5))

    # Plot the decision boundary
    ax.contourf(xx, yy, Z, alpha=0.3, levels=[-0.1, 0.1, 1.1], colors=['blue', 'red'])

    # Scatter plot of the training data
    scatter = ax.scatter(
        X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolor='k', s=50
    )

    # Title and labels
    ax.set_title(f'MLP Decision Boundary (alpha={alpha})')
    ax.set_xlabel('Sepal Length (standardized)')
    ax.set_ylabel('Petal Length (standardized)')

    # Remove axes for clarity
    ax.set_xticks([])
    ax.set_yticks([])

    # Tight layout
    plt.tight_layout()

    # Save the plot to a BytesIO object
    buf = BytesIO()
    plt.savefig(buf, format='png')
    plt.close(fig)
    buf.seek(0)
    return Image.open(buf)


Train MLP with Varying Alpha Values and Collect Images

In [226]:
def create_decision_boundary_gif(alpha_values, X_train, y_train, n_neurons):
    # List to store images
    images = []

    for idx, alpha in enumerate(alpha_values):
        print(f"Processing alpha={alpha:.4f} ({idx + 1}/{len(alpha_values)})")

        # Create and train the MLP
        mlp = MLPClassifier(
            hidden_layer_sizes=(n_neurons,),
            alpha=alpha,
            max_iter=1000,
            random_state=42
        )
        mlp.fit(X_train, y_train)

        # Plot decision boundary and get the image
        img = plot_decision_boundary(mlp, X_train, y_train, alpha)
        images.append(img)

    # Save the images as a GIF
    gif_filename = 'mlp_classification_boundaries.gif'
    images[0].save(
        gif_filename,
        save_all=True,
        append_images=images[1:],
        duration=500,
        loop=0
    )

    print(f"GIF saved as '{gif_filename}'")

    # Return the GIF filename
    return gif_filename


## RUN

In [228]:
# Use np.logspace to generate alpha values, with at least 20 values
alpha_values = np.logspace(-3, 3, 20)

n_neurons = 10

# Create the decision boundary GIF
gif_dir = create_decision_boundary_gif(alpha_values, X_train, y_train, n_neurons)


Processing alpha=0.0010 (1/20)
Processing alpha=0.0021 (2/20)
Processing alpha=0.0043 (3/20)
Processing alpha=0.0089 (4/20)
Processing alpha=0.0183 (5/20)
Processing alpha=0.0379 (6/20)
Processing alpha=0.0785 (7/20)
Processing alpha=0.1624 (8/20)
Processing alpha=0.3360 (9/20)
Processing alpha=0.6952 (10/20)
Processing alpha=1.4384 (11/20)
Processing alpha=2.9764 (12/20)
Processing alpha=6.1585 (13/20)
Processing alpha=12.7427 (14/20)
Processing alpha=26.3665 (15/20)
Processing alpha=54.5559 (16/20)
Processing alpha=112.8838 (17/20)
Processing alpha=233.5721 (18/20)
Processing alpha=483.2930 (19/20)
Processing alpha=1000.0000 (20/20)
GIF saved as 'mlp_classification_boundaries.gif'


Your gif should look like this:

<div style="text-align: center;">

### **Multilayer Perceptron Classification Boundaries**

![Classification Boundaries](mlp_classification_boundaries_example.gif)

*Figure 1: Demonstration of classification boundaries created by a Multilayer Perceptron (MLP) model.*

</div>

