# **Deep Learning Course**

## **Loss Functions and Multilayer Perceptrons (MLP)**

---

### **Student Information:**

- **Name:** *Ehsan Merrikhi*
- **Student Number:** *400101967*

---

### **Assignment Overview**

In this notebook, we will explore various loss functions used in neural networks, with a specific focus on their role in training **Multilayer Perceptrons (MLPs)**. By the end of this notebook, you will have a deeper understanding of:
- Types of loss functions
- How loss functions affect the training process
- The relationship between loss functions and model optimization in MLPs

---

### **Table of Contents**

1. Introduction to Loss Functions
2. Types of Loss Functions
3. Multilayer Perceptrons (MLP)
4. Implementing Loss Functions in MLP
5. Conclusion

---



# 1.Introduction to Loss Functions 

In deep learning, **loss functions** play a crucial role in training models by quantifying the difference between the predicted outputs and the actual targets. Selecting the appropriate loss function is essential for the success of your model. In this assay, we will explore various loss functions available in PyTorch, understand their theoretical backgrounds, and provide you with a scaffolded class to experiment with these loss functions.

Before begining, let's train a simle MLP model using the **L1Loss** function. We'll return to this model later to experiment with different loss functions. We'll start by importing the necessary libraries and defining the model architecture.

First things first, let's talk about **L1Loss**.

### 1. L1Loss (`torch.nn.L1Loss`)
- **Description:** Also known as Mean Absolute Error (MAE), L1Loss computes the average absolute difference between the predicted values and the target values.
- **Use Case:** Suitable for regression tasks where robustness to outliers is desired.

Here is the mathematical formulation of L1Loss:
\begin{equation}
\text{L1Loss} = \frac{1}{n} \sum_{i=1}^{n} |y_{\text{pred}_i} - y_{\text{true}_i}|
\end{equation}

Let's implement a simple MLP model using the L1Loss function.

In [12]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from torch.utils.data import TensorDataset, DataLoader, random_split
from sklearn.model_selection import train_test_split
from torch.optim import Adam
from tqdm import tqdm
from sklearn.preprocessing import StandardScaler
# Don't be courious about Adam, it's just a fancy name for a fancy optimization algorithm

Here, we'll define a class called `SimpleMLP` that inherits from `nn.Module`. This class can have multiple layers, and we'll use the `nn.Sequential` module to define the layers of the model. The model will have the following architecture:

In [11]:
class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_hidden_layers=1, last_layer_activation_fn=nn.ReLU):
        super(SimpleMLP, self).__init__()
        # Define the first layer
        layers = [nn.Linear(input_dim, hidden_dim), nn.ReLU()]
        
        # Define additional hidden layers
        for _ in range(num_hidden_layers - 1):
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(nn.ReLU())
        
        # Define the output layer
        layers.append(nn.Linear(hidden_dim, output_dim))
        
        # Apply the specified activation function to the output layer
        if last_layer_activation_fn is not None:
            layers.append(last_layer_activation_fn())
        
        # Combine layers into a sequential module
        self.model = nn.Sequential(*layers)

    def forward(self, x):
        # TODO: Define the forward pass of the MLP
        return self.model(x)

Now, let's define a class called `SimpleMLP_Loss` that has the following architecture:

In [13]:
class SimpleMLPTrainer:
    def __init__(self, model, criterion, optimizer, isFloat):
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer
        self.isFloat = isFloat


    def train(self, train_loader, num_epochs):
        #TODO: Implement the training loop
        #Note: You should also print the training loss at each epoch, use tqdm for progress bar
        #Note: You should return the training loss at each epoch
        # Store the training losses for each epoch
        training_losses = []

        # Training loop
        for epoch in range(num_epochs):
            epoch_loss = 0.0
            self.model.train()  # Set model to training mode

            # Use tqdm for the progress bar
            with tqdm(train_loader, unit="batch") as tepoch:
                tepoch.set_description(f"Epoch {epoch+1}/{num_epochs}")

                for data, target in tepoch:
                    # Move data to the same device as the model
                    # data, target = data.to(self.model.device), target.to(self.model.device)
                    if self.isFloat:
                        target = target.float()
                    
                    # Zero the parameter gradients
                    self.optimizer.zero_grad()
                    
                    # Forward pass
                    output = self.model(data)
                    loss = self.criterion(output, target)
                    
                    # Backward pass and optimization
                    loss.backward()
                    self.optimizer.step()
                    
                    # Update the epoch loss
                    epoch_loss += loss.item()
                
                # Calculate average loss for the epoch
                avg_loss = epoch_loss / len(train_loader)
                training_losses.append(avg_loss)
                
                # Print the epoch loss
                tepoch.set_postfix(loss=avg_loss)
                print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

        return training_losses


    def evaluate(self, val_loader):
        #TODO: Implement the evaluation loop
        #Note: You should return the validation loss and accuracy
        self.model.eval()  # Set model to evaluation mode
        val_loss = 0.0
        correct = 0
        total = 0

        # Evaluation loop
        with torch.no_grad():  # No gradient computation during evaluation
            for data, target in val_loader:
                # Move data to the same device as the model
                # data, target = data.to(self.model.device), target.to(self.model.device)
                if self.isFloat:
                    target = target.float()
                
                # Forward pass
                output = self.model(data)
                loss = self.criterion(output, target)
                
                # Accumulate validation loss
                val_loss += loss.item()
                
                # Calculate accuracy
                _, predicted = torch.max(output, 1)
                correct += (predicted == target).sum().item()
                total += target.size(0)

        # Calculate average loss and accuracy
        avg_val_loss = val_loss / len(val_loader)
        accuracy = correct / total * 100  # Accuracy as percentage
        
        print(f"Validation Loss: {avg_val_loss:.4f}, Accuracy: {accuracy:.2f}%")
        
        return avg_val_loss, accuracy


Next, lets test our model using the L1Loss function. You'll use <span style="color:red">*Titanic Dataset*</span> to train the model.


In [4]:
# Load dataset
train_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(train_url)

# Preprocessing (simple example)
data = data[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']].dropna()
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

# Separate features and target variable
X = data[['Pclass', 'Sex', 'Age', 'Fare']].values
y = data['Survived'].values

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Convert the data to PyTorch tensors
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)

# Create a dataset and split it into training and validation sets
dataset = TensorDataset(X_tensor, y_tensor)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

# Create DataLoaders for training and validation
batch_size = 2
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

# Define the model, criterion, and optimizer
input_dim = X.shape[1]         # Number of features (4 in this case)
hidden_dim = 16                # Hidden layer size (example value)
output_dim = 2                 # Output classes (survived or not)

optimizer = optim.Adam(model.parameters(), lr=0.001)

# # Move model to device if GPU is available
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# model = model.to(device)

<div style="text-align: center;"> <span style="color:red; font-size: 26px; font-weight: bold;">Let's train!</span> </div>


In [14]:
from torch.nn import L1Loss

model = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim)

# Define criterion
criterion = L1Loss()

# Define the number of epochs
num_epochs = 20

# Initialize the trainer with the model, criterion, and optimizer
trainer = SimpleMLPTrainer(model, criterion, optimizer, isFloat=False)

# Train the model
training_losses = trainer.train(train_loader, num_epochs)

# Evaluate the model
val_loss, val_accuracy = trainer.evaluate(val_loader)

print(f"Final Validation Loss: {val_loss:.4f}")
print(f"Final Validation Accuracy: {val_accuracy:.2f}%")

  return F.l1_loss(input, target, reduction=self.reduction)
  return F.l1_loss(input, target, reduction=self.reduction)
Epoch 1/20: 100%|██████████| 286/286 [00:00<00:00, 1390.76batch/s]


Epoch [1/20], Loss: 0.4673


Epoch 2/20: 100%|██████████| 286/286 [00:00<00:00, 1499.79batch/s]


Epoch [2/20], Loss: 0.4623


Epoch 3/20: 100%|██████████| 286/286 [00:00<00:00, 1491.40batch/s]


Epoch [3/20], Loss: 0.4678


Epoch 4/20: 100%|██████████| 286/286 [00:00<00:00, 1464.42batch/s]


Epoch [4/20], Loss: 0.4645


Epoch 5/20: 100%|██████████| 286/286 [00:00<00:00, 1194.01batch/s]


Epoch [5/20], Loss: 0.4620


Epoch 6/20: 100%|██████████| 286/286 [00:00<00:00, 1483.60batch/s]


Epoch [6/20], Loss: 0.4514


Epoch 7/20: 100%|██████████| 286/286 [00:00<00:00, 1455.78batch/s]


Epoch [7/20], Loss: 0.4586


Epoch 8/20: 100%|██████████| 286/286 [00:00<00:00, 1460.90batch/s]


Epoch [8/20], Loss: 0.4596


Epoch 9/20: 100%|██████████| 286/286 [00:00<00:00, 1459.63batch/s]


Epoch [9/20], Loss: 0.4689


Epoch 10/20: 100%|██████████| 286/286 [00:00<00:00, 1446.01batch/s]


Epoch [10/20], Loss: 0.4726


Epoch 11/20: 100%|██████████| 286/286 [00:00<00:00, 1440.59batch/s]


Epoch [11/20], Loss: 0.4701


Epoch 12/20: 100%|██████████| 286/286 [00:00<00:00, 1471.02batch/s]


Epoch [12/20], Loss: 0.4559


Epoch 13/20: 100%|██████████| 286/286 [00:00<00:00, 1464.40batch/s]


Epoch [13/20], Loss: 0.4732


Epoch 14/20: 100%|██████████| 286/286 [00:00<00:00, 1437.75batch/s]


Epoch [14/20], Loss: 0.4653


Epoch 15/20: 100%|██████████| 286/286 [00:00<00:00, 1468.71batch/s]


Epoch [15/20], Loss: 0.4581


Epoch 16/20: 100%|██████████| 286/286 [00:00<00:00, 1505.77batch/s]


Epoch [16/20], Loss: 0.4624


Epoch 17/20: 100%|██████████| 286/286 [00:00<00:00, 1488.09batch/s]


Epoch [17/20], Loss: 0.4472


Epoch 18/20: 100%|██████████| 286/286 [00:00<00:00, 1480.81batch/s]


Epoch [18/20], Loss: 0.4585


Epoch 19/20: 100%|██████████| 286/286 [00:00<00:00, 1454.66batch/s]


Epoch [19/20], Loss: 0.4479


Epoch 20/20: 100%|██████████| 286/286 [00:00<00:00, 1496.45batch/s]


Epoch [20/20], Loss: 0.4573
Validation Loss: 0.4162, Accuracy: 41.26%
Final Validation Loss: 0.4162
Final Validation Accuracy: 41.26%


---
# 2. Types of Loss Functions

PyTorch offers a variety of built-in loss functions tailored for different types of problems, such as regression, classification, and more. Below, we discuss several commonly used loss functions, their theoretical foundations, and typical use cases.

### 2. MSELoss (`torch.nn.MSELoss`)
- **Description:** Mean Squared Error (MSE) calculates the average of the squares of the differences between predicted and target values.
- **Use Case:** Commonly used in regression problems where larger errors are significantly penalized.

Here is boring math stuff for MSE:
\begin{equation}
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2}
\end{equation}

<span style="color:red; font-size: 18px; font-weight: bold;">Warning:</span> Don't forget to reinitialize the model before experimenting with different loss functions.

In [15]:
from torch.nn import MSELoss

model = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim)
# Initialize MSE loss function (suitable for regression tasks)
criterion = MSELoss()

# Set the number of epochs
num_epochs = 2

# Initialize the trainer with the model, criterion, and optimizer
trainer = SimpleMLPTrainer(model, criterion, optimizer, isFloat=True)

# Train the model
training_losses = trainer.train(train_loader, num_epochs)

# Evaluate the model
val_loss, _ = trainer.evaluate(val_loader)  # Only return validation loss since accuracy may not apply for regression

print(f"Final Validation Loss (MSE): {val_loss:.4f}")

  return F.mse_loss(input, target, reduction=self.reduction)
  return F.mse_loss(input, target, reduction=self.reduction)
Epoch 1/2: 100%|██████████| 286/286 [00:00<00:00, 1467.77batch/s]


Epoch [1/2], Loss: 0.3714


Epoch 2/2: 100%|██████████| 286/286 [00:00<00:00, 1491.35batch/s]


Epoch [2/2], Loss: 0.3627
Validation Loss: 0.3199, Accuracy: 32.17%
Final Validation Loss (MSE): 0.3199


### 3. NLLLoss (`torch.nn.NLLLoss`)
- **Description:** Negative Log-Likelihood Loss measures the likelihood of the target class under the predicted probability distribution.
- **Use Case:** Typically used in multi-class classification tasks, especially when combined with `log_softmax` activation.

Here is the mathematical formulation of NLLLoss:
\begin{equation}
\text{NLLLoss} = -\frac{1}{n} \sum_{i=1}^{n} \log(y_{i})
\end{equation}

I hope you note the logarithm in the formula. It's important! 

Why?

logarithm is an increasing positive and function and applying it in the formula doesnt change the meaning of cost 

but there are many benifits in using it such as having smaller values to work with and decreasing computational cost of multiplying


In this part, run your training with Relu at last layer. <span style="color:red; font-weight: bold;">Discuss </span> and explain the difference between the results of the two models. Find a proper solution to the problem.


In [16]:
# Run with relu activation function
from torch.nn import NLLLoss

model = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim)
# Initialize MSE loss function (suitable for regression tasks)
criterion = NLLLoss()

# Set the number of epochs
num_epochs = 20

# Initialize the trainer with the model, criterion, and optimizer
trainer = SimpleMLPTrainer(model, criterion, optimizer, isFloat=False)

# Train the model
training_losses = trainer.train(train_loader, num_epochs)

# Evaluate the model
val_loss, _ = trainer.evaluate(val_loader)  # Only return validation loss since accuracy may not apply for regression

print(f"Final Validation Loss (MSE): {val_loss:.4f}")

Epoch 1/20: 100%|██████████| 286/286 [00:00<00:00, 1622.03batch/s]


Epoch [1/20], Loss: -0.0526


Epoch 2/20: 100%|██████████| 286/286 [00:00<00:00, 1534.54batch/s]


Epoch [2/20], Loss: -0.0525


Epoch 3/20: 100%|██████████| 286/286 [00:00<00:00, 1550.43batch/s]


Epoch [3/20], Loss: -0.0524


Epoch 4/20: 100%|██████████| 286/286 [00:00<00:00, 1672.37batch/s]


Epoch [4/20], Loss: -0.0524


Epoch 5/20: 100%|██████████| 286/286 [00:00<00:00, 1648.80batch/s]


Epoch [5/20], Loss: -0.0524


Epoch 6/20: 100%|██████████| 286/286 [00:00<00:00, 1669.33batch/s]


Epoch [6/20], Loss: -0.0524


Epoch 7/20: 100%|██████████| 286/286 [00:00<00:00, 1673.82batch/s]


Epoch [7/20], Loss: -0.0528


Epoch 8/20: 100%|██████████| 286/286 [00:00<00:00, 1661.00batch/s]


Epoch [8/20], Loss: -0.0525


Epoch 9/20: 100%|██████████| 286/286 [00:00<00:00, 1648.74batch/s]


Epoch [9/20], Loss: -0.0524


Epoch 10/20: 100%|██████████| 286/286 [00:00<00:00, 1656.94batch/s]


Epoch [10/20], Loss: -0.0527


Epoch 11/20: 100%|██████████| 286/286 [00:00<00:00, 1665.68batch/s]


Epoch [11/20], Loss: -0.0524


Epoch 12/20: 100%|██████████| 286/286 [00:00<00:00, 1602.98batch/s]


Epoch [12/20], Loss: -0.0524


Epoch 13/20: 100%|██████████| 286/286 [00:00<00:00, 1635.68batch/s]


Epoch [13/20], Loss: -0.0524


Epoch 14/20: 100%|██████████| 286/286 [00:00<00:00, 1623.17batch/s]


Epoch [14/20], Loss: -0.0524


Epoch 15/20: 100%|██████████| 286/286 [00:00<00:00, 1686.63batch/s]


Epoch [15/20], Loss: -0.0529


Epoch 16/20: 100%|██████████| 286/286 [00:00<00:00, 1608.85batch/s]


Epoch [16/20], Loss: -0.0527


Epoch 17/20: 100%|██████████| 286/286 [00:00<00:00, 1351.94batch/s]


Epoch [17/20], Loss: -0.0524


Epoch 18/20: 100%|██████████| 286/286 [00:00<00:00, 1591.65batch/s]


Epoch [18/20], Loss: -0.0527


Epoch 19/20: 100%|██████████| 286/286 [00:00<00:00, 1689.66batch/s]


Epoch [19/20], Loss: -0.0525


Epoch 20/20: 100%|██████████| 286/286 [00:00<00:00, 1660.59batch/s]

Epoch [20/20], Loss: -0.0525
Validation Loss: -0.0493, Accuracy: 37.76%
Final Validation Loss (MSE): -0.0493





In [18]:
# Run with --- activation function
from torch.nn import NLLLoss

model = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim, last_layer_activation_fn=nn.Tanh)

# Initialize MSE loss function (suitable for regression tasks)
criterion = NLLLoss()

# Set the number of epochs
num_epochs = 20

# Initialize the trainer with the model, criterion, and optimizer
trainer = SimpleMLPTrainer(model, criterion, optimizer, isFloat=False)

# Train the model
training_losses = trainer.train(train_loader, num_epochs)

# Evaluate the model
val_loss, _ = trainer.evaluate(val_loader)  # Only return validation loss since accuracy may not apply for regression

print(f"Final Validation Loss (MSE): {val_loss:.4f}")

Epoch 1/20: 100%|██████████| 286/286 [00:00<00:00, 1490.89batch/s]


Epoch [1/20], Loss: -0.2143


Epoch 2/20: 100%|██████████| 286/286 [00:00<00:00, 1474.50batch/s]


Epoch [2/20], Loss: -0.2148


Epoch 3/20: 100%|██████████| 286/286 [00:00<00:00, 1432.14batch/s]


Epoch [3/20], Loss: -0.2148


Epoch 4/20: 100%|██████████| 286/286 [00:00<00:00, 1573.19batch/s]


Epoch [4/20], Loss: -0.2146


Epoch 5/20: 100%|██████████| 286/286 [00:00<00:00, 1572.93batch/s]


Epoch [5/20], Loss: -0.2148


Epoch 6/20: 100%|██████████| 286/286 [00:00<00:00, 1590.32batch/s]


Epoch [6/20], Loss: -0.2147


Epoch 7/20: 100%|██████████| 286/286 [00:00<00:00, 1608.05batch/s]


Epoch [7/20], Loss: -0.2148


Epoch 8/20: 100%|██████████| 286/286 [00:00<00:00, 1574.99batch/s]


Epoch [8/20], Loss: -0.2143


Epoch 9/20: 100%|██████████| 286/286 [00:00<00:00, 1515.60batch/s]


Epoch [9/20], Loss: -0.2145


Epoch 10/20: 100%|██████████| 286/286 [00:00<00:00, 1609.08batch/s]


Epoch [10/20], Loss: -0.2146


Epoch 11/20: 100%|██████████| 286/286 [00:00<00:00, 1588.63batch/s]


Epoch [11/20], Loss: -0.2145


Epoch 12/20: 100%|██████████| 286/286 [00:00<00:00, 1559.90batch/s]


Epoch [12/20], Loss: -0.2144


Epoch 13/20: 100%|██████████| 286/286 [00:00<00:00, 1552.62batch/s]


Epoch [13/20], Loss: -0.2145


Epoch 14/20: 100%|██████████| 286/286 [00:00<00:00, 1581.81batch/s]


Epoch [14/20], Loss: -0.2146


Epoch 15/20: 100%|██████████| 286/286 [00:00<00:00, 1595.07batch/s]


Epoch [15/20], Loss: -0.2156


Epoch 16/20: 100%|██████████| 286/286 [00:00<00:00, 1606.47batch/s]


Epoch [16/20], Loss: -0.2148


Epoch 17/20: 100%|██████████| 286/286 [00:00<00:00, 1590.93batch/s]


Epoch [17/20], Loss: -0.2143


Epoch 18/20: 100%|██████████| 286/286 [00:00<00:00, 1615.72batch/s]


Epoch [18/20], Loss: -0.2147


Epoch 19/20: 100%|██████████| 286/286 [00:00<00:00, 1550.89batch/s]


Epoch [19/20], Loss: -0.2147


Epoch 20/20: 100%|██████████| 286/286 [00:00<00:00, 1534.42batch/s]


Epoch [20/20], Loss: -0.2145
Validation Loss: -0.2096, Accuracy: 62.94%
Final Validation Loss (MSE): -0.2096


Your reason for your choice:

<div>
Choosing the Tanh activation function over ReLU for the output layer in my model likely yielded better performance because Tanh aligns more closely with the requirements of output value ranges that include both negative and positive numbers. In contrast, ReLU restricts output to non-negative values, which could be unsuitable if your task requires a broader range of output.

The symmetric output range of Tanh, spanning from -1 to 1, makes it particularly effective for tasks where normalization of output around zero is crucial.

Moreover, the properties of Tanh help maintain a more stable and effective gradient flow during training, which is beneficial for the convergence and generalization of the model. Its gradients, which are steeper for values near zero and saturate towards the extremes, help in reducing the risk of vanishing gradients—a common issue with sigmoid functions—while still providing a normalized output, unlike ReLU.
</div>


### 4. CrossEntropyLoss (`torch.nn.CrossEntropyLoss`)
- **Description:** Combines `LogSoftmax` and `NLLLoss` in one single class. It computes the cross-entropy loss between the target and the output logits.
- **Use Case:** Widely used for multi-class classification problems.

The mathematical formulation of CrossEntropyLoss is as follows:
\begin{equation}
  \text{CrossEntropy}(y, \hat{y}) = - \sum_{i=1}^{C} y_i \log\left(\frac{e^{\hat{y}_i}}{\sum_{j=1}^{C} e^{\hat{y}_j}}\right)
\end{equation}
  where:
  - \( C \) is the number of classes,
  - \( y_i \) is a one-hot encoded target vector (or a scalar class label),
  - \( \hat{y}_i \) represents the logits (unnormalized model outputs) for each class.
  
  In practice, `torch.nn.CrossEntropyLoss` expects raw logits as input and internally applies the softmax function to convert the logits into probabilities, followed by the negative log-likelihood computation.

- **Background:** Cross-entropy measures the difference between the true distribution \( y \) and the predicted distribution \( \hat{y} \). The function minimizes the negative log-probability assigned to the correct class, effectively penalizing predictions that deviate from the true class, making it a standard choice for classification tasks in deep learning.

Now, let's implement a class called `SimpleMLP_Loss` that has the following architecture:


In [20]:
# Run with relu activation function
from torch.nn import CrossEntropyLoss

model = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim)
# Initialize MSE loss function (suitable for regression tasks)
criterion = CrossEntropyLoss()

# Set the number of epochs
num_epochs = 20

# Initialize the trainer with the model, criterion, and optimizer
trainer = SimpleMLPTrainer(model, criterion, optimizer, isFloat=False)

# Train the model
training_losses = trainer.train(train_loader, num_epochs)

# Evaluate the model
val_loss, _ = trainer.evaluate(val_loader)  # Only return validation loss since accuracy may not apply for regression

print(f"Final Validation Loss (MSE): {val_loss:.4f}")

Epoch 1/20: 100%|██████████| 286/286 [00:00<00:00, 1443.08batch/s]


Epoch [1/20], Loss: 0.7405


Epoch 2/20: 100%|██████████| 286/286 [00:00<00:00, 1346.91batch/s]


Epoch [2/20], Loss: 0.7404


Epoch 3/20: 100%|██████████| 286/286 [00:00<00:00, 1519.34batch/s]


Epoch [3/20], Loss: 0.7402


Epoch 4/20: 100%|██████████| 286/286 [00:00<00:00, 1552.39batch/s]


Epoch [4/20], Loss: 0.7404


Epoch 5/20: 100%|██████████| 286/286 [00:00<00:00, 1550.49batch/s]


Epoch [5/20], Loss: 0.7402


Epoch 6/20: 100%|██████████| 286/286 [00:00<00:00, 1333.33batch/s]


Epoch [6/20], Loss: 0.7402


Epoch 7/20: 100%|██████████| 286/286 [00:00<00:00, 1578.52batch/s]


Epoch [7/20], Loss: 0.7401


Epoch 8/20: 100%|██████████| 286/286 [00:00<00:00, 1551.21batch/s]


Epoch [8/20], Loss: 0.7404


Epoch 9/20: 100%|██████████| 286/286 [00:00<00:00, 1589.72batch/s]


Epoch [9/20], Loss: 0.7400


Epoch 10/20: 100%|██████████| 286/286 [00:00<00:00, 1587.18batch/s]


Epoch [10/20], Loss: 0.7404


Epoch 11/20: 100%|██████████| 286/286 [00:00<00:00, 1542.96batch/s]


Epoch [11/20], Loss: 0.7403


Epoch 12/20: 100%|██████████| 286/286 [00:00<00:00, 1551.85batch/s]


Epoch [12/20], Loss: 0.7405


Epoch 13/20: 100%|██████████| 286/286 [00:00<00:00, 1578.04batch/s]


Epoch [13/20], Loss: 0.7404


Epoch 14/20: 100%|██████████| 286/286 [00:00<00:00, 1549.66batch/s]


Epoch [14/20], Loss: 0.7405


Epoch 15/20: 100%|██████████| 286/286 [00:00<00:00, 1593.93batch/s]


Epoch [15/20], Loss: 0.7404


Epoch 16/20: 100%|██████████| 286/286 [00:00<00:00, 1565.02batch/s]


Epoch [16/20], Loss: 0.7402


Epoch 17/20: 100%|██████████| 286/286 [00:00<00:00, 1580.86batch/s]


Epoch [17/20], Loss: 0.7404


Epoch 18/20: 100%|██████████| 286/286 [00:00<00:00, 1538.20batch/s]


Epoch [18/20], Loss: 0.7406


Epoch 19/20: 100%|██████████| 286/286 [00:00<00:00, 1527.72batch/s]


Epoch [19/20], Loss: 0.7404


Epoch 20/20: 100%|██████████| 286/286 [00:00<00:00, 1542.76batch/s]


Epoch [20/20], Loss: 0.7405
Validation Loss: 0.7409, Accuracy: 23.78%
Final Validation Loss (MSE): 0.7409



### 5. KLDivLoss (`torch.nn.KLDivLoss`)
- **Description:** Kullback-Leibler Divergence Loss measures how one probability distribution diverges from a second, reference distribution. Unlike other loss functions that focus on classification, KL divergence specifically compares the relative entropy between two distributions. It quantifies the information loss when using the predicted distribution to approximate the true distribution. 

- **Mathematical Function:**
\begin{equation}
  \text{KL}(P \parallel Q) = \sum_{i=1}^{C} P(i) \left( \log P(i) - \log Q(i) \right)
\end{equation}
  where:
  - \( P \) is the target (true) probability distribution,
  - \( Q \) is the predicted distribution (often the output of `log_softmax`),
  - \( C \) is the number of classes.

  KL divergence is always non-negative, and it equals zero if the two distributions are identical. The loss function expects the model's output to be in the form of log-probabilities (using `log_softmax`) and compares this against a target probability distribution, which is typically a normalized distribution (using softmax).

- **Use Case:** KLDivLoss is frequently used in:
  - **Variational Autoencoders (VAEs):** In VAEs, KL divergence is used to measure how much the learned latent space distribution deviates from a prior distribution (often Gaussian).
  - **Knowledge Distillation:** In teacher-student models, KL divergence is used to transfer the "soft" knowledge from a teacher model to a student model by comparing their output probability distributions.
  - **Reinforcement Learning:** It can be used to update policies while minimizing the divergence from a previous policy.

- **Background:** Kullback-Leibler divergence, a core concept in information theory, measures the inefficiency of assuming the predicted distribution \( Q \) when the true distribution is \( P \). It is asymmetric, meaning that \( KL(P \parallel Q) \neq KL(Q \parallel P) \), so the direction of the comparison matters.

Again, in this part, run your training with Relu at last layer. <span style="color:red; font-weight: bold;">Discuss </span> and explain the difference between the results of the two models. Find a proper solution to the problem.


In [21]:
# Run with relu activation function
from torch.nn import KLDivLoss

model = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim)
# Initialize KLDiv loss function
criterion = KLDivLoss(reduction='batchmean')  # 'batchmean' is a commonly used reduction for KLDiv

# Set the number of epochs
num_epochs = 20

# Initialize the trainer with the model, criterion, and optimizer
trainer = SimpleMLPTrainer(model, criterion, optimizer, isFloat=True)

# Train the model
training_losses = trainer.train(train_loader, num_epochs)

# Evaluate the model
val_loss, _ = trainer.evaluate(val_loader)  # Only return validation loss since accuracy may not apply for KLDivLoss

print(f"Final Validation Loss (KLDiv): {val_loss:.4f}")


Epoch 1/20: 100%|██████████| 286/286 [00:00<00:00, 1431.31batch/s]


Epoch [1/20], Loss: -0.0503


Epoch 2/20: 100%|██████████| 286/286 [00:00<00:00, 1533.66batch/s]


Epoch [2/20], Loss: -0.0521


Epoch 3/20: 100%|██████████| 286/286 [00:00<00:00, 1470.16batch/s]


Epoch [3/20], Loss: -0.0419


Epoch 4/20: 100%|██████████| 286/286 [00:00<00:00, 1546.43batch/s]


Epoch [4/20], Loss: -0.0472


Epoch 5/20: 100%|██████████| 286/286 [00:00<00:00, 1514.83batch/s]


Epoch [5/20], Loss: -0.0502


Epoch 6/20: 100%|██████████| 286/286 [00:00<00:00, 1504.81batch/s]


Epoch [6/20], Loss: -0.0486


Epoch 7/20: 100%|██████████| 286/286 [00:00<00:00, 1548.25batch/s]


Epoch [7/20], Loss: -0.0492


Epoch 8/20: 100%|██████████| 286/286 [00:00<00:00, 1516.14batch/s]


Epoch [8/20], Loss: -0.0437


Epoch 9/20: 100%|██████████| 286/286 [00:00<00:00, 1478.69batch/s]


Epoch [9/20], Loss: -0.0576


Epoch 10/20: 100%|██████████| 286/286 [00:00<00:00, 1472.48batch/s]


Epoch [10/20], Loss: -0.0484


Epoch 11/20: 100%|██████████| 286/286 [00:00<00:00, 1492.76batch/s]


Epoch [11/20], Loss: -0.0469


Epoch 12/20: 100%|██████████| 286/286 [00:00<00:00, 1507.02batch/s]


Epoch [12/20], Loss: -0.0466


Epoch 13/20: 100%|██████████| 286/286 [00:00<00:00, 1471.83batch/s]


Epoch [13/20], Loss: -0.0480


Epoch 14/20: 100%|██████████| 286/286 [00:00<00:00, 1520.56batch/s]


Epoch [14/20], Loss: -0.0476


Epoch 15/20: 100%|██████████| 286/286 [00:00<00:00, 1358.64batch/s]


Epoch [15/20], Loss: -0.0516


Epoch 16/20: 100%|██████████| 286/286 [00:00<00:00, 1455.13batch/s]


Epoch [16/20], Loss: -0.0457


Epoch 17/20: 100%|██████████| 286/286 [00:00<00:00, 1510.74batch/s]


Epoch [17/20], Loss: -0.0454


Epoch 18/20: 100%|██████████| 286/286 [00:00<00:00, 1488.31batch/s]


Epoch [18/20], Loss: -0.0458


Epoch 19/20: 100%|██████████| 286/286 [00:00<00:00, 1523.75batch/s]


Epoch [19/20], Loss: -0.0449


Epoch 20/20: 100%|██████████| 286/286 [00:00<00:00, 1516.82batch/s]


Epoch [20/20], Loss: -0.0445
Validation Loss: -0.0537, Accuracy: 70.63%
Final Validation Loss (KLDiv): -0.0537


In [22]:
# Run with --- activation function
from torch.nn import KLDivLoss

model = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim,last_layer_activation_fn= nn.Tanh)
# Initialize KLDiv loss function
criterion = KLDivLoss(reduction='batchmean')  # 'batchmean' is a commonly used reduction for KLDiv

# Set the number of epochs
num_epochs = 20

# Initialize the trainer with the model, criterion, and optimizer
trainer = SimpleMLPTrainer(model, criterion, optimizer, isFloat=True)

# Train the model
training_losses = trainer.train(train_loader, num_epochs)

# Evaluate the model
val_loss, _ = trainer.evaluate(val_loader)  # Only return validation loss since accuracy may not apply for KLDivLoss

print(f"Final Validation Loss (KLDiv): {val_loss:.4f}")


Epoch 1/20: 100%|██████████| 286/286 [00:00<00:00, 1470.05batch/s]


Epoch [1/20], Loss: -0.1162


Epoch 2/20: 100%|██████████| 286/286 [00:00<00:00, 1446.36batch/s]


Epoch [2/20], Loss: -0.1075


Epoch 3/20: 100%|██████████| 286/286 [00:00<00:00, 1468.26batch/s]


Epoch [3/20], Loss: -0.1093


Epoch 4/20: 100%|██████████| 286/286 [00:00<00:00, 1485.32batch/s]


Epoch [4/20], Loss: -0.1064


Epoch 5/20: 100%|██████████| 286/286 [00:00<00:00, 1437.46batch/s]


Epoch [5/20], Loss: -0.1103


Epoch 6/20: 100%|██████████| 286/286 [00:00<00:00, 1406.47batch/s]


Epoch [6/20], Loss: -0.1117


Epoch 7/20: 100%|██████████| 286/286 [00:00<00:00, 1316.86batch/s]


Epoch [7/20], Loss: -0.1075


Epoch 8/20: 100%|██████████| 286/286 [00:00<00:00, 1486.10batch/s]


Epoch [8/20], Loss: -0.1143


Epoch 9/20: 100%|██████████| 286/286 [00:00<00:00, 1461.37batch/s]


Epoch [9/20], Loss: -0.1091


Epoch 10/20: 100%|██████████| 286/286 [00:00<00:00, 1453.86batch/s]


Epoch [10/20], Loss: -0.0956


Epoch 11/20: 100%|██████████| 286/286 [00:00<00:00, 1466.02batch/s]


Epoch [11/20], Loss: -0.1244


Epoch 12/20: 100%|██████████| 286/286 [00:00<00:00, 1471.10batch/s]


Epoch [12/20], Loss: -0.1021


Epoch 13/20: 100%|██████████| 286/286 [00:00<00:00, 1453.99batch/s]


Epoch [13/20], Loss: -0.1180


Epoch 14/20: 100%|██████████| 286/286 [00:00<00:00, 1409.37batch/s]


Epoch [14/20], Loss: -0.0967


Epoch 15/20: 100%|██████████| 286/286 [00:00<00:00, 1448.23batch/s]


Epoch [15/20], Loss: -0.1115


Epoch 16/20: 100%|██████████| 286/286 [00:00<00:00, 1484.19batch/s]


Epoch [16/20], Loss: -0.1117


Epoch 17/20: 100%|██████████| 286/286 [00:00<00:00, 1470.23batch/s]


Epoch [17/20], Loss: -0.1123


Epoch 18/20: 100%|██████████| 286/286 [00:00<00:00, 1479.42batch/s]


Epoch [18/20], Loss: -0.0872


Epoch 19/20: 100%|██████████| 286/286 [00:00<00:00, 1461.86batch/s]


Epoch [19/20], Loss: -0.1199


Epoch 20/20: 100%|██████████| 286/286 [00:00<00:00, 1455.27batch/s]


Epoch [20/20], Loss: -0.1004
Validation Loss: -0.1532, Accuracy: 64.34%
Final Validation Loss (KLDiv): -0.1532


Your reason for your choice:

<div>
**Your answer here**
</div>

### 6. CosineEmbeddingLoss (`torch.nn.CosineEmbeddingLoss`)
- **Description:** Measures the cosine similarity between two input tensors, `x1` and `x2`, and computes the loss based on a label `y` that indicates whether the tensors should be similar (`y = 1`) or dissimilar (`y = -1`). Cosine similarity focuses on the angle between vectors, disregarding their magnitude.

- **Mathematical Function:** 
\begin{equation}
  \text{CosineEmbeddingLoss}(x1, x2, y) = 
  \begin{cases} 
  1 - \cos(x_1, x_2), & \text{if } y = 1 \\
  \max(0, \cos(x_1, x_2) - \text{margin}), & \text{if } y = -1
  \end{cases}
\end{equation}
  where $ \cos(x_1, x_2) $ is the cosine similarity between the two vectors, and `margin` is a threshold that determines how dissimilar the vectors should be.

- **Use Case:** Commonly used in tasks like face verification, image similarity, and other scenarios where the relative orientation of vectors (angle) is more important than their length, such as in embeddings and metric learning.

- **Background:** Cosine similarity compares the directional alignment of vectors, making it ideal for high-dimensional data where the magnitude may not be as informative. This loss is particularly useful when training models to learn meaningful embeddings that capture semantic similarity.

You'll become more fimiliar with this loss function in future.

---

# Regularization in Machine Learning

## Introduction

Regularization is a fundamental technique in machine learning that helps prevent overfitting by adding a penalty to the loss function. This penalty discourages the model from becoming too complex, ensuring better generalization to unseen data. In this notebook, you will explore the concepts of regularization, understand different types of regularization techniques, and apply them using Python's popular libraries.

## What is Regularization?

Regularization involves adding a regularization term to the loss function used to train machine learning models. This term imposes a constraint on the model's coefficients, effectively reducing their magnitude. By doing so, regularization helps in:

- **Preventing Overfitting:** Ensures the model does not become too tailored to the training data.
- **Improving Generalization:** Enhances the model's performance on new, unseen data.
- **Feature Selection:** Especially in L1 regularization, it can drive some coefficients to zero, effectively selecting important features.

## Types of Regularization

There are several types of regularization techniques, each imposing different constraints on the model's parameters:

### 1. L1 Regularization (Lasso)

L1 regularization adds the absolute value of the magnitude of coefficients as a penalty term to the loss function. It can lead to sparse models where some feature coefficients are exactly zero.

### 2. L2 Regularization (Ridge)

L2 regularization adds the squared magnitude of coefficients as a penalty term to the loss function. It tends to shrink the coefficients evenly but does not set them to zero.

### 3. Elastic Net

Elastic Net combines both L1 and L2 regularization penalties. It balances the benefits of both Lasso and Ridge methods, allowing for feature selection and coefficient shrinkage.

## Homework Time!
Import Iris dataset from sklearn.datasets and apply ridge regression with different alpha values. Then, create a gif that shows the changes of the classification boundary with respect to alpha values.

Import the libs that you need and start coding!

In [25]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from PIL import Image
from io import BytesIO
import imageio
import warnings


# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

Load the Iris dataset and select Setosa and Versicolor classes

In [36]:
# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Select only two classes for binary classification (Setosa and Versicolor)
# Note: Setosa is labeled as 0 and Versicolor as 1 in the dataset
class_indices = y < 2
X = X[class_indices]
y = y[class_indices]

# 3. Select two features for 2D visualization (Sepal Length and Petal Length)
# Feature indices: Sepal Length (0), Petal Length (2)
X = X[:, [0, 2]]

# 4. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Define Function to Plot Decision Boundary

In [37]:
def plot_decision_boundary(model, X, y, alpha):
    ##############################################
    # Define the grid (use meshgrid)
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
    
    # Predict over the grid
    grid = np.c_[xx.ravel(), yy.ravel()]
    grid_tensor = torch.tensor(grid, dtype=torch.float32)
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():
        predictions = model(grid_tensor).argmax(dim=1)
    Z = predictions.numpy().reshape(xx.shape)
    ##############################################

    # Create a figure
    fig, ax = plt.subplots(figsize=(6, 5))

    # Plot the decision boundary
    ax.contourf(xx, yy, Z, alpha=0.3, levels=[-0.1, 0.1, 1.1], colors=['blue', 'red'])

    # Scatter plot of the training data
    scatter = ax.scatter(
        X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolor='k', s=50
    )

    # Title and labels
    ax.set_title(f'MLP Decision Boundary (alpha={alpha})')
    ax.set_xlabel('Sepal Length (standardized)')
    ax.set_ylabel('Petal Length (standardized)')

    # Remove axes for clarity
    ax.set_xticks([])
    ax.set_yticks([])

    # Tight layout
    plt.tight_layout()

    # Save the plot to a BytesIO object
    buf = BytesIO()
    plt.savefig(buf, format='png')
    plt.close(fig)
    buf.seek(0)
    return Image.open(buf)

Train MLP with Varying Alpha Values and Collect Images

In [42]:
def create_decision_boundary_gif(alpha_values, X_train, y_train, n_neurons):

    # List to store images
    images = []

    for idx, alpha in enumerate(alpha_values):
        print(f"Processing alpha={alpha:.4f} ({idx + 1}/{len(alpha_values)})")

        # Create and train the MLP
        ####################################
        model = SimpleMLP(input_dim=2, hidden_dim=n_neurons, output_dim=2)  # Assuming no activation on output
        criterion = torch.nn.CrossEntropyLoss()
        optimizer = optim.Adam(model.parameters(), lr=0.01, weight_decay=alpha)

        # Fit the model
        X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
        y_train_tensor = torch.tensor(y_train, dtype=torch.long)
        model.train()
        for epoch in range(1000):
            optimizer.zero_grad()
            outputs = model(X_train_tensor)
            loss = criterion(outputs, y_train_tensor)
            loss.backward()
            optimizer.step()
        #####################################

        # Plot decision boundary and get the image
        img = plot_decision_boundary(model, X_train, y_train, alpha)
        images.append(img)

    # Save the images as a GIF
    gif_filename = 'mlp_classification_boundaries.gif'
    images[0].save(
        gif_filename,
        save_all=True,
        append_images=images[1:],
        duration=500,
        loop=0
    )

    print(f"GIF saved as '{gif_filename}'")

    # return the gif
    return gif_filename

## RUN

In [46]:

# Use np.logspace to generate alpha values, with at least 20 values
alpha_values = np.logspace(-3, 3, 20)
# Define the number of neurons in the hidden layer
n_neurons =  10

# Create the decision boundary GIF
gif_dir = create_decision_boundary_gif(alpha_values, X_train, y_train, n_neurons)

Processing alpha=0.0010 (1/20)
Processing alpha=0.0021 (2/20)
Processing alpha=0.0043 (3/20)
Processing alpha=0.0089 (4/20)
Processing alpha=0.0183 (5/20)
Processing alpha=0.0379 (6/20)
Processing alpha=0.0785 (7/20)
Processing alpha=0.1624 (8/20)
Processing alpha=0.3360 (9/20)
Processing alpha=0.6952 (10/20)
Processing alpha=1.4384 (11/20)
Processing alpha=2.9764 (12/20)
Processing alpha=6.1585 (13/20)
Processing alpha=12.7427 (14/20)
Processing alpha=26.3665 (15/20)
Processing alpha=54.5559 (16/20)
Processing alpha=112.8838 (17/20)
Processing alpha=233.5721 (18/20)
Processing alpha=483.2930 (19/20)
Processing alpha=1000.0000 (20/20)
GIF saved as 'mlp_classification_boundaries.gif'


Your gif should look like this:

<div style="text-align: center;">

### **Multilayer Perceptron Classification Boundaries**

![Classification Boundaries](mlp_classification_boundaries_example.gif)

*Figure 1: Demonstration of classification boundaries created by a Multilayer Perceptron (MLP) model.*

</div>

