# **Deep Learning Course**

## **Loss Functions and Multilayer Perceptrons (MLP)**

---

### **Student Information:**

- **Name:** *Pantea Amoie*
- **Student Number:** *400101656*

---

### **Assignment Overview**

In this notebook, we will explore various loss functions used in neural networks, with a specific focus on their role in training **Multilayer Perceptrons (MLPs)**. By the end of this notebook, you will have a deeper understanding of:
- Types of loss functions
- How loss functions affect the training process
- The relationship between loss functions and model optimization in MLPs

---

### **Table of Contents**

1. Introduction to Loss Functions
2. Types of Loss Functions
3. Multilayer Perceptrons (MLP)
4. Implementing Loss Functions in MLP
5. Conclusion

---



# 1.Introduction to Loss Functions

In deep learning, **loss functions** play a crucial role in training models by quantifying the difference between the predicted outputs and the actual targets. Selecting the appropriate loss function is essential for the success of your model. In this assay, we will explore various loss functions available in PyTorch, understand their theoretical backgrounds, and provide you with a scaffolded class to experiment with these loss functions.

Before begining, let's train a simle MLP model using the **L1Loss** function. We'll return to this model later to experiment with different loss functions. We'll start by importing the necessary libraries and defining the model architecture.

First things first, let's talk about **L1Loss**.

### 1. L1Loss (`torch.nn.L1Loss`)
- **Description:** Also known as Mean Absolute Error (MAE), L1Loss computes the average absolute difference between the predicted values and the target values.
- **Use Case:** Suitable for regression tasks where robustness to outliers is desired.

Here is the mathematical formulation of L1Loss:
\begin{equation}
\text{L1Loss} = \frac{1}{n} \sum_{i=1}^{n} |y_{\text{pred}_i} - y_{\text{true}_i}|
\end{equation}

Let's implement a simple MLP model using the L1Loss function.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from torch.utils.data import TensorDataset, DataLoader, random_split
from sklearn.model_selection import train_test_split
from torch.optim import Adam
from tqdm import tqdm
from sklearn.preprocessing import StandardScaler
# Don't be courious about Adam, it's just a fancy name for a fancy optimization algorithm

Here, we'll define a class called `SimpleMLP` that inherits from `nn.Module`. This class can have multiple layers, and we'll use the `nn.Sequential` module to define the layers of the model. The model will have the following architecture:

In [None]:
class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_hidden_layers=1, last_layer_activation_fn=nn.ReLU):
        super(SimpleMLP, self).__init__()
        # Define the layers of the MLP
        layers = []
        layers.append(nn.Linear(input_dim, hidden_dim))
        layers.append(nn.ReLU())
        for _ in range(num_hidden_layers - 1):
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(nn.ReLU())
        layers.append(nn.Linear(hidden_dim, output_dim))
        if last_layer_activation_fn is not None:
            layers.append(last_layer_activation_fn())
        self.model = nn.Sequential(*layers)


    def forward(self, x):
        # Define the forward pass of the MLP
        return self.model(x)

Now, let's define a class called `SimpleMLP_Loss` that has the following architecture:

In [None]:
class SimpleMLPTrainer:
    def __init__(self, model, criterion, optimizer):
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer

    def train(self, train_loader, num_epochs):
        # Implement the training loop
        # Note: You should also print the training loss at each epoch, use tqdm for progress bar
        # Note: You should return the training loss at each epoch

        self.model.train()
        losses = []
        for epoch in range(num_epochs):
            current_loss = 0
            for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}"):
                inputs, targets = batch
                self.optimizer.zero_grad()

                outputs = self.model(inputs)
                loss = self.criterion(outputs, targets)
                loss.backward()
                self.optimizer.step()
                current_loss += loss.item()

            avg_loss = current_loss / len(train_loader)
            losses.append(avg_loss)
            print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

        return losses


    def evaluate(self, val_loader):
        # Implement the evaluation loop
        # Note: You should return the validation loss and accuracy

        self.model.eval()
        total_loss = 0
        correct = 0
        total = 0

        with torch.no_grad():
            for X_batch, y_batch in val_loader:
                outputs = self.model(X_batch)
                loss = self.criterion(outputs, y_batch)
                total_loss += loss.item()
                # For the third part of the question(when the number of output neurons is 2)
                if outputs.shape[1] == 2:
                    predictions = torch.argmax(outputs, dim=1)
                    y_batch = torch.argmax(y_batch, dim=1) if y_batch.dim() > 1 else y_batch
                else:
                    predictions = (outputs >= 0.5).float().squeeze()
                    y_batch = y_batch.squeeze()
                # predictions = (outputs >= 0.5).float()
                correct += (predictions == y_batch).sum().item()
                total += y_batch.size(0)

        val_loss = total_loss / len(val_loader)
        accuracy = correct / total
        return val_loss, accuracy


Next, lets test our model using the L1Loss function. You'll use <span style="color:red">*Titanic Dataset*</span> to train the model.


In [None]:
# Load dataset
train_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(train_url)

# Preprocessing (simple example)
data = data[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']].dropna()
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

# Convert the data to PyTorch tensors and create a DataLoader
X = data[['Pclass', 'Sex', 'Age', 'Fare']].values
y = data['Survived'].values

# We can scale the data
# Without scaling we got an accuracy of around 62%
# But after scaling, we observe around 15% increase!
scaler = StandardScaler()
X = scaler.fit_transform(X)

X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).unsqueeze(1)


# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

# Define the model, criterion, and optimizer
input_dim = X.shape[1]
hidden_dim = 64
output_dim = 1
num_hidden_layers = 2


model = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers, last_layer_activation_fn=nn.Sigmoid)
optimizer = Adam(model.parameters(), lr=0.001)


<div style="text-align: center;"> <span style="color:red; font-size: 26px; font-weight: bold;">Let's train!</span> </div>

In [None]:
from torch.nn import L1Loss

# Train the model
criterion = L1Loss()
trainer = SimpleMLPTrainer(model, criterion, optimizer)
trainer.train(train_loader, num_epochs=20)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 201.03it/s]


Epoch [1/20], Loss: 0.4700


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 136.37it/s]


Epoch [2/20], Loss: 0.3922


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 239.62it/s]


Epoch [3/20], Loss: 0.3055


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 315.43it/s]


Epoch [4/20], Loss: 0.2471


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 86.31it/s] 


Epoch [5/20], Loss: 0.2234


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 213.79it/s]


Epoch [6/20], Loss: 0.2127


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 301.48it/s]


Epoch [7/20], Loss: 0.2076


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 290.13it/s]


Epoch [8/20], Loss: 0.2008


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 190.72it/s]


Epoch [9/20], Loss: 0.1970


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 218.56it/s]


Epoch [10/20], Loss: 0.1943


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 139.44it/s]


Epoch [11/20], Loss: 0.1904


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 116.96it/s]


Epoch [12/20], Loss: 0.1874


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 161.19it/s]


Epoch [13/20], Loss: 0.1849


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 299.49it/s]


Epoch [14/20], Loss: 0.1853


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 280.01it/s]


Epoch [15/20], Loss: 0.1829


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 121.59it/s]


Epoch [16/20], Loss: 0.1824


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 180.12it/s]


Epoch [17/20], Loss: 0.1801


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 341.04it/s]


Epoch [18/20], Loss: 0.1813


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 160.88it/s]


Epoch [19/20], Loss: 0.1787


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 93.43it/s]

Epoch [20/20], Loss: 0.1783





[0.47002209226290387,
 0.3921545810169644,
 0.30547556032737094,
 0.24706178489658567,
 0.22343006605903307,
 0.21271600988176134,
 0.2076160336534182,
 0.20078275062971646,
 0.1970085096028116,
 0.1942973170015547,
 0.1904323668115669,
 0.18742423670159447,
 0.18488306303819022,
 0.18528479689525235,
 0.1829002220183611,
 0.1823544195956654,
 0.18006748002436426,
 0.1813151886065801,
 0.17866701053248513,
 0.17833352709809938]

In [None]:
# Evaluate the model
val_loss, accuracy = trainer.evaluate(val_loader)
print(f"Validation Loss: {val_loss:.4f}, Accuracy: {accuracy:.2f}%")

Validation Loss: 0.2474, Accuracy: 0.74%


---
# 2. Types of Loss Functions

PyTorch offers a variety of built-in loss functions tailored for different types of problems, such as regression, classification, and more. Below, we discuss several commonly used loss functions, their theoretical foundations, and typical use cases.

### 2. MSELoss (`torch.nn.MSELoss`)
- **Description:** Mean Squared Error (MSE) calculates the average of the squares of the differences between predicted and target values.
- **Use Case:** Commonly used in regression problems where larger errors are significantly penalized.

Here is boring math stuff for MSE:
\begin{equation}
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2}
\end{equation}

<span style="color:red; font-size: 18px; font-weight: bold;">Warning:</span> Don't forget to reinitialize the model before experimenting with different loss functions.

In [None]:
from torch.nn import MSELoss

# Train the model
model = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers, last_layer_activation_fn=nn.Sigmoid)
criterion = MSELoss()
optimizer = Adam(model.parameters(), lr=0.001)
trainer = SimpleMLPTrainer(model, criterion, optimizer)
trainer.train(train_loader, num_epochs=20)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 533.05it/s]


Epoch [1/20], Loss: 0.2305


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 472.96it/s]


Epoch [2/20], Loss: 0.1890


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 537.29it/s]


Epoch [3/20], Loss: 0.1590


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 516.18it/s]


Epoch [4/20], Loss: 0.1460


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 527.37it/s]


Epoch [5/20], Loss: 0.1417


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 515.65it/s]


Epoch [6/20], Loss: 0.1377


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 477.96it/s]


Epoch [7/20], Loss: 0.1362


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 408.19it/s]


Epoch [8/20], Loss: 0.1342


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 247.25it/s]


Epoch [9/20], Loss: 0.1326


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 328.16it/s]


Epoch [10/20], Loss: 0.1332


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 313.10it/s]


Epoch [11/20], Loss: 0.1322


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 348.46it/s]


Epoch [12/20], Loss: 0.1301


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 244.50it/s]


Epoch [13/20], Loss: 0.1302


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 250.24it/s]


Epoch [14/20], Loss: 0.1292


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 310.68it/s]


Epoch [15/20], Loss: 0.1278


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 261.04it/s]


Epoch [16/20], Loss: 0.1270


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 239.57it/s]


Epoch [17/20], Loss: 0.1266


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 268.04it/s]


Epoch [18/20], Loss: 0.1266


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 292.64it/s]


Epoch [19/20], Loss: 0.1257


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 299.28it/s]

Epoch [20/20], Loss: 0.1261





[0.2305005606677797,
 0.18900182429287168,
 0.15895136280192268,
 0.1459977970355087,
 0.1417025596731239,
 0.13765139629443487,
 0.1361824195418093,
 0.13419797437058556,
 0.13257200312283304,
 0.13318195111221737,
 0.1321989618655708,
 0.130135970397128,
 0.13023176582323182,
 0.12921922239992353,
 0.1277505618830522,
 0.12695572152733803,
 0.12661020333568254,
 0.1266499672912889,
 0.12571345248983967,
 0.12613738452394804]

In [None]:
# Evaluate the model
val_loss, accuracy = trainer.evaluate(val_loader)
print(f"Validation Loss: {val_loss:.4f}, Accuracy: {accuracy:.2f}%")

Validation Loss: 0.1590, Accuracy: 0.75%


### 3. NLLLoss (`torch.nn.NLLLoss`)
- **Description:** Negative Log-Likelihood Loss measures the likelihood of the target class under the predicted probability distribution.
- **Use Case:** Typically used in multi-class classification tasks, especially when combined with `log_softmax` activation.

Here is the mathematical formulation of NLLLoss:
\begin{equation}
\text{NLLLoss} = -\frac{1}{n} \sum_{i=1}^{n} \log(y_{i})
\end{equation}

I hope you note the logarithm in the formula. It's important!

Why?

In this part, run your training with Relu at last layer. <span style="color:red; font-weight: bold;">Discuss </span> and explain the difference between the results of the two models. Find a proper solution to the problem.


In [None]:
# Load dataset
train_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(train_url)

# Preprocessing (simple example)
data = data[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']].dropna()
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

# Convert the data to PyTorch tensors and create a DataLoader
X = data[['Pclass', 'Sex', 'Age', 'Fare']].values
y = data['Survived'].values

# We can scale the data
# Without scaling we got an accuracy of around 62%
# But after scaling, we observe around 15% increase!
scaler = StandardScaler()
X = scaler.fit_transform(X)

X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long).squeeze()


# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

# Define the model, criterion, and optimizer
input_dim = X.shape[1]
hidden_dim = 64
output_dim = 2
num_hidden_layers = 2


In [None]:
# Run with relu activation function
from torch.nn import NLLLoss

# Train the model
relu_model = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers, last_layer_activation_fn=nn.ReLU)
relu_criterion = NLLLoss()
relu_optimizer = Adam(relu_model.parameters(), lr=0.001)
relu_trainer = SimpleMLPTrainer(relu_model, relu_criterion, relu_optimizer)
relu_trainer.train(train_loader, num_epochs=20)


Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 454.92it/s]


Epoch [1/20], Loss: -0.3812


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 524.29it/s]


Epoch [2/20], Loss: -1.0876


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 512.92it/s]


Epoch [3/20], Loss: -2.4118


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 475.93it/s]


Epoch [4/20], Loss: -4.6004


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 457.57it/s]


Epoch [5/20], Loss: -8.1313


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 447.61it/s]


Epoch [6/20], Loss: -13.5618


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 525.15it/s]


Epoch [7/20], Loss: -21.3654


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 434.47it/s]


Epoch [8/20], Loss: -32.2438


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 498.88it/s]


Epoch [9/20], Loss: -46.7161


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 485.80it/s]


Epoch [10/20], Loss: -65.9813


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 541.22it/s]


Epoch [11/20], Loss: -91.3280


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 499.14it/s]


Epoch [12/20], Loss: -122.4030


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 444.61it/s]


Epoch [13/20], Loss: -160.5675


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 484.23it/s]


Epoch [14/20], Loss: -205.4757


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 335.19it/s]


Epoch [15/20], Loss: -261.0701


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 358.93it/s]


Epoch [16/20], Loss: -324.1818


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 399.56it/s]


Epoch [17/20], Loss: -397.8480


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 352.83it/s]


Epoch [18/20], Loss: -484.4124


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 367.64it/s]


Epoch [19/20], Loss: -582.2153


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 384.90it/s]

Epoch [20/20], Loss: -690.6703





[-0.3811810256706344,
 -1.087620993455251,
 -2.411794937319226,
 -4.600350989235772,
 -8.131326384014553,
 -13.561817328135172,
 -21.36537233988444,
 -32.24382336934408,
 -46.71612601810031,
 -65.98131815592448,
 -91.32799127366808,
 -122.403015560574,
 -160.5674663119846,
 -205.47572326660156,
 -261.0700624254015,
 -324.1818440755208,
 -397.84795633951825,
 -484.4124281141493,
 -582.2153371175131,
 -690.6703389485677]

In [None]:
# Evaluate the model

relu_val_loss, relu_val_acc = relu_trainer.evaluate(val_loader)
print(f"Validation Loss: {relu_val_loss:.4f}, Accuracy: {relu_val_acc:.2f}%")

Validation Loss: -754.2925, Accuracy: 0.61%


In [None]:
from torch.nn import LogSoftmax

logsoftmax_model = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers, last_layer_activation_fn=LogSoftmax)
logsoftmax_criterion = NLLLoss()
logsoftmax_optimizer = Adam(logsoftmax_model.parameters(), lr=0.001)
logsoftmax_trainer = SimpleMLPTrainer(logsoftmax_model, logsoftmax_criterion, logsoftmax_optimizer)


logsoftmax_trainer.train(train_loader, num_epochs=20)

  return self._call_impl(*args, **kwargs)
Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 395.74it/s]


Epoch [1/20], Loss: 0.6360


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 519.87it/s]


Epoch [2/20], Loss: 0.5148


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 547.32it/s]


Epoch [3/20], Loss: 0.4497


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 528.93it/s]


Epoch [4/20], Loss: 0.4309


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 522.72it/s]


Epoch [5/20], Loss: 0.4198


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 497.14it/s]


Epoch [6/20], Loss: 0.4147


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 481.18it/s]


Epoch [7/20], Loss: 0.4149


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 439.27it/s]


Epoch [8/20], Loss: 0.4100


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 463.89it/s]


Epoch [9/20], Loss: 0.4078


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 546.79it/s]


Epoch [10/20], Loss: 0.4057


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 554.62it/s]


Epoch [11/20], Loss: 0.4049


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 490.55it/s]


Epoch [12/20], Loss: 0.4013


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 444.09it/s]


Epoch [13/20], Loss: 0.4019


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 441.03it/s]


Epoch [14/20], Loss: 0.4002


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 436.01it/s]


Epoch [15/20], Loss: 0.3968


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 404.56it/s]


Epoch [16/20], Loss: 0.3956


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 388.15it/s]


Epoch [17/20], Loss: 0.3969


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 502.98it/s]


Epoch [18/20], Loss: 0.3957


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 478.78it/s]


Epoch [19/20], Loss: 0.3961


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 395.01it/s]


Epoch [20/20], Loss: 0.4013


[0.6359966331058078,
 0.5147975318961673,
 0.4497317141956753,
 0.4308544993400574,
 0.4198215852181117,
 0.41474545664257473,
 0.41493046780427295,
 0.410004648897383,
 0.4078483581542969,
 0.4056989981068505,
 0.4048559640844663,
 0.4012915649347835,
 0.401866614818573,
 0.4002491450972027,
 0.3967904663748211,
 0.39564831886026597,
 0.3968635035885705,
 0.39569133188989425,
 0.3961053142944972,
 0.40127600563897026]

In [None]:
logsoftmax_val_loss, logsoftmax_val_acc = logsoftmax_trainer.evaluate(val_loader)
print(f"Validation Loss: {logsoftmax_val_loss:.4f}, Accuracy: {logsoftmax_val_acc:.2f}%")

Validation Loss: 0.4816, Accuracy: 0.76%


First, we used Relu as the activation function of the last layer. We had to change `output_dim` from 1 to 2. The two output nodes represent the scores (logits) for the two classes (Survived = 0 or 1) and work well with the NLLLoss function, which expects such a multi-class structure. Without this change, the model would output a single scalar value, incompatible with multi-class classification.\
Also, we had to change y tensor’s data type and shape: We set y to torch.long because NLLLoss requires the target tensor to be in a long integer format representing class indices (0 or 1). The .squeeze() ensures that y is 1D, with each entry corresponding to the class label. Previously, with torch.float32 and .unsqueeze(1), y was formatted as a 2D tensor suitable for regression or binary classification, but incompatible with NLLLoss, which expects 1D integer labels for multi-class classification.\
The model is producing low validation accuracy and an large negative loss value. These results suggest that the model setup is not quite compatible with the NLLLoss function and the ReLU activation in the output layer. The reasons are:
- The NLLLoss expects log-probabilities as input (typically produced by a LogSoftmax layer).
Since ReLU outputs non-negative values, it’s producing unexpected log-probabilities for NLLLoss, resulting in invalid loss values (negative numbers) and poor model performance (close to random guessing).
- By using output_dim=2 and setting y to long integers, the model avoided previous errors by aligning with the expected input shape and data type for NLLLoss.
However, this was only a partial solution, as the underlying issue—using ReLU on the final layer with NLLLoss—wasn’t addressed. This is why the current model does not perform well.


**Your reason for your choice:**\
To make this configuration compatible with NLLLoss, we should replace the ReLU activation with a LogSoftmax layer. This will provide log-probabilities needed by NLLLoss, yielding accurate loss values and higher accuracy.\
To improve the model, we’ll implement the following changes:
We replace ReLU Activation with LogSoftmax at Output:

Instead of ReLU on the last layer, we’ll apply LogSoftmax to convert the outputs into log-probabilities.
This will allow NLLLoss to calculate loss properly, which should lead to meaningful improvements in both loss and accuracy.



### 4. CrossEntropyLoss (`torch.nn.CrossEntropyLoss`)
- **Description:** Combines `LogSoftmax` and `NLLLoss` in one single class. It computes the cross-entropy loss between the target and the output logits.
- **Use Case:** Widely used for multi-class classification problems.

The mathematical formulation of CrossEntropyLoss is as follows:
\begin{equation}
  \text{CrossEntropy}(y, \hat{y}) = - \sum_{i=1}^{C} y_i \log\left(\frac{e^{\hat{y}_i}}{\sum_{j=1}^{C} e^{\hat{y}_j}}\right)
\end{equation}
  where:
  - \( C \) is the number of classes,
  - \( y_i \) is a one-hot encoded target vector (or a scalar class label),
  - \( \hat{y}_i \) represents the logits (unnormalized model outputs) for each class.
  
  In practice, `torch.nn.CrossEntropyLoss` expects raw logits as input and internally applies the softmax function to convert the logits into probabilities, followed by the negative log-likelihood computation.

- **Background:** Cross-entropy measures the difference between the true distribution \( y \) and the predicted distribution \( \hat{y} \). The function minimizes the negative log-probability assigned to the correct class, effectively penalizing predictions that deviate from the true class, making it a standard choice for classification tasks in deep learning.

Now, let's implement a class called `SimpleMLP_Loss` that has the following architecture:


In [None]:
from torch.nn import CrossEntropyLoss

# Load dataset
train_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(train_url)

# Preprocessing
data = data[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']].dropna()
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

# Convert the data to PyTorch tensors and create a DataLoader
X = data[['Pclass', 'Sex', 'Age', 'Fare']].values
y = data['Survived'].values

# Scale the data
scaler = StandardScaler()
X = scaler.fit_transform(X)

X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long).squeeze()  # Long for classification labels

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

input_dim = X.shape[1]
hidden_dim = 64
output_dim = 2
num_hidden_layers = 2

# In this part we did not apply any activation function to the last layer
# because CrossEntropyLoss internally applies both the softmax activation and the negative log-likelihood computation
ce_model = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers, last_layer_activation_fn=None)
ce_criterion = nn.CrossEntropyLoss()
ce_optimizer = Adam(ce_model.parameters(), lr=0.001)
ce_trainer = SimpleMLPTrainer(ce_model, ce_criterion, ce_optimizer)
ce_trainer.train(train_loader, num_epochs=20)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 546.88it/s]


Epoch [1/20], Loss: 0.6154


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 535.22it/s]


Epoch [2/20], Loss: 0.5011


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 544.68it/s]


Epoch [3/20], Loss: 0.4501


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 496.65it/s]


Epoch [4/20], Loss: 0.4318


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 540.00it/s]


Epoch [5/20], Loss: 0.4230


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 499.60it/s]


Epoch [6/20], Loss: 0.4169


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 458.62it/s]


Epoch [7/20], Loss: 0.4147


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 482.26it/s]


Epoch [8/20], Loss: 0.4149


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 501.97it/s]


Epoch [9/20], Loss: 0.4053


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 369.83it/s]


Epoch [10/20], Loss: 0.4060


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 389.42it/s]


Epoch [11/20], Loss: 0.4019


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 396.62it/s]


Epoch [12/20], Loss: 0.4033


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 318.44it/s]


Epoch [13/20], Loss: 0.4017


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 378.42it/s]


Epoch [14/20], Loss: 0.3990


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 361.48it/s]


Epoch [15/20], Loss: 0.3986


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 419.50it/s]


Epoch [16/20], Loss: 0.3965


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 403.15it/s]


Epoch [17/20], Loss: 0.3956


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 419.73it/s]


Epoch [18/20], Loss: 0.3961


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 579.79it/s]


Epoch [19/20], Loss: 0.3923


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 535.40it/s]

Epoch [20/20], Loss: 0.3919





[0.6154282457298703,
 0.5010942932632234,
 0.4501282705201043,
 0.43184634877575767,
 0.42301610277758706,
 0.4169138918320338,
 0.4147384473019176,
 0.4149402495887544,
 0.40525172154108685,
 0.4060488889614741,
 0.40193768342336017,
 0.4032685160636902,
 0.4017352991633945,
 0.3990241040786107,
 0.3986136598719491,
 0.3964868701166577,
 0.39562932319111294,
 0.3961096637778812,
 0.3922592310441865,
 0.39192621989382637]

In [None]:
ce_loss, ce_acc = ce_trainer.evaluate(val_loader)
print(f"Validation Loss: {ce_loss:.4f}, Accuracy: {ce_acc:.2f}%")

Validation Loss: 0.4740, Accuracy: 0.76%



### 5. KLDivLoss (`torch.nn.KLDivLoss`)
- **Description:** Kullback-Leibler Divergence Loss measures how one probability distribution diverges from a second, reference distribution. Unlike other loss functions that focus on classification, KL divergence specifically compares the relative entropy between two distributions. It quantifies the information loss when using the predicted distribution to approximate the true distribution.

- **Mathematical Function:**
\begin{equation}
  \text{KL}(P \parallel Q) = \sum_{i=1}^{C} P(i) \left( \log P(i) - \log Q(i) \right)
\end{equation}
  where:
  - \( P \) is the target (true) probability distribution,
  - \( Q \) is the predicted distribution (often the output of `log_softmax`),
  - \( C \) is the number of classes.

  KL divergence is always non-negative, and it equals zero if the two distributions are identical. The loss function expects the model's output to be in the form of log-probabilities (using `log_softmax`) and compares this against a target probability distribution, which is typically a normalized distribution (using softmax).

- **Use Case:** KLDivLoss is frequently used in:
  - **Variational Autoencoders (VAEs):** In VAEs, KL divergence is used to measure how much the learned latent space distribution deviates from a prior distribution (often Gaussian).
  - **Knowledge Distillation:** In teacher-student models, KL divergence is used to transfer the "soft" knowledge from a teacher model to a student model by comparing their output probability distributions.
  - **Reinforcement Learning:** It can be used to update policies while minimizing the divergence from a previous policy.

- **Background:** Kullback-Leibler divergence, a core concept in information theory, measures the inefficiency of assuming the predicted distribution \( Q \) when the true distribution is \( P \). It is asymmetric, meaning that \( KL(P \parallel Q) \neq KL(Q \parallel P) \), so the direction of the comparison matters.

Again, in this part, run your training with Relu at last layer. <span style="color:red; font-weight: bold;">Discuss </span> and explain the difference between the results of the two models. Find a proper solution to the problem.


In [None]:
import torch.nn.functional as F
# Load dataset
train_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(train_url)

# Preprocessing (simple example)
data = data[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']].dropna()
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

# Convert the data to PyTorch tensors and create a DataLoader
X = data[['Pclass', 'Sex', 'Age', 'Fare']].values
y = data['Survived'].values

# We can scale the data
# Without scaling we got an accuracy of around 62%
# But after scaling, we observe around 15% increase!
scaler = StandardScaler()
X = scaler.fit_transform(X)

X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long)
y = F.one_hot(y, num_classes=2).float()



# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

# Define the model, criterion, and optimizer
input_dim = X.shape[1]
hidden_dim = 64
output_dim = 2
num_hidden_layers = 2

In [None]:
# Run with relu activation function
from torch.nn import KLDivLoss

relu_model = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers, last_layer_activation_fn=nn.ReLU)
relu_criterion = KLDivLoss(reduction='batchmean')
relu_optimizer = Adam(relu_model.parameters(), lr=0.001)

relu_trainer = SimpleMLPTrainer(relu_model, relu_criterion, relu_optimizer)
relu_trainer.train(train_loader, num_epochs=20)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 527.35it/s]


Epoch [1/20], Loss: -0.3056


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 499.96it/s]


Epoch [2/20], Loss: -1.0055


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 520.28it/s]


Epoch [3/20], Loss: -2.4310


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 498.67it/s]


Epoch [4/20], Loss: -5.1360


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 440.05it/s]


Epoch [5/20], Loss: -9.8650


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 508.52it/s]


Epoch [6/20], Loss: -17.5313


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 534.64it/s]


Epoch [7/20], Loss: -29.3701


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 435.32it/s]


Epoch [8/20], Loss: -46.2668


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 368.19it/s]


Epoch [9/20], Loss: -69.5050


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 271.38it/s]


Epoch [10/20], Loss: -100.2279


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 273.83it/s]


Epoch [11/20], Loss: -139.4075


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 238.38it/s]


Epoch [12/20], Loss: -187.9260


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 213.79it/s]


Epoch [13/20], Loss: -248.6599


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 300.09it/s]


Epoch [14/20], Loss: -320.7862


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 274.59it/s]


Epoch [15/20], Loss: -406.8640


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 391.66it/s]


Epoch [16/20], Loss: -507.8099


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 309.87it/s]


Epoch [17/20], Loss: -623.8188


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 344.84it/s]


Epoch [18/20], Loss: -756.4103


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 350.07it/s]


Epoch [19/20], Loss: -906.5984


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 302.75it/s]

Epoch [20/20], Loss: -1077.3354





[-0.3056059243778388,
 -1.0054754780398474,
 -2.4310453004307218,
 -5.13602344195048,
 -9.865025361378988,
 -17.531318134731716,
 -29.370090378655327,
 -46.26676517062717,
 -69.50498729281955,
 -100.22786670260959,
 -139.40747451782227,
 -187.92595418294272,
 -248.65987565782336,
 -320.7862074110243,
 -406.8639899359809,
 -507.80985683865015,
 -623.818834092882,
 -756.4102952745226,
 -906.598375108507,
 -1077.3354187011719]

In [None]:
relu_val_loss, relu_val_acc = relu_trainer.evaluate(val_loader)
print(f"ReLU Model - Validation Loss: {relu_val_loss:.4f}, Accuracy: {relu_val_acc:.2f}%")

ReLU Model - Validation Loss: -1190.2690, Accuracy: 0.61%


In [None]:
# Run with softmax activation function
from torch.nn import LogSoftmax, KLDivLoss

logsoftmax_model = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers, last_layer_activation_fn=LogSoftmax)
logsoftmax_criterion = KLDivLoss(reduction='batchmean')
logsoftmax_optimizer = Adam(logsoftmax_model.parameters(), lr=0.001)

logsoftmax_trainer = SimpleMLPTrainer(logsoftmax_model, logsoftmax_criterion, logsoftmax_optimizer)
logsoftmax_trainer.train(train_loader, num_epochs=20)

  return self._call_impl(*args, **kwargs)
Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 458.65it/s]


Epoch [1/20], Loss: 0.6475


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 503.86it/s]


Epoch [2/20], Loss: 0.5340


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 542.42it/s]


Epoch [3/20], Loss: 0.4636


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 445.92it/s]


Epoch [4/20], Loss: 0.4354


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 377.94it/s]


Epoch [5/20], Loss: 0.4249


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 298.45it/s]


Epoch [6/20], Loss: 0.4197


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 428.06it/s]


Epoch [7/20], Loss: 0.4156


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 486.36it/s]


Epoch [8/20], Loss: 0.4112


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 496.83it/s]


Epoch [9/20], Loss: 0.4088


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 482.62it/s]


Epoch [10/20], Loss: 0.4113


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 445.38it/s]


Epoch [11/20], Loss: 0.4045


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 459.07it/s]


Epoch [12/20], Loss: 0.4053


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 399.52it/s]


Epoch [13/20], Loss: 0.4002


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 315.23it/s]


Epoch [14/20], Loss: 0.3979


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 468.49it/s]


Epoch [15/20], Loss: 0.3957


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 501.37it/s]


Epoch [16/20], Loss: 0.3953


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 476.17it/s]


Epoch [17/20], Loss: 0.3966


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 459.07it/s]


Epoch [18/20], Loss: 0.3911


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 453.05it/s]


Epoch [19/20], Loss: 0.3913


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 474.91it/s]

Epoch [20/20], Loss: 0.3901





[0.6474657754103342,
 0.5339570790529251,
 0.4635663777589798,
 0.4353876925177044,
 0.42491303715440965,
 0.41970115154981613,
 0.41560972068044877,
 0.4111771550443437,
 0.40882642898294663,
 0.41129814916186863,
 0.4045010606447856,
 0.405258693628841,
 0.40015405664841336,
 0.3978937218586604,
 0.39567310197485817,
 0.39528197960721123,
 0.39655165870984393,
 0.3910642655359374,
 0.3912660694784588,
 0.3901420657833417]

In [None]:
logsoftmax_val_loss, logsoftmax_val_acc = logsoftmax_trainer.evaluate(val_loader)
print(f"LogSoftmax Model - Validation Loss: {logsoftmax_val_loss:.4f}, Accuracy: {logsoftmax_val_acc:.2f}%")

LogSoftmax Model - Validation Loss: 0.4742, Accuracy: 0.76%


**Your reason for your choice:**

KLDivLoss requires the output from the model to be in the form of log-probabilities (achieved using LogSoftmax), and the target labels to be probability distributions rather than integer class labels. Initially, we faced two main issues:

- ReLU activation on the final layer did not produce the log-probabilities required by KLDivLoss, causing the loss function to interpret outputs incorrectly.
- Integer class labels were incompatible with KLDivLoss, which expects probability distributions.

To resolve these issues, we implemented the following steps:

- Replaced ReLU with LogSoftmax on the last layer, which transformed the model's outputs into log-probabilities. This made the output suitable for KLDivLoss.
- One-hot encoded the target labels into probability distributions (e.g., [1, 0] or [0, 1] for binary classes), aligning them with KLDivLoss’s expectation for target distribution inputs.

The results of training with ReLU vs. LogSoftmax activation in the final layer were starkly different:

ReLU Model:

The ReLU model produced highly negative validation loss values. This occurred because ReLU does not produce log-probabilities, so KLDivLoss received improper inputs, causing it to calculate divergence inaccurately. The model’s accuracy also stagnated at around 61%, indicating poor alignment between the predictions and target labels.

LogSoftmax Model:


With LogSoftmax as the final layer activation, the validation loss stabilized at a lower value, which is a reasonable value for KLDivLoss, and the model achieved a higher accuracy. LogSoftmax ensured the model's outputs were log-probabilities, enabling KLDivLoss to calculate the divergence correctly, significantly improving model performance.


### 6. CosineEmbeddingLoss (`torch.nn.CosineEmbeddingLoss`)
- **Description:** Measures the cosine similarity between two input tensors, `x1` and `x2`, and computes the loss based on a label `y` that indicates whether the tensors should be similar (`y = 1`) or dissimilar (`y = -1`). Cosine similarity focuses on the angle between vectors, disregarding their magnitude.

- **Mathematical Function:**
\begin{equation}
  \text{CosineEmbeddingLoss}(x1, x2, y) =
  \begin{cases}
  1 - \cos(x_1, x_2), & \text{if } y = 1 \\
  \max(0, \cos(x_1, x_2) - \text{margin}), & \text{if } y = -1
  \end{cases}
\end{equation}
  where $ \cos(x_1, x_2) $ is the cosine similarity between the two vectors, and `margin` is a threshold that determines how dissimilar the vectors should be.

- **Use Case:** Commonly used in tasks like face verification, image similarity, and other scenarios where the relative orientation of vectors (angle) is more important than their length, such as in embeddings and metric learning.

- **Background:** Cosine similarity compares the directional alignment of vectors, making it ideal for high-dimensional data where the magnitude may not be as informative. This loss is particularly useful when training models to learn meaningful embeddings that capture semantic similarity.

You'll become more fimiliar with this loss function in future.

---

# Regularization in Machine Learning

## Introduction

Regularization is a fundamental technique in machine learning that helps prevent overfitting by adding a penalty to the loss function. This penalty discourages the model from becoming too complex, ensuring better generalization to unseen data. In this notebook, you will explore the concepts of regularization, understand different types of regularization techniques, and apply them using Python's popular libraries.

## What is Regularization?

Regularization involves adding a regularization term to the loss function used to train machine learning models. This term imposes a constraint on the model's coefficients, effectively reducing their magnitude. By doing so, regularization helps in:

- **Preventing Overfitting:** Ensures the model does not become too tailored to the training data.
- **Improving Generalization:** Enhances the model's performance on new, unseen data.
- **Feature Selection:** Especially in L1 regularization, it can drive some coefficients to zero, effectively selecting important features.

## Types of Regularization

There are several types of regularization techniques, each imposing different constraints on the model's parameters:

### 1. L1 Regularization (Lasso)

L1 regularization adds the absolute value of the magnitude of coefficients as a penalty term to the loss function. It can lead to sparse models where some feature coefficients are exactly zero.

### 2. L2 Regularization (Ridge)

L2 regularization adds the squared magnitude of coefficients as a penalty term to the loss function. It tends to shrink the coefficients evenly but does not set them to zero.

### 3. Elastic Net

Elastic Net combines both L1 and L2 regularization penalties. It balances the benefits of both Lasso and Ridge methods, allowing for feature selection and coefficient shrinkage.

## Homework Time!
Import Iris dataset from sklearn.datasets and apply ridge regression with different alpha values. Then, create a gif that shows the changes of the classification boundary with respect to alpha values.

Import the libs that you need and start coding!

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from PIL import Image
from io import BytesIO
import imageio
import warnings


# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

Load the Iris dataset and select Setosa and Versicolor classes

In [None]:
# 1. Load and Prepare the Iris Dataset

# Select only two classes for binary classification (Setosa and Versicolor)
iris = load_iris()

# Select two features for 2D visualization (Sepal Length and Petal Length)
X = iris.data[iris.target != 2][:, [0, 2]]
y = iris.target[iris.target != 2]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


Define Function to Plot Decision Boundary

In [None]:
def plot_decision_boundary(model, X, y, alpha):
    # Define the grid (use meshgrid)
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))


    # Predict over the grid
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Create a figure
    fig, ax = plt.subplots(figsize=(6, 5))

    # Plot the decision boundary
    ax.contourf(xx, yy, Z, alpha=0.3, levels=[-0.1, 0.1, 1.1], colors=['blue', 'red'])

    # Scatter plot of the training data
    scatter = ax.scatter(
        X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolor='k', s=50
    )

    # Title and labels
    ax.set_title(f'MLP Decision Boundary (alpha={alpha})')
    ax.set_xlabel('Sepal Length (standardized)')
    ax.set_ylabel('Petal Length (standardized)')

    # Remove axes for clarity
    ax.set_xticks([])
    ax.set_yticks([])

    # Tight layout
    plt.tight_layout()

    # Save the plot to a BytesIO object
    buf = BytesIO()
    plt.savefig(buf, format='png')
    plt.close(fig)
    buf.seek(0)
    return Image.open(buf)

Train MLP with Varying Alpha Values and Collect Images

In [None]:
def create_decision_boundary_gif(alpha_values, X_train, y_train, n_neurons):

    # List to store images
    images = []

    for idx, alpha in enumerate(alpha_values):
        print(f"Processing alpha={alpha:.4f} ({idx + 1}/{len(alpha_values)})")

        # Create and train the MLP
        mlp = MLPClassifier(
            hidden_layer_sizes=(n_neurons,),
            alpha=alpha,
            max_iter=1000,
            random_state=42
        )
        mlp.fit(X_train, y_train)


        # Plot decision boundary and get the image
        img = plot_decision_boundary(mlp, X_train, y_train, alpha)
        images.append(img)

    # Save the images as a GIF
    gif_filename = 'mlp_classification_boundaries.gif'
    images[0].save(
        gif_filename,
        save_all=True,
        append_images=images[1:],
        duration=500,
        loop=0
    )

    print(f"GIF saved as '{gif_filename}'")

    # return the gif
    return gif_filename

## RUN

In [None]:
from sklearn.neural_network import MLPClassifier
# Use np.logspace to generate alpha values, with at least 20 values
alpha_values = np.logspace(-3, 3, 20)  # alpha from 0.001 to 1000

# Define the number of neurons in the hidden layer
n_neurons =  10

# Create the decision boundary GIF
gif_dir = create_decision_boundary_gif(alpha_values, X_train, y_train, n_neurons)

Processing alpha=0.0010 (1/20)
Processing alpha=0.0021 (2/20)
Processing alpha=0.0043 (3/20)
Processing alpha=0.0089 (4/20)
Processing alpha=0.0183 (5/20)
Processing alpha=0.0379 (6/20)
Processing alpha=0.0785 (7/20)
Processing alpha=0.1624 (8/20)
Processing alpha=0.3360 (9/20)
Processing alpha=0.6952 (10/20)
Processing alpha=1.4384 (11/20)
Processing alpha=2.9764 (12/20)
Processing alpha=6.1585 (13/20)
Processing alpha=12.7427 (14/20)
Processing alpha=26.3665 (15/20)
Processing alpha=54.5559 (16/20)
Processing alpha=112.8838 (17/20)
Processing alpha=233.5721 (18/20)
Processing alpha=483.2930 (19/20)
Processing alpha=1000.0000 (20/20)
GIF saved as 'mlp_classification_boundaries.gif'


Your gif should look like this:

<div style="text-align: center;">

### **Multilayer Perceptron Classification Boundaries**

![Classification Boundaries](mlp_classification_boundaries_example.gif)

*Figure 1: Demonstration of classification boundaries created by a Multilayer Perceptron (MLP) model.*

</div>

