# Batch Norm / Learning Rate / Normal vs Gradient Descent

## Normalization

In this exercise, you'll learn about normalization, a crucial technique in machine learning for standardizing input data. You'll implement normalization using two methods: manual normalization with NumPy and BatchNorm1d in PyTorch. This will help you understand the concept of normalization and how to apply it in different frameworks.

Your tasks are to:
1. Create a 2D NumPy array
2. Perform manual normalization along the first axis
3. Convert the NumPy array to a PyTorch tensor
4. Apply BatchNorm1d in PyTorch
5. Compare the results of both methods

This exercise will demonstrate how normalization works and show you the convenience of using built-in PyTorch functions for common operations in deep learning.

In [4]:
# Import necessary libraries
import numpy as np
import torch
import torch.nn as nn

# Create a random seed for reproducibility
np.random.seed(42)

# Create a 2D NumPy array with shape (4, 3) and random values
data = np.random.rand(4, 3)

# Perform manual normalization along the first axis (axis=0)
# Use np.mean() and np.std()
normalized_numpy = np.zeros_like(data)
for i in range(data.shape[1]):
    mean = np.mean(data[:, i])
    std = np.std(data[:, i])
    normalized_numpy[:, i] = (data[:, i] - mean) / std

# Convert NumPy array to PyTorch tensor
tensor_data = torch.tensor(data, dtype=torch.float32)

# Create a BatchNorm1d layer and apply it to the tensor
batch_norm = nn.BatchNorm1d(num_features=3)
normalized_torch = batch_norm(tensor_data)

# Print and compare results
print("Original data:")
print(data)

print("\nNumPy normalized data:")
print(normalized_numpy)

print("\nPyTorch BatchNorm1d normalized data:")
print(normalized_torch)

# Compare the results and explain any differences you observe

'''
The results of the manual normalization and the PyTorch BatchNorm1d layer are different.
The manual normalization is applied along the first axis (axis=0), which means that the mean and standard deviation
are calculated for each column. The PyTorch BatchNorm1d layer is applied along the last dimension (axis=-1),
which means that the mean and standard deviation are calculated for each row. This is why the results are different.
To get the same results, we need to transpose the tensor before applying the BatchNorm1d layer.
'''

Original data:
[[0.37454012 0.95071431 0.73199394]
 [0.59865848 0.15601864 0.15599452]
 [0.05808361 0.86617615 0.60111501]
 [0.70807258 0.02058449 0.96990985]]

NumPy normalized data:
[[-0.2426183   1.09277335  0.39604745]
 [ 0.65914784 -0.82706681 -1.5497212 ]
 [-1.51591764  0.88854453 -0.04607125]
 [ 1.0993881  -1.15425107  1.199745  ]]

PyTorch BatchNorm1d normalized data:
tensor([[-0.2426,  1.0927,  0.3960],
        [ 0.6591, -0.8270, -1.5496],
        [-1.5158,  0.8885, -0.0461],
        [ 1.0993, -1.1542,  1.1997]], grad_fn=<NativeBatchNormBackward0>)


'\nThe results of the manual normalization and the PyTorch BatchNorm1d layer are different.\nThe manual normalization is applied along the first axis (axis=0), which means that the mean and standard deviation\nare calculated for each column. The PyTorch BatchNorm1d layer is applied along the last dimension (axis=-1),\nwhich means that the mean and standard deviation are calculated for each row. This is why the results are different.\nTo get the same results, we need to transpose the tensor before applying the BatchNorm1d layer.\n'

## Learning Rate
In this exercise, you'll explore the concept of learning rate in optimization algorithms, which is crucial in training machine learning models. The learning rate determines the step size at each iteration while moving toward a minimum of the loss function.

You'll complete two tasks:

1. Implement a simple gradient descent algorithm in pure Python to see how different learning rates affect the optimization process.
2. Use PyTorch to train a simple linear model and observe how changing the learning rate impacts the training process.

This exercise will help you understand why choosing an appropriate learning rate is important and how it affects model training.

In [2]:
# Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim

# Part 1: Simple gradient descent in pure Python

print("\n--- Part 1: Simple Gradient Descent ---")

# Define a simple quadratic function f(x) = x^2 and its derivative df(x) = 2x
def f(x):
    """A simple quadratic function f(x) = x^2"""
    return x**2

def df(x):
    """Derivative of f(x)"""
    return 2*x

# Implement the gradient descent algorithm
def gradient_descent(start, learn_rate, num_iterations):
    x = start
    for i in range(num_iter):
        # Implement the gradient descent update rule
        x = x - learn_rate * df(x)
        if i % 5 == 0:
            print(f"Iteration {i}: x = {x:.4f}, f(x) = {f(x):.4f}")
    return x

# Run gradient descent with different learning rates
# Use start=5.0, num_iterations=25, and learning rates: 0.1, 0.01, 0.5
start = 5.0
num_iter = 25
learning_rate_01 = 0.1
result = gradient_descent(start, learning_rate_01, num_iter)
print("Learning rate 0.1: x =", result)
print("")

learning_rate_001 = 0.01
result = gradient_descent(start, learning_rate_001, num_iter)
print("Learning rate 0.01: x =", result)
print("")

learning_rate_05 = 0.5
result = gradient_descent(start, learning_rate_05, num_iter)
print("Learning rate 0.5: x =", result)
print("")
# Print the results and compare them

print("\n--- Part 2: PyTorch Implementation ---")

# Generate some sample data
torch.manual_seed(42)
X = torch.rand(100, 1) * 10
y = 2 * X + 1 + torch.randn(100, 1)

# Define a simple linear model
class SimpleLinearModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        out = self.linear(x)
        return out

# Create an instance of the model, define loss function and optimizer
model = SimpleLinearModel()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
def train(learning_rate, num_epochs):
    # Update the optimizer with the new learning rate
    optimizer.param_groups[0]['lr'] = learning_rate

    for epoch in range(num_epochs):
        # Implement the training loop
        # 1. Compute the model output
        # 2. Compute the loss
        # 3. Backpropagate the loss
        # 4. Update the model parameters

        optimizer.zero_grad()
        y_pred = model(X) # Compute the model output
        loss = criterion(y_pred, y) # Compute the loss
        loss.backward() # Backpropagation the loss
        optimizer.step() # Update the model parameters

        if epoch % 10 == 0:
            print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

# Train the model with different learning rates (e.g., 0.01, 0.1, 0.5)
# Print the results and compare them
learning_rate_001 = 0.01
train(learning_rate_001, 100)
print("")

learning_rate_01 = 0.1
train(learning_rate_01, 100)
print("")

learning_rate_05 = 0.5
train(learning_rate_05, 100)
print("")


--- Part 1: Simple Gradient Descent ---
Iteration 0: x = 4.0000, f(x) = 16.0000
Iteration 5: x = 1.3107, f(x) = 1.7180
Iteration 10: x = 0.4295, f(x) = 0.1845
Iteration 15: x = 0.1407, f(x) = 0.0198
Iteration 20: x = 0.0461, f(x) = 0.0021
Learning rate 0.1: x = 0.018889465931478583

Iteration 0: x = 4.9000, f(x) = 24.0100
Iteration 5: x = 4.4292, f(x) = 19.6179
Iteration 10: x = 4.0037, f(x) = 16.0293
Iteration 15: x = 3.6190, f(x) = 13.0971
Iteration 20: x = 3.2713, f(x) = 10.7013
Learning rate 0.01: x = 3.0173236488944846

Iteration 0: x = 0.0000, f(x) = 0.0000
Iteration 5: x = 0.0000, f(x) = 0.0000
Iteration 10: x = 0.0000, f(x) = 0.0000
Iteration 15: x = 0.0000, f(x) = 0.0000
Iteration 20: x = 0.0000, f(x) = 0.0000
Learning rate 0.5: x = 0.0


--- Part 2: PyTorch Implementation ---
Epoch 0, Loss: 145.8912
Epoch 10, Loss: 0.9808
Epoch 20, Loss: 0.9518
Epoch 30, Loss: 0.9254
Epoch 40, Loss: 0.9015
Epoch 50, Loss: 0.8797
Epoch 60, Loss: 0.8599
Epoch 70, Loss: 0.8420
Epoch 80, Loss: 0

## Analytical Solution vs. Iterative Solution
In this exercise, you'll explore two methods to solve the linear equation X*theta=Y, where X is the input matrix, Y is the target vector, and theta is the parameter vector we want to determine. This is a fundamental problem in linear regression and many other machine learning tasks.

You'll implement two approaches in PyTorch:

1. Analytical solution: theta = (X^T * X)^(-1) * (X^T * Y)
2. Iterative solution using gradient descent

In [1]:
import torch

# Generate some sample data
torch.manual_seed(42)
X = torch.rand(100, 3)
true_theta = torch.tensor([1.5, -0.8, 2.0])
Y = torch.matmul(X, true_theta) + torch.randn(100) * 0.1

# Implement the analytical solution
def analytical_solution(X, Y):
    # Hint: Use torch.matmul() for matrix multiplication and torch.inverse() for matrix inversion
    XTX = torch.matmul(X.T, X)
    XTY = torch.matmul(X.T, Y)
    theta = torch.matmul(torch.inverse(XTX), XTY)
    return theta

# Implement the gradient descent solution
def gradient_descent_solution(X, Y, learning_rate, num_iterations):
    theta = torch.randn(3)  # Initialize theta with random values
    for i in range(num_iterations):
        # Implement the gradient descent update rule
        # 1. Compute the predicted Y
        # 2. Compute the loss (mean squared error)
        # 3. Compute the gradient
        # 4. Update theta

        Y_pred = torch.matmul(X, theta)
        loss = torch.mean(torch.square(Y_pred - Y))
        gradient = 2 * torch.matmul(X.T, Y_pred - Y)
        theta = theta - learning_rate * gradient

        if i % 1000 == 0:
            print(f"Iteration {i}, Loss: {loss.item():.4f}")

    return theta

# Call both functions and compare the results
analytical_theta = analytical_solution(X, Y)
print("Analytical solution theta:", analytical_theta)
print("")

gradient_descent_theta = gradient_descent_solution(X, Y, 0.01, 10000)
print("")
print("Gradient descent solution theta:", gradient_descent_theta)

print("True theta:", true_theta)
print("")

# Compare the mean squared error for both solutions
# Hint: Use torch.mean() and torch.square()

analytical_Y_pred = torch.matmul(X, analytical_theta)
analytical_loss = torch.mean(torch.square(analytical_Y_pred - Y))
print("Analytical solution loss:", analytical_loss.item())

gradient_descent_Y_pred = torch.matmul(X, gradient_descent_theta)
gradient_descent_loss = torch.mean(torch.square(gradient_descent_Y_pred - Y))
print("Gradient descent solution loss:", gradient_descent_loss.item())

Analytical solution theta: tensor([ 1.4982, -0.8054,  2.0325])

Iteration 0, Loss: 2.2463
Iteration 1000, Loss: 0.0089
Iteration 2000, Loss: 0.0089
Iteration 3000, Loss: 0.0089
Iteration 4000, Loss: 0.0089
Iteration 5000, Loss: 0.0089
Iteration 6000, Loss: 0.0089
Iteration 7000, Loss: 0.0089
Iteration 8000, Loss: 0.0089
Iteration 9000, Loss: 0.0089
Gradient descent solution theta: tensor([ 1.4982, -0.8054,  2.0325])
True theta: tensor([ 1.5000, -0.8000,  2.0000])

Analytical solution loss: 0.008850558660924435
Gradient descent solution loss: 0.008850560523569584
