### Optimizer: Adam

In this notebook I will attempt to observe the values for the optimal alpha given the quadratic approximation of the loss for a mini batch 
$L(\theta) \approx q(\theta):=\frac{1}{2}\left(\theta-\theta_0\right)^{\top} H\left(\theta-\theta_0\right)+\left(\theta-\theta_0\right)^{\top} \cdot g+c$  
 
 
along a direction d  

$h(\alpha)=q\left(\theta_0+\alpha \cdot d\right)$  
 
 
which is
$\alpha^*=\frac{-d^{\top} g}{d^{\top} H d}$


First we define a simple model:

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple neural network
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(10, 5)
        self.fc2 = nn.Linear(5, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Create a simple dataset
X = torch.randn(1000, 10)
y = torch.randn(1000, 1)

# Initialize the model and optimizer
model = SimpleNet()
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

# Training loop
n_epochs = 100
batch_size = 32

Then we train using Adam and try to observe d from adam:

In [7]:
# Initialize an empty list to store flattened directions and gradients
d_list = []
g_list = []

for epoch in range(n_epochs):
    for i in range(0, len(X), batch_size):
        # Get mini-batch
        X_batch = X[i:i+batch_size]
        y_batch = y[i:i+batch_size]

        # Forward pass
        y_pred = model(X_batch)
        loss = criterion(y_pred, y_batch)

        # Backward pass
        optimizer.zero_grad()  # Reset gradients
        loss.backward()  # Compute gradients

        # Clear lists for directions and gradients
        d_list.clear()
        g_list.clear()

        for param in model.parameters():
            if param.grad is not None:
                # Flatten and store the gradient for this parameter
                g_list.append(param.grad.view(-1))

                if param not in optimizer.state:
                    # Initialize state if it doesn't exist
                    optimizer.state[param]['exp_avg'] = torch.zeros_like(param.data)
                    optimizer.state[param]['exp_avg_sq'] = torch.zeros_like(param.data)
                    optimizer.state[param]['step'] = torch.tensor(0)  # Initialize as tensor

                # Ensure 'step' is a tensor
                if isinstance(optimizer.state[param]['step'], int):
                    optimizer.state[param]['step'] = torch.tensor(optimizer.state[param]['step'])

                # Get Adam's internal state (first and second moments)
                m_t = optimizer.state[param]['exp_avg']        # First moment (moving average of gradients)
                v_t = optimizer.state[param]['exp_avg_sq']     # Second moment (moving average of squared gradients)

                # Bias correction for moments
                beta1, beta2 = optimizer.defaults['betas']
                optimizer.state[param]['step'] += 1  # Increment step as tensor
                t = optimizer.state[param]['step'].item()  # Convert step tensor to int

                m_t_hat = m_t / (1 - beta1**t)  # Bias-corrected first moment
                v_t_hat = v_t / (1 - beta2**t)  # Bias-corrected second moment

                # Compute the direction
                d_t = -m_t_hat / (torch.sqrt(v_t_hat) + optimizer.defaults['eps'])  # Adam's update direction

                # Flatten the direction d_t and append it to the list
                d_list.append(d_t.view(-1))

        # Concatenate all flattened directions into a single vector of size (1, p)
        d_vector = torch.cat(d_list).view(1, -1)  # Shape (1, p)

        # Concatenate all flattened gradients into a single vector of size (p, 1)
        g_vector = torch.cat(g_list).view(-1, 1)  # Shape (p, 1)

        # Optimizer step (updates the parameters)
        optimizer.step()

        # Print the direction vector and gradient for the first mini-batch (for demonstration)
        if i == 0:
            print(f"Epoch {epoch}, Batch 0:")
            print(f"Direction vector d (norm): {d_vector.norm()}")  # Print norm of the vector
            print(f"First 10 values of d: {d_vector[0, :10]}")  # First 10 values of d
            print(f"Gradient vector g (norm): {g_vector.norm()}")  # Print norm of the gradient vector
            print(f"First 10 values of g: {g_vector[:10]}")  # First 10 values of the gradient

    # Print epoch loss every 10 epochs
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item()}")

print("Training complete!")


Epoch 0, Batch 0:
Direction vector d (norm): 1.389596939086914
First 10 values of d: tensor([ 0.1085,  0.3498,  0.1337,  0.2082,  0.0716, -0.0683,  0.1831, -0.2486,
        -0.0767,  0.0833])
Gradient vector g (norm): 1.140217661857605
First 10 values of g: tensor([[ 0.0349],
        [ 0.1747],
        [-0.0589],
        [ 0.0860],
        [-0.0356],
        [-0.0953],
        [ 0.0052],
        [-0.0552],
        [ 0.1047],
        [-0.0160]])
Epoch 0, Loss: 0.4204362630844116
Epoch 1, Batch 0:
Direction vector d (norm): 1.399430751800537
First 10 values of d: tensor([ 0.1054,  0.3331,  0.1077,  0.2249,  0.0634, -0.0377,  0.1939, -0.2573,
        -0.1060,  0.0826])
Gradient vector g (norm): 1.1469032764434814
First 10 values of g: tensor([[ 0.0350],
        [ 0.1748],
        [-0.0589],
        [ 0.0863],
        [-0.0356],
        [-0.0954],
        [ 0.0051],
        [-0.0552],
        [ 0.1047],
        [-0.0160]])
Epoch 2, Batch 0:
Direction vector d (norm): 1.397036075592041
Firs

Epoch 25, Batch 0:
Direction vector d (norm): 1.4115558862686157
First 10 values of d: tensor([ 0.1021,  0.3292,  0.1160,  0.2333,  0.0601, -0.0371,  0.2017, -0.2273,
        -0.1097,  0.0839])
Gradient vector g (norm): 1.1589876413345337
First 10 values of g: tensor([[ 0.0347],
        [ 0.1753],
        [-0.0595],
        [ 0.0880],
        [-0.0367],
        [-0.0953],
        [ 0.0052],
        [-0.0555],
        [ 0.1038],
        [-0.0164]])
Epoch 26, Batch 0:
Direction vector d (norm): 1.394295334815979
First 10 values of d: tensor([ 0.0834,  0.3321,  0.1604,  0.2104,  0.0596, -0.0549,  0.1409, -0.2668,
        -0.1722,  0.1094])
Gradient vector g (norm): 1.1599968671798706
First 10 values of g: tensor([[ 0.0346],
        [ 0.1754],
        [-0.0596],
        [ 0.0882],
        [-0.0368],
        [-0.0954],
        [ 0.0049],
        [-0.0557],
        [ 0.1037],
        [-0.0164]])
Epoch 27, Batch 0:
Direction vector d (norm): 1.401181697845459
First 10 values of d: tensor([ 0.

Epoch 52, Batch 0:
Direction vector d (norm): 1.4002947807312012
First 10 values of d: tensor([ 0.0804,  0.3341,  0.1633,  0.2067,  0.0537, -0.0469,  0.1466, -0.2650,
        -0.1727,  0.1064])
Gradient vector g (norm): 1.1651958227157593
First 10 values of g: tensor([[ 0.0328],
        [ 0.1751],
        [-0.0602],
        [ 0.0906],
        [-0.0371],
        [-0.0963],
        [ 0.0054],
        [-0.0560],
        [ 0.1032],
        [-0.0152]])
Epoch 53, Batch 0:
Direction vector d (norm): 1.3697893619537354
First 10 values of d: tensor([ 0.0994,  0.3214,  0.1351,  0.1924,  0.0985, -0.0891,  0.1213, -0.2475,
        -0.1206,  0.1111])
Gradient vector g (norm): 1.1729422807693481
First 10 values of g: tensor([[ 0.0326],
        [ 0.1751],
        [-0.0601],
        [ 0.0906],
        [-0.0371],
        [-0.0965],
        [ 0.0053],
        [-0.0562],
        [ 0.1032],
        [-0.0151]])
Epoch 54, Batch 0:
Direction vector d (norm): 1.3906012773513794
First 10 values of d: tensor([ 

Epoch 76, Batch 0:
Direction vector d (norm): 1.403216004371643
First 10 values of d: tensor([ 0.0913,  0.3560,  0.1862,  0.1844,  0.0552, -0.0819,  0.1312, -0.2567,
        -0.1466,  0.1036])
Gradient vector g (norm): 1.1771109104156494
First 10 values of g: tensor([[ 0.0307],
        [ 0.1742],
        [-0.0606],
        [ 0.0915],
        [-0.0376],
        [-0.0963],
        [ 0.0058],
        [-0.0573],
        [ 0.1030],
        [-0.0145]])
Epoch 77, Batch 0:
Direction vector d (norm): 1.416673183441162
First 10 values of d: tensor([ 0.0585,  0.2664,  0.1013,  0.2130,  0.0585, -0.0846,  0.3887, -0.1556,
         0.0086,  0.1536])
Gradient vector g (norm): 1.1706690788269043
First 10 values of g: tensor([[ 0.0309],
        [ 0.1742],
        [-0.0607],
        [ 0.0916],
        [-0.0375],
        [-0.0963],
        [ 0.0057],
        [-0.0571],
        [ 0.1029],
        [-0.0145]])
Epoch 78, Batch 0:
Direction vector d (norm): 1.3786702156066895
First 10 values of d: tensor([ 0.