# Optimizer 

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Mitchell-Mirano/sorix/blob/qa/docs/learn/optimizers/06-Optimizer.ipynb)
[![Open in GitHub](https://img.shields.io/badge/Open%20in-GitHub-black?logo=github)](https://github.com/Mitchell-Mirano/sorix/blob/qa/docs/learn/optimizers/06-Optimizer.ipynb)
[![Open in Docs](https://img.shields.io/badge/Open%20in-Docs-blue?logo=readthedocs)](http://127.0.0.1:8000/sorix/learn/optimizers/06-Optimizer)

In Sorix, the `Optimizer` base class serves as the foundation for all optimization algorithms (like SGD, Adam, or RMSprop). It provides common utilities for updating model parameters but leaves the actual update logic to its subclasses.

By inheriting from `Optimizer`, you can easily implement your own optimization logic for research or specialized use cases.

## Anatomy of the Optimizer Class

Every optimizer in Sorix must follow a simple contract:

1. **`__init__(self, parameters, lr)`**: Receives a list of `Tensor` objects to optimize and a learning rate.
2. **`zero_grad()`**: Clears the `.grad` attribute of all parameters.
3. **`step()`**: The heart of the optimizer, where you define how to modify `parameter.data` using `parameter.grad`.

## 1. Creating a Custom Optimizer

Let's implement a **Sign Gradient Descent** optimizer. Instead of scaling the gradient, it only looks at the sign (direction) of the gradient and moves by a fixed step size $\eta$:

$$w = w - \eta \cdot \text{sign}(\nabla w)$$

This can be useful for robust optimization in noisy environments.

In [1]:
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@qa'

In [2]:
import numpy as np
from sorix.optim import Optimizer
from sorix import tensor

class SignSGD(Optimizer):
    def __init__(self, parameters, lr=0.01):
        # Initialize using the base class
        super().__init__(parameters, lr)
        
    def step(self):
        for param in self.parameters:
            if param.grad is not None:
                # Only update based on the sign of the gradient
                # We use self.xp to correctly handle CPU or GPU
                param.data -= self.lr * self.xp.sign(param.grad)

# Create a simple model
w = tensor([10.0], requires_grad=True)
optim = SignSGD([w], lr=2.0)

print(f"Initial value: {w.item():.2f}")
for i in range(5):
    loss = (w - 2.0)**2
    loss.backward()
    optim.step()
    optim.zero_grad()
    print(f"Step {i+1} | Value: {w.item():.2f} (Moved by exactly 2.0 per step)")

Initial value: 10.00
Step 1 | Value: 8.00 (Moved by exactly 2.0 per step)
Step 2 | Value: 6.00 (Moved by exactly 2.0 per step)
Step 3 | Value: 4.00 (Moved by exactly 2.0 per step)
Step 4 | Value: 2.00 (Moved by exactly 2.0 per step)
Step 5 | Value: 2.00 (Moved by exactly 2.0 per step)


## 2. Managing Internal State

Many optimizers (like Adam or SGD with Momentum) need to track additional state for each parameter across time (e.g., historical gradients). 

For maximum efficiency, Sorix prefers using **Lists** to store these states. Lists allow for direct indexing, which is faster than hash-map lookups in dictionaries.

In [3]:
class MovingAverageSGD(Optimizer):
    def __init__(self, parameters, lr=0.01, beta=0.9):
        super().__init__(parameters, lr)
        self.beta = beta
        # Pre-allocate a list of buffers (one for each parameter)
        self.m = [self.xp.zeros_like(p.data) for p in self.parameters]
        
    def step(self):
        for i, param in enumerate(self.parameters):
            if param.grad is None:
                continue
            
            # Update moving average: m = beta*m + (1-beta)*grad
            self.m[i] = self.beta * self.m[i] + (1 - self.beta) * param.grad
            
            # Perform weight update
            param.data -= self.lr * self.m[i]

print("MovingAverageSGD created successfully with List state management!")

MovingAverageSGD created successfully with List state management!


## 3. Training Proof

Let's see if our `SignSGD` optimizer can actually train a small model to solve a problem. If the gradients correctly guide the direction, the model should converge regardless of the gradient magnitude.

In [4]:
from sorix.nn import Linear, MSELoss

model = Linear(5, 1)
optimizer = SignSGD(model.parameters(), lr=0.01)
criterion = MSELoss()

# Simple linear target: y = sum(x)
X = tensor(np.random.randn(100, 5))
y = tensor(np.sum(X.numpy(), axis=1, keepdims=True))

print("Training model with custom SignSGD...")
for epoch in range(101):
    y_pred = model(X)
    loss = criterion(y_pred, y)
    
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    
    if epoch % 20 == 0:
        print(f"Epoch {epoch:3d} | Loss: {loss.item():.6f}")

Training model with custom SignSGD...
Epoch   0 | Loss: 5.047922
Epoch  20 | Loss: 3.456228
Epoch  40 | Loss: 2.279963
Epoch  60 | Loss: 1.405424
Epoch  80 | Loss: 0.754761
Epoch 100 | Loss: 0.327485


## Conclusion

The `Optimizer` base class makes it incredibly easy to experiment with new learning algorithms. All you need to do is subclass it and implement the `step()` method to manipulate your parameters' data based on their gradients. Sorix handles everything else, including zeroing out gradients and managing hardware-specific operations through `self.xp`.