### Optimization and Loss functions in PyTorch

Before diving into Neural Networks, we first investigate how the Model Optimization works. For that, PyTorch offers `torch.optim`.

In `torch.optim` we have a base class called `torch.optim.Optimizer` from which all Optimizers inherit. Using this class, we can also implement our own Optimization Algorithms, however, PyTorch also offers a plethora of already implement Algorithms, a list can be found here https://pytorch.org/docs/stable/optim.html.

For loss functions, PyTorch offers `torch.nn`, which is the fundamental building block for its Autograd graph, the loss functions come from `torch.nn.Module` from which all parts of our Neural Network, including the loss functions, inherit. We also can write custom loss functions using `torch.autograd.Function`.

In `Toy Optimization` we saw some basic examples how this looks like. 

### Custom Optimization Algorithms and Loss functions in PyTorch

Through torch.optim and torch.nn we can implement our Own loss functions and optimization Algorithms. 

This can be useful if we want to play around with our own loss functions or optiization algorithms, or if we want to modify existing ones. 

For the Optimizer, we need to implement the functions __init__() for initilization and step() which takes a optimization step.

For the loss function, what we need to implement, aside from __init__(), depends on wether or not we used `torch.autograd.Function` (forward and backward) as our class to inherit from or `torch.nn.Module`, the latter (i think) is more common, as here we only need to implement the forward method, since the backward is done automatically via the autograd. When using `torch.autograd.Function` we implement a new function which we can use autograd on, which essentially is what the training Algorithm does. 

In [1]:
import torch 
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

#### Custom AdamW Algorithm 

Below we give an example of a custom implementation for optimizers in PyTorch. It is inspired by the PyTorch docs https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html - there, many more considerations are made, which, for the sake of simplicity, we are not doing here. 

Here, `params` is a Iterable of all the things (i.e. weights, bias etc.) we want to optimize over.

Through the Framework of the Optimizer class from which we inheret, we can actually also define different parameters, like learning rates or decay, for different kind of parameters (imagine, you want to learn the weights of one layer faster than the weights in another etc) 

This we can do through the different groups, by passing `defaults` to the Optimizer class, we ensure that all parameters are accessible in the group dictionary, even if not all where specified for that group.

Here, in the step, all information relevant to the optimization in the `state` dictionary. We also allow `maxamize` - this allows the user to utilize this version of AdamW for maximizing instead of minimizing. Which might be useful. 

In [5]:
#Let us see how custom optimization Algorithms look like in PyTorch
class customAdamW(optim.Optimizer):
    def __init__(
        self, params, lr = 0.01, beta1 = 0.9, beta2 = 0.999, eps = 1e-7, decay = 0.0, maximize = False):
        defaults = dict(lr = 0.01, beta1 = 0.9, beta2 = 0.999, eps = 1e-7, decay = 0.0, maximize = False)
        super().__init__(params,defaults)
        
    def step(self):
        for group in self.param_groups:
            for p in group["params"]:
                #check if gradient has been calculated already
                if p.grad is None: 
                    continue
                
                gradient = p.grad.data
                
                if group["maximize"]: 
                    gradient = -gradient
                
                
                state = self.state[p] # get the state of the current parameter
                
                # get our parameters 
                eps = group["eps"]
                a = group["lr"]
                weight_decay = group["decay"]
                beta1 = group["beta1"]
                beta2 = group["beta2"]
                #init the state if we are in the first iteration
                if "state" not in state:
                    state["step"] = 0
                    #init the moments
                    state["moment1"] = torch.zeros_like(p.data)
                    state["moment2"] = torch.zeros_like(p.data)
                
                # regulraization
                p.data = p.data - a* weight_decay * p.data
                
                state["step"] += 1
                state["moment1"] = state["moment1"] * beta1 + (1-beta1) *gradient
                state["moment2"] = state["moment2"] * beta2 + (1-beta2) * gradient*gradient
                
                bias1_correction = state["moment1"]/(1 - beta1 ** state["step"])
                bias2_correction = state["moment2"]/(1-beta2 ** state["step"])
                
                p.data = p.data - a *bias1_correction/(torch.sqrt(bias2_correction) + eps)
        
        return None 

In [None]:
# Lets see how our Optimizer works on the Optimization from ToyOptimization
n = 3
m = 2
toy_model = nn.Linear(n,m) 
W = toy_model.weight
b = toy_model.bias
inputs = torch.arange(n, dtype=torch.float32)
labels = torch.zeros(m)
print(f"Before Training we have: {toy_model(inputs)}")
lossfunction = nn.MSELoss()
toy_optimizer = customAdamW(toy_model.parameters(), lr=0.01)
for i in range(100):
    toy_optimizer.zero_grad()
    output = toy_model(inputs)
    loss = lossfunction(labels,output)
    loss.backward()
    toy_optimizer.step()
    
print(f"After Training we have: {toy_model(inputs)}")

Parameter containing:
tensor([[-0.2771, -0.4816,  0.2808],
        [-0.4333,  0.0365,  0.5661]], requires_grad=True) Parameter containing:
tensor([-0.5408, -0.3892], requires_grad=True)
Before Training we have: tensor([-0.4608,  0.7794], grad_fn=<ViewBackward0>)
After Training we have: tensor([ 0.0192, -0.0206], grad_fn=<ViewBackward0>)


#### It works :) 

Now, lets see how we can implement our own loss functions. Here we are going to inherit from the `nn.Module` and we are just doing the classic MSE loss with regularization.

In [9]:
#Let us see how custom loss functions look like in PyTorch

class RegularizedMSELoss(nn.Module):
    def __init__(self, lambd = 0.01):
        super().__init__()
        self.lambd = lambd
        
    def forward(self, inputs,labels):
        loss = torch.mean((inputs-labels)**2) + self.lambd * torch.norm(inputs, p = 2)
        return loss
    
# Lets see how our Optimizer combined with the above loss functions works on the Optimization from ToyOptimization
n = 3
m = 2
toy_model = nn.Linear(n,m) 
W = toy_model.weight
b = toy_model.bias
inputs = torch.arange(n, dtype=torch.float32)
labels = torch.zeros(m)
print(f"Before Training we have: {toy_model(inputs)}")
lossfunction = RegularizedMSELoss(lambd = 0.001)
toy_optimizer = customAdamW(toy_model.parameters(), lr=0.01)
for i in range(100):
    toy_optimizer.zero_grad()
    output = toy_model(inputs)
    loss = lossfunction(labels,output)
    loss.backward()
    toy_optimizer.step()
    
print(f"After Training we have: {toy_model(inputs)}")

Before Training we have: tensor([-1.4156,  0.3178], grad_fn=<ViewBackward0>)
After Training we have: tensor([ 0.0244, -0.0023], grad_fn=<ViewBackward0>)
