<a href="https://colab.research.google.com/github/EddyGiusepe/Overfitting_and_Regularization/blob/main/Regularization_L2_weight_decay.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2 align="center">Weight decay in Neural Network with Pytorch (L2 Regularization)</h2>



Data Scientist: Dr.Eddy Giusepe Chirinos Isidro

O decaimento de peso (`weight decay`) é uma técnica de regularização que adiciona uma pequena penalidade, geralmente a `Norma L2` dos pesos (todos os pesos do modelo), à função de perda.

![](https://www.oreilly.com/library/view/hands-on-machine-learning/9781788393485/assets/320843d0-3683-4422-80b2-c2913f8d02d4.png)

# Como usamos weight decay?

Para usar o `decaimento de peso`, podemos simplesmente definir o parâmetro de decaimento de peso no otimizador `torch.optim.SGD` ou no otimizador `torch.optim.Adam`. Aqui usamos $1e-4$ como padrão para `weight_decay`.


Assim:

```
optimizer = torch.optim.SGD(my_model.parameters(), lr=1e-3, weight_decay=1e-4)


optimizer = torch.optim.Adam(my_model.parameters(), lr=1e-3, weight_decay=1e-4)
```

## Por que usamos a weight decay?

* Para evitar o sobreajuste (`Overfitting`)

* Para manter os pesos pequenos e evitar a explosão do gradiente (`Exploding Gradient`).



In [1]:
# Importamos as nossas bibliotecas 

import torch
from torch import nn
import torchvision
from torchvision import models

import numpy as np

# Obtendo os nomes dos parâmetros

In [2]:
# Nosso Modelo

#my_model = models.resnet50(pretrained=False) #  Deprecate
my_model = models.resnet50()

In [3]:
for name, parameter in my_model.named_parameters():
  print(name)

conv1.weight
bn1.weight
bn1.bias
layer1.0.conv1.weight
layer1.0.bn1.weight
layer1.0.bn1.bias
layer1.0.conv2.weight
layer1.0.bn2.weight
layer1.0.bn2.bias
layer1.0.conv3.weight
layer1.0.bn3.weight
layer1.0.bn3.bias
layer1.0.downsample.0.weight
layer1.0.downsample.1.weight
layer1.0.downsample.1.bias
layer1.1.conv1.weight
layer1.1.bn1.weight
layer1.1.bn1.bias
layer1.1.conv2.weight
layer1.1.bn2.weight
layer1.1.bn2.bias
layer1.1.conv3.weight
layer1.1.bn3.weight
layer1.1.bn3.bias
layer1.2.conv1.weight
layer1.2.bn1.weight
layer1.2.bn1.bias
layer1.2.conv2.weight
layer1.2.bn2.weight
layer1.2.bn2.bias
layer1.2.conv3.weight
layer1.2.bn3.weight
layer1.2.bn3.bias
layer2.0.conv1.weight
layer2.0.bn1.weight
layer2.0.bn1.bias
layer2.0.conv2.weight
layer2.0.bn2.weight
layer2.0.bn2.bias
layer2.0.conv3.weight
layer2.0.bn3.weight
layer2.0.bn3.bias
layer2.0.downsample.0.weight
layer2.0.downsample.1.weight
layer2.0.downsample.1.bias
layer2.1.conv1.weight
layer2.1.bn1.weight
layer2.1.bn1.bias
layer2.1.conv2.we

# Desative `weight decay` em algumas camadas ou defina valores diferentes para diferentes camadas

In [4]:
def custom_weight_decay(net, l2_value, skip_list=()):
  decay, no_decay = [], []

  for name, param in net.named_parameters():
    if not param.requires_grad: continue # Congelamento dos pesos
  if len(param.shape) == 1 or name.endswith(".bias") or name in skip_list:
    no_decay.append(param)
  else: decay.append(param)

  return [{'params': no_decay, 'weight_decay': 0}, {'params': decay, 'weight_decay': l2_value}]



# E a lista retornada é passada para o otimizador:  
params = custom_weight_decay(my_model, 2e-5)

sgd = torch.optim.SGD(params, lr=0.05)


# Verificando se os pesos são menores quando aplico weight_decay

In [5]:
np.random.seed(123)
np.set_printoptions(8, suppress=True)

x_np = np.random.random((3, 4)).astype(np.double)
weights_np = np.random.random((4, 5)).astype(np.double)

x_torch = torch.tensor(x_np, requires_grad=True)
weights_torch = torch.tensor(weights_np, requires_grad=True)

print('Pesos originais', weights_torch)

Pesos originais tensor([[0.4386, 0.0597, 0.3980, 0.7380, 0.1825],
        [0.1755, 0.5316, 0.5318, 0.6344, 0.8494],
        [0.7245, 0.6110, 0.7224, 0.3230, 0.3618],
        [0.2283, 0.2937, 0.6310, 0.0921, 0.4337]], dtype=torch.float64,
       requires_grad=True)


In [6]:
################ 0 weight decay  ##################


lr = 0.1
sgd = torch.optim.SGD([weights_torch], lr=lr, weight_decay=0)

y_torch = torch.matmul(x_torch, weights_torch)
loss = y_torch.sum()

sgd.zero_grad()
loss.backward()
sgd.step()

w_grad = weights_torch.grad.data.numpy()
print('weight decay igual a 0', weights_torch)

weight decay igual a 0 tensor([[ 0.2489, -0.1300,  0.2084,  0.5483, -0.0072],
        [ 0.0653,  0.4214,  0.4217,  0.5243,  0.7393],
        [ 0.5694,  0.4559,  0.5674,  0.1679,  0.2067],
        [ 0.0317,  0.0972,  0.4345, -0.1044,  0.2372]], dtype=torch.float64,
       requires_grad=True)


In [7]:
################ NOW 1 weight decay ######################

weights_torch = torch.tensor(weights_np, requires_grad=True)

print('Reinicializar pesos originais (Reset Original weights)', weights_torch)

sgd = torch.optim.SGD([weights_torch], lr=lr, weight_decay=1)

y_torch = torch.matmul(x_torch, weights_torch)
loss = y_torch.sum()

sgd.zero_grad()
loss.backward()
sgd.step()

w_grad = weights_torch.grad.data.numpy()
print('weight decay igual a 1', weights_torch)

Reinicializar pesos originais (Reset Original weights) tensor([[0.4386, 0.0597, 0.3980, 0.7380, 0.1825],
        [0.1755, 0.5316, 0.5318, 0.6344, 0.8494],
        [0.7245, 0.6110, 0.7224, 0.3230, 0.3618],
        [0.2283, 0.2937, 0.6310, 0.0921, 0.4337]], dtype=torch.float64,
       requires_grad=True)
weight decay igual a 1 tensor([[ 0.2050, -0.1360,  0.1686,  0.4745, -0.0254],
        [ 0.0478,  0.3683,  0.3685,  0.4608,  0.6544],
        [ 0.4969,  0.3948,  0.4951,  0.1356,  0.1705],
        [ 0.0089,  0.0678,  0.3714, -0.1136,  0.1938]], dtype=torch.float64,
       requires_grad=True)
