# DLO-JZ Optimizers and Large Batch - Day 3

The cells in this *notebook* are not meant to be modified, except for rare exceptions indicated in the comments.

*Notebook written by the IDRIS AI assistance team*
* Translation to english, June 2024
* Adaptation to the new format, February 2024
* modification of lrfinder & cifar10, October 2023
* addition of Lion, July 2023
* creation, October 2022


------------------------

This notebook is intended to be run from a front-end machine of Jean-Zay. The *hostname* should be jean-zay[1-5] or jean-zay-srv2.

In [None]:
!hostname

In [None]:
%load_ext autoreload
%autoreload 2

The *python* functions for SLURM queue management developed by IDRIS and the functions dedicated to the DLO-JZ training are to be imported.

The environment module for the *jobs* and the image size are set for this *notebook*.

**TODO:** choose a *nickname* (maximum 5 characters) to differentiate yourself in the SLURM queue and in collaborative tools during the training and competition.

In [None]:
from idr_pytools import display_slurm_queue, gpu_jobs_submitter, search_log
from dlojz_tools_tp import plot_accuracy, lrfind_plot, plot_accuracy_lr, model_vizu, update_vizu
MODULE = 'pytorch-gpu/py3/2.3.0'
account = 'for@a100'
n_gpu = 1

name = 'pseudo'  #TODO#

In [None]:
!mkdir checkpoints

------------------------------------

## Dataset and Model

For this lab, we will use the CIFAR 10 database and the Resnet-18 model to be able to run complete trainings in a reasonable amount of time. The lab will be conducted by modifying the [cifar10.py](cifar10.py) code.

### CIFAR 10

#### Train set


In [None]:
import os
import torchvision
import torchvision.transforms as transforms
import torchvision.models as models
import torch
import numpy as np
import matplotlib.pyplot as plt

transform = transforms.Compose([ 
        transforms.RandomHorizontalFlip(),   # Horizontal Flip - Data Augmentation
        transforms.RandomCrop(32, padding=4),
        transforms.ToTensor()                # Convert the PIL Image to a tensor
        ])
    
    
train_dataset = torchvision.datasets.CIFAR10(root=os.environ['ALL_CCFRSCRATCH']+'/CIFAR_10',
                                             train=True, download=False, transform=transform)

train_loader = torch.utils.data.DataLoader(dataset=train_dataset,    
                                           batch_size=4,
                                           shuffle=True)
train_dataset

In [None]:
batch = next(iter(train_loader))
print('X train batch, shape: {}, data type: {}, Memory usage: {} bytes'
      .format(batch[0].shape, batch[0].dtype, batch[0].element_size()*batch[0].nelement()))
print('Y train batch, shape: {}, data type: {}, Memory usage: {} bytes'
      .format(batch[1].shape, batch[1].dtype, batch[1].element_size()*batch[1].nelement()))


img = batch[0][0].numpy().transpose((1,2,0))
fig, ax = plt.subplots(figsize=(2,2))
plt.imshow(img)
plt.axis('off')
_ = plt.title('label class: {}'.format(batch[1][0].numpy()))

#### Validation set

In [None]:
val_transform = transforms.Compose([transforms.ToTensor()])# convert the PIL Image to a tensor
    
val_dataset = torchvision.datasets.CIFAR10(root=os.environ['ALL_CCFRSCRATCH']+'/CIFAR_10',
                                               train=False, download=False, transform=val_transform)
val_dataset

### Resnet-18

In [None]:
model = models.resnet18(num_classes=10)
print('number of total parameters: {}'.format(sum([p.numel() for p in model.parameters()])))
print('number of trainable parameters: {}'.format(sum([p.numel() for p in model.parameters() if p.requires_grad])))

-----------

## Description

We will study 4 optimizers (SGD, AdamW, LAMB, LARS, and Lion).

Each time, we will look at the case of **Small Batch** learning and **Large Batch** learning.

 * **Small Batch**: *Global Batch Size* of **256** on 1 GPU over **30** *epochs* (or **5880** *training steps*)
 * **Large Batch**: *Global Batch Size* of **8192** on 1 GPU over **75** *epochs* (or **525** *training steps*)

**Note**:

The *wrapped_optimizer* parameter is present in the code due to the specific implementation of LARS. This is because the *LR scheduler* must take the base *SGD optimizer* as input, not the *wrapped* one.

For the other optimizers, it serves no purpose. But this trick allows us to switch to each type of *optimizer* easily.

------------------------------------

## TP_opti_0: Constant *Learning Rate* (reference)

### Learning Rate Finder (SGD)

Whether with a constant *learning rate* or with a *Cycle Scheduler*, we first need to find the range of *learning rate* values that will have a positive effect on the model's learning.

Before the lab, we ran a script performing a few iterations for various *learning rates*. If you're curious, you can look at the *lrfinder_cifar10.py* script.

Below, you can observe the loss curves as a function of the *learning rate* for two batch sizes, knowing that the optimizer is *SGD*.

From these curves, you should be able to identify a scale of *learning rates* that perform better than others.
For "similar" performances with the *lr finder*, between two *learning rates*, the smaller one will probably have a more stable behavior.

#### Small Batch

![lr_decorelated_SGD_256](images_lrfinder/lr_decorelated_SGD_256.png)

#### Large Batch

![lr_decorelated_SGD_8192](images_lrfinder/lr_decorelated_SGD_8192.png)

### Training (SGD)

We will launch a reference training with a constant *learning rate*.

Job submission. **Be careful, you are requesting the compute nodes at this time**.

To submit the job, please switch the following cell from `Raw NBConvert` mode to `Code` mode.

**TODO: replace XXX with the chosen *learning rate* value**

In [None]:
display_slurm_queue(name)

In [None]:
#jobids_ref = ['400133', '400134']

#### Small Batch

In [None]:
jobids=jobids_ref[0]
plot_accuracy_lr(jobids)

#### Large Batch

In [None]:
jobids=jobids_ref[1]
plot_accuracy_lr(jobids)

## TP_opti_1 : *One Cycle Learning Rate*


We will now modify the code to replace the constant *learning rate* with a *One Cycle Scheduler* and compare the result with the reference training.

**TODO**: in the script [cifar10.py](cifar10.py):
* Find the line declaring the *Learning Rate Scheduler*:

```python
scheduler = torch.optim.lr_scheduler.ConstantLR(optimizer, factor=1, total_iters=5)
```
* Replace it with a *One Cycle Scheduler*:

```python
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=args.lr, steps_per_epoch=N_batch, epochs=args.epochs)
```

__Note__: PyTorch's *OneCycleLR* automatically calculates a minimum *learning rate* from the given maximum value.

Job submission. **Be careful, you are requesting the compute nodes at this time**.

To submit the job, please switch the following cell from `Raw NBConvert` mode to `Code` mode.

**TODO: replace XXX with the chosen maximum *learning rate* values (Small Batch and Large Batch) using the *learning rate finder***


In [None]:
display_slurm_queue(name)

In [None]:
#jobid_sgd = ['153761', '153762', '153763', '153764']

**Please note that we have run trainings with and without weight decay.** So you can estimate the effect of *weigh decay* + *one cycle scheduler* on the *training* curve.

Then, you can compare the *test accuracy* and *learning rate* curves with the reference training above.

#### Small Batch

In [None]:
jobids=[jobid_sgd[2],jobid_sgd[0]]
plot_accuracy_lr(jobids)

#### Large Batch

In [None]:
jobids=[jobid_sgd[3],jobid_sgd[1]]
plot_accuracy_lr(jobids)

------------------------------------

## TP_opti_2: *AdamW* Optimizer

We will now modify the optimizer to use *AdamW*.

**TODO**: in the script [cifar10.py](cifar10.py):
* Find the line declaring the *Stochastic Gradient Descent* optimizer:

```python
optimizer = torch.optim.SGD(model.parameters(), args.lr, momentum=args.mom, weight_decay=args.wd)
```
* Replace it with the *AdamW* optimizer:

```python
optimizer = torch.optim.AdamW(model.parameters(), args.lr, betas=(args.mom, 0.999), weight_decay=args.wd)
```

### Learning Rate Finder (AdamW)

Since the optimizer has changed, we need to determine the maximum *learning rate* value to provide as a parameter.

As with the previous optimizers, we suggest the following *learning rate* exploration:

#### Small Batch

![lr_decorelated_AdamW_256](images_lrfinder/lr_decorelated_AdamW_256.png)

#### Large Batch

![lr_decorelated_AdamW_8192](images_lrfinder/lr_decorelated_AdamW_8192.png)

### Training (AdamW)

You can now launch the training with the *AdamW* optimizer.

Job submission. **Be careful, you are requesting the compute nodes at this time**.

To submit the job, please switch the following cell from `Raw NBConvert` mode to `Code` mode.

**TODO: replace XXX with the chosen maximum *learning rate* values (Small Batch and Large Batch)**

In [None]:
display_slurm_queue(name)

In [None]:
#jobid_adamw = ['400755', '400757']

You can compare the *test accuracy* and *train accuracy* curves with the previous trainings.

#### Small Batch

In [None]:
plot_accuracy([jobid_sgd[0], jobid_adamw[0]])

#### Large Batch

In [None]:
plot_accuracy([jobid_sgd[1], jobid_adamw[1]])

------------------------------------

## TP_opti_3: *LAMB* Optimizer

We will now modify the optimizer to use *LAMB*.

**TODO**: in the script [cifar10.py](cifar10.py):
* Replace the *AdamW* optimizer with the *LAMB* optimizer:

```python
optimizer = apex.optimizers.FusedLAMB(model.parameters(), args.lr, betas=(args.mom, 0.999), weight_decay=args.wd)
```

### Learning Rate Finder (LAMB)

We now need to find the maximum *learning rate* value to provide as a parameter for this optimizer:

#### Small Batch

![lr_decorelated_FusedLAMB_256](images_lrfinder/lr_decorelated_FusedLAMB_256.png)

#### Large Batch

![lr_decorelated_FusedLAMB_8192](images_lrfinder/lr_decorelated_FusedLAMB_8192.png)

### Training (LAMB)

You can now launch the training with the LAMB optimizer.

Job submission. **Be careful, you are requesting the compute nodes at this time**.

To submit the job, please switch the following cell from `Raw NBConvert` mode to `Code` mode.

**TODO: replace XXX with the chosen maximum *learning rate* values (Small Batch and Large Batch)**

In [None]:
display_slurm_queue(name)

In [None]:
#jobid_lamb = ['401241', '401282']

You can compare the *test accuracy* and *train accuracy* curves with the previous trainings.

#### Small Batch

In [None]:
plot_accuracy([jobid_sgd[0], jobid_adamw[0], jobid_lamb[0]])

#### Large Batch

In [None]:
plot_accuracy([jobid_sgd[1], jobid_adamw[1], jobid_lamb[1]])

------------------------------------

## TP_opti_4: *LARS* Optimizer


We will try a large batch training with the LARS or LARC optimizer (Apex optimization of LARS).

**TODO**: in the script [cifar10.py](cifar10.py):
* Replace the *LAMB* optimizer with the *LARC* optimizer:

```python
optimizer = ...

wrapped_optimizer = optimizer
```
with

```python
optimizer = torch.optim.SGD(model.parameters(), args.lr, momentum=args.mom, weight_decay=args.wd)
    
wrapped_optimizer = LARC(optimizer)
```

### Learning Rate Finder

We now need to find the maximum *learning rate* value to provide as a parameter for this optimizer:

#### Small Batch

![lr_decorelated_LARC_256](images_lrfinder/lr_decorelated_LARC_256.png)

#### Large Batch

![lr_decorelated_LARC_8192](images_lrfinder/lr_decorelated_LARC_8192.png)

### Training (LARS)


You can now launch the training with the LARS optimizer.

Job submission. **Be careful, you are requesting the compute nodes at this time**.

To submit the job, please switch the following cell from `Raw NBConvert` mode to `Code` mode.

**TODO: replace XXX with the chosen maximum *learning rate* values (Small Batch and Large Batch)**

In [None]:
display_slurm_queue(name)

In [None]:
#jobid_lars = ['403084', '403085']

You can compare the *test accuracy* and *train accuracy* curves with the previous trainings.

#### Small Batch

In [None]:
plot_accuracy([jobid_sgd[0], jobid_adamw[0], jobid_lamb[0], jobid_lars[0]])

#### Large Batch

In [None]:
plot_accuracy([jobid_sgd[1], jobid_adamw[1], jobid_lamb[1], jobid_lars[1]])

------------------------------------

## TP_opti_5: *LION* Optimizer (experimental ⚠)

In this section, we will test an optimizer from early 2023: LION.

It is an optimizer from the publication: Symbolic Discovery of Optimization Algorithms, https://arxiv.org/abs/2302.06675

**TODO**: in the script [cifar10.py](cifar10.py):
* Replace the *LARC* optimizer with the *LION* optimizer:

```python
optimizer = torch.optim.SGD(model.parameters(), args.lr, momentum=args.mom, weight_decay=args.wd)

wrapped_optimizer = LARC(optimizer)
```
with

```python
optimizer = Lion(model.parameters(), lr=args.lr, weight_decay=args.wd)
    
wrapped_optimizer = optimizer
```


### Learning Rate Finder

We now need to find the maximum *learning rate* value to provide as a parameter for this optimizer.

**In addition to the *learning rate scheduler*, here are some indications from the authors of the paper:**
> *Based on our experience, a suitable learning rate for Lion is typically 3-10x smaller than that for AdamW. Since the effective weight decay is lr * λ, the value of decoupled weight decay λ used for Lion is 3-10x larger than that for AdamW in order to maintain a similar strength.*


#### Small Batch

![lr_decorelated_LARC_256](images_lrfinder/lr_decorelated_LARC_256.png)

#### Large Batch

![lr_decorelated_LARC_8192](images_lrfinder/lr_decorelated_LARC_8192.png)

### Training (LION)


You can now launch the training with the LION optimizer.

Job submission. **Be careful, you are requesting the compute nodes at this time**.

To submit the job, please switch the following cell from `Raw NBConvert` mode to `Code` mode.

**TODO: replace XXX with the chosen maximum *learning rate* values (Small Batch and Large Batch). 
You will also need to define the *weight decay* values for each batch size.**

**For relatively small batches, you should be able to achieve good performance using the learning rate finder and the authors' recommendations.**

**The case for large batches is not as straightforward. In our tests, we couldn't identify truly more effective parameters.**

**You can start with the same *learning rate scheduler* as before, but we strongly encourage you to experiment with changes to it.**

**TODO**: in the script [cifar10.py](cifar10.py):
* Change the learning rate scheduler to a CosineAnnealingLR for which LION's behavior is more consistent.

```python
scheduler = ...
```
with

```python
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, 
                                                       T_max=N_batch*args.epochs, 
                                                       eta_min=args.lr/5)
```

In [None]:
display_slurm_queue(name)

In [None]:
#jobid_lion = ['404951', '404955']


You can compare the *test accuracy* and *train accuracy* curves with the previous trainings.

#### Small Batch

In [None]:
plot_accuracy([jobid_sgd[0], jobid_adamw[0], jobid_lamb[0], jobid_lars[0], jobid_lion[0]])

#### Large Batch

In [None]:
plot_accuracy([jobid_sgd[1], jobid_adamw[1], jobid_lamb[1], jobid_lars[1], jobid_lion[1]])

--------------

## Appendix (Bonus)

With the optimizer of your choice (You will need to modify the code accordingly):

You can perform additional tests by adjusting the different parameters:
* Number of epochs
* The value of *weight decay*
* The value of *learning rate*
* Batch size. Note: **on 2 GPUs**, so the *batch* size will be **multiplied by 2**.

In [None]:
n_epoch = 
weight_decay = 
lr = 
batch_size = 
command = f'cifar10.py -b {batch_size} -e {n_epoch} --wd {weight_decay} --lr {lr}'

#### LR finder (Optionnel)

In [None]:
display_slurm_queue(name)

In [None]:
#jobid_test_lrf =

In [None]:
lrfind_plot(jobid_test_lrf)

#### Training

In [None]:
display_slurm_queue(name)

In [None]:
#jobid_test = ['428']

In [None]:
plot_accuracy(jobid_test)

--------------