# Checkpoint Manager Tutorial

**This tutorial was tested with the version `0.0.1-beta0` of NeuroTorch.**

The checkpoint manager is very useful in the NeuroTorch's training pipeline since it is a callback. Please note that it is possible to use the traditional pytorch save and load method, see [Pytorch save and load tutorial](https://pytorch.org/tutorials/beginner/saving_loading_models.html) for more information.

As usual, the first thing is to import NeuroTorch

In [2]:
import neurotorch as nt

First, simply create an object <code>CheckpointManager</code> with the name of your desired folder as an argument

In [2]:
checkpoint_folder = f"./checkpoints/network"
checkpoint_manager = nt.CheckpointManager(checkpoint_folder)
print(checkpoint_manager)

CheckpointManager<0>: (priority=100, save_state=False, load_state=False, )


The name of the checkpoint_folder must also be given to the <code>Sequential</code> in order to save the parameters of the network during training.

In [3]:
network = nt.Sequential(layers=[nt.Linear(10, 10)], checkpoint_folder=checkpoint_folder).build()
print(network)

Sequential(
  (_to_device_transform): ToDevice(cuda, async=True)
  (input_layers): ModuleDict()
  (hidden_layers): ModuleList()
  (output_layers): ModuleDict(
    (output_0): Linear<output_0>(10->10)@cuda
  )
  (input_transform): ModuleDict(
    (output_0): Sequential(
      (0): CallableToModuleWrapper(Compose(
          ToTensor()
      ))
      (1): ToDevice(cuda, async=True)
    )
  )
  (output_transform): ModuleDict(
    (output_0): IdentityTransform()
  )
)


## What is in the checkpoint folder ?

After a training, you will obtain three different types of file.

### Training's parameters

These are your <code> .pth </code> files. Those files are the one that contain the parameter of your model at a certain time (at a certain iteration for instance). These are the files you might want to give to a colleague in order to reproduce your data.

### Network-checkpoint $\Rightarrow$ .Json summary

A json file will be generated which contain the name of your different training parameters that are saved. The best one is label in a way that you can easily get access to it later. This json is the bridge between your code and the <code>.pth</code>

### Training history figure

A training history is also generated to summarize the performance of your training. It can bring insight on how the loss evolve relative to the iteration or the learning rate. It is a great tool to compare results obtained with different hyperparameters.

Here's an example of the json file and the <code>.pth</code> :

In [4]:
checkpoint_manager.checkpoints_meta_path

'./checkpoints/network/network-checkpoints.json'

In [5]:
checkpoint_manager.save_checkpoint(itr=0, itr_metrics={}, state_dict=network.state_dict())

'./checkpoints/network\\network-itr0.pth'

In [6]:
network.get_layer().forward_weights.data.zero_()

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0')

Once your parameters are saved, it is now time to load them up in ordre to "play" with your data. Simply call <code>load_checkpoint</code> and give it a <code>load_checkpoint_mode</code>. This last step will determine which of your multiple <code>.pth</code> will be loaded. For instance, one might want to load the last checkpoint (use <code>nt.LoadCheckpointMode.LAST_ITR</code>) or one might want to use the best one (nt.LoadCheckpointMode.BEST_ITR). Here is an example :

In [7]:
network.load_checkpoint(checkpoints_meta_path=checkpoint_manager.checkpoints_meta_path, load_checkpoint_mode=nt.LoadCheckpointMode.LAST_ITR)

{'itr': 0,
 'model_state_dict': OrderedDict([('output_layers.output_0.bias_weights',
               tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')),
              ('output_layers.output_0._forward_weights',
               tensor([[-7.3540e-01,  4.3947e-01,  4.3454e-04, -1.5365e-01,  2.7009e-01,
                         1.1282e-01,  4.2071e-02, -2.2711e-01,  3.6099e-01,  1.5909e-01],
                       [-4.6935e-02, -1.7355e-01,  2.8171e-01,  3.0744e-01, -3.5100e-01,
                         1.7603e-02,  6.6100e-02,  1.3925e-01, -7.1727e-02, -9.7251e-03],
                       [-7.4916e-02, -3.1412e-02,  1.8288e-01,  4.4510e-01,  9.3525e-02,
                         2.9784e-02,  1.2521e-01, -7.2491e-02, -4.5314e-01,  3.9042e-01],
                       [-1.9225e-01,  1.7739e-01,  2.8031e-01,  2.1685e-01,  1.9586e-01,
                        -1.3653e-01,  7.9019e-02,  1.3299e-01,  3.2456e-01, -4.5222e-01],
                       [-8.3174e-02, -3.9672e-01,  2.9031e

In [8]:
network.get_layer().forward_weights

Parameter containing:
tensor([[-7.3540e-01,  4.3947e-01,  4.3454e-04, -1.5365e-01,  2.7009e-01,
          1.1282e-01,  4.2071e-02, -2.2711e-01,  3.6099e-01,  1.5909e-01],
        [-4.6935e-02, -1.7355e-01,  2.8171e-01,  3.0744e-01, -3.5100e-01,
          1.7603e-02,  6.6100e-02,  1.3925e-01, -7.1727e-02, -9.7251e-03],
        [-7.4916e-02, -3.1412e-02,  1.8288e-01,  4.4510e-01,  9.3525e-02,
          2.9784e-02,  1.2521e-01, -7.2491e-02, -4.5314e-01,  3.9042e-01],
        [-1.9225e-01,  1.7739e-01,  2.8031e-01,  2.1685e-01,  1.9586e-01,
         -1.3653e-01,  7.9019e-02,  1.3299e-01,  3.2456e-01, -4.5222e-01],
        [-8.3174e-02, -3.9672e-01,  2.9031e-01, -1.6875e-01,  4.6306e-02,
          4.1986e-01, -1.4235e-01,  2.3946e-02,  2.3502e-01, -4.7609e-01],
        [ 1.9356e-02, -2.1739e-02,  1.7779e-01, -3.4789e-02,  4.4785e-02,
          8.8605e-02,  2.9104e-01,  1.5570e-01, -7.0589e-01,  2.3611e-01],
        [-3.1387e-01, -4.6288e-01,  3.2183e-02,  5.0033e-02,  1.3759e-02,
         -

### A few more words on the checkpoint manager

The checkpoint manager built in NeuroTorch allows you to specify **when** you want to save the training's parameters. This is because saving the parameters can be a long process if it is done at each step. Also, it is generally not interesting to save the first iteration (the last one are generally the one you want)! With the checkpoint manger, you can save at a certain frequency or only saved the last iteration for example.

### Example from our tutorial *time_series_forecasting_wilson_cowan*

In this tutorial (that we highly recommend!), we use the checkpoint manager as a powerful tool during the training. If you inspect closely <code>main.py</code> of this tutorial, you will find the following <code>CheckpointManager</code> :

In [5]:
checkpoint_folder = f"./checkpoints/network"
n_iterations = 1000
checkpoint_manager = nt.CheckpointManager(
	checkpoint_folder,
	metric="train_loss",
	minimise_metric=False,
	save_freq=-1,
	save_best_only=True,
	start_save_at=int(0.98 * n_iterations),
)
print(checkpoint_manager)

CheckpointManager<1>: (priority=100, save_state=False, load_state=False, )


Let's look at every argument to make sure we truly understand what is happening here.
- First, we give the name of our checkpoint folder!
- <code>metric</code> : We give the name of the metric to collect the best checkpoint on
- <code>minimise_metric</code> : In this example, we wanted to maximise the metric. It is therefore set to False
- <code>save_freq</code> : Here, we absolutely want to save the last iteration. By specifying $-1$, we tell the checkpoint manager to save the last iteration no matter what.
- <code>save_best_only</code> : Not only do we want to save the last, we also want to save the best! This argument is therefore **True**
- <code>start_save_at</code> : We also want to save the iterations near the end of our training. Here, we start saving after 98% of our training is done

**Feel free to explore the different tutorials since most of them use the checkpoint manager!**