# Training demo

This notebook illustrates two ways of training the network:
- in a single Jupyter Notebook and with resources on a a single node
- via slurm command and with gpus from multiple nodes 

Examples with OG, DA and GA are included. The code also supports in-training canonicalization, but this gets computationally expensive very quickly.


Problem-specific inputs:
- a config file in the `config` folder, which specifies the physical system and symmetries to use
- a config file in the `optim_config` folder, which specifies the training setup e.g. batch size and symmetrisation strategies (OG, DA, GA or canon)

The config object includes all specifications of the physical system and training setup. As detailed in `mpatch_load_cfg` in `utils/loader.py`, the training config is initialized by `utils/base_config.py`, before being processed by the user-specified `config` and `optim_config`.

# Single node

`resume` option can be used to restart a training from the last saved checkpoint. To use resume, one should not supply `dist`, `base_cfg_str`, `cfg_str` or `optim_cfg_str`, as these parameters are inherited from the last checkpoint.

In [None]:
import train 

train.train(
        log_dir='_log_graphene_OG_test/',
        libcu_lib_path='/opt/conda/envs/deepsolid/lib/',  # path to cuda libraries. Keep this path if the code is run inside the singularity container; change it to the corresponding path otherwise. 
        # dist=False, # whether distributed training is run over multiple nodes; false for single node setup
        # cfg_str='graphene_1.py',
        # optim_cfg_str='OG_batch1000.py',
        resume = True,
        x64 = True,
)

In [None]:
import train 

train.train(
        log_dir='_log_graphene_DA_test/',
        libcu_lib_path='/opt/conda/envs/deepsolid/lib/',  # path to cuda libraries. Keep this path if the code is run inside the singularity container; change it to the corresponding path otherwise. 
        dist=False, # whether distributed training is run over multiple nodes; false for single node setup
        cfg_str='graphene_1.py',
        optim_cfg_str='DA12_batch90.py',
        # resume = True,
        x64 = True,
)

In [None]:
import train 

train.train(
        log_dir='_log_graphene_GA_test/',
        libcu_lib_path='/opt/conda/envs/deepsolid/lib/',  # path to cuda libraries. Keep this path if the code is run inside the singularity container; change it to the corresponding path otherwise. 
        dist=False, # whether distributed training is run over multiple nodes; false for single node setup
        cfg_str='graphene_1.py',
        optim_cfg_str='GA12_batch90.py',
        # resume = True,
        x64 = True,
)

# Multiple nodes using singularity container

Some comments on the commands below:
- <span style='color:red'>**IMPORTANT**</span>: Use `export SCRATCH=/YOUR/SCRATCH/FOLDER` first to specify the folder containing your singularity image.
- `SINGULARITY_CMD` activates singularity. 
- Use `./slurm_dist.sh --help` to see slurm options
- Use `python train.py --help` to see training script options
- Certain flags need to be specified according to your slurm setup, e.g. -A, --partition, --mail-user

In [None]:
export SINGULARITY_CMD="singularity exec --no-home --nv --bind .:/home/invariant-schrodinger --pwd /home/invariant-schrodinger $SCRATCH/inv-ds.sif /bin/bash -c " && ./slurm_dist.sh --mem=10G --num-nodes=2 --port=8001 --timeout=1000 -A YOUR_ACCOUNT --partition="YOUR_PARTITION" --gres="gpu:1" --extra="-t 2-00:00:00 --mail-type=END,FAIL --mail-user=YOUR_EMAIL" --log='_log_graphene_OG_test_multi' --name="OGgraphene" --py-cmd="$SINGULARITY_CMD 'source /opt/conda/bin/activate deepsolid && python train.py --dist --x64 --cfg=graphene_1.py --optim_cfg=OG_batch1000.py --libcu_lib_path=/opt/conda/envs/deepsolid/lib/'"

In [None]:
export SINGULARITY_CMD="singularity exec --no-home --nv --bind .:/home/invariant-schrodinger --pwd /home/invariant-schrodinger $SCRATCH/inv-ds.sif /bin/bash -c " && ./slurm_dist.sh --mem=10G --num-nodes=2 --port=8002 --timeout=1000 -A YOUR_ACCOUNT --partition="YOUR_PARTITION" --gres="gpu:1" --extra="-t 2-00:00:00 --mail-type=END,FAIL --mail-user=YOUR_EMAIL" --log='_log_graphene_DA_test_multi' --name="DAgraphene" --py-cmd="$SINGULARITY_CMD 'source /opt/conda/bin/activate deepsolid && python train.py --dist --x64 --cfg=graphene_1.py --optim_cfg=DA12_batch90.py --libcu_lib_path=/opt/conda/envs/deepsolid/lib/'"

In [None]:
export SINGULARITY_CMD="singularity exec --no-home --nv --bind .:/home/invariant-schrodinger --pwd /home/invariant-schrodinger $SCRATCH/inv-ds.sif /bin/bash -c " && ./slurm_dist.sh --mem=10G --num-nodes=2 --port=8003 --timeout=1000 -A YOUR_ACCOUNT --partition="YOUR_PARTITION" --gres="gpu:1" --extra="-t 2-00:00:00 --mail-type=END,FAIL --mail-user=YOUR_EMAIL" --log='_log_graphene_GA_test_multi' --name="GAgraphene" --py-cmd="$SINGULARITY_CMD 'source /opt/conda/bin/activate deepsolid && python train.py --dist --x64 --cfg=graphene_1.py --optim_cfg=GA12_batch90.py --libcu_lib_path=/opt/conda/envs/deepsolid/lib/'"