## Huggingface Accelerate

In this notebook, we are going to write a Python and SLURM script directly from within cells here and also launch the SLURM script.
Since we only have limited ressources, we are going to use a small dateset in combination with a small model. However, this still demonstrates how to use Huggingface's [Accelerate](https://huggingface.co/docs/accelerate/index) library to perform distributed training on multiple GPUs across multiple nodes. In our case, we are using 2 nodes with 2 NVIDIA A100 GPUs.
The network used on VSC5 is Infiniband.

### Python script
Let's go through the Python script, so you know what we are launching here:

In [1]:
%%writefile ./examples/accelerate_example.py
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
from accelerate import Accelerator
import torch

def main():
    # Initialize Accelerator
    accelerator = Accelerator()
    
    # Load dataset
    dataset = load_dataset("ag_news")
    
    # Load model and tokenizer
    model_name = "distilbert-base-uncased"
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Tokenize dataset
    def tokenize(batch):
        return tokenizer(batch["text"], padding=True, truncation=True)

    tokenized_dataset = dataset.map(tokenize, batched=True)
    tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
    tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
    
    # Prepare data loaders
    train_dataloader = torch.utils.data.DataLoader(tokenized_dataset["train"], batch_size=8, shuffle=True)
    eval_dataloader = torch.utils.data.DataLoader(tokenized_dataset["test"], batch_size=8)

    # Prepare optimizer and learning rate scheduler
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

    # Prepare everything using accelerator
    model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader
    )

    # Training loop
    model.train()
    for epoch in range(3):
        for batch in train_dataloader:
            optimizer.zero_grad()
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)
            optimizer.step()

    print("Training completed!")

if __name__ == "__main__":
    main()

Overwriting ./examples/accelerate_example.py


### SLURM Script
Now, let's go through the SLURM script:

In [2]:
%%writefile ./tooling/slurm_scripts/accelerate_example.sh
#!/bin/bash

#SBATCH --job-name=training_example
#SBATCH --account=p71550
##SBATCH --account=p70824 # training account, please uncomment for training
#SBATCH --nodes=2                    # Number of nodes
#SBATCH --ntasks-per-node=1          # Number of tasks per node
#SBATCH --cpus-per-task=256          # Number of CPU cores per task (including hyperthreading if needed)
#SBATCH --partition=zen3_0512_a100x2
#SBATCH --qos=admin
##SBATCH --qos=zen3_0512_a100x2 # qos for training
#SBATCH --gres=gpu:2                 # Number of GPUs per node
#SBATCH --output=../../output/%x-%j.out  # Output file
##SBATCH --reservation=

######################
### Set Environment ###
######################
module load miniconda3
eval "$(conda shell.bash hook)"
source /opt/sw/jupyterhub/envs/conda/vsc5/jupyterhub-huggingface-v2/modules  # Activate the conda environment

######################
#### Set Network #####
######################
# Get the IP address of the master node (head node)
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
node_0=${nodes_array[0]}

NUM_PROCESSES=$(( SLURM_NNODES * SLURM_GPUS_ON_NODE ))

export MASTER_ADDR=$node_0
export MASTER_PORT=29500

######################
#### Prepare Launch ###
######################
# Configure Accelerate launch command

export LAUNCHER="accelerate launch \
    --config_file "../config/accelerate_default_config.yaml" \
    --machine_rank \$SLURM_PROCID \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --num_processes $NUM_PROCESSES \
    --num_machines $SLURM_NNODES \
    "
export PROGRAM="../../examples/accelerate_example.py"

START=$(date +%s.%N)
echo "START TIME: $(date)"

export SRUN_ARGS="--cpus-per-task $SLURM_CPUS_PER_TASK --jobid $SLURM_JOBID"
export OMP_NUM_THREADS=256
export CMD="$LAUNCHER $PROGRAM"

# Execute the command with srun to run on multiple nodes
srun $SRUN_ARGS ../start_train.sh "$CMD"

echo "END TIME: $(date)"
END=$(date +%s.%N)
RUNTIME=$(echo "$END - $START" | bc -l)
echo "Runtime: $RUNTIME"

Overwriting ./tooling/slurm_scripts/accelerate_example.sh


### Unload Env Variables
Before we can launch this SLURM script from within our Jupyter notebook, we need to source a script, which esures that we unload all Jupyterhub related environment variables, and functions, etc. You do not need this from outside the Jupyterhub!

`!source ./tooling/unload_jupyter_env.sh`

### Submit Job

In [None]:
!source ./tooling/unload_jupyter_env.sh && sbatch ./tooling/slurm_scripts/accelerate_example.sh

In [None]:
!squeue -u $USER