# Tutorial: Using Hugging Face `accelerate` with `nbdistributed`
> Showcasing how to use `nbdistributed` to create a more interactive Jupyter distributed tutorial

This notebook is based upon the [official `accelerate` notebook](https://github.com/huggingface/notebooks/blob/main/examples/accelerate_examples/simple_nlp_example.ipynb) but modified for this tutorial and to utilize `nbdistributed`

First we enable the plugin:

In [1]:
%load_ext nbdistributed

Then define how the topology should be:

In [2]:
%dist_init --num-processes 2 --gpu-ids 3,4

Using GPU IDs: [3, 4]
Starting 2 distributed workers...
✓ Successfully started 2 workers
  Rank 0 -> GPU 3
  Rank 1 -> GPU 4
Available commands:
  %%distributed - Execute code on all ranks (explicit)
  %%rank [0,n] - Execute code on specific ranks
  %sync - Synchronize all ranks
  %dist_status - Show worker status
  %dist_mode - Toggle automatic distributed mode
  %dist_shutdown - Shutdown workers

🚀 Distributed mode active: All cells will now execute on workers automatically!
   Magic commands (%, %%) will still execute locally as normal.

🐍 Below are auto-imported and special variables auto-generated into the namespace to use
  `torch`
  `dist`: `torch.distributed` import alias
  `rank` (`int`): The local rank
  `world_size` (`int`): The global world size
  `gpu_id` (`int`): The specific GPU ID assigned to this worker
  `device` (`torch.device`): The current PyTorch device object (e.g. `cuda:1`)


<IPython.core.display.Javascript object>

## Imports and model

Next let's bring in the imports we will use:

In [3]:
import torch
from torch.utils.data import DataLoader

from accelerate import Accelerator, DistributedType
from datasets import load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
    set_seed,
)
from evaluate import load as load_metric
from torch.optim import AdamW


from tqdm.auto import tqdm

import datasets
import transformers

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [4]:
set_seed(42)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

As with the tutorial, we'll train on a smol model:

In [5]:
model_checkpoint = "HuggingFaceTB/SmolLM2-135M"

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Load the data

We'll just use Hugging Face `dataset`'s `load_dataset` to download and cache the dataset:

In [6]:
raw_datasets = load_dataset("glue", "mrpc")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [7]:
%%rank [0]
raw_datasets


🔹 Rank 0:
  DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [9]:
%%rank [0]
raw_datasets["train"][0]


🔹 Rank 0:
  {'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0}


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Now we can preprocess the data:

In [15]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.pad_token = tokenizer.eos_token

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [16]:
%%rank [0]
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")


🔹 Rank 0:
  {'input_ids': [19556, 28, 451, 582, 6330, 17, 3528, 451, 6330, 3935, 351, 357, 30], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [17]:
def tokenize_function(examples):
    outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length", max_length=128)
    return outputs

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [18]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, remove_columns=["idx", "sentence1", "sentence2"])

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Lastly we get rid of any columns that we don't want to use, as well as rename columns to what we would expect:

In [19]:
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [20]:
%%rank [0]
tokenized_datasets["train"].features


🔹 Rank 0:
  {'labels': ClassLabel(names=['not_equivalent', 'equivalent'], id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [21]:
tokenized_datasets.set_format("torch")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Getting the training chunks ready

Since we're already in a distributed process, we can just declare a model and create dataloaders 

> With accelerate, we needed to create seperate functions for these since they call cuda

In [22]:
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [23]:
model.config.pad_token_id = tokenizer.pad_token_id

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [24]:
optimizer = AdamW(params=model.parameters(), lr=2e-5)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [25]:
train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=16
)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [26]:
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], shuffle=False, batch_size=32
)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [27]:
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=100,
    num_training_steps=len(train_dataloader) * 3
)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Now we can (safely) examine a batch of data:

In [28]:
for batch in train_dataloader:
    print({k:v.shape for k,v in batch.items()})
    outputs = model(**batch)
    break


🔹 Rank 0:
  {'labels': torch.Size([16]), 'input_ids': torch.Size([16, 128]), 'attention_mask': torch.Size([16, 128])}

🔹 Rank 1:
  {'labels': torch.Size([16]), 'input_ids': torch.Size([16, 128]), 'attention_mask': torch.Size([16, 128])}


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [29]:
metric = load_metric("glue", "mrpc")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [30]:
predictions = outputs.logits.detach().argmax(dim=-1)
metric.compute(predictions=predictions, references=batch["labels"])


🔹 Rank 0:
  {'accuracy': 0.25, 'f1': 0.0}

🔹 Rank 1:
  {'accuracy': 0.25, 'f1': 0.0}


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Setup training loop

Now we can fine-tune the model:

In [31]:
accelerator = Accelerator()
if accelerator.is_main_process:
    datasets.utils.logging.set_verbosity_warning()
    transformers.utils.logging.set_verbosity_info()
else:
    datasets.utils.logging.set_verbosity_error()
    transformers.utils.logging.set_verbosity_error()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [32]:
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [34]:
for batch in train_dataloader:
    print({k:v for k,v in batch.items()})
    break


🔹 Rank 0:
  {'labels': tensor([1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1], device='cuda:0'), 'input_ids': tensor([[19246,  6950, 11041,  ...,     0,     0,     0],
        [   59, 11052,   523,  ...,     0,     0,     0],
        [  504, 24078,   553,  ...,     0,     0,     0],
        ...,
        [  504,   426, 12373,  ...,     0,     0,     0],
        [  504,  4699,  3586,  ...,     0,     0,     0],
        [19318,  4385,  3297,  ...,     0,     0,     0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')}

🔹 Rank 1:
  {'labels': tensor([1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0], device='cuda:1'), 'input_ids': tensor([[   18,  3513,  1740,  ...,     0,     0,     0],
        [   56,   507,   537,  ...,     0,     0,     0],
        [11952,   216,    

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [35]:
model.train()
for step, batch in enumerate(train_dataloader):
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [36]:
model.eval()
all_predictions = []
all_labels = []

for step, batch in enumerate(eval_dataloader):
    with torch.no_grad():
        outputs = model(**batch)
    predictions = outputs.logits.argmax(dim=-1)

    all_predictions.append(accelerator.gather_for_metrics(predictions))
    all_labels.append(accelerator.gather_for_metrics(batch["labels"]))

all_predictions = torch.cat(all_predictions)
all_labels = torch.cat(all_labels)

eval_metric = metric.compute(predictions=all_predictions, references=all_labels)

accelerator.print(f"Epoch 0: ", eval_metric)


🔹 Rank 0:
  Epoch 0:
  {'accuracy': 0.7450980392156863, 'f1': 0.8317152103559871}


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>