# Distributed Training with Deepspeed

DeepSpeed is an open-source deep learning optimization library developed by Microsoft, designed to enhance the efficiency and scalability of training large models. It offers features like the Zero Redundancy Optimizer (ZeRO) for efficient memory usage, support for mixed precision training, and seamless integration with PyTorch/Huggingface. DeepSpeed is particularly beneficial when training models with billions of parameters, enabling faster training times and reduced resource consumption.

## ZeRO Optimization:

DeepSpeed’s ZeRO (Zero Redundancy Optimizer) strategy removes memory redundancy by partitioning key model states (parameters, gradients, and optimizer states) across GPUs. Here’s a step‐by‐step breakdown of its three primary stages:

![ZeRO](https://production-media.paperswithcode.com/methods/Screen_Shot_2021-07-26_at_3.17.43_PM_3oyU7Qb.png)

1. Stage 1 – Optimizer State Partitioning:
In this stage, the optimizer’s internal states (for example, moment estimates in Adam) are split among the GPUs. Instead of every GPU storing a complete copy of these states, each holds only a fraction. This immediately cuts down memory usage without affecting the forward or backward computations.


2. Stage 2 – Gradient Partitioning:
Building on stage 1, ZeRO Stage 2 partitions not only the optimizer states but also the gradients. During the backward pass, each GPU computes and stores only its portion of the gradients. Later, a communication step (such as an all-reduce operation) ensures that the required information is combined for the optimizer update—all while keeping the memory footprint much lower than storing full gradients on every device.


3. Stage 3 – Parameter Partitioning:
The most aggressive stage, Stage 3, partitions the model’s parameters themselves. Now, each GPU holds only a slice of the entire model’s weights, along with its corresponding gradients and optimizer states. When a forward pass is performed, the necessary parameters are gathered on the fly. This stage enables training extremely large models that would not fit in the memory of a single GPU, though it may introduce additional communication overhead.
In practice, during training the forward pass uses the available partitions (or gathers full parameters as needed in Stage 3), the backward pass computes local gradients that are later synchronized, and the optimizer updates are applied based only on the locally stored states. This stepwise, partitioned approach is critical when working with models of billions (or even trillions) of parameters, as it enables efficient distributed training by dramatically reducing memory consumption on each device.

## Getting Started with Deepspeed and transformers
The transformers library has native support for deepspeed in its Trainers using the accelerate library which includes deepspeed and other parallelism strategies (like Data parallelism, etc. see docs for more details)

### Installing deepspeed

```bash
DS_BUILD_CPU_ADAM=1 pip install deepspeed==0.15.4
```

## Training with and without Deepspeed 

## Installing necessary libraries

```bash
pip install argparse transformers datasets
```

### Without deepspeed/distributed trained

### Without Deepspeed
Ensuring only one device/GPU gets used, this model (gpt2) can actually run on a single devices

```bash
export USE_DEEPSPEED=false && export CUDA_VISIBLE_DEVICES=0 && accelerate launch ift.py --model_name gpt2
```

To check GPU usage:

```bash
nvidia-smi 
```
Example Output:
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:15:00.0 Off |                    0 |
| N/A   42C    P0   282W / 300W |   9591MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:16:00.0 Off |                    0 |
| N/A   34C    P0    44W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:3A:00.0 Off |                    0 |
| N/A   30C    P0    45W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   33C    P0    39W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   32C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   34C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   32C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:B3:00.0 Off |                    0 |
| N/A   33C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     25677      C   ...da3/envs/madar/bin/python     9588MiB |
+-----------------------------------------------------------------------------+
```

### With Deepspeed

Let's check how many GPUs on the machine

```bash
nvidia-smi  -L | wc -l
```

Set deepspeed config

```bash
accelerate config
```

```bash
export USE_DEEPSPEED=true && export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 && accelerate launch --config_file ./default_config.yaml ift.py --model_name gpt2 
```

To check GPU usage:

```bash
watch -n0.1 nvidia-smi
```
Example Output:
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:15:00.0 Off |                    0 |
| N/A   35C    P0    70W / 300W |   6731MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:16:00.0 Off |                    0 |
| N/A   37C    P0    72W / 300W |   6691MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:3A:00.0 Off |                    0 |
| N/A   34C    P0    72W / 300W |   6595MiB / 32768MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   36C    P0    66W / 300W |   6507MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   36C    P0    69W / 300W |   6651MiB / 32768MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   38C    P0    71W / 300W |   6779MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   35C    P0    69W / 300W |   6587MiB / 32768MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:B3:00.0 Off |                    0 |
| N/A   36C    P0    70W / 300W |   6635MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     34717      C   ...da3/envs/madar/bin/python     6728MiB |
|    1   N/A  N/A     34718      C   ...da3/envs/madar/bin/python     6688MiB |
|    2   N/A  N/A     34719      C   ...da3/envs/madar/bin/python     6592MiB |
|    3   N/A  N/A     34720      C   ...da3/envs/madar/bin/python     6504MiB |
|    4   N/A  N/A     34721      C   ...da3/envs/madar/bin/python     6648MiB |
|    5   N/A  N/A     34722      C   ...da3/envs/madar/bin/python     6776MiB |
|    6   N/A  N/A     34723      C   ...da3/envs/madar/bin/python     6584MiB |
|    7   N/A  N/A     34724      C   ...da3/envs/madar/bin/python     6632MiB |
+-----------------------------------------------------------------------------+
```

## Questions:

1. Does it make sense to use Deepspeed here?
2. Which setting has a faster training time? Why?
3. Try it with a bigger model if you have more than one device. First, run it without deepspeed. Keep choosing bigger and bigger models until you run into memory issues, then try it with Deepspeed.

## Example wtih Pytorch

We saw how to do it with transformers automatically, what if we have a training loop from scratch?

### Training Loop (PyTorch):

In [8]:
%%writefile ./sft_train_manual.py
import argparse
import os
import copy
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, AdamW
import deepspeed

def parse_args():
    parser = argparse.ArgumentParser(description="Manual SFT Training with optional DeepSpeed")
    parser.add_argument("--model_name_or_path", type=str, required=True,
                        help="Hugging Face model name or path (e.g. 'distilgpt2')")
    parser.add_argument("--output_dir", type=str, default="./sft_manual")
    parser.add_argument("--use_deepspeed", action="store_true", help="Enable DeepSpeed")
    parser.add_argument("--max_length", type=int, default=512)
    parser.add_argument("--epochs", type=int, default=1)
    parser.add_argument("--batch_size", type=int, default=8)
    parser.add_argument("--micro_batch", type=int, default=1)
    parser.add_argument("--local_rank", type=int, default=-1)
    parser = deepspeed.add_config_arguments(parser)

    return parser.parse_args()

def preprocess_function(example, tokenizer, max_length=512):
    # Assume each example has 'instruction', optional 'input', and 'output'
    instruction = example.get("instruction", "")
    user_input = example.get("input", "")
    output = example.get("output", "")
    
    # Build a prompt string
    if user_input:
        prompt = f"Instruction: {instruction}\nInput: {user_input}\nResponse:"
    else:
        prompt = f"Instruction: {instruction}\nResponse:"
    full_text = prompt + " " + output
    
    # Tokenize full text to fixed length with padding
    tokenized = tokenizer(full_text, truncation=True, max_length=max_length, padding="max_length")
    # Tokenize prompt (without padding) to know its token length
    prompt_tokens = tokenizer(prompt, truncation=True, add_special_tokens=False)["input_ids"]
    prompt_len = len(prompt_tokens)
    
    # Create labels as a copy of input_ids
    labels = copy.deepcopy(tokenized["input_ids"])
    # For next-token prediction the model automatically shifts labels.
    # Here, we mask out the prompt portion (positions 1 up to prompt_len) by setting them to -100.
    for i in range(1, min(prompt_len, len(labels))):
        labels[i] = -100
    tokenized["labels"] = labels
    return tokenized

def main():
    args = parse_args()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Load model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
    model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path).to(device)
    
    tokenizer.pad_token = tokenizer.eos_token

    # Load a sample instruction dataset (here, a cleaned Alpaca dataset)
    dataset = load_dataset("yahma/alpaca-cleaned", split="train")
    # Select 10k examples if dataset is larger
    if len(dataset) > 10000:
        dataset = dataset.shuffle(seed=42).select(range(10000))
    
    # Preprocess each example (build full text and mask prompt tokens)
    dataset = dataset.map(lambda x: preprocess_function(x, tokenizer, max_length=args.max_length))
    dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
    
    dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=True)
    
    optimizer = AdamW(model.parameters(), lr=5e-5)
    
    # Optionally initialize DeepSpeed
    if args.use_deepspeed:
        # A simple DeepSpeed config: enable fp16 training and set batch size
        ds_config = {
            "train_batch_size": args.batch_size,
            "gradient_accumulation_steps": 1,
            "fp16": {
                "enabled": True
            },
            "zero_optimization": {
                "stage": 3,
            },
            "zero_allow_untested_optimizer": True,
        }
        model_engine, optimizer, _, _ = deepspeed.initialize(
            args=args,
            model=model,
            optimizer=optimizer,
            config=ds_config,
            model_parameters=model.parameters()
        )
        print("Training with DeepSpeed enabled.")
    else:
        model_engine = model
        print("Training without DeepSpeed.")
    
    model_engine.train()
    for epoch in range(args.epochs):
        total_loss = 0.0
        for step, batch in enumerate(dataloader):
            # Move batch tensors to device
            batch = {k: v.to(device) for k, v in batch.items()}
            optimizer.zero_grad()
            outputs = model_engine(**batch)
            loss = outputs.loss
            if args.use_deepspeed:
                model_engine.backward(loss)
                model_engine.step()
            else:
                loss.backward()
                optimizer.step()
            total_loss += loss.item()
            if step % 50 == 0:
                print(f"Epoch {epoch+1} Step {step} Loss {loss.item():.4f}")
        avg_loss = total_loss / len(dataloader)
        print(f"Epoch {epoch+1} Average Loss: {avg_loss:.4f}")
    
    os.makedirs(args.output_dir, exist_ok=True)
    if args.use_deepspeed:
        model_engine.save_checkpoint(args.output_dir)
    else:
        model.save_pretrained(args.output_dir)
    print("Training complete. Model saved to", args.output_dir)

if __name__ == "__main__":
    main()
EOF

Overwriting ./sft_train_manual.py


```bash
python sft_train_manual.py --model_name_or_path gpt2
deepspeed --num_gpus=8 sft_train_manual.py --model_name_or_path gpt2 --use_deepspeed
```

## Questions:
1. Can we train on more than one node?
2. What is the difference between micro_batch_size and batch_size? Does it matter?
3. If I want to do something fancy, do I need to implement everything from scratch?