## Running on multiple GPUs using Hugging Face Transformers

Naive pipeline parallelism is supported out of the box. For this, simply load the model with device="auto" which will automatically place the different layers on the available GPUs.

Your task:

1. Create a pod with two 24GB GPUs.

2. Try to run the model with device="auto" and see how much VRAM is used. You can also try to run the model with device_map="auto" which will automatically place the different layers on the available GPUs. This is a more advanced version of pipeline parallelism that allows for more flexibility in how the model is distributed across GPUs.

In [4]:
model_path = "/ssdshare/share/Meta-Llama-3-8B-Instruct/"
# TODO(Your Task): Load the model to multiple GPUs and check the GPU memory usage

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
import gc
import torch


def flush():
  gc.collect()
  torch.cuda.empty_cache()
  torch.cuda.reset_peak_memory_stats()

flush()
# one GPU
model1 = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16, # note here
    attn_implementation="flash_attention_2",
    device_map="cuda:0"
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

pipe = pipeline("text-generation", model=model1, tokenizer=tokenizer)

prompt = "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer:"

result = pipe(prompt, max_new_tokens=300, pad_token_id=tokenizer.eos_token_id)[0]["generated_text"][len(prompt):]

def bytes_to_giga_bytes(bytes):
    gigabytes = bytes / (1024**3)
    return gigabytes

print(bytes_to_giga_bytes(torch.cuda.max_memory_allocated()))

del pipe
del model1
flush()

# 2 GPUs
model2 = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16, # note here
    attn_implementation="flash_attention_2",
    device_map="auto"
)
pipe = pipeline("text-generation", model=model2, tokenizer=tokenizer)

prompt = "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer:"

result = pipe(prompt, max_new_tokens=300, pad_token_id=tokenizer.eos_token_id)[0]["generated_text"][len(prompt):]

print(bytes_to_giga_bytes(torch.cuda.max_memory_allocated()))

del pipe
del model2
flush()

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:0


15.010313987731934


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:0


6.697180271148682


The GPU memory usage of loading the model to only one GPU is \_\_\_**15.01 G**\_\_\_\_\_.

The GPU memory usage of loading the model with device="auto" is \_\_\_**6.70 G**\_\_\_\_\_. The GPU memory usage of loading the model with device_map="auto" is \_\_\_\_\_\_\_\_.

The number of GPUs you used is \_\_\_\_**1**\_\_\_\_.

Does the numbers above make sense?

No, because 1 GPU is enough and the data transfer between the GPUs will cause overhead.