## Running on multiple GPUs using Hugging Face Transformers

Naive pipeline parallelism is supported out of the box. For this, simply load the model with device="auto" which will automatically place the different layers on the available GPUs.

Your task:

1. Create a pod with two 24GB GPUs.

2. Try to run the model with device="auto" and see how much VRAM is used. You can also try to run the model with device_map="auto" which will automatically place the different layers on the available GPUs. This is a more advanced version of pipeline parallelism that allows for more flexibility in how the model is distributed across GPUs.

In [1]:
model_path = "/ssdshare/share/Meta-Llama-3-8B-Instruct/"
# TODO(Your Task): Load the model to multiple GPUs and check the GPU memory usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import subprocess
import pandas as pd
from pynvml import *
import os

def get_gpu_memory():
    """获取每个GPU的内存使用情况"""
    nvmlInit()
    gpu_info = []
    device_count = nvmlDeviceGetCount()
    
    for i in range(device_count):
        handle = nvmlDeviceGetHandleByIndex(i)
        info = nvmlDeviceGetMemoryInfo(handle)
        gpu_info.append({
            'id': i,
            'total': info.total / 1024**2,  # MB
            'used': info.used / 1024**2,    # MB
            'free': info.free / 1024**2     # MB
        })
    
    nvmlShutdown()
    return pd.DataFrame(gpu_info)

# 检查初始GPU内存状态
print("初始GPU内存状态:")
initial_gpu_memory = get_gpu_memory()
display(initial_gpu_memory)

# 1. 加载模型到单个GPU
print("\n加载模型到单个GPU (device=0):")
torch.cuda.empty_cache()
single_gpu_model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    torch_dtype=torch.bfloat16, 
    device_map="cuda:0"
)
single_gpu_memory = get_gpu_memory()
display(single_gpu_memory)
single_gpu_usage = single_gpu_memory['used'] - initial_gpu_memory['used']
display(pd.DataFrame({'GPU ID': single_gpu_memory['id'], 'Memory Used (MB)': single_gpu_usage}))
del single_gpu_model
torch.cuda.empty_cache()

# 2. 使用device_map="balanced"加载模型 (替代device="auto")
# print("\n使用device_map=\"balanced\"加载模型:")
# torch.cuda.empty_cache()
# balanced_model = AutoModelForCausalLM.from_pretrained(
#     model_path, 
#     torch_dtype=torch.bfloat16, 
#     device_map="balanced"
# )
# balanced_memory = get_gpu_memory()
# display(balanced_memory)
# balanced_usage = balanced_memory['used'] - initial_gpu_memory['used']
# display(pd.DataFrame({'GPU ID': balanced_memory['id'], 'Memory Used (MB)': balanced_usage}))
# print("\n模型分布在以下设备上 (balanced):")
# print(balanced_model.hf_device_map)
# del balanced_model
# torch.cuda.empty_cache()

# 3. 使用device_map="auto"加载模型
print("\n使用device_map=\"auto\"加载模型:")
torch.cuda.empty_cache()
auto_map_model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
auto_map_memory = get_gpu_memory()
display(auto_map_memory)
auto_map_usage = auto_map_memory['used'] - initial_gpu_memory['used']
display(pd.DataFrame({'GPU ID': auto_map_memory['id'], 'Memory Used (MB)': auto_map_usage}))

# 显示模型在哪些设备上
print("\n模型分布在以下设备上 (auto):")
print(auto_map_model.hf_device_map)

print("\n总结:")
print(f"单GPU加载内存使用: GPU 0: {single_gpu_usage[0]:.2f} MB")
for i in range(len(auto_map_usage)):
    print(f"device_map=\"auto\"内存使用: GPU {i}: {auto_map_usage[i]:.2f} MB") 
print(f"使用的GPU数量: {torch.cuda.device_count()}")

torch.cuda.empty_cache()

初始GPU内存状态:


Unnamed: 0,id,total,used,free
0,0,24564.0,453.125,24110.875
1,1,24564.0,453.125,24110.875



加载模型到单个GPU (device=0):


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,id,total,used,free
0,0,24564.0,16165.3125,8398.6875
1,1,24564.0,456.0,24108.0


Unnamed: 0,GPU ID,Memory Used (MB)
0,0,15712.1875
1,1,2.875



使用device_map="auto"加载模型:


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,id,total,used,free
0,0,24564.0,7675.3125,16888.6875
1,1,24564.0,9339.3125,15224.6875


Unnamed: 0,GPU ID,Memory Used (MB)
0,0,7222.1875
1,1,8886.1875



模型分布在以下设备上 (auto):
{'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1, 'model.layers.28': 1, 'model.layers.29': 1, 'model.layers.30': 1, 'model.layers.31': 1, 'model.norm': 1, 'model.rotary_emb': 1, 'lm_head': 1}

总结:
单GPU加载内存使用: GPU 0: 15712.19 MB
device_map="auto"内存使用: GPU 0: 7222.19 MB
device_map="auto"内存使用: GPU 1: 8886.19 MB
使用的GPU数量: 2


The GPU memory usage of loading the model to only one GPU is 15712.19MB.

The GPU memory usage of loading the model with device="auto" is \_\_\_(This always says error). The GPU memory usage of loading the model with device_map="auto" is 7.2+8.5=16GB.

The number of GPUs you used is 2.

Does the numbers above make sense?

Yes, the numbers make sense. When using `device_map="auto"`, the model is distributed across the available GPUs. The GPU memory usage I observed aligns with the expected behavior of distributing the model layers across multiple GPUs.