<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# Memory-efficient Model Weight Loading
# 内存高效的模型权重加载

- This notebook provides tips for loading larger pretrained or finetuned models when GPU (or CPU) memory is limited
- 本笔记本提供了在 GPU（或 CPU）内存有限时加载较大的预训练或微调模型的技巧

- Specifically, it focuses on cases where you saved the model using `torch.save(model.state_dict(), "model.pth")` (for example, in chapters 5-7) and want to load it in a new session later for continued pretraining or additional finetuning
- 具体来说，它主要关注使用 `torch.save(model.state_dict(), "model.pth")` 保存模型的情况（例如在第 5-7 章中），以便稍后在新会话中加载它以继续预训练或额外的微调

- While the example uses an LLM, the methods explained in this notebook are general and apply to loading any PyTorch model, not just LLMs
- 虽然示例使用了 LLM，但本笔记本中解释的方法是通用的，适用于加载任何 PyTorch 模型，而不仅仅是 LLM

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/memory-efficient-loading/memory-efficient-loading.webp" width="800px">

In [1]:
# 导入版本检查所需的模块
from importlib.metadata import version

# 定义需要检查版本的包列表
pkgs = [
    "torch",  # PyTorch 深度学习框架
]

# 遍历包列表并打印每个包的版本
for p in pkgs:
    print(f"{p} version: {version(p)}")

memory_profiler version: 0.61.0
torch version: 2.4.1+cu121


&nbsp;
## 1. Benchmark utilities
## 1. 基准测试工具

- First, let's define some utility code to track VRAM (GPU memory)
- 首先，让我们定义一些用于跟踪 VRAM（GPU 内存）的实用代码
- Later, we will also introduce a tool to track the main system RAM (CPU memory) 
- 稍后，我们还将引入一个用于跟踪主系统 RAM（CPU 内存）的工具
- The purpose of these functions will become clear when we apply them later
- 当我们稍后应用这些函数时，它们的用途就会变得清晰

In [2]:
# 导入垃圾回收模块
import gc
# 导入时间模块
import time
# 导入PyTorch
import torch


def start_memory_tracking():
    """初始化GPU内存追踪"""
    if torch.cuda.is_available():
        # 重置GPU峰值内存统计信息
        torch.cuda.reset_peak_memory_stats()
    else:
        # 如果没有可用的CUDA GPU,打印提示信息
        print("This notebook is intended for CUDA GPUs but CUDA is not available.")

def print_memory_usage():
    # 计算最大GPU内存使用量(GB)
    max_gpu_memory = torch.cuda.max_memory_allocated() / (1024 ** 3)  # 将字节转换为GB
    # 打印最大GPU内存使用量
    print(f"Maximum GPU memory allocated: {max_gpu_memory:.1f} GB")

def cleanup():
    # 执行垃圾回收
    gc.collect()
    # 清空GPU缓存
    torch.cuda.empty_cache()
    # 等待3秒钟让内存清理完成
    time.sleep(3)  # 一些缓冲时间以允许内存清理
    # 重置GPU峰值内存统计信息
    torch.cuda.reset_peak_memory_stats()
    # 计算并打印最大GPU内存使用量(GB)
    max_memory_allocated = torch.cuda.max_memory_allocated(device) / (1024 ** 3)
    print(f"Maximum GPU memory allocated: {max_memory_allocated:.1f} GB")

&nbsp;
## 2. Model setup
## 2. 模型设置

- This code section sets up the model itself
- 这段代码设置模型本身
- Here, we use the "large" GPT-2 model to make things more interesting (you may use the "gpt2-small (124M)" to lower the memory requirements and execution time of this notebook)
- 在这里，我们使用"large"版本的GPT-2模型来使事情更有趣（你可以使用"gpt2-small (124M)"来降低内存需求和运行时间）

In [3]:
# 从前面章节导入GPTModel类
from previous_chapters import GPTModel


# 定义基础配置字典
BASE_CONFIG = {
    "vocab_size": 50257,     # 词汇表大小
    "context_length": 1024,  # 上下文长度
    "drop_rate": 0.0,        # Dropout比率
    "qkv_bias": True         # 是否使用Query-Key-Value偏置
}

# 定义不同规模GPT-2模型的配置参数
model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},    # 小型号GPT-2,1.24亿参数
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},  # 中型号GPT-2,3.55亿参数
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},   # 大型号GPT-2,7.74亿参数
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},     # 超大型号GPT-2,15.58亿参数
}

# 选择要使用的模型规模
CHOOSE_MODEL = "gpt2-xl (1558M)"

# 将选定模型的配置更新到基础配置中
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

- Now, let's see the GPU memory functions in action:
- 现在，让我们看看 GPU 内存函数的实际效果：

In [4]:
# 开始追踪内存使用情况
start_memory_tracking()

# 创建GPT模型实例
model = GPTModel(BASE_CONFIG)
# 设置设备为CUDA
device = torch.device("cuda") 
# 将模型移动到GPU
model.to(device)

# 打印当前内存使用情况
print_memory_usage()

Maximum GPU memory allocated: 6.4 GB


 - Additionally, let's make sure that the model runs okay by passing in some example tensor
 - 此外,让我们通过传入一些示例张量来确保模型运行正常

In [5]:
# 测试模型是否正常工作(这里不需要追踪内存)
test_input = torch.tensor([[1, 2, 3]]).to(device)  # 创建测试输入张量并移至GPU
model.eval()  # 将模型设置为评估模式

with torch.no_grad():  # 禁用梯度计算
    model(test_input)  # 运行模型前向传播

- Next, imagine we were pretraining the model and saving it for later use
- 接下来，假设我们要预训练模型并保存以供后续使用
- We skip the actual pretraining here for simplicity and just save the initialized model (but the same concept applies) 
- 为了简单起见，我们在这里跳过实际的预训练过程，只保存初始化后的模型（但概念是一样的）

In [6]:
# 这里应该是训练代码...

# 将模型设置为训练模式
model.train()

# 保存模型的状态字典到文件
torch.save(model.state_dict(), "model.pth")

- Lastly, we delete the model and example tensor in the Python session to reset the GPU memory
- 最后，我们在Python会话中删除模型和示例张量以重置GPU内存

In [7]:
# 删除模型和测试输入张量以释放内存
del model, test_input
# 清理并重置内存
cleanup()

Maximum GPU memory allocated: 0.0 GB


&nbsp;
## 3. Weight loading
## 3. 权重加载

- Now begins the interesting part where we load the pretrained model weights
- 现在开始有趣的部分，我们要加载预训练的模型权重
- Let's see how much GPU memory is required to load the previously saved model  
- 让我们看看加载之前保存的模型需要多少GPU内存

In [8]:
# 然后加载预训练权重

# 开始追踪内存使用情况
start_memory_tracking()

# 创建一个新的GPT模型实例
model = GPTModel(BASE_CONFIG)
# 将模型移动到指定设备(GPU/CPU)
model.to(device)

# 从文件加载预训练权重并更新模型参数
model.load_state_dict(
    torch.load("model.pth", map_location=device, weights_only=True)
)
# 再次确保模型在正确的设备上
model.to(device)
# 将模型设置为评估模式
model.eval();

# 打印当前内存使用情况
print_memory_usage()

Maximum GPU memory allocated: 12.8 GB


- Notice that the memory is 2x as large as in the previous session
- 注意内存使用量是之前会话的2倍
- This is because we have the same model in memory twice, for a short period of time:
- 这是因为在短时间内我们在内存中有两份相同的模型:
  - The first time via `model.to(device)`
  - 第一次是通过 `model.to(device)`
  - The second time via the code line `model.load_state_dict(torch.load("model.pth", map_location=device, weights_only=True))`; eventually, the loaded model weights will be copied into the model, and the `state_dict` will be discarded, but for a brief amount of time, we have both the main model and the loaded `state_dict` in memory
  - 第二次是通过代码行 `model.load_state_dict(torch.load("model.pth", map_location=device, weights_only=True))`; 最终,加载的模型权重会被复制到模型中,而 `state_dict` 会被丢弃,但在短时间内,我们在内存中同时拥有主模型和加载的 `state_dict`
- The remaining sections focus on addressing this
- 接下来的章节将重点解决这个问题
- But first, let's test the model and reset the GPU memory
- 但首先,让我们测试模型并重置 GPU 内存


In [9]:
# 测试模型是否正常工作(这里不需要追踪内存)
test_input = torch.tensor([[1, 2, 3]]).to(device)
model.eval()

with torch.no_grad():
    model(test_input)

# 删除模型和测试输入以释放内存
del model, test_input
cleanup()

Maximum GPU memory allocated: 0.0 GB


&nbsp;
## 4. Loading weights sequentially
## 4. 顺序加载权重

- One workaround for the problem of having the model weights in GPU memory twice, as highlighted in the previous section, is to load the model sequentially
- 对于前面提到的在 GPU 内存中有两份模型权重的问题,一个解决方法是顺序加载模型
- Below, we:
- 下面我们将:
  - first load the model into GPU memory
  - 首先将模型加载到 GPU 内存中
  - then load the model weights into CPU memory  
  - 然后将模型权重加载到 CPU 内存中
  - and finally copy each parameter one by one into GPU memory
  - 最后将每个参数逐个复制到 GPU 内存中


In [10]:
# 开始追踪内存使用情况
start_memory_tracking()

# 创建模型并移动到指定设备
model = GPTModel(BASE_CONFIG).to(device)

# 从文件加载模型权重到CPU内存
state_dict = torch.load("model.pth", map_location="cpu", weights_only=True)

# 打印当前内存使用情况
print_memory_usage()

# 顺序复制权重到模型参数
with torch.no_grad():
    # 遍历模型的所有命名参数
    for name, param in model.named_parameters():
        # 如果参数名存在于state_dict中
        if name in state_dict:
            # 将权重复制到设备上的模型参数
            param.copy_(state_dict[name].to(device))
        # 如果参数名不存在,打印警告
        else:
            print(f"Warning: {name} not found in state_dict.")

# 打印最终内存使用情况
print_memory_usage()

Maximum GPU memory allocated: 6.4 GB
Maximum GPU memory allocated: 6.7 GB


- As we can see above, the memory usage is much lower than before
- 如上所示,内存使用量比之前低得多
- Notice that the memory increases from 6.4 to 6.7 GB because initially, we only have the model in memory, and then we have the model plus 1 parameter tensor in memory (we temporarily move the parameter tensor to the GPU so we can assign it using `".to"` the model)
- 注意到内存从6.4 GB增加到6.7 GB,这是因为最初我们只在内存中有模型,然后我们在内存中同时有模型和1个参数张量(我们暂时将参数张量移动到GPU,以便使用`".to"`将其分配给模型)
- Overall, this is a significant improvement
- 总的来说,这是一个显著的改进
- Again, let's briefly test the model and then reset the GPU memory for the next section
- 再次简单测试模型,然后为下一部分重置GPU内存

In [11]:
# 测试模型是否正常工作(这里不需要追踪内存)
test_input = torch.tensor([[1, 2, 3]]).to(device)
model.eval()

# 使用无梯度计算模式
with torch.no_grad():
    model(test_input)

# 删除不再需要的变量以释放内存
del model, test_input, state_dict, param
cleanup()

Maximum GPU memory allocated: 0.0 GB


&nbsp;
## 5. Loading the model with low CPU memory
## 5. 以低CPU内存加载模型

- In the previous session, we reduced GPU memory use by loading the weights (`state_dict`) into CPU memory first before copying them one-by-one into the model
- 在上一节中,我们通过先将权重(`state_dict`)加载到CPU内存中,然后再逐个复制到模型中来减少GPU内存使用
- However, what do we do if we have limited CPU memory?
- 但是,如果我们的CPU内存有限该怎么办?
- This section uses PyTorch's so-called `"meta"` device approach to load a model on machines with large GPU memory but small CPU memory
- 本节使用PyTorch所谓的`"meta"`设备方法在具有大GPU内存但小CPU内存的机器上加载模型
- But first, let's define a convenience function to monitor CPU memory
- 首先,让我们定义一个方便的函数来监控CPU内存

In [12]:
# 导入操作系统相关功能的模块
import os
# 导入系统和进程监控模块
import psutil 
# 导入线程相关功能
from threading import Thread


def memory_usage_in_gb(func, *args, **kwargs):
    # 获取当前进程对象
    process = psutil.Process(os.getpid())

    # 测量函数运行前的基准内存使用量
    baseline_mem = process.memory_info().rss / 1024 ** 3  # 转换为GB单位

    # 在单独的线程中开始监控内存
    mem_usage = []
    done = False

    def monitor_memory():
        # 持续监控直到done为True
        while not done:
            # 记录当前内存使用量(GB)
            mem_usage.append(process.memory_info().rss / 1024 ** 3)  # 转换为GB
            # 每0.1秒采样一次
            time.sleep(0.1)

    # 创建并启动监控线程
    t = Thread(target=monitor_memory)
    t.start()

    # 运行目标函数
    func(*args, **kwargs)

    # 停止监控
    done = True
    t.join()

    # 计算峰值内存使用量
    peak_mem_usage_gb = max(mem_usage) - baseline_mem
    return peak_mem_usage_gb


- To start with, let's track the CPU memory of the sequential weight loading approach from the previous section
- 首先,让我们跟踪上一节中顺序加载权重方法的 CPU 内存使用情况

In [13]:
# 定义顺序加载权重的函数
def load_sequentially():
    # 开始跟踪内存使用情况
    start_memory_tracking()

    # 创建模型并移至指定设备
    model = GPTModel(BASE_CONFIG).to(device)

    # 从文件加载状态字典到CPU,仅加载权重
    state_dict = torch.load("model.pth", map_location="cpu", weights_only=True)

    # 打印当前内存使用情况
    print_memory_usage()

    # 顺序复制权重到模型参数
    with torch.no_grad():
        # 遍历模型的所有命名参数
        for name, param in model.named_parameters():
            # 如果参数名存在于状态字典中
            if name in state_dict:
                # 将权重复制到设备上的参数
                param.copy_(state_dict[name].to(device))
            else:
                # 打印警告信息
                print(f"Warning: {name} not found in state_dict.")

    # 打印最终内存使用情况
    print_memory_usage()


# 测量函数的峰值内存使用量
peak_memory_used = memory_usage_in_gb(load_sequentially)
# 打印最大CPU内存分配量
print(f"-> Maximum CPU memory allocated: {peak_memory_used:.1f} GB")

Maximum GPU memory allocated: 6.4 GB
Maximum GPU memory allocated: 6.7 GB
-> Maximum CPU memory allocated: 6.3 GB


- Now, suppose we have a machine with low CPU memory but large GPU memory
- 现在,假设我们有一台 CPU 内存较低但 GPU 内存较大的机器

- We can trade off CPU memory and GPU memory usage by introducing PyTorch's so-called "meta" device
- 我们可以通过引入 PyTorch 的所谓"meta"设备来权衡 CPU 内存和 GPU 内存的使用

- PyTorch's meta device is a special device type that allows you to create tensors without allocating actual memory for their data, effectively creating "meta" tensors
- PyTorch 的 meta 设备是一种特殊的设备类型,它允许创建张量而无需为其数据分配实际内存,从而有效地创建"meta"张量

- This is useful for tasks like model analysis or architecture definition, where you need tensor shapes and types without the overhead of memory allocation
- 这对于模型分析或架构定义等任务很有用,在这些任务中,你需要张量形状和类型,但不需要内存分配的开销

In [14]:
def load_sequentially_with_meta():
    # 开始内存追踪
    start_memory_tracking()

    # 使用meta设备创建模型,不分配实际内存
    with torch.device("meta"):
        model = GPTModel(BASE_CONFIG)

    # 将模型转换为空模型并移至指定设备
    model = model.to_empty(device=device)

    # 直接将状态字典加载到指定设备,仅加载权重
    state_dict = torch.load("model.pth", map_location=device, weights_only=True)

    # 打印当前内存使用情况
    print_memory_usage()

    # 顺序复制权重到模型参数
    with torch.no_grad():
        # 遍历模型的所有命名参数
        for name, param in model.named_parameters():
            # 如果参数名存在于状态字典中
            if name in state_dict:
                # 将权重复制到参数
                param.copy_(state_dict[name])
            else:
                # 打印警告信息
                print(f"Warning: {name} not found in state_dict.")

    # 打印最终内存使用情况
    print_memory_usage()

# 测量函数的峰值内存使用量
peak_memory_used = memory_usage_in_gb(load_sequentially_with_meta)
# 打印最大CPU内存分配量
print(f"-> Maximum CPU memory allocated: {peak_memory_used:.1f} GB")

Maximum GPU memory allocated: 12.8 GB
Maximum GPU memory allocated: 12.8 GB
-> Maximum CPU memory allocated: 1.3 GB


- As we can see above, by creating the model on the meta-device and loading the weights directly into GPU memory, we effectively reduced the CPU memory requirements
- 如上所示,通过在meta设备上创建模型并直接将权重加载到GPU内存中,我们有效地降低了CPU内存需求
- One might ask: "Is the sequential weight loading still necessary then, and how does that compare to the original approach?"
- 有人可能会问:"那么顺序加载权重是否仍然必要,与原始方法相比如何?"
- Let's check the simple PyTorch weight loading approach for comparison (from the first weight loading section in this notebook):
- 让我们检查简单的PyTorch权重加载方法进行比较(来自本笔记本中第一个权重加载部分):

In [15]:
# 定义基准函数
def baseline():
    # 开始内存追踪
    start_memory_tracking()

    # 使用基础配置创建模型
    model = GPTModel(BASE_CONFIG)
    # 将模型移至指定设备
    model.to(device)

    # 加载模型权重并移至指定设备
    model.load_state_dict(torch.load("model.pth", map_location=device, weights_only=True))
    model.to(device)
    # 将模型设置为评估模式
    model.eval();

    # 打印内存使用情况
    print_memory_usage()

# 测量函数的峰值内存使用量
peak_memory_used = memory_usage_in_gb(baseline)
# 打印最大CPU内存分配量
print(f"-> Maximum CPU memory allocated: {peak_memory_used:.1f} GB")

Maximum GPU memory allocated: 12.8 GB
-> Maximum CPU memory allocated: 4.4 GB


- As we can see above, the "simple" weight loading without the meta device uses more memory
- 如上所示,不使用meta设备的"简单"权重加载会使用更多内存
- In other words, if you have a machine with limited CPU memory, you can use the meta device approach to directly load the model weights into GPU memory to reduce peak CPU memory usage
- 换句话说,如果你的机器CPU内存有限,可以使用meta设备方法直接将模型权重加载到GPU内存中以减少峰值CPU内存使用

&nbsp;
## 6. Using `mmap=True` (recommmended)
## 6. 使用 `mmap=True` (推荐)

- As an intermediate or advanced `torch.load` user, you may wonder how these approaches compare to the `mmap=True` setting in PyTorch
- 作为一个中级或高级的`torch.load`用户,你可能想知道这些方法与PyTorch中的`mmap=True`设置相比如何
- The `mmap=True` setting in PyTorch enables memory-mapped file I/O, which allows the tensor to access data directly from disk storage, thus reducing memory usage by not loading the entire file into RAM if RAM is limited
- PyTorch中的`mmap=True`设置启用了内存映射文件I/O,这允许张量直接从磁盘存储访问数据,从而在RAM有限的情况下通过不将整个文件加载到RAM中来减少内存使用
- Also, see the helpful comment by [mikaylagawarecki](https://github.com/rasbt/LLMs-from-scratch/issues/402)
- 另请参阅[mikaylagawarecki](https://github.com/rasbt/LLMs-from-scratch/issues/402)的有用评论
- At first glance, it may look less efficient than the sequential approaches above:
- 乍看之下,它可能看起来不如上面的顺序方法高效:

In [37]:
# 定义最佳实践函数
def best_practices():
    # 使用meta设备创建模型
    with torch.device("meta"):
        model = GPTModel(BASE_CONFIG)

    # 加载模型权重,使用mmap=True进行内存映射加载
    model.load_state_dict(
        torch.load("model.pth", map_location=device, weights_only=True, mmap=True),
        assign=True
    )

    # 打印内存使用情况
    print_memory_usage()

# 测量函数的峰值内存使用量
peak_memory_used = memory_usage_in_gb(best_practices)
# 打印最大CPU内存分配量
print(f"-> Maximum CPU memory allocated: {peak_memory_used:.1f} GB")

Maximum GPU memory allocated: 6.4 GB
-> Maximum CPU memory allocated: 5.9 GB


- The reason why the CPU RAM usage is so high is that there's enough CPU RAM available on this machine
- CPU RAM使用率如此之高的原因是这台机器有足够的CPU RAM可用
- However, if you were to run this on a machine with limited CPU RAM, the `mmap` approach would use less memory  
- 但是,如果你在CPU RAM有限的机器上运行,`mmap`方法会使用更少的内存

&nbsp;
## 7. Other methods
## 7. 其他方法

- This notebook is focused on simple, built-in methods for loading weights in PyTorch
- 本笔记本专注于在PyTorch中加载权重的简单内置方法
- The recommended approach for limited CPU memory cases is the `mmap=True` approach explained enough
- 对于CPU内存有限的情况,推荐使用上面解释的`mmap=True`方法
- Alternatively, one other option is a brute-force approach that saves and loads each weight tensor separately:
- 另外,还有一种暴力方法是分别保存和加载每个权重张量:

In [13]:
# 创建一个GPT模型实例
model = GPTModel(BASE_CONFIG)
# 假设这是你已经训练好的模型
state_dict = model.state_dict()

# 创建一个目录来存储单独的参数文件
os.makedirs("model_parameters", exist_ok=True)

# 遍历状态字典中的每个参数
for name, param in state_dict.items():
    # 将每个参数张量单独保存到文件中
    torch.save(param.cpu(), f"model_parameters/{name}.pt")

# 删除模型释放内存
del model

In [16]:
# 定义一个函数用于单独加载权重
def load_individual_weights():

    # 开始跟踪内存使用情况
    start_memory_tracking()

    # 使用meta设备创建一个空模型
    with torch.device("meta"):
        model = GPTModel(BASE_CONFIG)

    # 将模型转换为空权重模型并移至指定设备
    model = model.to_empty(device=device)

    # 打印当前内存使用情况
    print_memory_usage()
    # 设置参数文件所在目录
    param_dir = "model_parameters"

    # 禁用梯度计算,逐个加载权重
    with torch.no_grad():
        for name, param in model.named_parameters():
            # 构建权重文件路径
            weight_path = os.path.join(param_dir, f"{name}.pt")
            if os.path.exists(weight_path):
                # 加载权重文件
                param_data = torch.load(weight_path, map_location="cpu", weights_only=True)
                # 将权重复制到模型参数中
                param.copy_(param_data)
                # 删除临时加载的权重释放内存
                del param_data  # Free memory
            else:
                # 如果权重文件不存在则打印警告
                print(f"Warning: {name} not found in {param_dir}.")

    # 打印最终内存使用情况
    print_memory_usage()


# 测量加载过程中的峰值内存使用
peak_memory_used = memory_usage_in_gb(load_individual_weights)
# 打印峰值内存使用量
print(f"-> Maximum CPU memory allocated: {peak_memory_used:.1f} GB")

Maximum GPU memory allocated: 6.4 GB
Maximum GPU memory allocated: 6.4 GB
-> Maximum CPU memory allocated: 0.3 GB
