<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# FLOPS Analysis
# FLOPS 分析

 - FLOPs (Floating Point Operations Per Second) measure the computational complexity of neural network models by counting the number of floating-point operations executed
- FLOPs (每秒浮点运算次数) 通过计算执行的浮点运算次数来衡量神经网络模型的计算复杂度
 - High FLOPs indicate more intensive computation and energy consumption  
- 高 FLOPs 表示更密集的计算和更多的能源消耗

In [None]:
# pip install -r requirements-extra.txt

In [None]:
# 从importlib.metadata导入version函数
from importlib.metadata import version

# 定义需要检查版本的包列表
pkgs = [
    "thop",    # 用于计算模型FLOPs的包
    "torch",   # PyTorch深度学习框架
]

# 遍历包列表并打印每个包的版本
for p in pkgs:
    print(f"{p} version: {version(p)}")

thop version: 0.1.1-2209072238
torch version: 2.4.1+cu121


&nbsp;
# Simple benchmark with fixed batch size
# 使用固定批量大小的简单基准测试

- forward pass only
- 仅前向传播

In [None]:
# 导入PyTorch库
import torch
# 导入profile函数用于分析FLOPs
from thop import profile
# 从之前的章节导入GPT模型
from previous_chapters import GPTModel


# 定义基础配置字典
BASE_CONFIG = {
    "vocab_size": 50257,     # 词汇表大小
    "context_length": 1024,  # 上下文长度
    "drop_rate": 0.0,        # Dropout比率
    "qkv_bias": True         # 是否使用QKV偏置
}

# 定义不同规模GPT模型的配置
model_configs = {
    "gpt-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},    # 小型GPT配置
    "gpt-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},  # 中型GPT配置
    "gpt-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},   # 大型GPT配置
    "gpt-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},     # 超大型GPT配置
}

# 设置计算设备(GPU如果可用,否则使用CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 设置批量大小
batch_size = 2
# 生成随机输入张量并移至指定设备
input_tensor = torch.randint(0, 50257, (batch_size, 1024)).to(device)

# 遍历不同规模的模型配置
for size in model_configs:
    # 更新基础配置
    BASE_CONFIG.update(model_configs[size])

    # 初始化模型,转换为bfloat16格式
    model = GPTModel(BASE_CONFIG).bfloat16()
    # 将模型移至指定设备
    model.to(device)

    # MACS = multiply-accumulate operations
    # MACS通常被计算为两个FLOPS(一个乘法和一个累加)
    macs, params = profile(model, inputs=(input_tensor,), verbose=False)
    # 计算总FLOPS
    flops = 2*macs
    # 打印模型规模和对应的FLOPS
    print(f"{size:18}: {flops:.1e} FLOPS")

    # 删除模型释放内存
    del model
    # 清空CUDA缓存
    torch.cuda.empty_cache()

gpt-small (124M)  : 5.1e+11 FLOPS
gpt-medium (355M) : 1.4e+12 FLOPS
gpt-large (774M)  : 3.2e+12 FLOPS
gpt-xl (1558M)    : 6.4e+12 FLOPS


&nbsp;
# Simple benchmark with automatic batch size finding
# 使用自动批量大小查找的简单基准测试

 - forward pass only
- 仅前向传播

In [None]:
# 遍历不同规模的模型配置
for size in model_configs:
    # 打印当前处理的模型大小
    print(f"\nProcessing {size}")
    # 复制基础配置
    config = BASE_CONFIG.copy()
    # 更新当前模型的配置
    config.update(model_configs[size])

    # 设置最小批量大小为1
    min_batch_size = 1
    # 初始化最大批量大小为None
    max_batch_size = None
    # 设置可能的最大批量大小为4096
    max_possible_batch_size = 4096

    # 当最小批量大小小于等于可能的最大批量大小时继续循环
    while min_batch_size <= max_possible_batch_size:
        # 计算当前尝试的批量大小(二分查找)
        batch_size = (min_batch_size + max_possible_batch_size) // 2
        try:
            # 生成随机输入张量
            input_tensor = torch.randint(
                0, config["vocab_size"],
                (batch_size, config["context_length"]),
                device=device
            )

            # 初始化模型并转换为bfloat16格式,移至指定设备
            model = GPTModel(config).bfloat16().to(device)

            # MACS = multiply-accumulate operations
            # MACS通常被计算为两个FLOPS(一个乘法和一个累加)
            macs, params = profile(model, inputs=(input_tensor,), verbose=False)
            # 计算总FLOPS
            flops = 2 * macs
            # 打印当前批量大小和对应的FLOPS
            print(f"  Batch size {batch_size}: {flops:.1e} FLOPS")

            # 如果成功,尝试更大的批量大小
            min_batch_size = batch_size + 1
            max_batch_size = batch_size

            # 清理内存
            del model, input_tensor
            torch.cuda.empty_cache()

        except RuntimeError as e:
            # 如果发生内存不足错误
            if "out of memory" in str(e):
                # 尝试更小的批量大小
                max_possible_batch_size = batch_size - 1

                # 清理内存
                try:
                    del model, input_tensor
                    torch.cuda.empty_cache()
                except NameError:
                    pass
            else:
                # 如果是其他错误则抛出
                raise e


Processing gpt-small (124M)
  Batch size 256: 6.5e+13 FLOPS
  Batch size 384: 9.7e+13 FLOPS
  Batch size 388: 9.8e+13 FLOPS
  Batch size 389: 9.8e+13 FLOPS

Processing gpt-medium (355M)
  Batch size 256: 1.9e+14 FLOPS
  Batch size 260: 1.9e+14 FLOPS
  Batch size 262: 1.9e+14 FLOPS
  Batch size 263: 1.9e+14 FLOPS

Processing gpt-large (774M)
  Batch size 256: 4.0e+14 FLOPS

Processing gpt-xl (1558M)
  Batch size 128: 4.1e+14 FLOPS
  Batch size 136: 4.3e+14 FLOPS
  Batch size 140: 4.5e+14 FLOPS
  Batch size 142: 4.5e+14 FLOPS
  Batch size 143: 4.6e+14 FLOPS


&nbsp;
# Benchmark with automatic batch size finding and Model FLOP Utilization (MFU)
# 使用自动批量大小查找和模型FLOP利用率(MFU)的基准测试

- Model FLOPs Utilization (MFU) explanation from the [PaLM paper](https://arxiv.org/abs/2204.02311)
- 模型 FLOPs 利用率(MFU)解释,来自 [PaLM 论文](https://arxiv.org/abs/2204.02311)

> We propose a new metric for efficiency that is implementation-independent and permits a cleaner comparison of system efficiency, called model FLOPs utilization (MFU). This is the ratio of the observed throughput (tokens-per-second) relative to the theoretical maximum throughput of a system operating at peak FLOPs. Crucially, the "theoretical maximum" throughput only accounts for the required operations to compute the forward+backward passes, and not rematerialization.
> 我们提出了一个新的效率指标,它与实现无关并允许对系统效率进行更清晰的比较,称为模型 FLOPs 利用率(MFU)。这是观察到的吞吐量(每秒令牌数)相对于系统在峰值 FLOPs 下运行的理论最大吞吐量的比率。至关重要的是,"理论最大"吞吐量仅考虑计算前向+后向传递所需的操作,而不考虑重物化。

$$\text{MFU} = \frac{\text{Observed Tokens per Second}}{\text{Theoretical Max Tokens per Second}}$$
$$\text{MFU} = \frac{\text{观察到的每秒令牌数}}{\text{理论最大每秒令牌数}}$$

where
其中

$$\text{Theoretical Max Tokens per Second} = \frac{\text{Max FLOPs per Second}}{\text{Total FLOPs per Token}}$$
$$\text{理论最大每秒令牌数} = \frac{\text{每秒最大 FLOPs}}{\text{每个令牌的总 FLOPs}}$$

and
和

$$\text{Tokens per Second} = \frac{\text{Batch Size} \times \text{Sequence Length}}{\text{Total Time}}$$
$$\text{每秒令牌数} = \frac{\text{批量大小} \times \text{序列长度}}{\text{总时间}}$$

 - forward and backward pass
 - 前向和后向传递

In [None]:
# 由 GPU 制造商提供的理论最大每秒浮点运算次数

# 定义不同 GPU 型号和精度下的每秒浮点运算次数
flops_per_second = {
    # H100 GPU 的规格
    "H100": {
        torch.float32: 51.22e12,  # H100 在 FP32 精度下可达 51.22 TFLOPs
        torch.float16: 204.9e12,  # H100 在 FP16 精度下可达 204.9 TFLOPs
        torch.bfloat16: 204.9e12  # H100 在 BF16 精度下可达 204.9 TFLOPs
    },
    # L4 GPU 的规格
    "L4": {
        torch.float32: 30.29e12,  # L4 在 FP32 精度下可达 30.29 TFLOPs
        torch.float16: 30.29e12,  # L4 在 FP16 精度下可达 30.29 TFLOPs
        torch.bfloat16: 30.29e12  # L4 在 BF16 精度下可达 30.29 TFLOPs
    },
    # T4 GPU 的规格
    "T4": {
        torch.float32: 8.1e12,    # T4 在 FP32 精度下可达 8.1 TFLOPs
        torch.float16: 65.13e12,  # T4 在 FP16 精度下可达 65.13 TFLOPs
        torch.bfloat16: 65.13e12  # T4 在 BF16 精度下可达 65.13 TFLOPs
    },
    # A10G GPU 的规格
    "A10G": {
        torch.float32: 31.52e12,  # A10G 在 FP32 精度下可达 31.52 TFLOPs
        torch.float16: 31.52e12,  # A10G 在 FP16 精度下可达 31.52 TFLOPs
        torch.bfloat16: 31.52e12  # A10G 在 BF16 精度下可达 31.52 TFLOPs
    },
    # A100 GPU 的规格
    "A100": {
        torch.float32: 19.49e12,  # A100 在 FP32 精度下可达 19.49 TFLOPs
        torch.float16: 77.97e12,  # A100 在 FP16 精度下可达 77.97 TFLOPs
        torch.bfloat16: 77.97e12  # A100 在 BF16 精度下可达 77.97 TFLOPs
    },
    # RTX 3080 GPU 的规格
    "RTX_3080": {
        torch.float32: 29.77e12,  # RTX 3080 在 FP32 精度下可达 29.77 TFLOPs
        torch.float16: 29.77e12,  # RTX 3080 在 FP16 精度下可达 29.77 TFLOPs
        torch.bfloat16: 29.77e12  # RTX 3080 在 BF16 精度下可达 29.77 TFLOPs
    },
    # RTX 3090 GPU 的规格
    "RTX_3090": {
        torch.float32: 35.58e12,  # RTX 3090 在 FP32 精度下可达 35.58 TFLOPs
        torch.float16: 35.58e12,  # RTX 3090 在 FP16 精度下可达 35.58 TFLOPs
        torch.bfloat16: 35.58e12  # RTX 3090 在 BF16 精度下可达 35.58 TFLOPs
    }
}


In [None]:
# 导入时间模块用于计时
import time

# 定义函数用于获取 GPU 型号
def get_gpu_model(flops_per_second_dict):
    # 获取当前设备的 GPU 名称
    device_name = torch.cuda.get_device_name(0)
    # 遍历已知的 GPU 型号
    for model in flops_per_second_dict.keys():
        # 如果在设备名称中找到匹配的型号
        if model in device_name:
            return model
    # 如果没有找到匹配的型号，返回"Unknown"
    return "Unknown"  # Default if no matching model is found


# 获取当前 GPU 型号
gpu_model = get_gpu_model(flops_per_second)
# 打印 GPU 型号
print("GPU Model:", gpu_model)

# 如果 GPU 型号已知，则继续执行
if gpu_model != "Unknown":

    # 遍历不同的模型配置大小
    for size in model_configs:
        print(f"\nProcessing {size}")
        # 复制基础配置
        config = BASE_CONFIG.copy()
        # 更新特定大小的配置
        config.update(model_configs[size])

        # 初始化批次大小的搜索范围
        min_batch_size = 1
        max_batch_size = None
        max_possible_batch_size = 4096

        # 二分搜索最大可用批次大小
        while min_batch_size <= max_possible_batch_size:
            # 计算中间批次大小
            batch_size = (min_batch_size + max_possible_batch_size) // 2
            try:
                # 生成随机输入张量
                input_tensor = torch.randint(
                    0, config["vocab_size"],
                    (batch_size, config["context_length"]),
                    device=device
                )

                # 创建模型并转换为 bfloat16 精度
                model = GPTModel(config).bfloat16().to(device)
                # 设置为训练模式
                model.train()

                # 开始计时
                torch.cuda.synchronize()
                start_time = time.time()

                # 前向传播和反向传播
                output = model(input_tensor)
                loss = output.sum()  # 计算虚拟损失
                loss.backward()

                # 结束计时
                torch.cuda.synchronize()
                end_time = time.time()

                # 计算总用时
                total_time_seconds = end_time - start_time

                # 计算前向传播的 FLOPs
                macs, params = profile(model, inputs=(input_tensor,), verbose=False)
                flops_forward = 2 * macs  # 假设一个 MAC 等于两个 FLOPs

                # 估算反向传播的 FLOPs（通常是前向传播的 2 倍）
                flops_backward = 2 * flops_forward

                # 计算前向+反向传播的总 FLOPs
                total_flops = flops_forward + flops_backward

                # 获取模型数据类型和对应的理论最大 FLOPs
                data_type = next(model.parameters()).dtype
                max_flops_per_second = flops_per_second[gpu_model].get(data_type, 0)

                # 计算每秒处理的 token 数
                tokens_processed = batch_size * config["context_length"]
                tokens_per_second = tokens_processed / total_time_seconds

                # 计算每个 token 的 FLOPs
                flops_per_token = total_flops / tokens_processed

                # 计算理论最大每秒 token 数
                if flops_per_token > 0:
                    theoretical_max_tokens_per_second = max_flops_per_second / flops_per_token
                else:
                    theoretical_max_tokens_per_second = 0  # 避免除以零

                # 计算 MFU（模型 FLOPs 利用率）
                if theoretical_max_tokens_per_second > 0:
                    mfu = tokens_per_second / theoretical_max_tokens_per_second
                else:
                    mfu = 0  # 避免除以零

                # 打印当前批次大小的性能指标
                print(f"  Batch size {batch_size}: Tokens/sec: {tokens_per_second:.2f}, MFU: {mfu:.4f}")

                # 如果成功，尝试更大的批次大小
                min_batch_size = batch_size + 1
                max_batch_size = batch_size

                # 清理内存
                del model, input_tensor, output, loss
                torch.cuda.empty_cache()

            except RuntimeError as e:
                # 处理内存不足错误
                if "out of memory" in str(e).lower():
                    # 尝试更小的批次大小
                    max_possible_batch_size = batch_size - 1

                    # 清理内存
                    try:
                        del model, input_tensor
                        torch.cuda.empty_cache()
                    except NameError:
                        pass
                else:
                    # 如果是其他错误则抛出
                    raise e

else:
    # 如果 GPU 型号未知，打印提示信息
    print("Unknown GPU model. Please update the flops_per_second dictionary with your GPU information.")

GPU Model: A100

Processing gpt-small (124M)
  Batch size 16: Tokens/sec: 34248.82, MFU: 0.3256
  Batch size 24: Tokens/sec: 62568.34, MFU: 0.5948

Processing gpt-medium (355M)
  Batch size 4: Tokens/sec: 20159.93, MFU: 0.5483
  Batch size 6: Tokens/sec: 21717.66, MFU: 0.5907
  Batch size 7: Tokens/sec: 22536.25, MFU: 0.6130

Processing gpt-large (774M)
  Batch size 8: Tokens/sec: 12465.21, MFU: 0.7406

Processing gpt-xl (1558M)
  Batch size 4: Tokens/sec: 6779.92, MFU: 0.8113


- a value of 1.0 is best (equal to 100%)
- 1.0是最佳值(等于100%)
- Note that the batch sizes are smaller than previously because we also carry out the backward pass here, which is more memory-intensive  
- 注意这里的批量大小比之前小,因为我们还要执行反向传播,这需要更多内存