# FLOPS Analysis(FLOPS分析)

- FLOP（每秒浮点运算）通过计算执行的浮点运算数量来衡量神经网络模型的计算复杂性
- 高 FLOP 表示更密集的计算和能源消耗

In [1]:
# pip install -r requirements-extra.txt

In [2]:
from importlib.metadata import version

from transformers import GPTJModel

pkgs=[
    "thop",
    "torch",
]
for p in pkgs:
    print(f"{p} version:{version(p)}")

thop version:0.1.1-2209072238
torch version:2.9.1+cu130


&nbsp;
# 具有固定批量大小的简单基准测试

- 仅向前传递

In [3]:
import torch
from thop import profile
# pip install llms-from-scratch
from llms_from_scratch.ch04 import GPTModel

BASE_CONFIG = {
    "vocab_size": 50257,  #词表大小
    "context_length": 1024,  #上下文长度
    "drop_rate": 0.0,  #丢失率
    "qkv_bias": True,  #查询键值偏差
}

model_configs = {
    "gpt-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt-midium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt-large (744M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# print(device)
batch_size = 2
input_tensor = torch.randint(0, 50257, (batch_size, 1024)).to(device)

for size in model_configs:
    BASE_CONFIG.update(model_configs[size])

    model = GPTModel(BASE_CONFIG).bfloat16()
    model.to(device)

    # MACS = 乘法累加运算
    # MACS 通常计算为两次 FLOPS（一次乘法和一次累加）

    macs,params = profile(model,inputs=(input_tensor,),verbose=False)
    flops =2*macs
    print(f"{size:18}:{flops:.1e} FLOPS")

    del model
    torch.cuda.empty_cache()

gpt-small (124M)  :5.1e+11 FLOPS
gpt-midium (355M) :1.4e+12 FLOPS
gpt-large (744M)  :3.2e+12 FLOPS
gpt-xl (1558M)    :6.4e+12 FLOPS


&nbsp;
# 具有自动批量大小查找功能的简单基准测试

- 仅向前传递

In [4]:
for size in model_configs:
    print(f"\nProcessing {size} 处理中...")
    config = BASE_CONFIG.copy()
    config.update(model_configs[size])

    min_batch_size = 1
    max_batch_size = None
    max_possible_batch_size = 4096
    while min_batch_size <= max_possible_batch_size:
        batch_size = (min_batch_size + max_possible_batch_size) // 2
        try:
            input_tensor = torch.randint(
                0,config["vocab_size"],
                (batch_size,config["context_length"]),
                device=device
            )
            model = GPTModel(config).bfloat16().to(device)
            # MACS = 乘法累加运算
            # MACS 通常计算为两次 FLOPS（一次乘法和一次累加）
            macs,params = profile(model,inputs=(input_tensor,),verbose=False)
            flops =2*macs
            print(f"Batch size {batch_size}: {flops:.1e} FLOPS")

            #如果成功，请尝试更大的批量大小
            min_batch_size = batch_size+1
            max_batch_size = batch_size

            #清理
            del model,input_tensor
            torch.cuda.empty_cache()
        except RuntimeError as e:
            if "out of memory" in str(e):
                # 尝试较小的批量
                max_possible_batch_size = batch_size-1

                #清理
                try:
                    del model,input_tensor
                    torch.cuda.empty_cache()
                except NameError:
                    pass
            else:
                raise e


Processing gpt-small (124M) 处理中...
Batch size 256: 6.5e+13 FLOPS
Batch size 384: 9.7e+13 FLOPS
Batch size 416: 1.1e+14 FLOPS
Batch size 432: 1.1e+14 FLOPS
Batch size 440: 1.1e+14 FLOPS
Batch size 444: 1.1e+14 FLOPS
Batch size 446: 1.1e+14 FLOPS
Batch size 447: 1.1e+14 FLOPS

Processing gpt-midium (355M) 处理中...
Batch size 256: 1.9e+14 FLOPS
Batch size 384: 2.8e+14 FLOPS

Processing gpt-large (744M) 处理中...
Batch size 256: 4.0e+14 FLOPS
Batch size 264: 4.2e+14 FLOPS
Batch size 268: 4.2e+14 FLOPS
Batch size 270: 4.3e+14 FLOPS
Batch size 271: 4.3e+14 FLOPS

Processing gpt-xl (1558M) 处理中...
Batch size 256: 8.2e+14 FLOPS


## GPU性能基准测试结果 - RTX 5070 Ti (16GB)

| 模型        | 参数量     | 最大批量 | 峰值FLOPS  | 状态    |
|------------|---------|------|-----------|---------|
| GPT-Small   | 124M    | 462  | 1.2e+14   | ✅ 稳定 |
| GPT-Medium  | 355M    | ≥384 | 2.8e+14   | ✅ 良好 |
| GPT-Large   | 744M    | -    | -         | ⚠️ 边界 |

&nbsp;
# 具有自动批量大小查找和模型 FLOP 利用率 (MFU) 的基准

- [PaLM 论文]中的模型FLOP利用率(MFU)解释(https://arxiv.org/abs/2204.02311)

> 我们提出了一种新的效率指标，该指标与实现无关，并且允许对系统效率进行更清晰的比较，称为模型浮点运算利用率（MFU）。这是观察到的吞吐量（每秒Tokens数）相对于以峰值 FLOP 运行的系统的理论最大吞吐量的比率。至关重要的是，“理论最大”吞吐量仅考虑计算前向+后向传递所需的操作，而不考虑重新物化。


$$\text{MFU} = \frac{\text{每秒观察到的Tokens数}}{\text{每秒理论最大Tokens数}}$$

哪里

$$\text{每秒理论最大Tokens数} = \frac{\text{每秒最大 FLOPs}}{\text{每个Tokens的总 FLOPs}}$$

和

$$\text{每秒Tokens数} = \frac{\text{批量大小} \times \text{序列长度}}{\text{总时间}}$$

- 向前和向后传递

In [5]:
# Theoretical max flops per second provided by the GPU manufacturer

flops_per_second = {
    # https://www.techpowerup.com/gpu-specs/h100-pcie-80-gb.c3899
    "H100": {
        torch.float32: 51.22e12,  # 51.22 TFLOPs for FP32 on NVIDIA H100
        torch.float16: 204.9e12,  # 204.9 TFLOPs for FP16 on NVIDIA H100
        torch.bfloat16: 204.9e12
    },
    # https://www.techpowerup.com/gpu-specs/l4.c4091
    "L4": {
        torch.float32: 30.29e12,  # 30.29 TFLOPs for FP32 on NVIDIA L4
        torch.float16: 30.29e12,  # 30.29 TFLOPs for FP16 on NVIDIA L4
        torch.bfloat16: 30.29e12
    },
    # https://www.techpowerup.com/gpu-specs/tesla-t4.c3316
    "T4": {
        torch.float32: 8.1e12,  # 8.1 TFLOPs for FP32 on NVIDIA T4
        torch.float16: 65.13e12,  # 65.13 TFLOPs for FP16 on NVIDIA T4
        torch.bfloat16: 65.13e12
    },
    # https://www.techpowerup.com/gpu-specs/a10g.c3798
    "A10G": {
        torch.float32: 31.52e12,  # 31.52 TFLOPs for FP32 on NVIDIA A10G
        torch.float16: 31.52e12,  # 31.52 TFLOPs for FP16 on NVIDIA A10G
        torch.bfloat16: 31.52e12
    },
    # https://www.techpowerup.com/gpu-specs/a100-pcie-40-gb.c3623
    "A100": {
        torch.float32: 19.49e12,  # 19.49 TFLOPs for FP32 on NVIDIA A100
        torch.float16: 77.97e12,  # 77.97 TFLOPs for FP16 on NVIDIA A100
        torch.bfloat16: 77.97e12
    },
    # https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621
    "RTX 3080": {
        torch.float32: 29.77e12,  # 29.77 TFLOPs for FP32 on NVIDIA RTX 3080
        torch.float16: 29.77e12,  # 29.77 TFLOPs for FP16 on NVIDIA RTX 3080
        torch.bfloat16: 29.77e12
    },
    # https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622
    "RTX 3090": {
        torch.float32: 35.58e12,  # 35.58 TFLOPs for FP32 on NVIDIA RTX 3090
        torch.float16: 35.58e12,  # 35.58 TFLOPs for FP16 on NVIDIA RTX 3090
        torch.bfloat16: 35.58e12
    },
    # https://www.techpowerup.com/gpu-specs/geforce-rtx-5070-ti.c4243
    "RTX 5070 Ti":{
        torch.float32:43.94e12, # 43.94 TFLOPs for FP32 on NVIDIA RTX_5070Ti
        torch.float16:43.94e12, # 43.94 TFLOPs for FP32 on NVIDIA RTX_5070Ti
        torch.bfloat16:43.94e12
    }

}


In [6]:
import time

def get_gpu_model(flops_per_second_dict):
    device_name = torch.cuda.get_device_name(0)
    # print(f"Device name: {device_name}")
    for model in flops_per_second_dict.keys():
        if model in device_name:
            return model
    return "Unknown"  # 如果没有找到匹配的型号则默认


gpu_model = get_gpu_model(flops_per_second)
print("GPU Model:", gpu_model)

if gpu_model != "Unknown":

    for size in model_configs:
        print(f"\nProcessing {size} 处理中...")
        config = BASE_CONFIG.copy()
        config.update(model_configs[size])

        min_batch_size = 1
        max_batch_size = None
        max_possible_batch_size = 4096

        while min_batch_size <= max_possible_batch_size:
            batch_size = (min_batch_size + max_possible_batch_size) // 2
            try:
                input_tensor = torch.randint(
                    0, config["vocab_size"],
                    (batch_size, config["context_length"]),
                    device=device
                )

                model = GPTModel(config).bfloat16().to(device)
                model.train()

                # 开始计时
                torch.cuda.synchronize()
                start_time = time.time()

                # 向前和向后传递
                output = model(input_tensor)
                loss = output.sum()  # 计算虚拟损失
                loss.backward()

                # 结束计时
                torch.cuda.synchronize()
                end_time = time.time()

                total_time_seconds = end_time - start_time

                # 计算前向传播的 FLOPs
                macs, params = profile(model, inputs=(input_tensor,), verbose=False)
                flops_forward = 2 * macs  # 假设 1 个 MAC 等于 2 个 FLOP

                # 估计向后传递的 FLOP（通常是前向 FLOP 的 2 倍）
                flops_backward = 2 * flops_forward

                # 前向 + 后向传递的总 FLOP 次数
                total_flops = flops_forward + flops_backward  #或者total_flops = flops_forward * 3

                data_type = next(model.parameters()).dtype
                max_flops_per_second = flops_per_second[gpu_model].get(data_type, 0)

                # 每秒计算tokens数
                tokens_processed = batch_size * config["context_length"]
                tokens_per_second = tokens_processed / total_time_seconds

                # 计算每个token的 FLOPs
                flops_per_token = total_flops / tokens_processed

                # 计算每秒理论最大tokens数
                if flops_per_token > 0:
                    theoretical_max_tokens_per_second = max_flops_per_second / flops_per_token
                else:
                    theoretical_max_tokens_per_second = 0  # 避免除0

                # 计算MFU
                if theoretical_max_tokens_per_second > 0:
                    mfu = tokens_per_second / theoretical_max_tokens_per_second
                else:
                    mfu = 0  # 避免除0

                print(f"  Batch size {batch_size}: Tokens/sec: {tokens_per_second:.2f}, MFU: {mfu:.4f}")

                # 如果成功，请尝试更大的批量大小
                min_batch_size = batch_size + 1
                max_batch_size = batch_size

                # 清理
                del model, input_tensor, output, loss
                torch.cuda.empty_cache()

            except RuntimeError as e:
                if "out of memory" in str(e).lower():
                    # 尝试较小的批量
                    max_possible_batch_size = batch_size - 1

                    # 清除
                    try:
                        del model, input_tensor
                        torch.cuda.empty_cache()
                    except NameError:
                        pass
                else:
                    raise e

else:
    print("Unknown GPU model. Please update the flops_per_second dictionary with your GPU information.")
    print("未知 GPU 型号。请使用您的 GPU 信息更新 flops_per_second 字典。")

GPU Model: RTX 5070 Ti

Processing gpt-small (124M) 处理中...
  Batch size 32: Tokens/sec: 901.45, MFU: 0.0152

Processing gpt-midium (355M) 处理中...
  Batch size 1: Tokens/sec: 293.08, MFU: 0.0141

Processing gpt-large (744M) 处理中...
  Batch size 8: Tokens/sec: 176.39, MFU: 0.0186
  Batch size 10: Tokens/sec: 121.51, MFU: 0.0128

Processing gpt-xl (1558M) 处理中...
  Batch size 4: Tokens/sec: 67.54, MFU: 0.0143
