This notebook shows how to use optimum-benchmark to benchmark LLMs. It  focuses on benchmarking quantization algorithms for Mistral 7B.

We need to install the following packages. bitsandbytes, auto-gptq and autoawq are only necessary if you benchmark models quantized with these algorithms.

In [None]:
!python -m pip install git+https://github.com/huggingface/optimum-benchmark.git

Define the configuration for optimum-benchmark.

Here we benchmark for inference, using different batch sizes, Mistral 7B loaded as fp16.
If you run this notebook on Google Colab, you will need the A100 only for this part. The following benchmarks would run on the T4.

In [1]:
!nvidia-smi

Tue Jun 11 00:46:29 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100 80GB PCIe          On  | 00000001:00:00.0 Off |                    0 |
| N/A   34C    P0              43W / 300W |      9MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [12]:
from optimum_benchmark import Benchmark, BenchmarkConfig, TorchrunConfig, InferenceConfig, PyTorchConfig  
from optimum_benchmark.logging_utils import setup_logging  
  
# setup_logging(level="INFO", handlers=["console"])  
if __name__ == "__main__":  
    launcher_config = TorchrunConfig(nproc_per_node=1)  
      
    # 设置推理配置，包括批处理大小和序列长度等参数  
    scenario_config = InferenceConfig(  
        latency=True,  
        memory=True,  
        warmup_runs=5,  
        input_shapes={  
            "batch_size": 1,  # 设置批处理大小  
            "sequence_length": 512  # 设置序列长度  
        },  
        generate_kwargs={  
            "max_new_tokens": 100,  # 默认生成的新标记的最大数量  
            "min_new_tokens": 100,  # 默认生成的新标记的最小数量  
            "num_beams": 1,  # 默认束搜索数量  
            "temperature": 1.0,  # 默认温度  
            "top_k": 50,  # 默认Top-k采样  
            "top_p": 0.9,  # 默认Top-p采样  
            "do_sample": True  # 启用采样模式 
        }  
    ) 
      
    # 定义 PyTorch 后端配置  
    backend_config = PyTorchConfig(  
        model="microsoft/Phi-3-mini-4k-instruct",  
        device="cuda",  
        device_ids="0",  # 确保设备ID为字符串  
        no_weights=True  
    )  
      
    # 定义基准测试配置  
    benchmark_config = BenchmarkConfig(  
        name="pytorch_Phi-3-mini-4k-instruct",  
        scenario=scenario_config,  
        launcher=launcher_config,  
        backend=backend_config,  
    )  
      
    # 运行基准测试  
    benchmark_report = Benchmark.launch(benchmark_config)  
      
    # 在终端中记录基准测试结果  
    benchmark_report.log()  # or print(benchmark_report)  
      
    # 将工件转换为字典或数据帧  
    benchmark_config.to_dict()  # or benchmark_config.to_dataframe()  
      
    # 将工件保存到磁盘为 JSON 或 CSV 文件  
    benchmark_report.save_csv("benchmark_report_pytorch_Phi-3-mini-4k-instruct.csv")  # 保存为 CSV 文件  
    benchmark_report.save_json("benchmark_report_pytorch_Phi-3-mini-4k-instruct.json")  # 保存为 JSON 文件  
      
    # 或者将它们合并到一个单一的工件中  
    benchmark = Benchmark(config=benchmark_config, report=benchmark_report)  
    benchmark.save_json("benchmark_pytorch_Phi-3-mini-4k-instruct.json")  # 保存为 JSON 文件  
    benchmark.save_csv("benchmark_pytorch_Phi-3-mini-4k-instruct.csv")  # 保存为 CSV 文件  


[ISOLATED-PROCESS][[36m2024-06-11 01:49:16,776[0m][[34mtorchrun[0m][[32mINFO[0m] - 	+ Starting benchmark in isolated process[0m
[RANK-PROCESS-0][[36m2024-06-11 01:49:19,569[0m][[34mtorchrun[0m][[32mINFO[0m] - 	+ Setting torch.distributed cuda device to 0[0m
[RANK-PROCESS-0][[36m2024-06-11 01:49:19,571[0m][[34mtorchrun[0m][[32mINFO[0m] - 	+ Initializing torch.distributed process group[0m
[RANK-PROCESS-0][[36m2024-06-11 01:49:19,596[0m][[34mdatasets[0m][[32mINFO[0m] - PyTorch version 2.3.0 available.[0m




[RANK-PROCESS-0][[36m2024-06-11 01:49:20,210[0m][[34mpytorch[0m][[32mINFO[0m] - Allocating pytorch backend[0m
[RANK-PROCESS-0][[36m2024-06-11 01:49:20,210[0m][[34mpytorch[0m][[32mINFO[0m] - 	+ Benchmarking a Transformers model[0m




[RANK-PROCESS-0][[36m2024-06-11 01:49:21,859[0m][[34mpytorch[0m][[32mINFO[0m] - 	+ Using automodel class AutoModelForCausalLM[0m
[RANK-PROCESS-0][[36m2024-06-11 01:49:21,859[0m][[34mpytorch[0m][[32mINFO[0m] - 	+ Seeding pytorch backend with seed 42[0m
[RANK-PROCESS-0][[36m2024-06-11 01:49:21,860[0m][[34mpytorch[0m][[32mINFO[0m] - 	+ Using AutoModel AutoModelForCausalLM[0m
[RANK-PROCESS-0][[36m2024-06-11 01:49:21,860[0m][[34mpytorch[0m][[32mINFO[0m] - 	+ Creating backend temporary directory[0m
[RANK-PROCESS-0][[36m2024-06-11 01:49:21,861[0m][[34mpytorch[0m][[32mINFO[0m] - 	+ Loading model with random weights[0m
[RANK-PROCESS-0][[36m2024-06-11 01:49:21,861[0m][[34mpytorch[0m][[32mINFO[0m] - 	+ Creating no weights model directory[0m
[RANK-PROCESS-0][[36m2024-06-11 01:49:21,861[0m][[34mpytorch[0m][[32mINFO[0m] - 	+ Creating no weights model state dict[0m
[RANK-PROCESS-0][[36m2024-06-11 01:49:21,862[0m][[34mpytorch[0m][[32mINFO[0m] - 	+

In [11]:
from optimum_benchmark import Benchmark, BenchmarkConfig, TorchrunConfig, InferenceConfig, VLLMConfig  
from optimum_benchmark.logging_utils import setup_logging  
  
# setup_logging(level="INFO", handlers=["console"])  
if __name__ == "__main__":  
    launcher_config = TorchrunConfig(nproc_per_node=1)  
      
    # 设置推理配置，包括batch size、sequence length等参数  
    scenario_config = InferenceConfig(  
        latency=True,  
        memory=True,  
        warmup_runs=5,  
        input_shapes={  
            "batch_size": 1,  # 默认批处理大小  
            "sequence_length": 512  # 默认序列长度  
        },  
        generate_kwargs={  
            "max_new_tokens": 50,  # 默认生成的新标记的最大数量  
            "min_new_tokens": 50,  # 默认生成的新标记的最小数量  
            "num_beams": 1,  # 默认束搜索数量  
            "temperature": 1.0,  # 默认温度  
            "top_k": 50,  # 默认Top-k采样  
            "top_p": 0.9,  # 默认Top-p采样  
            "do_sample": True  # 启用采样模式 
        }  
    )  
      
    # 定义 vllm_config  
    vllm_config = VLLMConfig(  
        model="microsoft/Phi-3-mini-4k-instruct",  
        device="cuda",  
        device_ids="0",  # 确保设备ID为字符串  
        no_weights=True  
    )  
      
    benchmark_config = BenchmarkConfig(  
        name="vllm_phi3_mini_4k_instruct",  
        scenario=scenario_config,  
        launcher=launcher_config,  
        backend=vllm_config,  
    )  
      
    benchmark_report = Benchmark.launch(benchmark_config)  
      
    # log the benchmark in terminal  
    benchmark_report.log()  # or print(benchmark_report)  
      
    # convert artifacts to a dictionary or dataframe  
    benchmark_config.to_dict()  # or benchmark_config.to_dataframe()  
      
    # save artifacts to disk as json or csv files  
    benchmark_report.save_csv("benchmark_reportvllm.Phi-3-mini-4k-instruct.csv")  # 保存为 CSV 文件  
    benchmark_report.save_json("benchmark_reportvllm.Phi-3-mini-4k-instruct.json")  # 保存为 JSON 文件  
      
    # or merge them into a single artifact  
    benchmark = Benchmark(config=benchmark_config, report=benchmark_report)  
    benchmark.save_json("benchmarkvllm.Phi-3-mini-4k-instruct.json")  # 保存为 JSON 文件  
    benchmark.save_csv("benchmarkvllm.Phi-3-mini-4k-instruct.csv")  # 保存为 CSV 文件  


[ISOLATED-PROCESS][[36m2024-06-11 01:47:27,886[0m][[34mtorchrun[0m][[32mINFO[0m] - 	+ Starting benchmark in isolated process[0m
[RANK-PROCESS-0][[36m2024-06-11 01:47:30,675[0m][[34mtorchrun[0m][[32mINFO[0m] - 	+ Setting torch.distributed cuda device to 0[0m
[RANK-PROCESS-0][[36m2024-06-11 01:47:30,677[0m][[34mtorchrun[0m][[32mINFO[0m] - 	+ Initializing torch.distributed process group[0m




[RANK-PROCESS-0][[36m2024-06-11 01:47:30,966[0m][[34mdatasets[0m][[32mINFO[0m] - PyTorch version 2.3.0 available.[0m
INFO 06-11 01:47:31 base.py:41] Allocating vllm backend
INFO 06-11 01:47:31 base.py:60] 	+ Benchmarking a Transformers model




INFO 06-11 01:47:32 base.py:74] 	+ Using automodel class AutoModelForCausalLM
INFO 06-11 01:47:32 base.py:78] 	+ Seeding vllm backend with seed 42
INFO 06-11 01:47:32 backend.py:23] 	+ Creating backend temporary directory
INFO 06-11 01:47:32 backend.py:27] 	+ Loading no weights model
INFO 06-11 01:47:32 backend.py:90] 	+ Creating no weights model
INFO 06-11 01:47:32 backend.py:59] 	+ Creating no weights model directory
INFO 06-11 01:47:32 backend.py:61] 	+ Creating no weights model state dict
INFO 06-11 01:47:32 backend.py:63] 	+ Saving no weights model safetensors
INFO 06-11 01:47:32 backend.py:66] 	+ Saving no weights model pretrained config
INFO 06-11 01:47:32 backend.py:68] 	+ Saving no weights model pretrained processor
INFO 06-11 01:47:32 backend.py:71] 	+ Loading no weights model from /tmp/tmpsag0weo1/no_weights_model
[RANK-PROCESS-0][[36m2024-06-11 01:47:33,235[0m][[34maccelerate.utils.modeling[0m][[32mINFO[0m] - We will use 90% of the memory on device 0 for storing the m