Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 21 additions & 10 deletions examples/models/core/deepseek_v3/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# DeepSeek‑V3 and DeepSeek-R1
# DeepSeek‑V3, DeepSeek-R1, and DeepSeek-V3.2-Exp

This guide walks you through the examples to run the DeepSeek‑V3/DeepSeek-R1/DeepSeek-V3.2-Exp models using NVIDIA's TensorRT LLM framework with the PyTorch backend.
**DeepSeek-R1 and DeepSeek-V3 share exact same model architecture other than weights differences, and share same code path in TensorRT LLM. DeepSeek-V3.2-Exp features DeepSeek Sparse Attention (DSA), but otherwise shares the same code as DeepSeek-R1 and DeepSeek-V3 in TensorRT LLM. For brevity we only provide one model example, the example command to be used interchangeably by only replacing the model name to the other one**.

This guide walks you through the examples to run the DeepSeek‑V3/DeepSeek-R1 models using NVIDIA's TensorRT LLM framework with the PyTorch backend.
**DeepSeek-R1 and DeepSeek-V3 share exact same model architecture other than weights differences, and share same code path in TensorRT-LLM, for brevity we only provide one model example, the example command to be used interchangeably by only replacing the model name to the other one**.

To benchmark the model with best configurations, refer to [DeepSeek R1 benchmarking blog](../../../../docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md).

Expand All @@ -14,7 +15,7 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
## Table of Contents


- [DeepSeek‑V3 and DeepSeek-R1](#deepseekv3-and-deepseek-r1)
- [DeepSeek‑V3, DeepSeek-R1, and DeepSeek-V3.2-Exp](#deepseekv3-deepseek-r1-and-deepseekv32-exp)
- [Table of Contents](#table-of-contents)
- [Hardware Requirements](#hardware-requirements)
- [Downloading the Model Weights](#downloading-the-model-weights)
Expand Down Expand Up @@ -56,15 +57,15 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
## Hardware Requirements

DeepSeek-v3 has 671B parameters which needs about 671GB GPU memory for FP8 weights, and needs more memories for activation tensors and KV cache.
The minimum hardware requirements for running DeepSeek V3/R1 at FP8/FP4/W4A8 are listed as follows.
The minimum hardware requirements for running DeepSeek V3/R1/V3.2-Exp at FP8/FP4/W4A8 are listed as follows.

| GPU | DeepSeek-V3/R1 FP8 | DeepSeek-V3/R1 FP4 | DeepSeek-V3/R1 W4A8 |
| GPU | DeepSeek-V3/R1/V3.2-Exp FP8 | DeepSeek-V3/R1/V3.2-Exp FP4 | DeepSeek-V3/R1 W4A8 |
| -------- | ------- | -- | -- |
| H100 80GB | 16 | N/A | 8 |
| H20 141GB | 8 | N/A | 4 |
| H20 96GB | 8 | N/A | 4 |
| H200 | 8 | N/A | 4 |
| B200/GB200| Not supported yet, WIP | 4 (8 GPUs is recommended for best perf) | Not supported yet, WIP |
| B200/GB200| 8 | 4 (8 GPUs is recommended for best perf) | Not supported yet, WIP |

Ampere architecture (SM80 & SM86) is not supported.

Expand All @@ -88,6 +89,7 @@ To quickly run DeepSeek-V3, [examples/llm-api/quickstart_advanced.py](../llm-api
cd examples/llm-api
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --tp_size 8
```
Please include `--tokens_per_block 64` when running DeepSeek-V3.2-Exp, as this model uses the deep_gemm.fp8_paged_mqa_logits kernel, which requires a KV cache block size of 64.

The model will be run by PyTorch backend and generate outputs like:
```
Expand All @@ -105,7 +107,7 @@ cd examples/llm-api
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_max_draft_len N
```

`N` is the number of MTP modules. When `N` is equal to `0`, which means that MTP is not used (default). When `N` is greater than `0`, which means that `N` MTP modules are enabled. In the current implementation, the weight of each MTP module is shared.
`N` is the number of MTP modules. When `N` is equal to `0`, which means that MTP is not used (default). When `N` is greater than `0`, which means that `N` MTP modules are enabled. In the current implementation, the weight of each MTP module is shared. Please include `--tokens_per_block 64` when running DeepSeek-V3.2-Exp.

#### Relaxed acceptance
**NOTE: This feature can only be used for DeepSeek R1.**
Expand Down Expand Up @@ -737,15 +739,15 @@ mpirun -H <HOST1>:8,<HOST2>:8 \
```

### FlashMLA
TensorRT LLM has already integrated FlashMLA in the PyTorch backend. It is enabled automatically when running DeepSeek-V3/R1.
TensorRT LLM has already integrated FlashMLA in the PyTorch backend. It is enabled automatically when running DeepSeek-V3/R1. When running DeepSeek-V3.2-Exp on Hopper, FlashMLA is the default backend for the sparse MLA.

### FP8 KV Cache and MLA

FP8 KV Cache and MLA quantization could be enabled, which delivers two key performance advantages:
- Compression of the latent KV cache enables larger batch sizes, resulting in higher throughput;
- MLA kernel of the generation phase is accelerated by FP8 arithmetic and reduced KV cache memory access.

FP8 KV Cache and MLA is supported on Hopper and Blackwell. The accuracy loss is small, with GSM8k accuracy drop less than 1%.
FP8 KV Cache and MLA is supported on Hopper and Blackwell for DeepSeek-V3 and DeepSeek-R1, while it is only supported on Blackwell for DeepSeek-V3.2-Exp. The accuracy loss is small, with GPQA accuracy drop less than 1%.
- On Hopper we use the [FP8 FlashMLA kernel](https://github.com/deepseek-ai/FlashMLA/pull/54) from community.
- On Blackwell we use the kernel generated from an internal code-gen based solution called `trtllm-gen`.

Expand Down Expand Up @@ -861,3 +863,12 @@ python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --enable_chunked_pref
- **GPU Memory:** Adjust `--max_batch_size` and `--max_num_tokens` if you encounter out-of-memory errors.
- **Logs:** Check `/workspace/trt_bench.log` for detailed performance information and troubleshooting messages.
- **Configuration Files:** Verify that the configuration files are correctly formatted to avoid runtime issues.

## Known Issues
- Support for KV Cache Reuse and Chunked Prefill in DeepSeek-V3.2-Exp is currently under development. When running `quickstart_advanced.py`, please include `--disable_kv_cache_reuse` to disable KV Cache Reuse. When using `trtllm-eval`/`trtllm-serve`/`trtllm-bench`, please include the following configuration in the extra llm_api options:
```
kv_cache_config:
enable_block_reuse: false
tokens_per_block: 64
enable_chunked_prefill: false
```