NVIDIA · lfr-0531 · Nov 14, 2025 · Nov 13, 2025 · Nov 14, 2025 · Nov 14, 2025
@@ -1,7 +1,8 @@
-# DeepSeek‑V3 and DeepSeek-R1
+# DeepSeek‑V3, DeepSeek-R1, and DeepSeek-V3.2-Exp
+
+This guide walks you through the examples to run the DeepSeek‑V3/DeepSeek-R1/DeepSeek-V3.2-Exp models using NVIDIA's TensorRT LLM framework with the PyTorch backend.
+**DeepSeek-R1 and DeepSeek-V3 share exact same model architecture other than weights differences, and share same code path in TensorRT LLM. DeepSeek-V3.2-Exp features DeepSeek Sparse Attention (DSA), but otherwise shares the same code as DeepSeek-R1 and DeepSeek-V3 in TensorRT LLM. For brevity we only provide one model example, the example command to be used interchangeably by only replacing the model name to the other one**.
 
-This guide walks you through the examples to run the DeepSeek‑V3/DeepSeek-R1 models using NVIDIA's TensorRT LLM framework with the PyTorch backend.
-**DeepSeek-R1 and DeepSeek-V3 share exact same model architecture other than weights differences, and share same code path in TensorRT-LLM, for brevity we only provide one model example, the example command to be used interchangeably by only replacing the model name to the other one**.
 
 To benchmark the model with best configurations, refer to [DeepSeek R1 benchmarking blog](../../../../docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md).
 
@@ -14,7 +15,7 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
 ## Table of Contents
 
 
-- [DeepSeek‑V3 and DeepSeek-R1](#deepseekv3-and-deepseek-r1)
+- [DeepSeek‑V3, DeepSeek-R1, and DeepSeek-V3.2-Exp](#deepseekv3-deepseek-r1-and-deepseekv32-exp)
   - [Table of Contents](#table-of-contents)
   - [Hardware Requirements](#hardware-requirements)
   - [Downloading the Model Weights](#downloading-the-model-weights)
@@ -56,15 +57,15 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
 ## Hardware Requirements
 
 DeepSeek-v3 has 671B parameters which needs about 671GB GPU memory for FP8 weights, and needs more memories for activation tensors and KV cache.
-The minimum hardware requirements for running DeepSeek V3/R1 at FP8/FP4/W4A8 are listed as follows.
+The minimum hardware requirements for running DeepSeek V3/R1/V3.2-Exp at FP8/FP4/W4A8 are listed as follows.
 
-| GPU  | DeepSeek-V3/R1 FP8 | DeepSeek-V3/R1 FP4 | DeepSeek-V3/R1 W4A8 |
+| GPU  | DeepSeek-V3/R1/V3.2-Exp FP8 | DeepSeek-V3/R1/V3.2-Exp FP4 | DeepSeek-V3/R1 W4A8 |
 | -------- | ------- | -- | -- |
 | H100 80GB | 16 | N/A | 8 |
 | H20 141GB | 8 | N/A | 4 |
 | H20 96GB | 8  | N/A | 4 |
 | H200 | 8     | N/A | 4 |
-| B200/GB200| Not supported yet, WIP | 4 (8 GPUs is recommended for best perf) | Not supported yet, WIP |
+| B200/GB200| 8 | 4 (8 GPUs is recommended for best perf) | Not supported yet, WIP |
 
 Ampere architecture (SM80 & SM86) is not supported.
 
@@ -88,6 +89,7 @@ To quickly run DeepSeek-V3, [examples/llm-api/quickstart_advanced.py](../llm-api
 cd examples/llm-api
 python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --tp_size 8
 ```
+Please include `--tokens_per_block 64` when running DeepSeek-V3.2-Exp, as this model uses the deep_gemm.fp8_paged_mqa_logits kernel, which requires a KV cache block size of 64.
 
 The model will be run by PyTorch backend and generate outputs like:
 ```
@@ -105,7 +107,7 @@ cd examples/llm-api
 python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_max_draft_len N
 ```
 
-`N` is the number of MTP modules. When `N` is equal to `0`, which means that MTP is not used (default). When `N` is greater than `0`, which means that `N` MTP modules are enabled. In the current implementation, the weight of each MTP module is shared.
+`N` is the number of MTP modules. When `N` is equal to `0`, which means that MTP is not used (default). When `N` is greater than `0`, which means that `N` MTP modules are enabled. In the current implementation, the weight of each MTP module is shared. Please include `--tokens_per_block 64` when running DeepSeek-V3.2-Exp.
 
 #### Relaxed acceptance
 **NOTE: This feature can only be used for DeepSeek R1.**
@@ -737,15 +739,15 @@ mpirun -H <HOST1>:8,<HOST2>:8 \
 ```
 
 ### FlashMLA
-TensorRT LLM has already integrated FlashMLA in the PyTorch backend. It is enabled automatically when running DeepSeek-V3/R1.
+TensorRT LLM has already integrated FlashMLA in the PyTorch backend. It is enabled automatically when running DeepSeek-V3/R1. When running DeepSeek-V3.2-Exp on Hopper, FlashMLA is the default backend for the sparse MLA.
 
 ### FP8 KV Cache and MLA
 
 FP8 KV Cache and MLA quantization could be enabled, which delivers two key performance advantages:
 - Compression of the latent KV cache enables larger batch sizes, resulting in higher throughput;
 - MLA kernel of the generation phase is accelerated by FP8 arithmetic and reduced KV cache memory access.
 
-FP8 KV Cache and MLA is supported on Hopper and Blackwell. The accuracy loss is small, with GSM8k accuracy drop less than 1%.
+FP8 KV Cache and MLA is supported on Hopper and Blackwell for DeepSeek-V3 and DeepSeek-R1, while it is only supported on Blackwell for DeepSeek-V3.2-Exp. The accuracy loss is small, with GPQA accuracy drop less than 1%.
 - On Hopper we use the [FP8 FlashMLA kernel](https://github.com/deepseek-ai/FlashMLA/pull/54) from community.
 - On Blackwell we use the kernel generated from an internal code-gen based solution called `trtllm-gen`.
 
@@ -861,3 +863,12 @@ python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --enable_chunked_pref
 - **GPU Memory:** Adjust `--max_batch_size` and `--max_num_tokens` if you encounter out-of-memory errors.
 - **Logs:** Check `/workspace/trt_bench.log` for detailed performance information and troubleshooting messages.
 - **Configuration Files:** Verify that the configuration files are correctly formatted to avoid runtime issues.
+
+## Known Issues
+- Support for KV Cache Reuse and Chunked Prefill in DeepSeek-V3.2-Exp is currently under development. When running `quickstart_advanced.py`, please include `--disable_kv_cache_reuse` to disable KV Cache Reuse. When using `trtllm-eval`/`trtllm-serve`/`trtllm-bench`, please include the following configuration in the extra llm_api options:
+```
+kv_cache_config:
+    enable_block_reuse: false
+    tokens_per_block: 64
+enable_chunked_prefill: false
+```