# SpecForge: Training Speculative Decoding Models with SGLang

Speculative decoding is a powerful technique which improves inter-token latency in memory-bound LLM inference,which needs to run two models in parallel: Target Model (the main LLM model to run for our AI application) and Small Draft Model (a smaller, lightweight LLM that runs alongside to help speed up the main LLM’s inference). Small draft model is important to speculative decoding, whether it can correctly predict the tokens for target model to accept, is the key factor for the success or failure of speculative decoding deployment. [SpecForge](https://github.com/sgl-project/SpecForge), a purpose-built ecosystem for training draft models that integrate natively with SGLang, has been open source and listed as a flagship project in LMSYS. In this tutorial,we will demonstrate how to run SpecForge Draft model training and SGLang Speculative decoding inference with the trained draft model on a AMD MI300x GPU node.  

## Prerequisites
This tutorial was developed and tested using the following setup.

### Operating system
* **Ubuntu 22.04/24.04**: Ensure your system is running Ubuntu version 22.04/24.04.

### Hardware
* **AMD GPUs**: This tutorial was tested on an AMD Instinct MI300X GPU node with 8 GPUs. Ensure you are using an AMD Instinct GPU node with 8 GPUs or compatible hardware with ROCm support and that your system meets [the official requirements](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html).

### Software
* **ROCm 6.3**:  This tutorial requires ROCm6.3 or later version. Install and verify ROCm by following [the ROCm install guide] (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html). After installation, confirm your setup using rocm-smi command. AMD has also provided the pre-built rocm docker images, for example, [rocm pytorch image](https://hub.docker.com/r/rocm/pytorch),[rocm ubuntu22.04 image](https://hub.docker.com/r/rocm/dev-ubuntu-22.04) and [rocm ubuntu24.04 image](https://hub.docker.com/r/rocm/dev-ubuntu-24.04). Developers can use these pre-built docker images to reduce the efforts of setting up ROCm environment.  

* **Docker**: Ensure Docker is installed and configured correctly. Follow the Docker installation guide for your operating system.

   **Note**: Ensure the Docker permissions are correctly configured. To configure permissions to allow non-root access, run the following commands:

   ``` bash
   sudo usermod -aG docker $USER
   newgrp docker
   ```

   Verify Docker is working correctly with:

   ``` bash
   docker run hello-world
   ```

### Hugging Face API access

* Obtain an API token from [Hugging Face](https://huggingface.co) for downloading models.
* Ensure the Hugging Face API token has the necessary permissions and approval to access the [Meta Llama checkpoints](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct).

## Set up SpecForge environment 
In this tutorial, we will work on the pre-built ROCm pytorch images as the example. Developer can also try other ROCm images as their needs.

### Step 1: Launch the docker image 
Launch the Docker container . Replace /path/to/SpecForge_Project with the full path to the directory on your host machine where the SpecForge codes and model files are stored. In this tutorial, we used[SGLang ROCm docker images](https://hub.docker.com/r/lmsysorg/sglang/tags?name=rocm) to demonstrate SpecForge draft model training and SGlang speculative decoding inference. If you would like to have the better model training performance, you could also try [ROCm Pytorch Training Docker image](https://hub.docker.com/r/rocm/pytorch-training/)

``` bash
docker run -it --rm \
  --network=host \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add=video \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --shm-size 32G \
  -v /path/to/SpecForge_Project:/SpecForge \
  -w /SpecForge/ \
  lmsysorg/sglang:v0.4.10.post2-rocm630-mi30x
```
**Note**: This command mounts the current directory to the `/SpecForge` directory in the container. Ensure the notebook file is either copied to this directory before running the Docker command or uploaded into the Jupyter Notebook environment after it starts. Save the token or URL provided in the terminal output to access the notebook from your web browser. You can download this notebook from the [AI Developer Hub GitHub repository](https://github.com/ROCm/gpuaidev).

### Step 2: Install and launch Jupyter

Inside the Docker container, install Jupyter using the following command:

``` bash
pip install jupyter
```

Start the Jupyter server:

``` bash
jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
```

**Note**: Ensure port `8888` is not already in use on your system before running the above command. If it is, you can specify a different port by replacing `--port=8888` with another port number, for example, `--port=8890`.

### Step 3: Install SpecForge
If we install [Official SpecForge codes](https://github.com/sgl-project/SpecForge) directly, we will also install some Non-AMD related GPU libraries on AMD GPU platform, which is not recommended. In order to avoid this issue, we pick up the right commit ,make some modifications, and then share the [codes](https://github.com/zhangnju/SpecForge) for this tutorial. You can install it from the source codes for this tutorial. Run the following commands inside the Jupyter notebook running within the Docker container:

In [None]:
%%bash
git clone https://github.com/zhangnju/SpecForge.git
pip install -v ./SpecForge

### Step 4: Provide your Hugging Face token

You'll require a Hugging Face API token to access Llama-3. Generate your token at [Hugging Face Tokens](https://huggingface.co/settings/tokens) and request access for [Llama-3 8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). Tokens typically start with "hf_". 

Run the following interactive block in your Jupyter notebook to set up the token:

**Note**: Uncheck the "Add token as Git credential" option.

In [None]:
from huggingface_hub import notebook_login

# Prompt the user to log in
notebook_login()

In [None]:
from huggingface_hub import HfApi

try:
    api = HfApi()
    user_info = api.whoami()
    print(f"Token validated successfully! Logged in as: {user_info['name']}")
except Exception as e:
    print(f"Token validation failed. Error: {e}")

## Speculative decoding draft model training

SpecForge is a framework for training speculative decoding models so that you can smoothly port them over to the SGLang serving framework to speed up your inference. It has offered two ways to train the draft model: online training and offline training. Online training means freezing target model and training draft model at same time, which generates auxiliary hidden states on the fly and needs multiple GPUs to have the better performance. Offline training means generating and saving the hidden states using the target model first and then training the draft model in a separate process, which GPU requirement as low as 1 GPU, as only need to accommodate the draft model,but it needs huge disk space,e.g. ultrachat and sharegpt datasets will need 12TB storage. Regarding to the disk size of AMD MI300x machine, we will use online training to demonstrate SpecForge in this tutorial.  

### Data Preparation 

SpecForge has provided a script to prepare the ultrachat (200k) and sharegpt (120k) datasets for draft model training.By running the below commands,the datasets will be processed into `jsonl` files, which are the raw dataset ready for online training and placed in the `cache/dataset/<dataset_name>` directory.If you need to train the model using your own data, You should prepare the dataset in jsonl format and the schema should look like this:

```json
{
    "id": "xxxx",
    "conversations": [
        {
            "role": "user | assistant",
            "content": "The message content"
        }
    ],
}
```
In this tutorial, we will use sharegpt and ultrachat datasets as the example.

In [None]:
%%bash
# ultrachat
python SpecForge/scripts/prepare_data.py --dataset ultrachat

# sharegpt
python SpecForge/scripts/prepare_data.py --dataset sharegpt

### Draft model Online Training 
SpecForge has provided the sample scripts to train the draft model for LLama3/4, Qwen3 and other popular LLM models.You can refer the scripts in your work. In this tutorial, we will try LLama3-8B model to show how to run online training through SpecForge.Because model training will take a long time,we will run only one epoch training, which may not have the good accuracy, just for your reference. You can also change the training options as your need.

In [None]:
%%bash
# Llama-3 8B draft model training using 8 GPUs 
torchrun \
    --standalone \
    --nproc_per_node 8 \
    SpecForge/scripts/train_eagle3_online.py \
    --target-model-path meta-llama/Meta-Llama-3-8B-Instruct \
    --draft-model-config SpecForge/configs/llama3-8B-eagle3.json \
    --train-data-path SpecForge/cache/dataset/sharegpt.jsonl \
    --output-dir outputs/llama3-8b-eagle3 \
    --num-epochs 1 \
    --batch-size 1 \
    --learning-rate 1e-4 \
    --max-length 2048 \
    --chat-template llama3 \
    --cache-dir cache

Once the training is done, you can find the trained draft model file from the path of output-dir.

## [optional] SGLang Speculative decoding inference 
This section is to show how to test the trained draft model based on SGlang framework. If you have been familiar with SGlang speculative decoding inference, you could skip this section. 

Note: Run the commands in this section from a terminal, not from notebook code cells. In JupyterLab, open a terminal using Launcher → Terminal (or File → New → Terminal).

### SGLang Speculative decoding server

SGlang inference server has supported EAGLE speculative decoding through the below options:
1) speculative-draft-model-path: Specifies draft model. This parameter is required.
2) speculative-num-steps: Depth of auto-regressive drafting. Increases speculation range but risks rejection cascades. Default is 5.
3) speculative-eagle-topk: Branching factor per step. Improves candidate diversity, will lead to higher acceptance rate, but more lead to higher memory/compute consumption. Default is 4.
4) speculative-num-draft-tokens: Maximum parallel verification capacity. Allows deeper tree evaluation but will lead to higher GPU memory usage. Default is 8. 

In this tutorial, the settings of above options in the SGlang server command are just for speculative decoding functional test, not about performance. You can find the best combinations of these parameters with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py)

**Note: Run the training command in this section from a terminal, not from notebook code cells. In JupyterLab, open a terminal using Launcher → Terminal (or File → New → Terminal).**

In [None]:
%%bash
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct \
         --speculative-algorithm EAGLE3 --speculative-draft-model-path ./outputs/llama3-8b-eagle3/epoch_0 \
         --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
         --mem-fraction-static 0.75 --cuda-graph-max-bs 2 --tp 8 --context-length 8192 --trust-remote-code \
         --host 0.0.0.0 --port 30000 --dtype bfloat16 --attention-backend triton

### SGLang Speculative decoding benchmarking 

SpecForge has provided multiple benchmark scripts for Speculative decoding. After launching the above SGlang inference server successfully, you run these benchmarking scripts to test Speculative decoding performance directly with different test datasets. 

In [None]:
%%bash
# GSM8K
python SpecForge/benchmarks/run_gsm8k.py

# MATH-500
python SpecForge/benchmarks/run_math500.py

# MTBench
python SpecForge/benchmarks/run_mtbench.py

# HumanEval
python SpecForge/benchmarks/run_humaneval.py


## Summary 
Speculative decoding has emerged as a breakthrough for accelerating LLM inference.SpecForge is designed for Speculative decoding draft model training and is tightly integrated with SGLang inference engine. Through this tutorial, developer has already know how to run SpecForge draft model training and test the model through SGlang on AMD GPUs. SpecForge itself is constantly evolving, AMD is working closely with Open-source community to enhance these work on AMD platforms. We hope that this tutorial will encourage you to try, test, and contribute to SpecForge on AMD GPUs, and help us shape the future of AI acceleration.   