This repository provides an autolabeling pipeline for autonomous-driving scenarios. It derives meta-actions, which are high-level categorical descriptions of ego motion, and chain-of-causation labels, which connect causal factors in the scene to the ego vehicle's intended behavior.
- Workflow Overview
- Paper
- Runtime Requirements
- Dependencies (Build Image from Dockerfile)
- Prepare the Data (Physical-AI AV Dataset)
- Run CoC Autolabeling
- Troubleshooting
- Extend with Other Model Clients
- Disclaimer
- License
- Citation
- Step 1: Generate Meta-Actions: produce per-clip high-level motion labels from trajectory data.
- Step 2: Identify Keyframes: select frames where ego meta-actions change, since these transitions are likely to contain decision-making context.
- Step 3: Generate CoC Labels: run the VLM pipeline on selected keyframes to produce chain-of-causation labels.
This autolabeling pipeline is related to the Chain-of-Causation reasoning pipeline described in Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail.
The minimum hardware requirement depends on the model backend and dataset size.
GPU is not required when CoC labels are generated with a hosted model API such
as gpt5 or gpt5.5; in that setup, local compute is mainly used for data
loading, meta-action generation, keyframe selection, and video extraction.
Local Qwen inference requires GPU capacity sufficient for the selected Qwen model. The released Qwen model examples have been tested on NVIDIA A100 and H100 GPUs with the CUDA 12.8 Docker image; other GPU setups may work depending on the selected Qwen model, available VRAM, batch size, driver compatibility, and worker settings. CPU and host-memory usage also scale with the number of workers used for trajectory-data caching and meta-action labeling.
Use nvidia-smi on the host to confirm the NVIDIA driver before running local
Qwen inference. The Docker image uses CUDA 12.8 and vLLM 0.17.1.
| GPU generation | Example GPUs | Minimum host NVIDIA driver | Validation status |
|---|---|---|---|
| Ampere | A100 | >=535 |
Tested |
| Hopper | H100 | >=545 |
Tested |
For a small smoke test of about 100 clips with hosted model API for CoC generation, the following setup is sufficient:
- 8 CPU cores
- 8 GB memory
- no GPU, unless running a local Qwen model
A standalone image can be built from this repository's Dockerfile.
Build:
docker build -t coc_auto_labeling:latest .Run:
docker run --gpus all -it --ipc=host \
-v path/to/coc_label_oss:/workspace/coc_auto_labeling \
coc_auto_labeling:latestThe default in-container project path is /workspace/coc_auto_labeling.
For local Qwen inference, verify the vLLM import before starting a labeling run:
docker run --gpus all coc_auto_labeling:latest \
python -c 'from vllm import LLM; print("ok")'To standardize trajectory data formats and support reindexing and interpolation, this pipeline leverages trajdata for data formatting under the hood.
Download the Physical-AI AV dataset once from
https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles/tree/main.
The same dataset root is used for meta-action labeling, keyframe selection,
video extraction, and CoC labeling.
Run meta-action autolabeling to produce per-clip ego-motion labels:
meta-action-autolabel \
--dataset_name pai \
--meta_action_names all_ego \
--data_dir /path/to/physical_ai_data \
--cache_dir /path/to/trajdata_cache \
--save_dir /path/to/meta_action/resultdir \
--num_workers 8Common options:
--meta_action_names: meta-action types to generate. The CLI default isgo_straight; the example usesall_egoto generate the default ego-action set.--scene_list: optional path to a text file with one clip ID per line. Use this for small subset or sample-eval runs.--num_workers: worker count for data loading and clip processing. The CLI default is32; the example uses8as a conservative sample-run setting.
Key outputs:
--cache_dir: formatted trajectory data cache.--save_dir: final per-clip meta-action text outputs. Use this path asmeta_action_dirin later steps, for example/path/to/meta_action/resultdir/final_outputs.
For details on running meta-action autolabeling, including dataset input
layout, output format, worker guidance, and visualization commands, see
docs/meta_action_autolabel.md.
Use meta-action transitions to generate relative keyframe timestamps:
python -m coc_labeling.keyframe_auto_select \
--meta_action_dir /path/to/meta_action/resultdir/final_outputs \
--output_dir ./experiments/keyframesArguments:
--meta_action_dir: folder containing meta-action text outputs from Step 1. This should point to thefinal_outputsdirectory.--min_duration: minimum action span duration in frames. The default is10, which keeps brief maneuvers such as short strong-deceleration events in sample-eval runs.--target_count: maximum number of segments to keep per action type after balancing. The default is500000for large-scale experiments, so small samples usually keep every matching segment.--output_dir: folder where the generated keyframe index files are written.
This script automatically generates relative keyframe timestamps from meta-action outputs by selecting frames where ego meta-actions change, as these transitions are more likely to indicate ego decision-making moments.
The output keyframes will be stored at:
./experiments/keyframes/segments_relative_timestamp_sampled.json
The structure is:
{
"<meta_action_1>": [
{
"meta_action": "<meta_action_1>",
"clip_id": "<clip_uuid>",
"event_start_frame": <start_frame_index>,
"event_end_frame": <end_frame_index>,
"duration": <num_frames>
},
"... additional entries ..."
],
"<meta_action_2>": [
{
"meta_action": "<meta_action_2>",
"clip_id": "<clip_uuid>",
"event_start_frame": <start_frame_index>,
"event_end_frame": <end_frame_index>,
"duration": <num_frames>
}
],
"... additional meta_action keys ...": [
"... additional entries ..."
]
}Before running CoC labeling, confirm the following inputs. Configure dataset
paths in src/coc_labeling/config/data/base.yaml.
data_dir: root folder that contains parquet clip data, for example,/path/to/physical_ai_datacache_dir: formatted trajectory data cache.meta_action_dir: meta-action outputs, for example:/path/to/meta_action/final_outputs- Keyframe input: the
segment_config_pathcan be set in the configsrc/coc_labeling/config/data/keyframe_rel_ts.yaml, thesegment_generator_typeandmeta_action_filtercan be set in the configsrc/coc_labeling/config/base_config_vlm_rel_ts.yaml. video_dir: root folder that contains acamerasubfolder for raw AV videos, for example:/path/to/extracted_pai_videos
If video_dir is not already populated, you can extract only the clips
referenced by your keyframe/index JSON. Set EXTRACTED_VIDEO_ROOT to the path
you want to use for video_dir; the extracted MP4 files will be saved under
its camera subfolder. VIDEO_ZIP_DIR should point to the folder containing
the PAI camera chunk zips, for example camera_front_wide_120fov.chunk_0000.zip,
camera_front_wide_120fov.chunk_0001.zip, and so on.
export SEGMENT_INDEX=./experiments/keyframes/segments_relative_timestamp_sampled.json
export VIDEO_ZIP_DIR=/path/to/physical_ai_data/camera/camera_front_wide_120fov
export EXTRACTED_VIDEO_ROOT=/path/to/extracted_pai_videos
python scripts/extract_pai_videos_from_index.py \
--index-file "${SEGMENT_INDEX}" \
--video-zip-dir "${VIDEO_ZIP_DIR}" \
--output-video-root "${EXTRACTED_VIDEO_ROOT}"Variables and arguments:
SEGMENT_INDEX: keyframe/index JSON generated in the previous step.VIDEO_ZIP_DIR: folder containing the PAI camera chunk zip files.EXTRACTED_VIDEO_ROOT: output root used later asvideo_dir.--index-file: path to the keyframe/index JSON that lists clips to extract.--video-zip-dir: path to the source camera zip folder.--output-video-root: root directory where extracted videos are written under acamerasubfolder.--meta-action-filter: optional filter that extracts only clips matching the requested meta-action type.
Without --meta-action-filter, the script extracts all clips referenced by the
index file. If your run uses a meta-action filter, pass the same filter to the
extraction script, for example --meta-action-filter go_straight.
Example video structure:
/path/to/extracted_pai_videos/camera
├── 01d55181-c15d-49f2-8b52-0ddf141375d0.camera_front_wide_120fov.mp4
├── 5b530101-f63b-4c61-aeac-178ad1626774.camera_front_wide_120fov.mp4
└── ...
The released framework currently supports a VLM labeling agent.
Supported model_name values in the current release:
qwen3_vl_235b_awqqwen3.5_35bqwen3.5_397b_fp8gpt5gpt5.5
For label quality, the recommended model_name values are gpt5.5 and
qwen3.5_397b_fp8. For local-only runs, start with qwen3.5_397b_fp8 when
the machine has enough GPU memory; otherwise use qwen3.5_35b. Use
qwen3.5_35b for local smoke tests because it is faster to download and load
than the larger local Qwen variants.
For Qwen models, authenticate with Hugging Face:
export HF_TOKEN="hf_yourtoken"
hf auth login --token "$HF_TOKEN"This is required because the Qwen models are hosted on Hugging Face. HF_TOKEN
is your Hugging Face access token, and --token passes it to the login command.
For gpt5 or gpt5.5, configure one of the following credential sets.
Standard OpenAI API credentials:
export OPENAI_API_KEY="sk-your-openai-api-key"NVIDIA inference credentials from inference.nvidia.com/build.nvidia.com:
export NVIDIA_API_KEY="nvapi-your-nvidia-api-key"NVIDIA-hosted Azure OpenAI credentials:
export NVHOST_OAI_CLIENT_ID="your_client_id"
export NVHOST_OAI_CLIENT_SECRET="your_client_secret"Credential precedence is NVIDIA-hosted Azure OpenAI first, then NVIDIA inference, then standard OpenAI.
For Hugging Face local cache, optionally set:
export MODEL_CACHE_DIR=/path/to/hf-cache #examples, /workspace/hf-cache
# or
export HF_HOME=/path/to/hf-cacheRun CoC autolabeling. For example:
export MODEL_CACHE_DIR=/workspace/hf-cache
python -m coc_labeling.data_labeling \
--config-name=base_config_vlm_rel_ts \
model_name=qwen3.5_35b \
resume_exp_dir=null \
exp_name=qwen3.5_35b_test \
data_loader.keyframe.meta_action_filter='[go_straight]' \
data_loader.video.save_segment_videos=falseVariables and arguments:
MODEL_CACHE_DIR: local Hugging Face model cache used by local Qwen model loading.--config-name: Hydra config preset.base_config_vlm_rel_tsruns VLM labeling with relative keyframe timestamps.model_name: model backend to use for CoC generation. Use one of the supportedmodel_namevalues listed above.vlm_agent.temperature,vlm_agent.top_p,vlm_agent.repetition_penalty: local Qwen sampling parameters. Defaults are0.0,1.0, and1.0.resume_exp_dir: existing experiment directory to resume from. Usenullto create a new experiment directory.exp_name: readable suffix for the output experiment folder.data_loader.keyframe.meta_action_filter: list of meta-action types to label. Usenullto include all available meta-action types.data_loader.video.save_segment_videos: whether to save extracted segment videos alongside CoC outputs.
You can configure meta_action_filter, save_segment_videos, model_name, and related settings in:
src/coc_labeling/config/base_config_vlm_rel_ts.yaml
More examples for running CoC autolabeling.
python -m coc_labeling.data_labeling \
--config-name=base_config_vlm_rel_ts \
model_name=qwen3.5_397b_fp8 \
resume_exp_dir=null \
exp_name=qwen3.5_397b_fp8_test \
data_loader.keyframe.meta_action_filter=null \
data_loader.video.save_segment_videos=falseThis example uses the same arguments as above. It switches to
qwen3.5_397b_fp8 and sets data_loader.keyframe.meta_action_filter=null to
process all available meta-action types.
For local model inference, if you run into out-of-memory errors, use a GPU or machine with more available GPU memory, or reduce parallelism.
Experiment outputs are written under:
path/to/coc_label_oss/experiments/<run_id>_<exp_name>/
Example:
path/to/coc_label_oss/experiments/20260308_002355_qwen3.5_35b_test
├── b6700354-ab89-45ec-8b47-7d6dbfe16b1a/
│ ├── cot_12600000.yaml
│ ├── ...
└── ...
cot_<keyframe_timestamp>.yaml uses the keyframe timestamp in the filename.
In this example, 12600000 is the keyframe's relative timestamp.
Inside each YAML file, the results structure looks like:
results:
event_start_frame: 126
event_start_timestamp: 12600000
final_content:
ego_behavior_schema:
effect_on_ego_behavior: "Keep distance to the lead vehicle by decelerating."
# prompt: [...] # full model input (system/user text + sampled video frame references)Field meanings:
event_start_timestamp: relative keyframe timestamp in microseconds (also used incot_<timestamp>.yamlfilename).event_start_frame: relative frame index in the sampled clip timeline. With default10FPS,126means12.6s.final_content: model output payload.final_content.ego_behavior_schema.effect_on_ego_behavior: free-form CoC text.
If data_loader.video.save_segment_videos is set to true, the segment videos are saved under:
experiments/video_segment folder by default.
PyTorch likely resolved to a CPU wheel, or the image was built without the CUDA 12.8 wheel index. Rebuild from this Dockerfile and run:
docker run --gpus all coc_auto_labeling:latest \
python -c 'from vllm import LLM; print("ok")'If local Qwen loading fails with cudaErrorUnsupportedPtxVersion or another
CUDA/PTX compatibility error, check the host NVIDIA driver with nvidia-smi.
The Docker image uses CUDA 12.8 and vLLM 0.17.1, so the host driver must satisfy
the validated configuration table above.
The hosted GPT endpoint may reject explicit sampling parameters. Use the
current wrapper defaults for gpt5 or gpt5.5; sampling overrides documented
in this README are for local Qwen. Non-retryable 4xx errors fail fast.
Use this section to add new model clients beyond built-in options (for example, future Qwen variants or other VLM providers).
If the new model uses a provider flow that is already implemented, reuse the
existing wrapper. For example, new Qwen aliases should use QwenWrapper, and
new OpenAI-compatible aliases should use OpenAIWrapper.
If the new provider needs different request or response handling, create a wrapper under:
src/coc_labeling/model_clients/vlm_wrappers/
Your wrapper should implement the same interface used by existing wrappers:
infer(...)add_message(...)
See qwen.py, cloud.py, and dummy.py as reference implementations.
Edit:
src/coc_labeling/model_clients/vlm_wrappers/factory.py
Import the wrapper class if it is new, then add your model key to
MODEL_WRAPPER_REGISTRY, mapping to the wrapper class. This key is the value
passed via model_name=... when launching
coc_labeling.data_labeling.
For a new Qwen model alias, also add its cache folder, Hugging Face model ID,
and quantization setting to QWEN_MODEL_SPECS in qwen.py.
For a new OpenAI-compatible alias, also update model_name_map in cloud.py
when the public model_name key should map to a different provider model ID.
If you want public import coverage from the wrapper package, update:
src/coc_labeling/model_clients/vlm_wrappers/__init__.py
Update default configs if you want the new model as a runnable preset:
src/coc_labeling/config/base_config_vlm.yamlsrc/coc_labeling/config/base_config_vlm_rel_ts.yaml
Run a short test on a small clip subset:
python -m coc_labeling.data_labeling \
--config-name=base_config_vlm_rel_ts \
model_name=<your_model_key> \
resume_exp_dir=null \
exp_name=smoke_<your_model_key>Use the same CoC labeling arguments described above. Replace
<your_model_key> with the model registry key you added, and set exp_name to
a short name that identifies the smoke run.
If your wrapper loads local model weights, make sure cache folder names and Hugging Face model IDs are both correctly mapped in the wrapper implementation.
This autolabeling tool is provided for research and development in the autonomous vehicle (AV) domain. It is intended as a foundation and a starting point for building custom VLA applications and is not a production-ready system.
Because this pipeline relies on VLMs, generated CoC outputs may contain errors, including incorrect maneuver attribution (for example, right vs. left lane change), hallucinated objects, or inaccurate temporal-causal reasoning about surrounding agents and ego behavior.
To improve the CoC quality, use human auditing and/or a hybrid post-processing workflow that combines review with heuristic checks. Example safeguards include but are not limited to validating outputs against object-detection signals (for example, lead-vehicle or pedestrian presence), planner/behavior signals (for example, nudging or yielding), and human correction loops. These safeguards are outside the scope of the current release.
By using this tool, you acknowledge that it is intended to support scientific inquiry, benchmarking, and exploration, and is not a substitute for a validated or certified AV stack. Developers and contributors disclaim responsibility and liability for use of the model and its outputs.
This project is licensed under the Apache-2.0 License.
If you use this autolabeling pipeline in your research, please cite:
@article{nvidia2025alpamayo,
title={{Alpamayo 1}: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail},
author={NVIDIA and Yan Wang and Wenjie Luo and Junjie Bai and Yulong Cao and Tong Che and Ke Chen and Yuxiao Chen and Jenna Diamond and Yifan Ding and Wenhao Ding and Liang Feng and Greg Heinrich and Jack Huang and Peter Karkus and Boyi Li and Pinyi Li and Tsung-Yi Lin and Dongran Liu and Ming-Yu Liu and Langechuan Liu and Zhijian Liu and Jason Lu and Yunxiang Mao and Pavlo Molchanov and Lindsey Pavao and Zhenghao Peng and Mike Ranzinger and Ed Schmerling and Shida Shen and Yunfei Shi and Sarah Tariq and Ran Tian and Tilman Wekel and Xinshuo Weng and Tianjun Xiao and Eric Yang and Xiaodong Yang and Yurong You and Xiaohui Zeng and Wenyuan Zhang and Boris Ivanovic and Marco Pavone},
year={2025},
journal={arXiv preprint arXiv:2511.00088},
}