Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
122 changes: 122 additions & 0 deletions .agents/skills/cosmos3-codebase-nav/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
---
name: cosmos3-codebase-nav
description: >
Navigate the Cosmos3 package codebase to find where parameters, configs, defaults,
scripts, and documentation live. Use when the user asks "where is X in cosmos3",
"how do I find the config for Y", "where are the defaults", "where do I change a
parameter", or any question about locating files, modules, or settings. Also use
when the user opens or edits files and needs orientation.
---

# Cosmos3 Codebase Navigation

## When to use this skill

- Use this skill when an agent is navigating the Cosmos3 package
- Use this skill to answer "where is X", "how do I find the config for Y", or any file-location question
- Use this skill when the user opens or edits cosmos3 files and needs orientation

## Path convention

All paths below are relative to this file's location (`.agents/skills/cosmos3-codebase-nav/`). The repo is laid out as:

- `cosmos_framework/` — main training package (data, model, trainer, callbacks, checkpoint, utils, …).
- `cosmos_framework/configs/base/experiment/` — vfm (generator) experiment SKUs referenced by `[train.train_policy].experiment` in the recipe TOMLs.
- `cosmos_framework/configs/base/vlm/experiment/` — vlm (reasoner) experiment SKUs.
- `cosmos_framework/inference/` — inference subpackage (args, model, inference engine, defaults, Ray serving, common helpers).
- `cosmos_framework/scripts/` — top-level entry-point scripts (train, inference, eval, export_model, convert_model_to_dcp, upsample_prompts, caption_from_video, captions_to_sft_jsonl, action_policy_server, …). Invoked as `python -m cosmos_framework.scripts.<name>`.
- `examples/toml/sft_config/<recipe>.toml` + `examples/launch_sft_<recipe>.sh` — paired SFT recipes (training entry-point input). The shell sources `examples/_sft_launcher_common.sh`, which forwards into `cosmos_framework.scripts.train --sft-toml=...`.
- `cosmos_framework/configs/toml_config/` — pydantic schemas (`sft_config.py`) and helpers that validate the recipe TOML at load time.

## Quick Reference

### Where parameters and defaults live

| What you're looking for | File |
| --------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| Sampling params (num_steps, guidance, shift, fps, etc.) | `../../../cosmos_framework/inference/args.py` → `SamplingArgs`, `SamplingOverrides` |
| Per-modality default values | `../../../cosmos_framework/inference/defaults/<mode>/sample_args.json` |
| Setup params (parallelism, checkpoints, model path) | `../../../cosmos_framework/inference/args.py` → `OmniSetupArgs`, `OmniSetupOverrides` |
| Common args base classes | `../../../cosmos_framework/inference/common/args.py` → `ArgsBase`, `OverridesBase` |
| Ray serving parallelism presets | `../../../cosmos_framework/inference/ray/configs/latency.yaml`, `../../../cosmos_framework/inference/ray/configs/throughput.yaml` |
| Feature flags | `../../../cosmos_framework/utils/flags.py` |
| Prompt upsampler system prompt | `../../../cosmos_framework/inference/defaults/prompt_upsampler.txt` |
| Video captioner system prompt | `../../../cosmos_framework/inference/defaults/video_captioner.txt` |
| SFT recipe TOMLs (paired with `examples/launch_sft_*.sh`) | `../../../examples/toml/sft_config/<recipe>.toml` |
| SFT pydantic schema (validates the recipe TOML) | `../../../cosmos_framework/configs/toml_config/sft_config.py` |
| Training experiment SKUs (vfm) | `../../../cosmos_framework/configs/base/experiment/` |
| Training experiment SKUs (vlm / reasoner) | `../../../cosmos_framework/configs/base/vlm/experiment/` |
| Example inputs | `../../../inputs/omni/t2i.json`, `../../../inputs/omni/t2v.json`, `../../../inputs/omni/i2v.json`, … |

Available modality modes for defaults: `text2image`, `text2video`, `image2video`, `image2image`, `video2video`, `forward_dynamics`, `inverse_dynamics`, `policy`.

### Config defaults resolution chain

When a user runs inference, default parameter values are resolved in this order:

```
cosmos_framework/inference/defaults/<mode>/sample_args.json # 1. Per-modality JSON defaults (num_steps, guidance, shift, fps, etc.)
_load_modality_defaults() in cosmos_framework/inference/args.py # 2. Loaded and cached at import time
SamplingArgs / SamplingOverrides # 3. Pydantic models with field-level validation
OmniSampleOverrides.build_sample() # 4. Merges user overrides → final resolved args
_RESOLUTION_SHIFT_DEFAULTS[model_size, resolution] # 5. Model+resolution shift override (if user didn't set shift)
CLI flags (--guidance, --shift, etc.) # 6. User overrides from command line
```

The `_RESOLUTION_SHIFT_DEFAULTS` table in `../../../cosmos_framework/inference/args.py` (on `OmniSampleOverrides`) overrides the default `shift` based on model size and resolution, unless the user explicitly specified `--shift`.

| Mode | Default file | Key defaults |
| ------------- | --------------------------------------------------------------------------- | ---------------------------------------------- |
| `text2image` | `../../../cosmos_framework/inference/defaults/text2image/sample_args.json` | `num_frames=1`, `guidance=6.0`, `shift=10.0` |
| `text2video` | `../../../cosmos_framework/inference/defaults/text2video/sample_args.json` | `num_frames=189`, `guidance=6.0`, `shift=10.0` |
| `image2video` | `../../../cosmos_framework/inference/defaults/image2video/sample_args.json` | `num_frames=189`, `guidance=6.0`, `shift=10.0` |

Action and video2video modes also have defaults under `cosmos_framework/inference/defaults/{image2image,video2video,forward_dynamics,inverse_dynamics,policy}/sample_args.json`.

Users can also supply a custom defaults file per-request via the `defaults_file` field in sample arguments (see `../../../docs/inference.md`).

### Where to make changes

| Task | Edit |
| ------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
| Change a built-in default value | `../../../cosmos_framework/inference/defaults/<mode>/sample_args.json` |
| Add a new CLI parameter | `SamplingArgs` + `SamplingOverrides` in `../../../cosmos_framework/inference/args.py`, then add to each `sample_args.json` |
| Change parallelism presets | `../../../cosmos_framework/inference/ray/configs/latency.yaml` or `throughput.yaml` |
| Add a new script | `../../../cosmos_framework/scripts/` — follow `inference.py` as the pattern |

### Key entry points

| Entry point | How to run |
| -------------------- | -------------------------------------------------------------------------------------------- |
| Batch inference | `python -m cosmos_framework.scripts.inference` |
| Training | `python -m cosmos_framework.scripts.train --sft-toml=examples/toml/sft_config/<recipe>.toml` |
| Action evaluation | `python -m cosmos_framework.scripts.eval` |
| Online serving (Ray) | `python -m cosmos_framework.inference.ray.serve` |
| Submit to Ray server | `python -m cosmos_framework.inference.ray.submit` |
| Gradio UI | `python -m cosmos_framework.inference.ray.gradio` |
| Prompt upsampling | `python -m cosmos_framework.scripts.upsample_prompts` |
| Model export (HF) | `python -m cosmos_framework.scripts.export_model` |
| DCP conversion | `python -m cosmos_framework.scripts.convert_model_to_dcp` |
| Diffusers conversion | `python -m cosmos_framework.scripts.convert_model_to_diffusers` |
| Video captioning | `python -m cosmos_framework.scripts.caption_from_video` |
| Captions → SFT JSONL | `python -m cosmos_framework.scripts.captions_to_sft_jsonl` |
| Action policy server | `python -m cosmos_framework.scripts.action_policy_server` |

### Documentation

| Doc | Covers |
| ----------------------------------- | ---------------------------------------------------------- |
| `../../../AGENTS.md` | Commands, rules, key file locations (read this first) |
| `../../../README.md` | Overview, quickstart, examples |
| `../../../docs/setup.md` | Installation, environment, checkpoints |
| `../../../docs/code_structure.md` | Repo layout and per-subpackage tour of `cosmos_framework/` |
| `../../../docs/inference.md` | Sample args, default values, custom defaults |
| `../../../docs/inference_online.md` | Ray Serve and Gradio |
| `../../../docs/prompting.md` | Prompt engineering, upsampling |
| `../../../docs/training.md` | SFT / post-training workflow |
| `../../../docs/faq.md` | FAQ, tips, and troubleshooting |
106 changes: 106 additions & 0 deletions .agents/skills/cosmos3-env-troubleshoot/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
---
name: cosmos3-env-troubleshoot
description: >
Diagnose and fix Cosmos3 environment, installation, and runtime errors.
Use when the user encounters an ImportError, ModuleNotFoundError, CUDA error,
Docker error, checkpoint download failure, or any traceback during setup or inference.
---

# Cosmos3 Environment Troubleshooting

## When to use this skill

- Use when a user hits an error during installation, environment setup, or first run
- Use when a traceback mentions torch, CUDA, missing modules, or shared libraries
- Use when Docker or container setup fails
- Use when checkpoint downloads fail or HuggingFace auth errors appear

## Path convention

All paths below are relative to this file's location (`.agents/skills/cosmos3-env-troubleshoot/`).

## Step 1: Match against known errors

Check the error message against the table below. Each row links to the canonical fix in the docs.

| Error signature | Cause | Fix location |
| ----------------------------------------------------------------------------- | ------------------------------------ | -------------------------------------------------------------------------------------------------------- |
| `ImportError: cannot import name '_functionalization' from 'torch._C'` | NGC container library conflict | `../../../docs/setup.md` § PyTorch Import Issue — run `export LD_LIBRARY_PATH=''` |
| `ModuleNotFoundError: No module named 'cosmos_framework'` | Package not installed | `../../../docs/setup.md` § Dependency Issue — run `uv sync --all-extras --group=cu130-train --reinstall` |
| `ModuleNotFoundError: No module named <other>` | Dependency missing | `../../../docs/setup.md` § Dependency Issue — reinstall venv |
| `fatal error: Python.h: No such file or directory` | Broken Python / uv install | `../../../docs/setup.md` § Python Issue — reinstall uv + venv from scratch |
| `OSError: <lib>: cannot open shared object file` | CUDA version mismatch | `../../../docs/setup.md` § CUDA Issue — install matching `cuda-toolkit-<major>` |
| `docker: Error response from daemon: unknown or invalid runtime name: nvidia` | Docker nvidia runtime not configured | `../../../docs/setup.md` § Docker Container — run `sudo nvidia-ctk runtime configure --runtime=docker` |
| HuggingFace 401 / download failures | Auth or license not accepted | `../../../docs/setup.md` § Downloading Base Checkpoints — check `HF_TOKEN`, accept license agreement |

## Step 2: If no documented fix matches, try common remediation

Run these diagnostic commands to collect information, then attempt fixes in order:

### Diagnostic commands

```shell
# System
uname -a
cat /etc/os-release | head -5

# Python
python --version
which python

# CUDA
nvidia-smi
python -c "import torch; print(f'torch={torch.__version__}, cuda={torch.version.cuda}')"

# Package
uv pip list | head -20
```

### Remediation ladder (try in order)

1. **Clear library path**: `export LD_LIBRARY_PATH=''`
2. **Reinstall venv**: `uv sync --all-extras --group=cu130-train --reinstall` (or `cu128-train` on older drivers; drop `-train` only if you intentionally want the inference-only group)
3. **Reinstall uv + venv from scratch**:

```shell
curl -LsSf https://astral.sh/uv/install.sh | sh
uv python install --reinstall
rm -rf .venv
uv sync --all-extras --group=cu130-train --reinstall
source .venv/bin/activate
```

4. **Check CUDA version alignment**: the major CUDA version from `nvidia-smi` must match `torch.version.cuda`
5. **Try Docker**: if the host environment is too broken, fall back to the Docker container (see `../../../docs/setup.md`)

## Step 3: If still unresolved, generate a bug report

If none of the above resolves the issue, collect environment information and present the user with a pre-filled bug report they can submit as a GitHub issue.

Fill in the template below by running the diagnostic commands and inserting the results:

````markdown
## Environment

- **OS**: <output of `uname -a`>
- **Python**: <output of `python --version`>
- **CUDA (system)**: <output of `nvidia-smi` — first line with driver/CUDA version>
- **CUDA (torch)**: <output of `python -c "import torch; print(torch.version.cuda)">`>
- **torch version**: <output of `python -c "import torch; print(torch.__version__)">`>
- **cosmos_framework version**: <output of `python -c "import cosmos_framework; print(cosmos_framework.__version__)"` or "not installed">
- **Installation method**: <uv sync / uv pip / Docker / NGC container>

## Error

```
<full traceback>
```

## What was tried

1. <list each remediation step attempted and its result>

## Additional context

<any other relevant details — multi-GPU setup, custom CUDA install, etc.>
````
Loading