trajlens

The quality and synthesis layer for the open robot-learning data ecosystem.

ruff for robot data — lint, fix, and generate clean LeRobotDataset datasets.

Status

Pre-v0.1 (0.1.0.dev0), under active development. Not yet on PyPI.

lint is implemented and audited against the public Hub (see Real-world audit below). fix and web are stubs reserved for the v0.2 milestone.

Install (dev)

git clone https://github.com/<your-username>/trajlens
cd trajlens
uv venv
source .venv/bin/activate
uv pip install -e ".[dev,hub]"

The [hub] extra pulls in huggingface_hub; it's only required to lint datasets by Hub repo id rather than local path.

Usage

trajlens lint <path-or-org/dataset>          # human-readable terminal report
trajlens lint <path-or-org/dataset> --json   # machine-readable JSON report
trajlens lint <path-or-org/dataset> --report out.html
trajlens lint <path-or-org/dataset> --sarif out.sarif   # SARIF 2.1.0, for CI annotations
trajlens lint <path-or-org/dataset> --deep   # also decode video and verify per-frame stats

Exit codes follow lint-tool convention: 0 = clean, 1 = WARN present, 2 = FAIL or load ERROR — so trajlens lint composes directly into CI gates.

By default, checks that require materializing a lot of data over the network (full video decode, per-frame stats reconciliation) are skipped for Hub datasets and reported as INFO/skipped rather than run. Pass --deep to force them; expect this to be significantly slower and to fetch the full dataset.

Architecture

graph TD
  subgraph Interfaces
    CLI[CLI - typer]
    WEB[Web dashboard - FastAPI + React, optional]
    SDK[Python SDK / import]
  end

  subgraph Core
    LOADER[Dataset Source Layer<br/>local + Hub, version-aware]
    MODEL[Canonical Dataset Model<br/>typed in-memory view]
    REGISTRY[Check Registry<br/>pluggable rules]
    ENGINE[Check Engine<br/>runs checks, bounded]
    REPORT[Report Builder<br/>terminal / json / html / sarif]
    REPAIR[Repair Engine<br/>dry-run, diff, opt-in]
  end

  subgraph Synthesis [Pillar 3, later]
    SIMBK[Sim Backend Protocol<br/>MuJoCo default]
    AUG[Trajectory Augmenter<br/>MimicGen-style]
    DR[Domain Randomizer]
    WRITER[LeRobotDataset Writer]
  end

  CLI --> LOADER
  WEB --> LOADER
  SDK --> LOADER
  LOADER --> MODEL
  MODEL --> ENGINE
  REGISTRY --> ENGINE
  ENGINE --> REPORT
  MODEL --> REPAIR
  REPAIR --> WRITER
  SIMBK --> AUG --> DR --> WRITER
  WRITER --> MODEL
  REPORT --> WEB
  HUB[(Hugging Face Hub)] <--> LOADER
  HUB <--> WRITER

What it checks

trajlens validates a LeRobotDataset (v2.0, v2.1, or v3.0) against its own declared metadata, independent of any particular consumer's assumptions. Checks are grouped by category and run as a check engine over each dataset:

Category	Check	Severity	What it catches
STRUCTURAL	`VERSION_DETECTED`	INFO	Reports the detected `codebase_version`.
STRUCTURAL	`SCHEMA_CONSISTENCY`	FAIL	Parquet column dtypes/widths disagree with `info.json`'s declared feature shapes.
STRUCTURAL	`INDEX_CONTINUITY`	FAIL	Gaps or duplicates in `frame_index`/`episode_index`/global `index` columns.
STRUCTURAL	`METADATA_DATA_AGREEMENT`	FAIL	Declared episode lengths/`from`-`to` boundaries disagree with actual Parquet row counts (catches #2401-class corruption).
STRUCTURAL	`PATH_TEMPLATE_RESOLVES`	FAIL	A declared shard path (data or video) doesn't resolve to a readable file.
SEMANTIC	`FEATURE_DIMENSIONALITY`	FAIL	A feature's actual column width doesn't match its declared `shape`.
SEMANTIC	`TASK_INTEGRITY`	FAIL	A `task_index` reference has no corresponding, non-empty task description.
SEMANTIC	`LANGUAGE_PRESENT`	WARN	An episode has no non-empty language/task description.
SEMANTIC	`CAMERA_INTRINSICS_PLAUSIBLE`	INFO	Advisory; skipped where the LeRobot format carries no intrinsics field.
TEMPORAL	`TIMESTAMP_MONOTONIC`	FAIL	Timestamps are not strictly increasing within an episode.
TEMPORAL	`TIMESTAMP_SPACING`	WARN	Timestamp spacing is inconsistent with declared `fps` beyond decoder tolerance.
STATISTICAL	`STATS_MATCH_DATA`	FAIL	Recomputed global Welford stats diverge from `meta/stats.json`. Skipped over Hub HTTP by default — too slow without `--deep`.
STATISTICAL	`PER_EPISODE_STATS_MATCH`	WARN	Same, per-episode. Skipped over Hub HTTP by default.
STATISTICAL	`VALUE_SANITY`	WARN	Out-of-range or NaN/Inf values in numeric features. Skipped over Hub HTTP by default.
VIDEO	`DECODABLE_SPOTCHECK`	FAIL	A sampled video segment fails to decode.
KNOWNBUG	`TIMESTAMP_DRIFT`	FAIL	Cumulative timestamp drift matching the known lerobot #3177 bug pattern.

Every check's full result — message, severity, and structured details — is included in the JSON/HTML/SARIF report; the table above is the summary.

Real-world audit of the Hub

scripts/audit_hub.py runs trajlens lint --json against a random sample of public Hub datasets tagged lerobot, each in an isolated subprocess with a 60s timeout, and aggregates the results. It's how this project validates itself against the actual long tail of community datasets rather than only its own fixtures.

A 100-dataset run (2026-06-24) produced:

Status	Count	Meaning
PASS	19	No issues found.
WARN	0	—
FAIL	13	A real check fired — schema mismatch, metadata/data disagreement, missing language, etc.
ERROR	47	Dataset failed to load (unsupported v2.x Hub streaming, malformed/missing `meta/`, mistagged or deleted repos) — never reached the check engine.
TIMEOUT	21	Exceeded the 60s per-dataset budget.

These figures are from a single 100-dataset random sample (raw results: see the v0.1.0 release assets); audit_hub.py samples a fresh random subset of lerobot-tagged Hub datasets on each run, so rerunning it will produce a similarly-shaped but not identical distribution.

Of the 47 load-time ERRORs, none are trajlens bugs: about half (24) are the documented v0.1 limitation that v2.x Hub datasets can't be lazily streamed (shard paths are implicit and require a local filesystem to glob), and the rest are dead/mistagged Hub references, repos that aren't actually LeRobotDatasets (no meta/ directory at the repo root), or genuinely malformed meta/info.json (wrong dtype, missing required fields) on the dataset's side.

TIMEOUTs were investigated as a possible performance bug rather than accepted as an inherent network ceiling: profiling two small, previously-timing-out datasets (abdul004/so101_multi_task_v1, 125 episodes; Elvinky/pick_green_block_into_box, 102 episodes) found that loading a dataset's metadata over Hub HTTP was issuing dozens of small, separately-latency-bound reads per Parquet shard, and downloading the meta/ file tree one file at a time. Fixing both (single whole-shard fetch instead of scattered reads; parallelized meta/ download) brought those two datasets from 60s+ timeouts down to 33s and 11s respectively, and cut the audit's overall TIMEOUT count and mean per-dataset duration by roughly a third in before/after sampling. The remaining TIMEOUTs are concentrated in genuinely large multi-thousand-episode shards, where 60s is a real infra ceiling rather than a fixable inefficiency.

Launch audit findings

Of the 81 datasets that reached a grade (excluding ERROR/TIMEOUT, where no check ever ran), two known upstream lerobot bugs accounted for a meaningful share of the failures:

Known bug	Prevalence (of successfully-linted datasets)
`KNOWNBUG.TIMESTAMP_DRIFT` (#3177)	3.1%
`STRUCTURAL.METADATA_DATA_AGREEMENT` (#2401)	18.8%

audit_hub.py resamples a fresh random subset of lerobot-tagged Hub datasets on every run, so these are not a fixed, reproducible distribution — rerunning the audit will not return the same percentages, only a similarly-shaped one. Raw per-dataset results behind these specific numbers are attached to the v0.1.0 GitHub release as audit_results_100.json and audit_summary_100.txt.

Performance note: Hub vs. local

Linting a 100-episode dataset locally takes under 30 seconds.

Linting a Hub dataset directly (trajlens lint org/dataset) streams metadata and data shards over HTTP. It will inherently be slower than a local copy — typically under a minute for small-to-medium datasets, more for very large ones — because of unavoidable network round trips. For repeated linting, downloading the dataset locally first is still faster.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github		.github
scripts		scripts
src/trajlens		src/trajlens
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

trajlens

Status

Install (dev)

Usage

Architecture

What it checks

Real-world audit of the Hub

Launch audit findings

Performance note: Hub vs. local

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

trajlens

Status

Install (dev)

Usage

Architecture

What it checks

Real-world audit of the Hub

Launch audit findings

Performance note: Hub vs. local

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages