ObjectForesight

Predicting future 3D object trajectories from human videos.

ObjectForesight is a 3D object-centric dynamics model: given a single egocentric observation — a scene point cloud and an object's recent 6-DoF pose — it predicts the object's H future 6-DoF poses. This repo is the model code (training, evaluation, inference). The data-curation pipeline that produces the training data lives in a separate repo, RustinS/ObjectForesight-Data; the extracted dataset and pretrained weights are on Hugging Face.

PoserV1 = PointTransformer V3 scene encoder (via Sonata) + a DiT diffusion temporal head. Each predicted pose is a 9-D token [t_x, t_y, t_z, rot6d(6)]; the 6-D rotation maps to SO(3) via Gram–Schmidt.


Encoder	PTv3, `embed_dim=768`, `in_channels=6` (camera-xyz ⊕ object-centric-xyz), `attn_obj` pooling
Temporal head	DiT, 12 layers / 768-d / 12 heads, `adaln_zero` conditioning, cosine β-schedule, v-prediction, 50 DDIM steps
I/O	scene point cloud `[N,3]` + `context_len` past poses → `[H, 9]` future poses
Params	~183 M

Results (EPIC-KITCHENS-100)

6-DoF trajectory metrics from the paper (lower is better). ADE/FDE = average/final translation error (m); ARE/FRE = average/final rotation error (°).

Model	ADE ↓	FDE ↓	ARE ↓	FRE ↓
ObjectForesight-DiT (this model)	0.019	0.035	7.98°	13.93°
ObjectForesight-AR (baseline)	0.067	0.074	9.48°	12.58°

See the paper for the full table (DES/RES error-growth slopes, HOT3D, and the video-generation comparison).

Setup

Requires Python 3.11, CUDA 12.x (with nvcc on PATH), and a GPU — the PTv3 encoder depends on spconv + torch-scatter, which are compiled from source.

# 1. install uv (https://github.com/astral-sh/uv) if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. one-command setup (run on a GPU node; ~20–30 min, builds CUDA packages from source)
./scripts/setup.sh                 # H100/H200 (sm_90) by default
./scripts/setup.sh --cuda-arch 89  # e.g. RTX 40-series (Ada)
./scripts/setup.sh --skip-gpu      # CPU-only (no spconv/flash-attn; for editing/CI)

setup.sh creates .venv (uv, Python 3.11), runs uv sync for the base deps, then builds torch-scatter, flash-attn (optional — the code falls back to PyTorch SDPA if absent), pytorch3d, and cumm/spconv. Compiled kernels are JIT-cached in ~/.cumm after the first run.

Run anything with uv run (no manual activation needed):

uv run python -c "import torch, spconv, torch_scatter, src; print('env OK', torch.cuda.is_available())"

Manual install (advanced): uv venv --python 3.11 && uv sync, then install torch-scatter and flash-attn (--no-build-isolation) and build cumm/spconv matching your CUDA/PyTorch — see scripts/setup.sh for the exact, patched build steps.

Pretrained weights

The main EPIC-KITCHENS model (ObjectForesight-DiT) is on Hugging Face:

huggingface-cli download raivn/ObjectForesight-EPIC-DiT --local-dir checkpoints/of-epic-dit
# -> best.pt (repo-native) and model.safetensors (pickle-free)
uv run python -m src.eval_main --config-name epic_eval eval.ckpt=checkpoints/of-epic-dit/best.pt

Data

The extracted trajectories are released as the gated dataset raivn/ObjectForesight-EPIC:

huggingface-cli download raivn/ObjectForesight-EPIC --repo-type dataset --local-dir of-epic
cd of-epic && python examples/prepare.py    # untar shards -> ./manip_data

Point the loader at it with data.dataset_root=/path/to/manip_data (default: ./manip_data). The dataset ships the windowing/filtering loader; this repo's src/data/ performs the same trajectory construction at train time.

Usage

All runs are configured with Hydra (conf/epic.yaml is the primary config). Override any field on the command line.

# Train (single GPU)
uv run python -m src.train_main data.dataset_root=/path/to/manip_data

# Train (multi-GPU / Slurm)
uv run torchrun --standalone --nproc_per_node=8 -m src.train_main
bash scripts/submit.sh --nodes 1 --gpus-per-node 8

# Evaluate (paper-style filtered eval) / infer / visualize with a checkpoint
uv run python -m src.eval_main  --config-name epic_eval eval.ckpt=checkpoints/of-epic-dit/best.pt
uv run python -m src.infer_main infer.ckpt=checkpoints/of-epic-dit/best.pt
uv run python -m src.viz_main   viz.save_dir=outputs/overlays

# Quick smoke test (synthetic data, no dataset needed)
uv run python -m src.train_main data.dataset_name=synth data.use_synthetic=true \
  train.tiny_overfit=true train.tiny_n=8 train.epochs=1

Configuration highlights

Section	Key	Meaning
`data`	`H`, `context_len`, `n_points`	horizon, # context frames, points sampled from the scene
`model`	`temporal_kind`	`dit` (default) or `ar_transformer`
`model.temporal_dit`	`conditioning`, `ddim_steps`	`adaln_zero`/`film`, # sampling steps
`train`	`batch_size`, `lr`, `amp`, `ema`	standard training knobs
`eval`	`eval_mode`, `steps`, `prefer_ema`	sampler vs loss eval, DDIM steps

Repository structure

src/
├── models/poser_v1/   # PoserV1 (PTv3 encoder + DiT/AR temporal head)
├── encoders/          # PointTransformer V3 adapter + serialization
├── temporal/          # DiT diffusion (DDIM) and AR transformer
├── data/              # dataset loaders, windowing, point-cloud / pose IO
├── geom/              # SE(3) ops, 6-D rotation, pose canonicalization
├── dist/              # DDP / FSDP launch
└── utils/             # config adapter, normalization, logging
conf/                  # Hydra configs (epic.yaml [primary], default.yaml, epic_eval.yaml)
scripts/               # setup.sh, submit.sh, preprocessing utilities

Citation

@article{soraki2026objectforesight,
  title   = {ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos},
  author  = {Soraki, Rustin and Bharadhwaj, Homanga and Farhadi, Ali and Mottaghi, Roozbeh},
  journal = {arXiv preprint arXiv:2601.05237},
  year    = {2026}
}

License & acknowledgments

Code released for non-commercial research use. The dataset and weights are derived from EPIC-KITCHENS-100 (CC BY-NC 4.0) — cite EPIC-KITCHENS-100 and comply with its terms when using them.

Built on PointTransformer V3 / Sonata, Hydra, and PyTorch.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
conf		conf
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
architecture.png		architecture.png
fetch_run_summaries.py		fetch_run_summaries.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ObjectForesight

Results (EPIC-KITCHENS-100)

Setup

Pretrained weights

Data

Usage

Configuration highlights

Repository structure

Citation

License & acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ObjectForesight

Results (EPIC-KITCHENS-100)

Setup

Pretrained weights

Data

Usage

Configuration highlights

Repository structure

Citation

License & acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages