Project | Paper | Hardware | Dataset
OpenTouch is an egocentric in-the-wild dataset and cross-modal learning framework for visual (RGB), tactile (pressure), and hand-pose modalities.
The OpenTouch data is organized as synchronized multimodal recordings:
- egocentric RGB video streams
- full hand tactile pressure maps
- hand pose
The dataset is hosted on Google Drive. We use gdown to download all files:
pip install gdown
bash scripts/download_data.sh
cd data && unzip final_annotations.zip && cd ..See scripts/download_data.sh for the full list of Google Drive file IDs.
conda create -n opentouch python=3.10
conda activate opentouch
pip install -e . The rendering scripts require extra dependencies:
git submodule update --init --recursive
pip install -e ".[rendering]"
cd EasyMocap && pip install -e . && cd ..You also need the MANO hand model:
- Download
MANO_RIGHT.pklfrom the MANO project - Place it in
preprocess/scratch/MANO_RIGHT.pkl
# Generate a synchronized visualization from an HDF5 recording:
python preprocess/build_demo.py \
--hdf5 data/fablab_ml_p1.hdf5 \
--demo-id demo_05 \
--fps 30Example output: simple RGB+tactile view and tri-view with MANO/pose rendering.
Output path: data/<dataset_name>/<demo_id>/combined.mp4
# Retrieval dataset
python build_retrieval_data.py \
--input-dir data \
--output-dir preprocessed_data/train_dataset
# Classification dataset
python build_label_data.py \
--input-dir data \
--output-dir preprocessed_data/classification_peak \
--label-mapping-path final_annotations \
--label-column action \
--frame-index-column peak_idx \
--temporal-radius 10The default visual backbone is DINOv3 ViT-B/16 (facebook/dinov3-vitb16-pretrain-lvd1689m).
Access to this model may require approval from Meta. Please refer to DINOv3 for more details.
bash scripts/train.shOr run directly:
CUDA_VISIBLE_DEVICES=0 python -m opentouch_train.main \
--train-data preprocessed_data/train_dataset \
--model OpenTouch-DINOv3-B16-Retrieval \
--task-type v2t \
--batch-size 128 \
--lr 1e-4 \
--epochs 300 \
--precision amp \
--workers 8 \
--sequence-length 20If you want to train with multiple GPUs, use distributed data parallel (DDP): please see scripts/train_multigpu.sh for the full reference configuration.
Set --task-type to choose the retrieval task:
| Task Type | Description |
|---|---|
v2t |
Visual |
p2t |
Pose |
v2p |
Visual |
vp2t |
Visual + pose |
tp2v |
Tactile + pose |
vt2p |
Visual + tactile |
Train action or grip classifiers on top of the same encoders:
bash scripts/train_classifier.shOr run directly:
CUDA_VISIBLE_DEVICES=0 python -m opentouch_train.classification_main \
--train-data preprocessed_data/classification_peak \
--model OpenTouch-DINOv3-B16-Classify \
--task action \
--modalities visual tactile \
--batch-size 64 \
--lr 3e-3 \
--epochs 500 \
--precision amp| Flag | Description |
|---|---|
--task |
Classification task: action or grip |
--modalities |
Input modalities: visual, tactile, pose (any combination). |
Model name, task type, and modalities are auto-detected from the checkpoint or params.txt.
bash scripts/eval.sh logs/<run_name>/checkpoints/epoch_<N>.ptbash scripts/eval_classifier.sh logs/<run_name>/checkpoints/epoch_<N>.ptIf you find this work helpful, please consider citing:
@article{song2025opentouch,
title={OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction},
author={Song, Yuxin Ray and Li, Jinzhou and Fu, Rao and Murphy, Devin and Zhou, Kaichen and Shiv, Rishi and Li, Yaqi and Xiong, Haoyu and Owens, Crystal Elaine and Du, Yilun and others},
journal={arXiv preprint arXiv:2512.16842},
year={2025}
}This codebase builds on OpenCLIP.
