| 2023.10 |
GIT |
MimicPlay |
Long-Horizon Imitation Learning byWatching Human Play |
 |
 |
Human:hand; robot: grippers |
Robot:head and wrist cams, human:third-person-view video |
NA |
Simple BC models |
Stage1: only human data to train a high-level planner, Stage2: only robot data to train visomotor policy |
|
|
| 2024.10 |
GIT |
EgoMimic |
Scaling Imitation Learning via Egocentric Video |
 |
 |
human: hand; robot: grippers |
Human: head cam, hand pose (estimated); Robot: head and wrist cams, Proprio EEF poses, Joint positions |
Hardware:1)identical Aria glassesas, 2) teleop device similar to human upper body; Data Processing: 1) unify action frames, 2) align action distributions, 3) mask out robot & human arms |
改进ACT |
co-train(1 hr human data + 2 hrs robot data) |
generalize to new objects/scenes/tasks seen only in human data |
值得细看,缺点是数据量较少 |
| 2025.09 |
GIT |
ImMimic |
Cross-Domain Imitation from Human Videos via Mapping and Interpolation |
 |
 |
Human:hand; robot: grippers or hands; retargeting required |
Human: head cam, hand pose (algorithm estimated); Robot: head and wrist cams, proprioception |
头部观测对齐 |
DP |
Robot data + interpolated human data cotrain |
|
|
| 2025.09 |
GIT |
EgoBridge |
Domain Adaptation for Generalizable Imitation from Egocentric Human Data |
 |
|
与EgoMimic一致 |
与EgoMimic一致 |
align latent representations from human and robot domains |
transformer-based design |
co-training with OT loss |
generalize to new objects/scenes/tasks seen only in human data |
在EgoMimic基础上增加latent representation的对齐及相应的co-train改进 |
| 2025.12 |
GIT |
EMMA |
Scaling Mobile Manipulation via Egocentric Human Data |
 |
|
与EgoMimic一致 |
与EgoMimic一致+导航 |
optimization-based retargeting for navigation and coordinate-space alignment for manipulation |
decoder-only transformer |
co-train human full-body motion data with static robot data |
1) direct transfer of navigation skills from human data to robot (2) co-training scales up full mobile manipulation policy performance |
在EgoMimic基础上引入mobile base,从人类移动数据中学习机器人的移动 |
| 2025.12 |
GIT, PI |
Human2robo |
Emergence of Human to Robot Transfer in VLAs |
 |
|
|
|
|
pi0.5 |
|
|
|
| 2026.02 |
GIT, Nvidia |
EgoScale |
Scaling Human Video to Unlock Dexterous Robot Intelligence |
 |
|
|
|
|
|
|
|
|
| 2025.07 |
UCSD |
EgoVLA |
Learning Vision-Language-Action Models from Egocentric Human Videos |
 |
 |
human:hand; robo:hand |
human:head cam; robo:head cam |
unified action space + robo data finetune |
VLM + action head |
pretrain only on human data to unified action space + robo data finetune |
|
|
| 2025.11 |
UCSD |
In-N-On |
Scaling Egocentric Manipulation with in-the-wild and on-task Data |
 |
 |
human:hand; robo:hand |
|
|
VLM + action head |
pretrain on human and robo data to unified action space |
|
Adversarial domain adaptation |
| 2025.08 |
Tsinghua |
Motiontrans |
Human vr data enable motion-level learning for robotic manipulation policies |
|
 |
|
|
|
|
|
|
|
| 2026.02 |
Microsoft |
VITRA |
Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos |
 |
 |
|
|
|
|
|
|
|