Recent advances in large multimodal models have significantly advanced video comprehension, yet their performance remains limited in first-person scenarios. The interactive nature of egocentric videos is critical for applications like embodied intelligence, but introduces complex visual contexts that conventional models struggle to capture. To bridge this gap, we introduce OpenMMEgo with innovations across three dimensions: data, model, and training strategy. To provide rich spatiotemporal visual knowledge, we curate a large-scale, high-quality dataset named OME10M, comprising over 8.2M egocentric video QA pairs synthesized from Ego4D series. We also establish OMEBench, a comprehensive benchmark for rigorous egocentric understanding assessment. To alleviate the frequent viewpoint shifts inherent in egocentric videos, we implement semantic-aware visual token compression. Further, a curriculum learning strategy is complemented to foster stable learning across various data complexities. OpenMMEgo consistently improves the performance of LMMs on egocentric benchmarks without sacrificing general video understanding performance. Notably, Qwen2.5-VL tuned with OpenMMEgo substantially outperforms other models of the same size in egocentric video understanding.
We will release our code and data soon.
If you find our work useful, please consider citing us!
@inproceedings{
hao2025openmmego,
title={Open{MME}go: Enhancing Egocentric Understanding for {LMM}s with Open Weights and Data},
author={Hao, Luo and Zihao, Yue and Wanpeng, Zhang and Yicheng, Feng and Sipeng, Zheng and Deheng, Ye and Zongqing, Lu},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025}
}