Stars
moojink / openvla-oft
Forked from openvla/openvlaFine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Baseline model for "GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping" (CVPR 2020)
The repo for "Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator"
Curated list of recent visual autoregressive (VAR) modeling works
Latest Advances on System-2 Reasoning
🧑🚀 全世界最好的LLM资料总结(数据处理、模型训练、模型部署、o1 模型、MCP、小语言模型、视觉语言模型) | Summary of the world's best LLM resources.
Clean, minimal, accessible reproduction of DeepSeek R1-Zero
Solve Visual Understanding with Reinforced VLMs
[ICLR2025] Official code implementation of Video-UTR: Unhackable Temporal Rewarding for Scalable Video MLLMs
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
DeepTimber Robotics Talent Call | DeepTimber社区具身智能招贤榜 | A list for Embodied AI / Robotics Jobs (PhD, RA, intern, full-time, etc
Witness the aha moment of VLM with less than $3.
ManiBox: Enhancing Spatial Grasping Generalization via Scalable Simulation Data Generation
《开源大模型食用指南》针对中国宝宝量身打造的基于Linux环境快速微调(全参数/Lora)、部署国内外开源大模型(LLM)/多模态大模型(MLLM)教程
🦄 Record your terminal and generate animated gif images or share a web player
Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources
Janus-Series: Unified Multimodal Understanding and Generation Models
Re-implementation of pi0 vision-language-action (VLA) model from Physical Intelligence
基于Qwen-2.5-1.5B 进行DPO fine-tuning后,意外说真话的AI暴躁哥
[Survey] Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
Official code implementation of Slow Perception:Let's Perceive Geometric Figures Step-by-step
Cosmos is a world model development platform that consists of world foundation models, tokenizers and video processing pipeline to accelerate the development of Physical AI at Robotics & AV labs. C…
[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Experts via Clustering