-
ShanghaiTech University
- Shanghai
- https://tr3e.github.io/
Lists (3)
Sort Name ascending (A-Z)
Stars
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
[NeurIPS 2024] SimPO: Simple Preference Optimization with a Reference-Free Reward
SpatialLM: Large Language Model for Spatial Understanding
Code of LHM: Large Animatable Human Reconstruction Model for Single Image to 3D in Seconds
A community-driven AI automation framework that builds upon the incredible work of the open source community. Our goal is to combine language models with specialized tools for tasks like web search…
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
[CVPR 2025] VGGT: Visual Geometry Grounded Transformer
MiniCPM3-4B: An edge-side LLM that surpasses GPT-3.5-Turbo.
⏰ AI conference deadline countdowns
Open Source Deep Research Alternative to Reason and Search on Private Data. Written in Python.
Wan: Open and Advanced Large-Scale Video Generative Models
Solve Visual Understanding with Reinforced VLMs
[CVPR 2025] A Hierarchical Movie Level Dataset for Long Video Generation
The first open autoregressive foundational video AI model.
[IJCV 2024] LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models
SkyReels V1: The first and most advanced open-source human-centric video foundation model
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities。
An open source deep research clone. AI Agent that reasons large amounts of web data extracted with Firecrawl
🤗 smolagents: a barebones library for agents that think in python code.
Magic to turn Cursor/Windsurf as 90% of Devin
Clean, minimal, accessible reproduction of DeepSeek R1-Zero