Skip to content
View csuhan's full-sized avatar
🐇
Focusing
🐇
Focusing

Highlights

  • Pro

Block or report csuhan

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first work to systematically explore R1 for video]

Python 205 4 Updated Mar 29, 2025

A curated list of Awesome Personalized Large Multimodal Models resources

16 Updated Mar 27, 2025

CUDA Python: Performance meets Productivity

Python 1,253 102 Updated Mar 27, 2025

Code for "How far can we go with ImageNet for Text-to-Image generation?" paper

Python 76 Updated Mar 18, 2025

DataComp for Language Models

HTML 1,266 119 Updated Mar 19, 2025

Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"

Jupyter Notebook 179 7 Updated Mar 21, 2025

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

Python 1,719 181 Updated Mar 26, 2025
Python 28 4 Updated Mar 29, 2025

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Python 1,176 61 Updated Jul 17, 2024

GenEval: An object-focused framework for evaluating text-to-image alignment

HTML 207 11 Updated Mar 3, 2025

A Unified Tokenizer for Visual Generation and Understanding

Python 217 5 Updated Mar 3, 2025

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

Python 92 1 Updated Mar 2, 2025

[ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Python 250 7 Updated Jan 22, 2025

High-Resolution 3D Assets Generation with Large Scale Hunyuan3D Diffusion Models.

Python 8,076 660 Updated Mar 28, 2025
Python 30 Updated Jan 17, 2025

Cosmos is a world model development platform that consists of world foundation models, tokenizers and video processing pipeline to accelerate the development of Physical AI at Robotics & AV labs. C…

Jupyter Notebook 7,841 503 Updated Mar 28, 2025

A suite of image and video neural tokenizers

Jupyter Notebook 1,589 74 Updated Feb 11, 2025

High-performance Image Tokenizers for VAR and AR

Python 228 5 Updated Mar 25, 2025
Python 121 8 Updated Jun 28, 2024

SEED-Voken: A Series of Powerful Visual Tokenizers

Python 856 31 Updated Feb 19, 2025

[ICLR 2025][arXiv:2406.07548] Image and Video Tokenization with Binary Spherical Quantization

Python 139 Updated Jun 12, 2024

SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

Python 230 7 Updated Dec 29, 2024

[CVPR 2025] Official code of "DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation"

Python 245 5 Updated Mar 17, 2025

Official Pytorch implementation for LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior (ICLR 2025 Oral).

Python 59 Updated Feb 11, 2025

[CVPR'25] Official PyTorch implementation of Lumos: Learning Visual Generative Priors without Text

Python 34 Updated Mar 16, 2025

[CVPR 2025] 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".

Python 296 1 Updated Mar 5, 2025

This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"

Python 23 1 Updated Dec 23, 2024

[ICLR 2025] SOTA discrete acoustic codec models with 40/75 tokens per second for audio language modeling

Python 1,086 82 Updated Mar 2, 2025

📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.

442 18 Updated Mar 14, 2025

[ICLR 2025] Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.

Python 1,294 56 Updated Mar 24, 2025
Next
Showing results