Skip to content

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

License

Notifications You must be signed in to change notification settings

AILab-CVC/VideoGen-Eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues


Logo

VideoGen-Eval 1.0

To observe and compare the video quality of recent video generative models!
Ailing Zeng1* · Yuhang Yang2* · Weidong Chen1 · Wei Liu1

1 Tencent AI Lab, 2 USTC. *Equal contribution

🔥 Project Updates

  • News: 2024/11/01: We update text-to-video results of Mochi1, we use cfg=6.0, which is the same as their website.
  • News: 2024/10/19: We update 1k text-to-video results of Meta-MovieGen (prompts are from MovieGenVideoBench); please check here. Plus, we make the pypi package VGenEval available, you can easily obtain all input prompts (text, image, video) corresponding to any ID through jsut one line of code.
  • News: 2024/10/14: We update results of Minimax image-to-video generation, please check here.
  • News: 2024/10/08: VideoGen-Eval-1.0 is available, please check the Project Page and Technical Report for more details.
Table of Contents
  1. About The Project
  2. Assets
  3. Job List
  4. Contributing
  5. License
  6. Contact
  7. Citation

💡 About The Project

High-quality video generation, such as text-to-video (T2V), image-to-video (I2V), and video-to-video (V2V) generation, holds considerable significance in content creation and world simulation. Models like SORA have advanced generating videos with higher resolution, more natural motion, better vision-language alignment, and increased controllability, particularly for long video sequences. These improvements have been driven by the evolution of model architectures, shifting from UNet to more scalable and parameter-rich DiT models, along with large-scale data expansion and refined training strategies. However, despite the emergence of several DiT-based closed-source and open-source models, a comprehensive investigation into their capabilities and limitations still needs to be completed. Additionally, existing evaluation metrics often fail to align with human preferences.

This report v1.0 studies a series of SORA-like T2V, I2V, and V2V models via to bridge the gap between academic research and industry practice and provide a more profound analysis of recent video generation advancements. This is achieved by demonstrating and comparing over 8,000 generated video cases from ten closed-source and several open-source models (Kling 1.0, Kling 1.5, Gen-3, Luma 1.0, Luma 1.6, Vidu, Qingying, MiniMax Hailuo, Tongyi Wanxiang, Pika1.5) via our 700 critical prompts. Seeing is believing. We encourage readers to visit our Website to browse these results online. Our study systematically examines four core aspects:

  • Impacts on vertical-domain application models, such as human-centric animation and robotics;
  • Key objective capabilities, such as text alignment, motion diversity, composition, stability, etc.;
  • Video generation across ten real-life application scenarios;
  • In-depth discussions on potential usage scenarios and tasks, challenges, and future work.

We assign an ID to each case. The input text, the names of input images and videos correspond to the ID. The results generated by different models are named as model_name+id.mp4. Please refer to the prompt. All the results are publicly accessible, and we will continuously update the results as new models are released and existing ones undergo version updates.

🎞️ Assets

The inputs we introduced, including the input text, images, videos, and the generated results of all models, are available for download at Google Drive and Baidu. You can also visit our Website to browse these results online.

Get VideoGen-Eval prompts:

pip install VGenEval

# example
#id_list of the id e.g. [0,1,2,3]
#test_model_name of the model name e.g. 'SORA'
from VGenEval import load_prompt
results = load_prompt.get_prompts([id_list], 'test_model_name')

# results is a dict, {
#   'text prompt': [],
#   'visual prompt': [], return the url of the input image or video
#   'save name': [], We have standardized the save name
# }

# note: for the sample which takes the first-last frame for generation, visual prompt return urls of the two frames.

🦉 Job List

  • VideoGen-Eval-1.0 released
  • Add results of Seaweed, PixelDance, and MiracleVision.
  • Make the arena for video generation models.

💗 Contributing

Welcome all contributions! If you have a suggestion to improve this project, please fork the repo and create a pull request. You can also open an issue with the tag "enhancement." Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some change')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

🏄 Top contributors:

contrib.rocks image

✏️ License

Distributed under the MIT License. See LICENSE.txt for more information.

📢 Contact

Ailing Zeng - ailingzengzzz@gmail.com

Yuhang Yang - yyuhang@mail.ustc.edu.cn

💌 Citation

@article{zeng2024dawn,
  title={The Dawn of Video Generation: Preliminary Explorations with SORA-like Models},
  author={Zeng, Ailing and Yang, Yuhang and Chen, Weidong and Liu, Wei},
  journal={arXiv preprint arXiv:2410.05227},
  year={2024}
}