Skip to content

We propose MMAD, a novel automated pipeline for precise AD generation. MMAD introduces ambient music alongside visual and linguistic, enhancing the model's multimodal representation learning through modality encoders and alignment.

Daria8976/MMAD

Repository files navigation

MMAD: Multi-modal Movie Audio Description

If you like our project, please give us a star ⭐ on GitHub for latest update.

📰 News

[2024.2.20] Our MMAD has been accepted at COLING 2024! Welcome to watch 👀 this repository for the latest updates.

😮 Highlights

MMAD exhibits remarkable AD generation capabilities in movies, by utilizing multiple modal inputs.

🎥 Demo

The talented pianist, 1900, mesmerized the audience with his virtuosic performance of "Christmas Eve" while wearing a pristine white tuxedo and bow tie.Chris Gardner, a man with a box in his hand, runs frantically through the city, dodging people and cars while being chased by a taxi driver who is honking.
Dancing in the rain, Don Lockwood twirls with joy, umbrella in hand, amidst city streets.Alice fled through the mushroom forest, her heart racing as the Bandersnatch's ominous hisses and growls echoed behind her.

🛠️ Installation

  1. You are required to install the dependencies. If you have conda installed, you can run the following:
git clone https://github.com/Daria8976/MMAD.git
cd MMAD
bash environment.sh
  1. Download weights from pretrained model:
  • checkpoint_step_50000.pth under checkpoint folder
  • base.pth under AudioEnhancing/configs folder
  • LanguageBind/Video-LLaVA-7B under VideoCaption folder
  1. prepare REPLICATE_API_TOKEN in llama.py

  2. Prepare demo data (We provide four demo video here):

  • put demo.mp4 under Example/Video
  • put [character photo] (Photos should be named with the corresponding character name) under Example/ActorCandidate

💡 Inference

python infer.py

🚀 Main Results

Finally, we organized 10 vision health volunteers, 10 BVI people (including 3 totally blind and 7 partially sighted) for human evaluation via Likert scale, and we merged the statistical results into the result table of the paper.

Acknowledgements

Here are some great resources we benefit or utilize from:

About

We propose MMAD, a novel automated pipeline for precise AD generation. MMAD introduces ambient music alongside visual and linguistic, enhancing the model's multimodal representation learning through modality encoders and alignment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages