# HuggingFace Test: Audio-to-video

This demo uses two models:
- `openai/whisper-small`: to extract the text from the audio (speech-to-text)
- `damo-vilab/text-to-video-ms-1.7b`: to generate a 4s video from a text (text-to-video)

We first have to import the audio (tighly bound to Google Colab there), then let the magic happen to get the video at the end (using Matplotlib).

In [1]:
!pip install diffusers transformers accelerate torch > /dev/null

In [2]:
from google.colab import files
uploaded = files.upload()
filename = list(uploaded.keys())[0]

filepath = f'/content/{filename}'

Saving 20240124_123615.aac to 20240124_123615 (1).aac


In [3]:
from transformers import pipeline

transcriber = pipeline(task="automatic-speech-recognition",
                       model="openai/whisper-small")
result = transcriber(filepath)

print(result)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


{'text': ' He was working on the street by night. The rain was so heavy and a storm was hitting.'}


In [4]:
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = result['text']
video_frames = pipe(prompt, num_inference_steps=25, num_frames=100).frames
video_path = export_to_video(video_frames, output_video_path='/content/output.mp4')

print(f"video successfully created in '{video_path}'!")

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

video successfully created in '/content/output.mp4'!


In [5]:
import imageio
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import HTML

def display_video(video):
    fig = plt.figure(figsize=(3,3))

    mov = []
    for i in range(len(video)):
        img = plt.imshow(video[i], animated=True)
        plt.axis('off')
        plt.tight_layout()
        mov.append([img])

    #Animation creation
    anime = animation.ArtistAnimation(fig, mov, interval=50, repeat_delay=1000)

    plt.close()
    return anime

video = imageio.mimread(video_path)  #Loading video
HTML(display_video(video).to_html5_video())  #Inline video display in HTML5