# Generate assets for scaling up video generation

We have seen how we can generate videos from audio files. Let's scale up our production now. We need more audio files and new faces.

First, let's fetch some sample audio recording from the [Common Voice dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) by the Mozilla Foundation.

In [1]:
import os

from datasets import load_dataset
from dotenv import load_dotenv
from huggingface_hub import login

load_dotenv()
HF_TOKEN = os.getenv("HF_TOKEN")
CACHE_DIR = os.getenv("CACHE_DIR")
login(token=HF_TOKEN)

common_voice_en = load_dataset(
    "mozilla-foundation/common_voice_17_0", "en",
    split="test",
    cache_dir=CACHE_DIR,
    streaming=True,
    trust_remote_code=True
)

  from .autonotebook import tqdm as notebook_tqdm


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/ubuntu/.cache/huggingface/token
Login successful


Downloading builder script: 100%|██████████| 8.19k/8.19k [00:00<00:00, 14.3MB/s]
Downloading readme: 100%|██████████| 12.7k/12.7k [00:00<00:00, 15.4MB/s]
Downloading extra modules: 100%|██████████| 3.92k/3.92k [00:00<00:00, 16.6MB/s]
Downloading extra modules: 100%|██████████| 132k/132k [00:00<00:00, 25.7MB/s]


ValueError: Loading mozilla-foundation/common_voice_17_0 requires you to execute the dataset script in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.

In [None]:
def generate_audio():
    for sample in common_voice_en:
        if sample["gender"] in ["male_masculine", "female_feminine"]:
            yield (
                sample["audio"]["array"],
                sample["audio"]["sampling_rate"],
                "male" if sample["gender"] == "male_masculine" else "female"
            )


In [None]:
from IPython.display import Audio

sample_audio, sampling_rate, gender = next(generate_audio())
Audio(sample_audio, rate=sampling_rate)

Let's write now the audio to file to be used for the video generation.

In [None]:
from scipy.io import wavfile

wavfile.write(f'./audio/sample_audio_{gender}.wav', sampling_rate, sample_audio)

Now we need to generate a new face. We will use a leading open source model for text-to-image: Stable Difussion 3 Medium model.



In [None]:
import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.float16,
    cache_dir=CACHE_DIR
)
pipe.to("cuda")

In [None]:
prompt = "man" if gender == "male" else "woman" + \
" looking at the camera, casually dressed, flat background, uniform lighting"
image = pipe(
    prompt=prompt,
    num_inference_steps=28,
    height=1024,
    width=1024,
    guidance_scale=7.0,
).images[0]


In [None]:
from IPython.display import display

display(image)

In [None]:

image.save(f"./img/sample_{gender}.jpg")