## Open notebook in:
| Colab                                 
:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nicolepcx/transformers-the-definitive-guide/blob/master/CH05/ch05_latte.ipynb)                                             

# About this Notebook

This notebook demonstrates **text-to-video generation** using **Latte-1**, a diffusion-based video generation model developed by Maxin. Built on top of Hugging Face's `diffusers` library, **Latte-1** is designed to synthesize high-quality short videos from natural language prompts, supporting both static image generation and temporally consistent video output.

### Steps Included:

1. **Model Setup**:
   The notebook loads the `LattePipeline` from the `maxin-cn/Latte-1` model on the Hugging Face Hub. It also loads the **temporal VAE decoder** (`AutoencoderKLTemporalDecoder`) for improved video synthesis by explicitly modeling temporal dependencies across frames.

2. **Device Configuration**:
   The pipeline is moved to GPU if available and configured to use `float16` precision for efficient inference on supported hardware.

3. **Prompt Definition and Generation**:
   A sample prompt (e.g., *"Slow pan upward of blazing oak fire in an indoor fireplace."*) is passed to the pipeline. The pipeline generates a 16-frame video clip that visually interprets the textual prompt with smooth temporal coherence.

4. **Exporting the Result**:
   The generated frames are compiled into a video using the `export_to_video` utility from `diffusers.utils`. The output is saved as an `.mp4` file with customizable frame rate and quality settings.

### Key Features:

* **Latte-1** supports both **text-to-image** and **text-to-video** generation, controlled via the `video_length` parameter.
* The use of a **temporal VAE decoder** allows the model to generate smooth, coherent motion across frames.
* Output is easily exportable to common video formats for playback, sharing, or downstream use.

This notebook offers a simple yet powerful starting point for experimenting with **natural language-driven video generation**, applicable to creative media production, prototyping animation tools, or studying video synthesis models in research contexts.


# Installs

# Imports

In [None]:
import torch
from diffusers import LattePipeline
from diffusers.models import AutoencoderKLTemporalDecoder
from torchvision.utils import save_image
import imageio


# Load Model

In [1]:

torch.manual_seed(0)

device = "cuda" if torch.cuda.is_available() else "cpu"
video_length = 16 # 1 (text-to-image) or 16 (text-to-video)
pipe = LattePipeline.from_pretrained("maxin-cn/Latte-1", torch_dtype=torch.float16).to(device)

# Using temporal decoder of VAE
vae = AutoencoderKLTemporalDecoder.from_pretrained("maxin-cn/Latte-1", subfolder="vae_temporal_decoder", torch_dtype=torch.float16).to(device)
pipe.vae = vae



model_index.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]

scheduler_config.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.87G [00:00<?, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/4.19G [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/780 [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/503 [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/4.23G [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/391M [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some weights of the model checkpoint at /root/.cache/huggingface/hub/models--maxin-cn--Latte-1/snapshots/0653024365272f061fc44d1078134df22842b687/transformer were not used when initializing LatteTransformer3DModel: 
 ['caption_projection.y_embedding']


config.json:   0%|          | 0.00/475 [00:00<?, ?B/s]

# Generate Video

In [8]:
prompt = "Slow pan upward of blazing oak fire in an indoor fireplace."

In [9]:
video = pipe(prompt, video_length=video_length).frames[0]
#export_to_gif(video, "latte.gif")


Setting `clean_caption=True` requires the ftfy library but it was not found in your environment. Checkout the instructions on the
installation section: https://github.com/rspeer/python-ftfy/tree/master#installing and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.

Setting `clean_caption` to False...

Setting `clean_caption=True` requires the ftfy library but it was not found in your environment. Checkout the instructions on the
installation section: https://github.com/rspeer/python-ftfy/tree/master#installing and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.

Setting `clean_caption` to False...


  0%|          | 0/50 [00:00<?, ?it/s]

# Export as Gif

In [None]:
#from diffusers.utils import export_to_gif

# Export as Video

In [10]:
from diffusers.utils import export_to_video

# Example usage
export_to_video(
    video_frames=video,                 # Your list of frames (NumPy arrays or PIL Images)
    output_video_path="latte.mp4",      # Output path and filename
    fps=10,                             # Frames per second
    quality=5.0,                        # Variable bitrate quality (0–10)
    bitrate=None,                       # Optional: use a fixed bitrate instead
    macro_block_size=16                 # Optional: typically 16; can be 4, 8, or 1 to disable
)


'latte.mp4'