## Open notebook in:
| Colab                                 
:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nicolepcx/transformers-the-definitive-guide/blob/master/CH05/ch05_tora.ipynb)                                             

# About this Notebook

This notebook demonstrates how to generate **trajectory-guided videos** using **Tora**, a trajectory-oriented diffusion framework developed by Alibaba's Research & Intelligence Computing team. Tora extends conventional text-to-video models by introducing **spatial control via 2D trajectory prompts**, enabling users to influence the motion path of generated objects in video clips.

### Steps Included:

1. **Repository Setup and Environment Configuration**:
   The Tora GitHub repository is cloned and the working directory is changed to the `sat/` subdirectory where the codebase resides. Required dependencies are installed, including `huggingface_hub` with fast transfer enabled to streamline model checkpoint downloads.

2. **Model Checkpoint Download**:
   The pretrained weights for **CogVideoX-5B** fine-tuned on Tora (a sparse trajectory-conditioned diffusion model) are downloaded directly from the Hugging Face Hub. These checkpoints are stored in the local `ckpts/tora/t2v` directory.

3. **Inference with Trajectory Guidance**:
   The main generation script `sample_video.py` is executed using `torchrun` with one GPU. It reads:

   * A **text prompt** file that defines the visual concept (e.g., "a drone flying over a forest").
   * A **trajectory file** (`trajs/coaster.txt`) that defines a 2D motion path.
     The script generates videos consistent with both the text and trajectory prompt and stores them in the `samples` directory.

4. **Interactive App Launch**:
   A Gradio-based web interface is started using `app.py`, allowing users to upload their own prompts and trajectories to interactively generate videos without modifying the source code.

### Highlights:

* Combines **language-based semantics** with **trajectory control**, making video generation more precise and expressive.
* Enables experimentation with **sparse trajectory inputs**, offering better generalization and fewer constraints than dense trajectory supervision.
* Supports **multi-modal input**, integrating language and spatial motion cues into the video generation process.


# Installs

In [None]:
# Clone this repository.
!git clone https://github.com/Nicolepcx/Tora
%cd Tora

Cloning into 'Tora'...
remote: Enumerating objects: 710, done.[K
remote: Counting objects: 100% (710/710), done.[K
remote: Compressing objects: 100% (536/536), done.[K
remote: Total 710 (delta 204), reused 624 (delta 143), pack-reused 0 (from 0)[K
Receiving objects: 100% (710/710), 4.85 MiB | 20.67 MiB/s, done.
Resolving deltas: 100% (204/204), done.
/content/Tora


In [None]:
!ls

CogVideoX_LICENSE  LICENSE  pyproject.toml  sat
diffusers-version  modules  README.md


In [None]:
%cd sat/

/content/Tora/sat


In [None]:
!pip install -r requirements.txt -q

Collecting SwissArmyTransformer==0.4.12 (from -r requirements.txt (line 1))
  Downloading SwissArmyTransformer-0.4.12-py3-none-any.whl.metadata (9.6 kB)
Collecting pytorch_lightning==2.3.3 (from -r requirements.txt (line 2))
  Downloading pytorch_lightning-2.3.3-py3-none-any.whl.metadata (21 kB)
Collecting kornia==0.7.3 (from -r requirements.txt (line 3))
  Downloading kornia-0.7.3-py2.py3-none-any.whl.metadata (7.7 kB)
Collecting beartype==0.18.5 (from -r requirements.txt (line 4))
  Downloading beartype-0.18.5-py3-none-any.whl.metadata (30 kB)
Collecting fsspec==2024.5.0 (from -r requirements.txt (line 5))
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Collecting decord==0.6.0 (from -r requirements.txt (line 6))
  Downloading decord-0.6.0-py3-none-manylinux2010_x86_64.whl.metadata (422 bytes)
Collecting deepspeed==0.15.1 (from -r requirements.txt (line 7))
  Downloading deepspeed-0.15.1.tar.gz (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1

In [None]:
#!pip install modelscope -q
!pip install "huggingface_hub[hf_transfer]"
!HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Alibaba-Research-Intelligence-Computing/Tora --local-dir ckpts

Downloading '.gitattributes' to 'ckpts/.cache/huggingface/download/wPaCkH-WbT7GsmxMKKrNZTV4nSM=.a6344aac8c09253b3b630fb776ae94478aa0275b.incomplete'
.gitattributes: 100% 1.52k/1.52k [00:00<00:00, 11.9MB/s]
Download complete. Moving file to ckpts/.gitattributes
Downloading 'CogVideoX_LICENSE' to 'ckpts/.cache/huggingface/download/k5CF3exemdNS-lSlKN72gCTS5nA=.188f301bad0622302715b6caac6ea934414e77fa.incomplete'
CogVideoX_LICENSE: 100% 5.70k/5.70k [00:00<00:00, 29.8MB/s]
Download complete. Moving file to ckpts/CogVideoX_LICENSE
Downloading 'LICENSE' to 'ckpts/.cache/huggingface/download/DhCjcNQuMpl4FL346qr3tvNUCgY=.261eeb9e9f8b2b4b0d119366dda99c6fd7d35c64.incomplete'
LICENSE: 100% 11.4k/11.4k [00:00<00:00, 48.7MB/s]
Download complete. Moving file to ckpts/LICENSE
Downloading 'README.md' to 'ckpts/.cache/huggingface/download/Xn7B-BWUGOee2Y6hCZtEhtFu4BE=.b945edd8c6d118ef9f1691414b9beb5a7fc8be6a.incomplete'
README.md: 100% 9.20k/9.20k [00:00<00:00, 36.9MB/s]
Download complete. Moving file to

# Run inference

In [None]:
!N_GPU=1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=1 sample_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/inference_sparse.yaml --load ckpts/tora/t2v --output-dir samples --point_path trajs/coaster.txt --input-file assets/text/t2v/examples.txt

[2025-05-30 03:58:01,747] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2025-05-30 03:58:09.518104: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-30 03:58:10.226102: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748577490.552699    5950 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748577490.637620    5950 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-30 03:58:11.138

# Start Gradio app

In [None]:
!python app.py --load ckpts/tora/t2v

  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
2025-05-30 04:29:43.568671: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-30 04:29:43.588541: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748579383.610215   14161 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748579383.616727   14161 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-30 04:29:43.640112: I tensorflow/core/platform/cpu_feature_guard.cc:210] This T