## CogVideoX (Text-to-Video)

This code implements CogVideoX with Diffusers library on a free-tier Colab GPU.

Average time necessary to generate 1 video (using the free GPU) for each model:
* [CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b) +/- 20min per vídeo
* [CogVideoX-5b](https://huggingface.co/THUDM/CogVideoX-5b) +/- 1h per vídeo

Each generation may take some time, but implementation via code is worth it because you won't be limited by a daily generation quota (as in other AI platforms). Instead, you'll only be restricted by the overall GPU usage.
 * Note: Colab provides several hours of continuous GPU usage per day, and the free quota resets daily. So you will need to wait if you don't want to pay (or implement the code below locally on your own computer, this way you won't have any limitations).


## Install


In [None]:
!pip install diffusers==0.30.1 transformers hf_transfer
# !pip install git+https://github.com/huggingface/accelerate
!pip install accelerate==0.33.0

Collecting diffusers==0.30.1
  Downloading diffusers-0.30.1-py3-none-any.whl.metadata (18 kB)
Collecting hf_transfer
  Downloading hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.7 kB)
Downloading diffusers-0.30.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m33.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: hf_transfer, diffusers
  Attempting uninstall: diffusers
    Found existing installation: diffusers 0.32.1
    Uninstalling diffusers-0.32.1:
      Successfully uninstalled diffusers-0.32.1
Successfully installed diffusers-0.30.1 hf_transfer-0.1.9
Collecting accelerate==0.33.0
  Downloading accelerate-0.33.0-py3-none-any.whl.metadata (18 kB)
Down

## Imports

In [None]:
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video
from transformers import T5EncoderModel

In [None]:
#@title <font size="3">Function to display video directly in Colab</font>

import io
import base64
from IPython.display import HTML
def show_video(file_name, width=500, height=320):
  video_encoded = base64.b64encode(io.open(file_name, 'rb').read())
  return HTML(data = '''<video width="{0}" height="{1}" alt="Video" controls>
                          <source src="data:video/mp4;base64,{2}" type="video/mp4" />
                        </video>'''.format(width, height, video_encoded.decode('ascii')))

## Generate!

Enter the prompt in the field below and click to run inference and generate. Wait for the processing to finish and skip to the next block to display the result  
* You can customize the `pipe` function parameters if you want, such as guidance and the seed (we set a fixed seed here for reproducibility between results, but you can leave a random value to always have different results even when using the same prompt)
 * See the [Documentation](https://huggingface.co/docs/diffusers/en/api/pipelines/cogvideox) for further explanations of each parameter

In [None]:
prompt = "a beautiful waterfall during a sunny day"  #@param {type:"string"}

pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
video = pipe(prompt=prompt, num_videos_per_prompt=1, num_inference_steps=50, num_frames=49, guidance_scale=6, generator=torch.Generator(device="cuda").manual_seed(42)).frames[0]
export_to_video(video, "output.mp4", fps=8)

## Show result

In [None]:
show_video("output.mp4")

## (Optional) Download video

You can download it to your computer or save it to Google Drive. This can be done by accessing the side menu of this page in Colab, or by executing the code below

In [None]:
from google.colab import files
files.download("output.mp4")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

---

# CogVideoX-5b

This model is slower to generate but can offer better results.

Note: If you have a paid plan (whether on Colab or another cloud provider), you can access better GPUs with more VRAM. This way, in addition to allowing faster generations, it will allow larger models to be executed.

## Settings

`[ ! ]` If you have restarted your session or are starting a new one, re-run the code block with the imports (at the beginning of this colab) and then run the code below

In [None]:
model_id = "THUDM/CogVideoX-5b"

transformer = CogVideoXTransformer3DModel.from_pretrained("camenduru/cogvideox-5b-float16", subfolder="transformer", torch_dtype=torch.float16)
text_encoder = T5EncoderModel.from_pretrained("camenduru/cogvideox-5b-float16", subfolder="text_encoder", torch_dtype=torch.float16)
vae = AutoencoderKLCogVideoX.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float16)

* Note: The reason for using checkpoints hosted by github.com/camenduru instead of the original is because they exported with a max_shard_size of "5GB" when saving the model with `.save_pretrained`. The original converted model was saved with "10GB" as the max shard size, which causes the Colab CPU RAM to be insufficient leading to OOM (Out of memory) error (on the CPU)

## Pipeline and optimizations

> Create pipeline and Enable memory optimizations

In [None]:
pipe = CogVideoXPipeline.from_pretrained(
    model_id,
    text_encoder=text_encoder,
    transformer=transformer,
    vae=vae,
    torch_dtype=torch.float16,
)

### Enable memory optimizations
pipe.enable_sequential_cpu_offload()
# pipe.vae.enable_tiling()

model_index.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

scheduler/scheduler_config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

tokenizer/added_tokens.json:   0%|          | 0.00/2.59k [00:00<?, ?B/s]

tokenizer/special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer/tokenizer_config.json:   0%|          | 0.00/20.6k [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

* Note regarding memory optimizations: sequential cpu offloading is necessary for being able to run the model on Turing or lower architectures. It aggressively maintains everything on the CPU and only moves the currently executing nn.Module to the GPU. This saves a lot of VRAM but adds a lot of overhead for inference, making generations extremely slow (1 hour+). Unfortunately, this is the only solution for running the model on Colab until efficient kernels are supported.

## Generate!

In [None]:
prompt = "a beautiful waterfall during a sunny day"  #@param {type:"string"}

video = pipe(prompt=prompt, guidance_scale=6, use_dynamic_cfg=True, num_inference_steps=50).frames[0]
export_to_video(video, "output2.mp4", fps=8)

  0%|          | 0/50 [00:00<?, ?it/s]

'output2.mp4'

In [None]:
show_video("output2.mp4")

(Show gpu usage)

In [None]:
!nvidia-smi

Fri Jan 10 01:49:05 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0              33W /  70W |   7417MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    