<a href="https://colab.research.google.com/github/Amey-Thakur/ZERO-SHOT-VIDEO-GENERATION/blob/main/Source%20Code/Zero_Shot_Video_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#
<h1 align="center">üé¨ Zero-Shot Video Generation</h1>
<h3 align="center"><i>Text-to-Video Synthesis via Temporal Latent Warping & Cross-Frame Attention</i></h3>

<div align="center">

| **Author** | **Profiles** |
|:---:|:---|
| **Amey Thakur** | [![GitHub](https://img.shields.io/badge/GitHub-Amey--Thakur-181717?logo=github)](https://github.com/Amey-Thakur) [![ORCID](https://img.shields.io/badge/ORCID-0000--0001--5644--1575-A6CE39?logo=orcid)](https://orcid.org/0000-0001-5644-1575) [![Google Scholar](https://img.shields.io/badge/Google_Scholar-Amey_Thakur-4285F4?logo=google-scholar&logoColor=white)](https://scholar.google.ca/citations?user=0inooPgAAAAJ&hl=en) [![Kaggle](https://img.shields.io/badge/Kaggle-Amey_Thakur-20BEFF?logo=kaggle)](https://www.kaggle.com/ameythakur20) |

---

**Research Foundation:** Based on [Text2Video-Zero](https://arxiv.org/abs/2303.13439) by the Picsart AI Research (PAIR) team.

üöÄ **Live Demo:** [Hugging Face Space](https://huggingface.co/spaces/AmeyThakur/ZERO-SHOT-VIDEO-GENERATION) | üé¨ **Video Demo:** [YouTube](https://youtu.be/za9hId6UPoY) | üíª **Repository:** [GitHub](https://github.com/Amey-Thakur/ZERO-SHOT-VIDEO-GENERATION)

</div>

## üìñ Introduction

> **Zero-shot video generation enables creating temporally consistent videos from text prompts without requiring any video-specific training.**

This notebook implements the **Text2Video-Zero** framework, which transforms a pre-trained Stable Diffusion model into a video generator through:
1.  **Temporal Latent Warping**: Ensuring geometric consistency by warping latents along a motion field.
2.  **Global Cross-Frame Attention**: Synchronizing object appearance by making all frames attend to the first frame's appearance.
3.  **Background Smoothing**: (Optional) Applying masks to separate foreground motion from background stability.

## ‚òÅÔ∏è Cloud Environment Setup
Execute the following cell to configure your environment. This script is designed to be **fail-proof** and **platform-agnostic**.

### Features:
1.  **Runtime Detection**: Automatically configures paths for **Kaggle**, **Colab**, or **Local** execution.
2.  **Robust Cloning**: Attempts GitHub first, with a fallback to a **Hugging Face Mirror** if GitHub is unreachable.
3.  **LFS Defense**: Automatically handles Git LFS budget issues by falling back to **Kagglehub** for model checkpoints.
4.  **Tiered Asset Retrieval**: Prioritizes local mounts, then Kaggle datasets, then cloud downloads.

In [None]:
import os
import sys
import shutil
import subprocess

# ‚îÄ‚îÄ Detect Environment ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
try:
    shell = get_ipython()
    IS_COLAB = 'google.colab' in str(shell)
    IS_KAGGLE = "kaggle" in os.environ.get("KAGGLE_KERNEL_RUN_TYPE", "")
except NameError:
    IS_COLAB = IS_KAGGLE = False

PROJECT_NAME = "ZERO-SHOT-VIDEO-GENERATION"
print(f"üåç Environment: {'Google Colab' if IS_COLAB else ('Kaggle' if IS_KAGGLE else 'Local/Custom')}")

def run_setup():
    if IS_COLAB or IS_KAGGLE:
        WORKDIR = "/content" if IS_COLAB else "/kaggle/working"
        os.chdir(WORKDIR)
        
        # 1. Clone Repository (with Fallback)
        if not os.path.exists(PROJECT_NAME):
            print(f"‚¨áÔ∏è Cloning {PROJECT_NAME} from GitHub...")
            res = os.system(f"git clone https://github.com/Amey-Thakur/{PROJECT_NAME}")
            
            if res != 0 or not os.path.exists(PROJECT_NAME):
                print("‚ö†Ô∏è GitHub Clone Failed. Falling back to Hugging Face Mirror...")
                os.system(f"git clone https://huggingface.co/spaces/AmeyThakur/{PROJECT_NAME}")
        
        os.chdir(os.path.join(WORKDIR, PROJECT_NAME, "Source Code"))
        
        # 2. Dependency Installation
        print("üõ†Ô∏è Installing Dependencies...")
        os.system("pip install -q diffusers transformers accelerate einops kornia imageio imageio-ffmpeg moviepy tomesd decord safetensors kagglehub")
        
        # 3. Model Fallback Check (Kagglehub)
        # Check if local models directory exists and has content
        models_trigger = "models/dreamlike-photoreal-2.0/model_index.json"
        if not os.path.exists(models_trigger) or os.path.getsize(models_trigger) < 100:
            print("üì¶ Local models missing or LFS pointers detected. Using Kaggle Fallback...")
            import kagglehub
            try:
                # IMPORTANT: Replace with your actual public dataset handle when available
                k_path = kagglehub.dataset_download("ameythakur20/zero-shot-video-gen")
                print(f"‚úÖ Assets downloaded to {k_path}")
                
                # Link models folder
                k_models = os.path.join(k_path, "models")
                if os.path.exists(k_models):
                    if os.path.exists("models"): 
                        if os.path.islink("models"): os.unlink("models")
                        else: shutil.rmtree("models")
                    os.symlink(k_models, "models")
                    print("üîó Models linked successfully.")
            except Exception as e:
                print(f"‚ö†Ô∏è Kagglehub download failed: {e}. Models will be downloaded from HF Hub on-demand.")
                
    print("‚úÖ Environment Setup Complete.")

run_setup()

# Add Source Code to path for module imports
current_path = os.getcwd()
if current_path not in sys.path:
    sys.path.append(current_path)
print(f"üìç Python Path: {current_path}")

## 1Ô∏è‚É£ Hardware & Model Initialization
We initialize the Text2Video pipeline and verify GPU availability. For the best experience, a **GPU with at least 15GB VRAM** (like Colab's T4) is recommended.

In [None]:
import torch
import gc
from diffusers import DDIMScheduler
from model import Model, ModelType
from text_to_video_pipeline import TextToVideoPipeline
from utils import CrossFrameAttnProcessor

# ‚îÄ‚îÄ Hardware Diagnostics ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32
print(f"üéØ Computation Device: {device}")
print(f"üíé Precision Mode:    {dtype}")

if device == "cuda":
    print(f"üìü GPU:               {torch.cuda.get_device_name(0)}")
    vram = torch.cuda.get_device_properties(0).total_mem / 1024**3
    print(f"üìä Total VRAM:        {vram:.2f} GB")

# ‚îÄ‚îÄ Pipeline Loading ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("‚è≥ Loading Neural Networks (Stable Diffusion + T2V-Zero)...")

model_id = "dreamlike-art/dreamlike-photoreal-2.0"

# Check for local weights
local_path = os.path.abspath(os.path.join(os.getcwd(), "models", "dreamlike-photoreal-2.0"))
load_path = local_path if os.path.exists(local_path) else model_id

if os.path.exists(local_path):
    print(f"üü¢ Using local weights from {local_path}")
else:
    print(f"üåê Downloading weights from HuggingFace: {model_id}")

try:
    # Load Model Wrapper
    model = Model(device=device, dtype=dtype)
    
    # Configure Pipeline
    # We initialize the Text2VideoPipeline directly for more control in the notebook
    pipe = TextToVideoPipeline.from_pretrained(
        load_path, 
        torch_dtype=dtype
    ).to(device)
    
    # Set Scheduler
    pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
    
    # Apply Temporal Consistency (Cross-Frame Attention)
    attn_proc = CrossFrameAttnProcessor(unet_chunk_size=2)
    pipe.unet.set_attn_processor(processor=attn_proc)
    
    print("‚úÖ Pipeline Operational. All components synchronized.")
except Exception as e:
    print(f"‚ùå Initialization Error: {e}")

def cleanup():
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

## 2Ô∏è‚É£ Video Generation Interface
Enter a descriptive prompt below. The system uses a **quality-enhancing prompt wrapper** to ensure cinematic results.

In [None]:
import numpy as np
import torchvision
import imageio
from IPython.display import HTML, display
import base64

def play_video(path, width=512):
    with open(path, "rb") as f:
        data = f.read()
    b64 = base64.b64encode(data).decode()
    return HTML(f'<video width="{width}" controls autoplay loop><source src="data:video/mp4;base64,{b64}" type="video/mp4"></video>')

def generate(
    prompt,
    video_length=8,
    resolution=512,
    seed=42,
    steps=50,
    motion_strength=12.0,
    fps=4
):
    cleanup()
    generator = torch.Generator(device=device).manual_seed(seed)
    
    print(f"üé¨ Generating: \"{prompt}\"")
    
    # Quality modifiers
    added_prompt = "high quality, HD, 8K, trending on artstation, high focus, dramatic lighting"
    negative_prompt = "longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, deformed body, bloated, ugly, unrealistic"
    
    full_prompt = [f"{prompt}, {added_prompt}"] * video_length
    neg_prompt = [negative_prompt] * video_length
    
    result = pipe(
        prompt=full_prompt,
        negative_prompt=neg_prompt,
        video_length=video_length,
        height=resolution,
        width=resolution,
        num_inference_steps=steps,
        guidance_scale=7.5,
        motion_field_strength_x=motion_strength,
        motion_field_strength_y=motion_strength,
        generator=generator,
        output_type="numpy",
        frame_ids=list(range(video_length))
    )
    
    # Create Video
    frames = result.images
    video_frames = []
    for frame in frames:
        # The pipeline returns numpy for 'numpy' output_type, typically [F, H, W, C] if batched
        # But check shape. From text_to_video_pipeline, images is typically List of PIL or np array
        img = (frame * 255).astype(np.uint8)
        video_frames.append(img)
    
    output_path = "output.mp4"
    imageio.mimsave(output_path, video_frames, fps=fps)
    return output_path

# ‚îÄ‚îÄ Configuration ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
PROMPT = "an astronaut waving the arm on the moon"
SEED = 42
FRAMES = 8

video_file = generate(PROMPT, video_length=FRAMES, seed=SEED)
display(play_video(video_file))

## 3Ô∏è‚É£ Advanced Visualization
Analyze the temporal consistency by viewing the frames in a grid. Note the consistent identity of objects across the sequence.

In [None]:
import matplotlib.pyplot as plt
import cv2

def show_frames(path):
    cap = cv2.VideoCapture(path)
    frames = []
    while True:
        ret, frame = cap.read()
        if not ret: break
        frames.append(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
    cap.release()
    
    fig, axes = plt.subplots(1, len(frames), figsize=(20, 5))
    for i, f in enumerate(frames):
        axes[i].imshow(f)
        axes[i].axis('off')
        axes[i].set_title(f"Frame {i+1}")
    plt.show()

show_frames(video_file)

---
**Clean Environment Memory**

In [None]:
cleanup()
print("üßπ Memory cleared. System ready for next generation.")

## üìö References
1. **Text2Video-Zero**: Khachatryan et al. [arXiv:2303.13439](https://arxiv.org/abs/2303.13439)
2. **Dreamlike Photoreal 2.0**: [Hugging Face](https://huggingface.co/dreamlike-art/dreamlike-photoreal-2.0)
3. **Diffusers Library**: [GitHub](https://github.com/huggingface/diffusers)