<a href="https://colab.research.google.com/github/Amey-Thakur/ZERO-SHOT-VIDEO-GENERATION/blob/main/Source%20Code/ZERO-SHOT-VIDEO-GENERATION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#
<h1 align="center">🎬 Zero-Shot Video Generation</h1>
<h3 align="center"><i>Text-to-Video Synthesis via Temporal Latent Warping & Cross-Frame Attention</i></h3>

<div align="center">

| **Author** | **Profiles** |
|:---:|:---|
| **Amey Thakur** | [![GitHub](https://img.shields.io/badge/GitHub-Amey--Thakur-181717?logo=github)](https://github.com/Amey-Thakur) [![ORCID](https://img.shields.io/badge/ORCID-0000--0001--5644--1575-A6CE39?logo=orcid)](https://orcid.org/0000-0001-5644-1575) [![Google Scholar](https://img.shields.io/badge/Google_Scholar-Amey_Thakur-4285F4?logo=google-scholar&logoColor=white)](https://scholar.google.ca/citations?user=0inooPgAAAAJ&hl=en) [![Kaggle](https://img.shields.io/badge/Kaggle-Amey_Thakur-20BEFF?logo=kaggle)](https://www.kaggle.com/ameythakur20) |

---

**Research Foundation:** Based on [Text2Video-Zero](https://arxiv.org/abs/2303.13439) by the Picsart AI Research (PAIR) team.

🚀 **Live Demo:** [Hugging Face Space](https://huggingface.co/spaces/AmeyThakur/ZERO-SHOT-VIDEO-GENERATION) | 🎬 **Video Demo:** [YouTube](https://youtu.be/za9hId6UPoY) | 💻 **Repository:** [GitHub](https://github.com/Amey-Thakur/ZERO-SHOT-VIDEO-GENERATION)

</div>

## 📖 Introduction

> **Zero-shot video generation enables creating temporally consistent videos from text prompts without requiring any video-specific training.**

This implementation utilizes the **Text2Video-Zero** framework, an architecture designed to leverage pre-trained text-to-image diffusion models for video synthesis. By applying temporal latent warping and global cross-frame attention, the pipeline ensures structural and appearance consistency across generated frames without additional fine-tuning on video datasets.

### Core Methodology
1.  **Temporal Latent Warping**: Ensures consistent motion dynamics by warping latent representations according to defined motion fields.
2.  **Cross-Frame Attention**: Replaces standard self-attention with cross-frame attention, allowing each frame to reference the first frame to preserve identity and background details.
3.  **Background Smoothing**: Optionally detects salient objects and applies specialized warping to the background to reduce temporal flickering.

## ☁️ Cloud Environment Setup
Execute the following cell to prepare the execution environment. This script is designed for cross-platform compatibility, managing directory structures, dependencies, and neural weights for cloud providers such as **Google Colab** and **Kaggle**.

### Automated Procedures
1.  **Repository Synchronization**: Clones the core project and navigates to the source directory.
2.  **Dependency Management**: Installs required libraries including `diffusers`, `transformers`, `einops`, and `kornia`.
3.  **Asset Acquisition**: Retrieves auxiliary annotator weights (poses, depth, etc.) from Hugging Face for advanced inference modes.

In [None]:
import os
import sys
import shutil
import subprocess

# Environment detection for automated setup
try:
    shell = get_ipython()
    IS_COLAB = 'google.colab' in str(shell)
    IS_KAGGLE = "kaggle" in os.environ.get("KAGGLE_KERNEL_RUN_TYPE", "")
except NameError:
    IS_COLAB = IS_KAGGLE = False

PROJECT_NAME = "ZERO-SHOT-VIDEO-GENERATION"
print(f"🌍 Detected Environment: {'Google Colab' if IS_COLAB else ('Kaggle' if IS_KAGGLE else 'Local/Custom')}")

def initialize_environment():
    """
    Performs workspace initialization and dependency resolution.
    """
    if IS_COLAB or IS_KAGGLE:
        # Establish workspace root based on cloud provider
        WORKDIR = "/content" if IS_COLAB else "/kaggle/working"
        os.chdir(WORKDIR)
        
        # Clone repository if not present
        if not os.path.exists(PROJECT_NAME):
            print(f"⬇️ Cloning {PROJECT_NAME}...")
            os.system(f"git clone https://github.com/Amey-Thakur/{PROJECT_NAME}")
        
        # Transition to source code directory
        os.chdir(os.path.join(WORKDIR, PROJECT_NAME, "Source Code"))
        
        print("🛠️ Installing required neural engine dependencies...")
        os.system("pip install -q diffusers transformers accelerate einops kornia imageio imageio-ffmpeg moviepy tomesd decord safetensors huggingface_hub ipywidgets")
        
        # Neural asset management (Annotators/CKPTS)
        from huggingface_hub import hf_hub_download
        annotators = {
            "body_pose_model.pth": "lllyasviel/Annotators",
            "hand_pose_model.pth": "lllyasviel/Annotators",
            "dpt_hybrid-midas-501f0c75.pt": "lllyasviel/Annotators",
            "upernet_global_small.pth": "lllyasviel/Annotators"
        }
        os.makedirs("annotator/ckpts", exist_ok=True)
        for f, repo in annotators.items():
            target = os.path.join("annotator/ckpts", f)
            if not os.path.exists(target):
                print(f"⬇️ Downloading neural weight: {f}")
                path = hf_hub_download(repo_id=repo, filename=f)
                shutil.copy(path, target)
                
    print("✅ Environment workspace established.")

initialize_environment()
if os.getcwd() not in sys.path: 
    sys.path.append(os.getcwd())

## 1️⃣ Framework Initialization

This section initializes the primary diffusion model and configures hardware acceleration. The system uses **Mixed Precision (FP16)** when CUDA is available to optimize VRAM utilization and inference speed.

In [None]:
import torch
import gc
import warnings
import transformers
import diffusers
import importlib

# Automatic module reloading to synchronize local disk changes with the active kernel
import text_to_video_pipeline
import model
importlib.reload(text_to_video_pipeline)
importlib.reload(model)
from model import Model, ModelType

# Quiet logging to maintain a clean terminal output focus
warnings.filterwarnings("ignore", category=UserWarning, module="diffusers")
warnings.filterwarnings("ignore", category=UserWarning, module="transformers")
transformers.logging.set_verbosity_error()
diffusers.logging.set_verbosity_error()

# Hardware Discovery and Precision Orchestration
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32

print(f"🎯 Computation Device: {device}")
print(f"⚙️ Mathematical Precision: {dtype}")

if device == "cuda":
    vram = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"📟 GPU Engine: {torch.cuda.get_device_name(0)} ({vram:.2f} GB Available)")

# Initialization of the Zero-Shot Video Pipeline
print("⏳ Instantiating neural pipeline...")
video_model = model.Model(device=device, dtype=dtype)

def clear_hardware_cache():
    """
    Performs garbage collection and clears CUDA memory buffers.
    """
    gc.collect()
    if torch.cuda.is_available(): 
        torch.cuda.empty_cache()

print("✅ Pipeline operational.")

## 2️⃣ Video Generation Studio

The interface below facilitates the generation of temporally consistent video sequences. Users can select from curated presets or define custom prompts. Higher frame counts and resolutions will increase computation time and memory requirements.

In [None]:
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output
import base64

# Curated Prompt Presets for structural validation
EXPERIMENTAL_PRESETS = [
    "an astronaut waving the arm on the moon",
    "a sloth surfing on a wakeboard",
    "a cute cat walking on grass",
    "a horse is galloping on a street",
    "a gorilla dancing on times square"
]

# UI component instantiation
presets = widgets.Dropdown(
    options=[("Select a curated prompt...", "")] + [(p, p) for p in EXPERIMENTAL_PRESETS], 
    description='Presets:', 
    layout={'width': '600px'}
)

prompt_textarea = widgets.Textarea(
    value='an astronaut waving the arm on the moon', 
    description='Prompt:', 
    layout={'width': '600px', 'height': '60px'}
)

frame_slider = widgets.IntSlider(
    value=8, min=4, max=24, 
    description='Frames:', 
    layout={'width': '300px'}
)

resolution_dropdown = widgets.Dropdown(
    options=[256, 512, 768], 
    value=512, 
    description='Resolution:', 
    layout={'width': '300px'}
)

generate_button = widgets.Button(
    description='🎬 Generate Video', 
    button_style='primary', 
    layout={'width': '200px'}
)

output_widget = widgets.Output()

def handle_preset_selection(change):
    """
    Synchronizes the prompt text area with selected preset.
    """
    if change['new']:
        prompt_textarea.value = change['new']
presets.observe(handle_preset_selection, names='value')

def initiate_synthesis(b):
    """
    Triggers the text-to-video inference engine.
    """
    generate_button.disabled = True
    generate_button.description = "Synthesizing..."
    
    with output_widget: 
        clear_output()
        print("⏳ Processing neural frames...")
        try:
            # Execute inference via the Model wrapper
            final_path = video_model.process_text2video(
                prompt=prompt_textarea.value, 
                video_length=frame_slider.value, 
                resolution=resolution_dropdown.value, 
                motion_field_strength_x=12.0, 
                motion_field_strength_y=12.0, 
                seed=42, fps=4, path="output.mp4"
            )
            
            # Encode result for Jupyter display synchronization
            if os.path.exists(final_path):
                with open(final_path, "rb") as video_file: 
                    encoded_data = base64.b64encode(video_file.read()).decode()
                
                display(HTML(f'''
                    <div align="center" style="margin-top: 20px;">
                        <video width="{resolution_dropdown.value}" controls autoplay loop style="border-radius: 12px; border: 1px solid #ddd; box-shadow: 0 4px 15px rgba(0,0,0,0.1);">
                            <source src="data:video/mp4;base64,{encoded_data}" type="video/mp4">
                        </video>
                    </div>
                '''))
            else:
                print("⚠️ Output file generation failed. Check terminal logs.")
        
        except Exception as synthesis_error: 
            print(f"❌ Synthesis Error: {synthesis_error}")
            
    generate_button.disabled = False
    generate_button.description = "🎬 Generate Video"
    clear_hardware_cache()

generate_button.on_click(initiate_synthesis)

# UI Layout Assembly
header_header = widgets.HTML("<h3 style='text-align: center; width: 600px; margin-bottom: 20px; font-weight: 800;'>🎬 Zero-Shot Video Studio</h3>")
interactive_elements = widgets.VBox([
    header_header,
    presets,
    prompt_textarea,
    widgets.HBox([frame_slider, resolution_dropdown], layout={'width': '600px', 'justify_content': 'space-between', 'margin_bottom': '10px'}),
    widgets.HBox([generate_button], layout={'width': '600px', 'justify_content': 'center', 'margin_top': '10px'})
], layout={'align_items': 'center'})

studio_frame = widgets.VBox([interactive_elements], layout=widgets.Layout(
    border='1px solid #e0e0e0', 
    padding='25px', 
    border_radius='15px', 
    margin='10px 0',
    background_color='#ffffff',
    box_shadow='0 10px 30px rgba(0,0,0,0.05)'
))

display(studio_frame)
display(output_widget)

## 📚 Technical References
1. **Text2Video-Zero**: [Picsart AI Research (PAIR) - arXiv:2303.13439](https://arxiv.org/abs/2303.13439)
2. **Diffusers Framework**: [Hugging Face Documentation](https://huggingface.co/docs/diffusers/index)
3. **Dreamlike Photoreal 2.0**: [Standard Diffusion Checkpoint](https://huggingface.co/dreamlike-art/dreamlike-photoreal-2.0)

---
*Research Laboratory for Neural Video Synthesis*