<a href="https://colab.research.google.com/github/Amey-Thakur/ZERO-SHOT-VIDEO-GENERATION/blob/main/Source%20Code/Zero_Shot_Video_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#
<h1 align="center">🎬 Zero-Shot Video Generation</h1>
<h3 align="center"><i>Text-to-Video Synthesis via Temporal Latent Warping & Cross-Frame Attention</i></h3>

<div align="center">

| **Author** | **Profiles** |
|:---:|:---|
| **Amey Thakur** | [![GitHub](https://img.shields.io/badge/GitHub-Amey--Thakur-181717?logo=github)](https://github.com/Amey-Thakur) [![ORCID](https://img.shields.io/badge/ORCID-0000--0001--5644--1575-A6CE39?logo=orcid)](https://orcid.org/0000-0001-5644-1575) [![Google Scholar](https://img.shields.io/badge/Google_Scholar-Amey_Thakur-4285F4?logo=google-scholar&logoColor=white)](https://scholar.google.ca/citations?user=0inooPgAAAAJ&hl=en) [![Kaggle](https://img.shields.io/badge/Kaggle-Amey_Thakur-20BEFF?logo=kaggle)](https://www.kaggle.com/ameythakur20) |

---

**Research Foundation:** Based on [Text2Video-Zero](https://arxiv.org/abs/2303.13439) by the Picsart AI Research (PAIR) team.

🚀 **Live Demo:** [Hugging Face Space](https://huggingface.co/spaces/AmeyThakur/ZERO-SHOT-VIDEO-GENERATION) | 🎬 **Video Demo:** [YouTube](https://youtu.be/za9hId6UPoY) | 💻 **Repository:** [GitHub](https://github.com/Amey-Thakur/ZERO-SHOT-VIDEO-GENERATION)

</div>

## 📖 Introduction

> **An audio-visual deepfake is when temporally consistent synthetic content is generated without requiring person-specific training.**

This research implementation demonstrates **Text2Video-Zero**, a state-of-the-art framework that transforms pre-trained Text-to-Image (T2I) diffusion models into zero-shot video generators. By enforcing temporal consistency through latent manipulation and attention synchronization, we can synthesize high-quality videos without any video-specific fine-tuning.

### Core Innovations:
1.  **Temporal Latent Warping**: Enforces geometric consistency by warping latents along a motion field.
2.  **Global Cross-Frame Attention**: Synchronizes object identity by making all frames attend to the appearance of the first frame.
3.  **Background Smoothing**: Separates foreground dynamics from background stability to reduce pixel flickering.

## ☁️ Cloud Environment Setup
Execute the following cell to configure your environment. This script is **fail-proof** and **multi-tier fallback** optimized.

### Operational Workflow:
1.  **Environment Detection**: Automatically detects **Kaggle**, **Colab**, or **Local** execution.
2.  **Robust Cloning**: Attempts GitHub first, with a fallback to **Hugging Face Spaces** if LFS budget is exceeded.
3.  **Asset Synchronization**: Automatically downloads missing neural weights (model checkpoints and annotators) from the Hugging Face Hub if they aren't provided in the local filesystem.

In [None]:
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output
import numpy as np
import base64

# ── Presets ──────────────────────────────────────────────────────────────────
EXAMPLES = [
    "an astronaut waving the arm on the moon",
    "a sloth surfing on a wakeboard",
    "a cute cat walking on grass",
    "a horse is galloping on a street",
    "a gorilla dancing on times square"
]

# ── UI Style (Premium Scholarly Theme) ───────────────────────────────────────
style = HTML("""
<style>
    .studio-container {
        padding: 30px;
        background-color: #1e1e2e;
        border-radius: 15px;
        border: 1px solid #313244;
        box-shadow: 0 10px 30px rgba(0,0,0,0.5);
        color: #cdd6f4;
        font-family: 'Inter', system-ui, -apple-system, sans-serif;
    }
    .widget-label { font-weight: 600; color: #a6adc8; margin-bottom: 5px; }
    .custom-button {
        background: linear-gradient(135deg, #89b4fa 0%, #74c7ec 100%) !important;
        border-radius: 8px !important;
        color: #11111b !important;
        font-weight: 700 !important;
        transition: all 0.3s ease !important;
        height: 45px !important;
        border: none !important;
    }
    .custom-button:hover {
        transform: translateY(-2px);
        box-shadow: 0 5px 15px rgba(137, 180, 250, 0.4);
    }
    .widget-dropdown > select, .widget-textarea > textarea, .widget-text > input {
        background-color: #313244 !important;
        color: #cdd6f4 !important;
        border: 1px solid #45475a !important;
        border-radius: 6px !important;
        padding: 8px !important;
    }
    .jupyter-widgets.widget-slider .ui-slider-range { background: #89b4fa !important; }
    .section-title {
        font-size: 1.5rem;
        font-weight: 800;
        margin-bottom: 20px;
        color: #f5e0dc;
        display: flex;
        align-items: center;
        gap: 10px;
    }
</style>
""")

# ── UI Components ────────────────────────────────────────────────────────────
header = widgets.HTML("<div class='section-title'>🎬 Video Generation Studio</div>")

preset_dropdown = widgets.Dropdown(
    options=[("Select a creative preset...", "")] + [(p, p) for p in EXAMPLES],
    description='<b>Style Presets</b>',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='100%', margin='0 0 15px 0')
)

prompt_input = widgets.Textarea(
    value='an astronaut waving the arm on the moon',
    placeholder='Describe your cinematic vision...',
    description='<b>Prompt</b>',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='100%', height='100px', margin='0 0 20px 0')
)

# Sliders and Number Inputs
slider_style = {'description_width': '140px'}
length_slider = widgets.IntSlider(value=8, min=4, max=24, step=1, description='Video Length', style=slider_style, layout=widgets.Layout(width='48%'))
res_dropdown = widgets.Dropdown(options=[256, 512, 768], value=512, description='Resolution', style=slider_style, layout=widgets.Layout(width='48%'))

motion_slider = widgets.FloatSlider(value=12.0, min=0.0, max=30.0, step=0.5, description='Motion Dynamics', style=slider_style, layout=widgets.Layout(width='48%'))
steps_slider = widgets.IntSlider(value=50, min=10, max=100, description='Inference Steps', style=slider_style, layout=widgets.Layout(width='48%'))

fps_slider = widgets.IntSlider(value=4, min=1, max=12, description='Frames Per Second', style=slider_style, layout=widgets.Layout(width='48%'))
seed_input = widgets.IntText(value=42, description='Random Seed', style=slider_style, layout=widgets.Layout(width='48%'))

process_btn = widgets.Button(
    description='🚀 GENERATE CINEMATIC VIDEO',
    layout=widgets.Layout(width='100%', height='50px', margin='20px 0 0 0')
)
process_btn.add_class('custom-button')

output_video = widgets.Output(layout=widgets.Layout(margin='20px 0 0 0'))

# ── Logic ────────────────────────────────────────────────────────────────────
def on_preset_change(change):
    if change['new']:
        prompt_input.value = change['new']

preset_dropdown.observe(on_preset_change, names='value')

def run_generation(b):
    process_btn.disabled = True
    process_btn.description = "⏳ SYNTHESIZING NEURONS..."
    
    with output_video:
        clear_output()
        print("\n⚡ Initializing Text-to-Video Pipeline...")
        
        output_path = "output.mp4"
        try:
            video_path = model.process_text2video(
                prompt=prompt_input.value,
                video_length=length_slider.value,
                resolution=res_dropdown.value,
                motion_field_strength_x=motion_slider.value,
                motion_field_strength_y=motion_slider.value,
                seed=seed_input.value,
                fps=fps_slider.value,
                path=output_path
            )
            
            with open(video_path, "rb") as f:
                data = f.read()
            b64 = base64.b64encode(data).decode()
            html = f'''
            <div style="background: #181825; padding: 20px; border-radius: 12px; border: 1px solid #313244; text-align: center;">
                <h4 style="color: #a6e3a1; margin-bottom: 15px;">✨ Synthesis Successful</h4>
                <video width="100%" style="border-radius: 8px; box-shadow: 0 4px 20px rgba(0,0,0,0.4);" controls autoplay loop>
                    <source src="data:video/mp4;base64,{b64}" type="video/mp4">
                </video>
            </div>
            '''
            display(HTML(html))
            
        except Exception as e:
            print(f"❌ Error: {e}")
        
        process_btn.disabled = False
        process_btn.description = "🚀 GENERATE CINEMATIC VIDEO"
        cleanup_vram()

process_btn.on_click(run_generation)

# ── Final Assembly ───────────────────────────────────────────────────────────
ui_form = widgets.VBox([
    header,
    preset_dropdown,
    prompt_input,
    widgets.HBox([length_slider, res_dropdown], layout=widgets.Layout(justify_content='space-between')),
    widgets.HBox([motion_slider, steps_slider], layout=widgets.Layout(justify_content='space-between')),
    widgets.HBox([fps_slider, seed_input], layout=widgets.Layout(justify_content='space-between')),
    process_btn
], layout=widgets.Layout(width='700px', padding='30px'))

ui_container = widgets.VBox([style, ui_form])
ui_container.add_class('studio-container')

display(ui_container)
display(output_video)

## 1️⃣ Pipeline & HW Initialization
We initialize the Text2Video architecture. A **GPU with at least 15GB VRAM** is recommended for high-resolution output.

In [None]:
import torch
import gc
import warnings
from diffusers import DDIMScheduler
from model import Model, ModelType

# Suppress the "Flax classes are deprecated" warning since we are using PyTorch
warnings.filterwarnings("ignore", category=UserWarning, module="diffusers")

# ── Hardware Diagnostics ─────────────────────────────────────────────────────
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32
print(f"🎯 Computation Device: {device}")
print(f"💎 Precision Mode:    {dtype}")

if device == "cuda":
    print(f"📟 GPU:               {torch.cuda.get_device_name(0)}")
    # FIX: Changed 'total_mem' to 'total_memory'
    vram = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"📊 Total VRAM:        {vram:.2f} GB")

# ── Architecture Initialization ──────────────────────────────────────────────
print("⏳ Loading Neural Networks (Zero-Shot Pipeline)...")
model = Model(device=device, dtype=dtype)

def cleanup_vram():
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

print("✅ Pipeline Operational.")

## 2️⃣ Synthesis Studio
Configure your parameters below. This interactive studio allows you to fine-tune the motion dynamics and temporal consistency of the generated video.

In [None]:
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output
import numpy as np
import base64

# ── Presets ──────────────────────────────────────────────────────────────────
EXAMPLES = [
    "an astronaut waving the arm on the moon",
    "a sloth surfing on a wakeboard",
    "a cute cat walking on grass",
    "a horse is galloping on a street",
    "a gorilla dancing on times square"
]

# ── UI Components ────────────────────────────────────────────────────────────
header = widgets.HTML("<h3>🎬 Video Generation Controls</h3>")

preset_dropdown = widgets.Dropdown(
    options=[("Select a preset...", "")] + [(p, p) for p in EXAMPLES],
    description='<b>Presets:</b>',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='500px')
)

prompt_input = widgets.Textarea(
    value='an astronaut waving the arm on the moon',
    placeholder='Type your creative prompt here...',
    description='<b>Prompt:</b>',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='500px', height='80px')
)

length_slider = widgets.IntSlider(value=8, min=4, max=24, step=1, description='Video Length (Frames)')
res_dropdown = widgets.Dropdown(options=[256, 512, 768], value=512, description='Resolution')
motion_slider = widgets.FloatSlider(value=12.0, min=0.0, max=30.0, step=0.5, description='Motion Strength')
steps_slider = widgets.IntSlider(value=50, min=10, max=100, description='Inference Steps')
fps_slider = widgets.IntSlider(value=4, min=1, max=12, description='Playback FPS')
seed_input = widgets.IntText(value=42, description='Seed')

process_btn = widgets.Button(
    description='🚀 Generate Video',
    button_style='primary',
    layout=widgets.Layout(width='500px', height='40px')
)

output_vram = widgets.Output()
output_video = widgets.Output()

# ── Logic ────────────────────────────────────────────────────────────────────
def on_preset_change(change):
    if change['new']:
        prompt_input.value = change['new']

preset_dropdown.observe(on_preset_change, names='value')

def run_generation(b):
    process_btn.disabled = True
    process_btn.description = "⏳ Processing... Please wait..."
    
    with output_video:
        clear_output()
        print("\n🚀 Initiating generation sequence...")
        
        output_path = "output.mp4"
        try:
            video_path = model.process_text2video(
                prompt=prompt_input.value,
                video_length=length_slider.value,
                resolution=res_dropdown.value,
                motion_field_strength_x=motion_slider.value,
                motion_field_strength_y=motion_slider.value,
                seed=seed_input.value,
                fps=fps_slider.value,
                path=output_path
            )
            
            # Display Video
            with open(video_path, "rb") as f:
                data = f.read()
            b64 = base64.b64encode(data).decode()
            html = f'''
            <div align="center">
                <br>
                <h4>✨ Generation Complete!</h4>
                <video width="{res_dropdown.value}" controls autoplay loop>
                    <source src="data:video/mp4;base64,{b64}" type="video/mp4">
                </video>
            </div>
            '''
            display(HTML(html))
            
        except Exception as e:
            print(f"❌ Error during synthesis: {e}")
        
        process_btn.disabled = False
        process_btn.description = "🚀 Generate Video"
        cleanup_vram()

process_btn.on_click(run_generation)

# ── Layout ───────────────────────────────────────────────────────────────────
ui = widgets.VBox([
    header,
    preset_dropdown,
    prompt_input,
    widgets.HBox([length_slider, res_dropdown]),
    widgets.HBox([motion_slider, steps_slider]),
    widgets.HBox([fps_slider, seed_input]),
    process_btn,
    output_vram,
    output_video
])

display(ui)

## 📚 References
1. **Text2Video-Zero**: Khachatryan, L. et al. (2023). *Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators.* [arXiv:2303.13439](https://arxiv.org/abs/2303.13439)
2. **Dreamlike Photoreal 2.0**: [HuggingFace](https://huggingface.co/dreamlike-art/dreamlike-photoreal-2.0)
3. **Diffusers Framework**: [GitHub](https://github.com/huggingface/diffusers)

---

*Research Project | University of Windsor | Authors: Amey Thakur*