# üé¨ Text-to-Video Fine-Tuning with RL

## Complete Setup Using Hugging Face Models & Datasets

**Goal:** Fine-tune text-to-video generation model with RL (GRPO/DPO)

**Your Setup:**
- ‚úÖ 205GB VRAM - Perfect for large video models
- ‚úÖ Unsloth available - Fast training
- ‚úÖ ROCm GPU - AMD optimized

**Models Available:**
- Stable Video Diffusion (Image ‚Üí Video)
- AnimateDiff (Text ‚Üí Video)
- ModelScope Video (Text ‚Üí Video)

**Datasets Available:**
- WebVid-2M (video + captions)
- MSR-VTT (video + descriptions)
- Custom video datasets

Let's build a complete fine-tuning pipeline!


In [None]:
# Step 1: Install Dependencies
import subprocess
import sys

print("üì¶ Installing video generation libraries...\n")

packages = [
    "diffusers",
    "transformers",
    "accelerate",
    "peft",
    "trl",
    "imageio",
    "opencv-python",
    "pillow",
]

for pkg in packages:
    try:
        __import__(pkg)
        print(f"‚úÖ {pkg}: Already installed")
    except:
        print(f"üì¶ Installing {pkg}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", pkg, "-q"])
        print(f"‚úÖ {pkg} installed")

# Install Unsloth
try:
    import unsloth
    print("‚úÖ unsloth: Already installed")
except:
    print("üì¶ Installing unsloth...")
    subprocess.check_call([
        sys.executable, "-m", "pip", "install", 
        "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git", "-q"
    ])
    print("‚úÖ unsloth installed")

print("\n‚úÖ All libraries ready!")


In [None]:
# Step 2: Find Text-to-Video Models on Hugging Face
from huggingface_hub import list_models

print("üîç Searching Hugging Face for text-to-video models...\n")

# Search for video diffusion models
video_models = []

try:
    models = list_models(
        search="text-to-video",
        sort="downloads",
        direction=-1,
        limit=10
    )
    
    print("Top Text-to-Video Models:")
    for i, model in enumerate(models, 1):
        print(f"\n{i}. {model.id}")
        print(f"   Downloads: {model.downloads:,}")
        print(f"   Likes: {model.likes}")
        video_models.append(model.id)
        
except Exception as e:
    print(f"‚ö†Ô∏è Search error: {e}")
    print("\nüí° Manual list:")
    print("   - stabilityai/stable-video-diffusion-img2vid-xt")
    print("   - guoyww/animatediff-motion-adapter-v1-5-2")
    print("   - damo-vilab/text-to-video-ms-1.7b")
    print("   - THUDM/CogVideoX-17B")


In [None]:
# Step 3: Find Video Datasets on Hugging Face
from huggingface_hub import list_datasets

print("üîç Searching Hugging Face for video datasets...\n")

try:
    datasets = list_datasets(
        search="video text",
        sort="downloads",
        direction=-1,
        limit=10
    )
    
    print("Top Video-Text Datasets:")
    for i, ds in enumerate(datasets, 1):
        print(f"\n{i}. {ds.id}")
        print(f"   Downloads: {ds.downloads:,}")
        print(f"   Likes: {ds.likes}")
        
except Exception as e:
    print(f"‚ö†Ô∏è Search error: {e}")
    print("\nüí° Known datasets:")
    print("   - mrm8488/webvid-2M-subset (2M video-text pairs)")
    print("   - jameseese/msr-vtt (10K videos)")
    print("   - lmms-lab/LLaVA-Video-178K (178K pairs)")
    print("   - ActivityNet/ActivityNetCaptions (20K videos)")


In [None]:
# Step 4: Load Video Dataset
from datasets import load_dataset

print("üìπ Loading video dataset...\n")

# Try WebVid subset (smaller, faster)
try:
    dataset = load_dataset("mrm8488/webvid-2M-subset", split="train[:100]")
    print(f"‚úÖ Dataset loaded: {len(dataset)} examples")
    print(f"‚úÖ Keys: {dataset[0].keys()}")
    
    # Show example
    example = dataset[0]
    print(f"\nüìù Example:")
    print(f"   Keys: {list(example.keys())}")
    if 'text' in example:
        print(f"   Text: {example['text'][:100]}...")
    if 'video' in example:
        print(f"   Video: {type(example['video'])}")
        
except Exception as e:
    print(f"‚ö†Ô∏è Dataset error: {e}")
    print("\nüí° Alternative: Create custom dataset")
    print("   Format: {'prompt': [...], 'video_path': [...]}")


In [None]:
# Step 5: Setup Text-to-Video Model
import unsloth  # IMPORT FIRST!
import torch
from diffusers import StableVideoDiffusionPipeline, StableDiffusionPipeline
from PIL import Image

print("üé¨ Loading text-to-video models...\n")

# Model 1: Image generator (for image-to-video pipeline)
print("1. Loading Stable Diffusion XL (image generator)...")
try:
    pipe_img = StableDiffusionPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        torch_dtype=torch.bfloat16,
    )
    pipe_img = pipe_img.to("cuda")
    print("   ‚úÖ Image generator loaded!")
except Exception as e:
    print(f"   ‚ö†Ô∏è Error: {e}")
    pipe_img = None

# Model 2: Video generator (image ‚Üí video)
print("\n2. Loading Stable Video Diffusion...")
try:
    pipe_video = StableVideoDiffusionPipeline.from_pretrained(
        "stabilityai/stable-video-diffusion-img2vid-xt",
        torch_dtype=torch.bfloat16,
    )
    pipe_video = pipe_video.to("cuda")
    print("   ‚úÖ Video generator loaded!")
except Exception as e:
    print(f"   ‚ö†Ô∏è Error: {e}")
    print("   üí° May need to download model weights first")
    pipe_video = None

print("\n‚úÖ Models ready for fine-tuning!")


In [None]:
# Step 6: Test Video Generation
import imageio

if pipe_img and pipe_video:
    print("üé¨ Testing text-to-video generation...\n")
    
    # Step 1: Generate image from text
    prompt = "a futuristic city at night, neon lights, cyberpunk style"
    print(f"üìù Prompt: {prompt}")
    print("üé® Generating image...")
    
    image = pipe_img(prompt, num_inference_steps=20).images[0]
    image.save("test_base_image.png")
    print("   ‚úÖ Image generated!")
    
    # Step 2: Generate video from image
    print("\nüé• Generating video from image...")
    video_frames = pipe_video(
        image,
        num_frames=14,
        decode_chunk_size=4,
    ).frames[0]
    
    print(f"   ‚úÖ Generated {len(video_frames)} frames!")
    
    # Save video
    imageio.mimwrite("test_video.mp4", video_frames, fps=7)
    print("   ‚úÖ Video saved to test_video.mp4")
    
    print("\nüéâ Text-to-video pipeline working!")
else:
    print("‚ö†Ô∏è Models not loaded. Install diffusers first.")


## üéØ RL Fine-Tuning for Video Generation

### Challenge: Video RL Fine-Tuning

**Problem:** Standard GRPO/DPO trainers expect text outputs, not video frames.

**Solutions:**

1. **Two-Stage Approach** (Recommended)
   - Stage 1: SFT on video datasets (standard fine-tuning)
   - Stage 2: RL on video quality metrics (custom rewards)

2. **Video-to-Text Model** (Easier)
   - Fine-tune video understanding model (Qwen2.5-VL)
   - Use RL on text outputs
   - Generate videos separately

3. **Custom Video RL Trainer** (Advanced)
   - Modify GRPOTrainer for video outputs
   - Use video quality metrics (SSIM, PSNR, CLIP score)
   - Requires custom implementation

### Next Steps:

1. **Collect Video Dataset**
   - Text prompts + videos
   - Format: `{"prompt": "...", "video_path": "..."}`

2. **Fine-Tune Generation** (SFT)
   - Train Stable Video Diffusion on your dataset
   - Use standard diffusion training

3. **Add RL** (Advanced)
   - Custom reward function for video quality
   - Modify GRPO trainer for video outputs

**Your 205GB VRAM:** Perfect for this! üöÄ
