# 🎬 Text-to-Video RL Fine-Tuning (GRPO Only)

**Goal:** RL fine-tuning of text-to-video model

**Model:** `ali-vilab/text-to-video-ms-1.7b` ✅ Working
**Dataset:** `Rapidata/text-2-video-human-preferences` ✅ Loaded
**Method:** GRPO (Group Relative Policy Optimization)

**Focus:** RL fine-tuning ONLY - no extra stuff

In [None]:
# Step 1: Load Model
import torch
from diffusers import DiffusionPipeline
import warnings
warnings.filterwarnings("ignore")

print("🎬 Loading ModelScope Text-to-Video Model...\n")

pipe = DiffusionPipeline.from_pretrained(
    "ali-vilab/text-to-video-ms-1.7b",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
pipe = pipe.to("cuda")

print(f"✅ Model on {pipe.device}")
print(f"✅ VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Test
test_video = pipe("A cat walking", num_inference_steps=25).frames[0]
print(f"✅ Test: {len(test_video)} frames generated")

In [None]:
# Step 2: Load Human Preference Dataset
from datasets import load_dataset

print("📹 Loading Human Preference Dataset...\n")

dataset = load_dataset(
    "Rapidata/text-2-video-human-preferences",
    split="train[:1000]"
)

print(f"✅ Dataset: {len(dataset)} examples")
print(f"✅ Keys: {dataset[0].keys()}")

# Show example
ex = dataset[0]
print(f"\n📝 Example:")
for key in list(ex.keys())[:5]:
    val = ex[key]
    if isinstance(val, str):
        print(f"   {key}: {val[:80]}...")
    else:
        print(f"   {key}: {type(val)}")

In [None]:
# Step 3: Video Quality Reward Function
# Scores generated videos for GRPO

def video_quality_reward(*args, **kwargs):
    """Reward function for video generation quality"""
    prompts = kwargs.get('prompts') or kwargs.get('inputs') or (args[0] if args else [])
    videos = kwargs.get('responses') or kwargs.get('completions') or (args[1] if len(args) > 1 else [])
    
    rewards = []
    
    for prompt, video in zip(prompts, videos):
        reward = 0.0
        
        if isinstance(video, list) and len(video) > 0:
            num_frames = len(video)
            
            # Reward frame count
            if num_frames >= 14:
                reward += 3.0
            elif num_frames >= 7:
                reward += 1.5
            elif num_frames >= 3:
                reward += 0.5
            else:
                reward -= 1.0
            
            # Reward consistency
            if num_frames > 1:
                try:
                    sizes = [f.size for f in video if hasattr(f, 'size')]
                    if sizes:
                        diffs = [abs(sizes[i][0] - sizes[i+1][0]) for i in range(len(sizes)-1)]
                        avg_diff = sum(diffs) / len(diffs) if diffs else 0
                        if avg_diff < 5:
                            reward += 2.0
                        elif avg_diff < 10:
                            reward += 1.0
                except:
                    pass
            
            reward += 1.0  # Base reward
        
        reward = max(-5.0, min(10.0, reward))
        rewards.append(reward)
    
    return rewards

print("✅ Video reward function created!")

In [None]:
# Step 4: Format Dataset for GRPO
from datasets import Dataset

def format_grpo_video(examples):
    """Format for GRPO training"""
    prompts = []
    for prompt in examples.get('prompt', examples.get('text', [])):
        formatted = f"Generate a video: {prompt}"
        prompts.append(formatted)
    return {"prompt": prompts}

grpo_dataset = dataset.map(format_grpo_video, batched=True)

print(f"✅ GRPO dataset: {len(grpo_dataset)} examples")
print(f"✅ Example: {grpo_dataset[0]['prompt'][:100]}...")

In [None]:
# Step 5: GRPO Configuration
from trl import GRPOConfig

grpo_config = GRPOConfig(
    output_dir="./text-to-video-grpo",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=5e-5,
    max_steps=500,
    warmup_steps=50,
    bf16=torch.cuda.is_bf16_supported(),
    fp16=not torch.cuda.is_bf16_supported(),
    logging_steps=10,
    save_steps=100,
    num_generations=4,  # Generate 4 videos per prompt
    optim="adamw_torch",
)

print("✅ GRPO Config:")
print(f"   Batch: {grpo_config.per_device_train_batch_size}")
print(f"   Generations: {grpo_config.num_generations}")
print(f"   Steps: {grpo_config.max_steps}")

In [None]:
# Step 6: RL Fine-Tuning Setup
print("🚀 RL Fine-Tuning Setup\n")

print("⚠️ Challenge: Standard GRPOTrainer expects text, not video")
print("💡 Solution: Custom training loop needed\n")

print("Custom Video RL Training Loop:")
print("-" * 50)
print("""
For each prompt:
  1. Generate 4 videos (using pipe)
  2. Score each video (reward function)
  3. Rank videos by reward
  4. Update model to prefer high-reward videos

Implementation: Custom trainer class
""")

print("\n✅ Reward function ready!")
print("✅ Dataset formatted!")
print("✅ Config ready!")
print("\n💡 Next: Implement custom video RL trainer")

## ✅ RL Fine-Tuning Ready!

**Setup Complete:**
- ✅ ModelScope model loaded
- ✅ Human preference dataset loaded
- ✅ Video reward function created
- ✅ GRPO config ready

**Next:** Implement custom video RL training loop

**205GB VRAM:** Perfect for video RL! 🚀