# Video Model Evaluation Metrics
## Comprehensive Quality Assessment for Fine-tuned Text-to-Video Model

**This notebook evaluates your generated videos using 7 metrics:**

1. Temporal Consistency - Frame-to-frame coherence
2. Sharpness - Image clarity and detail
3. Contrast - Tonal range
4. Brightness - Proper exposure
5. Motion Smoothness - Fluid movement
6. CLIP Score - Text-video alignment
7. Inception Score - Diversity and realism

---

**Expected Results for 10K Model:**
- Temporal Consistency: 0.75 - 0.85
- Sharpness: 150 - 300
- Motion Smoothness: 0.55 - 0.70
- CLIP Score: 0.27 - 0.33
- Inception Score: 2.5 - 3.5

## Step 1: Install Required Libraries

In [None]:
# Install evaluation libraries
print("Installing evaluation libraries...")
!pip install -q lpips pytorch-fid torchmetrics scikit-image scikit-learn
!pip install -q git+https://github.com/openai/CLIP.git
!pip install -q opencv-python imageio matplotlib seaborn

print("All libraries installed successfully")

Installing evaluation libraries...
  Preparing metadata (setup.py) ... [?25l[?25hdone
All libraries installed successfully


In [None]:
# Install torch-fidelity for InceptionScore
print("Installing torch-fidelity...")
!pip install -q torch-fidelity
print("torch-fidelity installed successfully")

Installing torch-fidelity...
torch-fidelity installed successfully


## Step 2: Import Libraries

In [None]:
# Core libraries
import torch
import torch.nn as nn
import numpy as np
import os
import json
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import cv2
import imageio
import warnings
warnings.filterwarnings('ignore')

# Evaluation libraries
import lpips
from torchmetrics.image.fid import FrechetInceptionDistance
from torchmetrics.image.inception import InceptionScore

# CLIP for text-video alignment
try:
    import clip
    CLIP_AVAILABLE = True
    print("CLIP loaded successfully")
except:
    CLIP_AVAILABLE = False
    print("Warning: CLIP not available, text-video alignment score will be skipped")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
print("All libraries imported successfully")

CLIP loaded successfully
Using device: cuda
All libraries imported successfully


## Step 3: Initialize Evaluation Models

In [None]:
# Initialize evaluation models
print("Initializing evaluation models...")

# LPIPS for perceptual similarity
lpips_model = lpips.LPIPS(net='alex').to(device).eval()
print("LPIPS model loaded")

# CLIP for text-video alignment
if CLIP_AVAILABLE:
    clip_model, clip_preprocess = clip.load("ViT-B/32", device=device)
    print("CLIP model loaded")

# Inception Score calculator
inception_score_calculator = InceptionScore(normalize=True).to(device)
print("Inception Score calculator loaded")

print("All evaluation models initialized successfully")

Initializing evaluation models...
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
Loading model from: /usr/local/lib/python3.12/dist-packages/lpips/weights/v0.1/alex.pth
LPIPS model loaded
CLIP model loaded
Inception Score calculator loaded
All evaluation models initialized successfully


## Step 4: Define Evaluation Functions

In [None]:
# Metric 1: Temporal Consistency
def calculate_temporal_consistency(video_frames):
    """
    Measures frame-to-frame consistency using LPIPS perceptual similarity.

    Args:
        video_frames: Tensor of shape [T, C, H, W] normalized to [0, 1]

    Returns:
        float: Temporal consistency score (higher is better)
               Range: [0, 1], Good: >0.75
    """
    if len(video_frames) < 2:
        return 0.0

    consistencies = []

    for i in range(len(video_frames) - 1):
        frame1 = video_frames[i].unsqueeze(0).to(device)
        frame2 = video_frames[i + 1].unsqueeze(0).to(device)

        with torch.no_grad():
            # LPIPS gives distance (lower = more similar)
            distance = lpips_model(frame1, frame2).item()
            # Convert to similarity score
            similarity = max(0, 1.0 - distance)
            consistencies.append(similarity)

    avg_consistency = np.mean(consistencies)
    return avg_consistency


# Metric 2: Frame Quality (Sharpness, Contrast, Brightness)
def calculate_frame_quality(video_frames):
    """
    Measures visual quality of individual frames.

    Args:
        video_frames: Tensor of shape [T, C, H, W] normalized to [0, 1]

    Returns:
        dict: Dictionary with sharpness, contrast, and brightness scores
              Sharpness - Good: >150
              Contrast - Good: >50
              Brightness - Good: 100-150
    """
    sharpness_scores = []
    contrast_scores = []
    brightness_scores = []

    for frame in video_frames:
        # Convert to numpy
        frame_np = frame.permute(1, 2, 0).cpu().numpy()
        frame_np = (frame_np * 255).clip(0, 255).astype(np.uint8)

        # Sharpness (Laplacian variance)
        gray = cv2.cvtColor(frame_np, cv2.COLOR_RGB2GRAY)
        laplacian = cv2.Laplacian(gray, cv2.CV_64F)
        sharpness = laplacian.var()
        sharpness_scores.append(sharpness)

        # Contrast (standard deviation)
        contrast = frame_np.std()
        contrast_scores.append(contrast)

        # Brightness (mean)
        brightness = frame_np.mean()
        brightness_scores.append(brightness)

    return {
        'sharpness': np.mean(sharpness_scores),
        'contrast': np.mean(contrast_scores),
        'brightness': np.mean(brightness_scores)
    }


# Metric 3: Motion Smoothness
def calculate_motion_smoothness(video_frames):
    """
    Measures smoothness of motion using optical flow.

    Args:
        video_frames: Tensor of shape [T, C, H, W] normalized to [0, 1]

    Returns:
        float: Motion smoothness score (higher is better)
               Range: [0, 1], Good: >0.55
    """
    if len(video_frames) < 2:
        return 0.0

    flow_magnitudes = []

    for i in range(len(video_frames) - 1):
        # Convert to grayscale numpy
        frame1_np = (video_frames[i].permute(1, 2, 0).cpu().numpy() * 255).astype(np.uint8)
        frame2_np = (video_frames[i + 1].permute(1, 2, 0).cpu().numpy() * 255).astype(np.uint8)

        gray1 = cv2.cvtColor(frame1_np, cv2.COLOR_RGB2GRAY)
        gray2 = cv2.cvtColor(frame2_np, cv2.COLOR_RGB2GRAY)

        # Calculate optical flow
        flow = cv2.calcOpticalFlowFarneback(
            gray1, gray2, None, 0.5, 3, 15, 3, 5, 1.2, 0
        )

        # Flow magnitude
        magnitude = np.sqrt(flow[..., 0]**2 + flow[..., 1]**2)
        flow_magnitudes.append(np.mean(magnitude))

    # Smoothness = inverse of variance in flow
    if len(flow_magnitudes) > 1:
        smoothness = 1.0 / (1.0 + np.var(flow_magnitudes))
    else:
        smoothness = 1.0

    return smoothness


# Metric 4: CLIP Score
def calculate_clip_score(video_frames, text_prompt):
    """
    Measures how well video matches text prompt using CLIP.

    Args:
        video_frames: Tensor of shape [T, C, H, W] normalized to [0, 1]
        text_prompt: str, text description of video

    Returns:
        float: CLIP score (higher is better)
               Range: [0, 1], Good: >0.25
    """
    if not CLIP_AVAILABLE:
        return None

    # Encode text
    text_tokens = clip.tokenize([text_prompt]).to(device)
    with torch.no_grad():
        text_features = clip_model.encode_text(text_tokens)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)

    # Encode each frame
    clip_scores = []
    for frame in video_frames:
        # Convert to PIL
        frame_np = (frame.permute(1, 2, 0).cpu().numpy() * 255).clip(0, 255).astype(np.uint8)
        frame_pil = Image.fromarray(frame_np)

        # Preprocess and encode
        frame_input = clip_preprocess(frame_pil).unsqueeze(0).to(device)

        with torch.no_grad():
            image_features = clip_model.encode_image(frame_input)
            image_features = image_features / image_features.norm(dim=-1, keepdim=True)

            # Cosine similarity
            similarity = (image_features @ text_features.T).item()
            clip_scores.append(similarity)

    return np.mean(clip_scores)


# Metric 5: Inception Score
def calculate_inception_score(video_frames):
    """
    Measures diversity and quality using Inception network.

    Args:
        video_frames: Tensor of shape [T, C, H, W] normalized to [0, 1]

    Returns:
        tuple: (mean, std) of inception score
               Good: mean >2.5
    """
    inception_score_calculator.reset()

    for frame in video_frames:
        # Convert to uint8 [0, 255]
        frame_uint8 = (frame * 255).clip(0, 255).byte()
        inception_score_calculator.update(frame_uint8.unsqueeze(0).to(device))

    is_mean, is_std = inception_score_calculator.compute()
    return is_mean.item(), is_std.item()


print("All evaluation functions defined successfully")

All evaluation functions defined successfully


## Step 5: Comprehensive Evaluation Function

In [None]:
def evaluate_video_comprehensive(video_frames, text_prompt=None, verbose=True):
    """
    Evaluate a video with all metrics.

    Args:
        video_frames: Tensor [T, C, H, W] normalized to [0, 1]
        text_prompt: Optional text prompt for CLIP score
        verbose: Print progress

    Returns:
        Dictionary with all metrics
    """
    if verbose:
        print("Evaluating video...")

    results = {}

    # Temporal consistency
    if verbose:
        print("  Calculating temporal consistency...")
    results['temporal_consistency'] = calculate_temporal_consistency(video_frames)

    # Frame quality
    if verbose:
        print("  Calculating frame quality...")
    quality = calculate_frame_quality(video_frames)
    results['sharpness'] = quality['sharpness']
    results['contrast'] = quality['contrast']
    results['brightness'] = quality['brightness']

    # Motion smoothness
    if verbose:
        print("  Calculating motion smoothness...")
    results['motion_smoothness'] = calculate_motion_smoothness(video_frames)

    # CLIP score
    if text_prompt and CLIP_AVAILABLE:
        if verbose:
            print("  Calculating CLIP score...")
        results['clip_score'] = calculate_clip_score(video_frames, text_prompt)

    # Inception score
    if verbose:
        print("  Calculating Inception score...")
    is_mean, is_std = calculate_inception_score(video_frames)
    results['inception_score'] = is_mean
    results['inception_score_std'] = is_std

    if verbose:
        print("Evaluation complete")

    return results


def print_evaluation_results(metrics, prompt=None):
    """
    Print evaluation results in a formatted way.

    Args:
        metrics: Dictionary of evaluation metrics
        prompt: Optional text prompt
    """
    print("="*70)
    print("EVALUATION RESULTS")
    print("="*70)

    if prompt:
        print(f"Prompt: {prompt}")
        print("-"*70)

    # Temporal consistency
    tc = metrics['temporal_consistency']
    tc_status = "Excellent" if tc > 0.8 else "Good" if tc > 0.7 else "Fair" if tc > 0.6 else "Poor"
    print(f"Temporal Consistency: {tc:.4f} [{tc_status}]")
    print(f"  Target: >0.75, Higher = more consistent motion")

    # Sharpness
    sharp = metrics['sharpness']
    sharp_status = "Excellent" if sharp > 300 else "Good" if sharp > 150 else "Fair" if sharp > 50 else "Poor"
    print(f"\nSharpness: {sharp:.2f} [{sharp_status}]")
    print(f"  Target: >150, Higher = sharper image")

    # Contrast
    contrast = metrics['contrast']
    contrast_status = "Excellent" if contrast > 70 else "Good" if contrast > 50 else "Fair" if contrast > 30 else "Poor"
    print(f"\nContrast: {contrast:.2f} [{contrast_status}]")
    print(f"  Target: >50, Higher = better contrast")

    # Brightness
    brightness = metrics['brightness']
    brightness_status = "Good" if 100 <= brightness <= 150 else "Too Bright" if brightness > 150 else "Too Dark"
    print(f"\nBrightness: {brightness:.2f} [{brightness_status}]")
    print(f"  Target: 100-150, Proper exposure")

    # Motion smoothness
    motion = metrics['motion_smoothness']
    motion_status = "Excellent" if motion > 0.7 else "Good" if motion > 0.5 else "Fair" if motion > 0.3 else "Poor"
    print(f"\nMotion Smoothness: {motion:.4f} [{motion_status}]")
    print(f"  Target: >0.55, Higher = smoother motion")

    # CLIP score
    if 'clip_score' in metrics:
        clip = metrics['clip_score']
        clip_status = "Excellent" if clip > 0.3 else "Good" if clip > 0.25 else "Fair" if clip > 0.2 else "Poor"
        print(f"\nCLIP Score: {clip:.4f} [{clip_status}]")
        print(f"  Target: >0.25, Higher = better text alignment")

    # Inception score
    inception = metrics['inception_score']
    inception_status = "Excellent" if inception > 3.5 else "Good" if inception > 2.5 else "Fair" if inception > 1.5 else "Poor"
    print(f"\nInception Score: {inception:.4f} +/- {metrics['inception_score_std']:.4f} [{inception_status}]")
    print(f"  Target: >2.5, Higher = more realistic")

    # Overall assessment
    scores = [tc, motion, sharp/500, contrast/80]
    if 'clip_score' in metrics:
        scores.append(metrics['clip_score'] * 2.5)
    overall = np.mean(scores)

    print("\n" + "="*70)
    print("OVERALL ASSESSMENT")
    print("="*70)
    print(f"Overall Score: {overall:.3f}")

    if overall > 0.8:
        assessment = "EXCELLENT - Production ready"
    elif overall > 0.7:
        assessment = "GOOD - High quality"
    elif overall > 0.6:
        assessment = "FAIR - Acceptable quality"
    else:
        assessment = "NEEDS IMPROVEMENT"

    print(f"Assessment: {assessment}")
    print("="*70)


print("Comprehensive evaluation function ready")

Comprehensive evaluation function ready


## Step 6: Batch Evaluation Function

In [None]:
def evaluate_multiple_videos(video_list, prompts=None, output_dir="evaluation_results"):
    """
    Evaluate multiple videos and generate comparative analysis.

    Args:
        video_list: List of video frame tensors
        prompts: Optional list of text prompts
        output_dir: Directory to save results

    Returns:
        Dictionary with individual and aggregate results
    """
    os.makedirs(output_dir, exist_ok=True)

    print("="*70)
    print(f"EVALUATING {len(video_list)} VIDEOS")
    print("="*70)

    all_results = []

    # Evaluate each video
    for i, video_frames in enumerate(video_list):
        prompt = prompts[i] if prompts and i < len(prompts) else None

        print(f"\nVideo {i+1}/{len(video_list)}")
        if prompt:
            print(f"Prompt: {prompt}")
        print("-"*70)

        # Evaluate
        metrics = evaluate_video_comprehensive(video_frames, text_prompt=prompt, verbose=False)

        # Print key metrics
        print(f"Temporal Consistency: {metrics['temporal_consistency']:.4f}")
        print(f"Sharpness: {metrics['sharpness']:.2f}")
        print(f"Motion Smoothness: {metrics['motion_smoothness']:.4f}")
        if 'clip_score' in metrics:
            print(f"CLIP Score: {metrics['clip_score']:.4f}")
        print(f"Inception Score: {metrics['inception_score']:.4f}")

        result = {
            'video_index': i,
            'prompt': prompt,
            'metrics': metrics
        }
        all_results.append(result)

    # Calculate aggregate statistics
    print("\n" + "="*70)
    print("AGGREGATE STATISTICS")
    print("="*70)

    metric_names = ['temporal_consistency', 'sharpness', 'contrast', 'brightness',
                   'motion_smoothness', 'clip_score', 'inception_score']

    aggregate = {}
    for metric_name in metric_names:
        values = [r['metrics'][metric_name] for r in all_results if metric_name in r['metrics']]
        if values:
            aggregate[metric_name] = {
                'mean': np.mean(values),
                'std': np.std(values),
                'min': np.min(values),
                'max': np.max(values)
            }

            print(f"\n{metric_name.replace('_', ' ').title()}:")
            print(f"  Mean: {aggregate[metric_name]['mean']:.4f} +/- {aggregate[metric_name]['std']:.4f}")
            print(f"  Range: [{aggregate[metric_name]['min']:.4f}, {aggregate[metric_name]['max']:.4f}]")

    # Create visualization
    print("\nCreating visualization...")

    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    fig.suptitle('Video Evaluation Results', fontsize=16)

    metrics_to_plot = ['temporal_consistency', 'sharpness', 'contrast',
                      'motion_smoothness', 'clip_score', 'inception_score']

    for idx, metric_name in enumerate(metrics_to_plot):
        if idx >= 6:
            break

        row = idx // 3
        col = idx % 3
        ax = axes[row, col]

        values = [r['metrics'].get(metric_name, 0) for r in all_results]

        bars = ax.bar(range(len(values)), values, color='steelblue', alpha=0.7)
        ax.axhline(y=np.mean(values), color='red', linestyle='--',
                   label=f'Mean: {np.mean(values):.3f}', linewidth=2)

        ax.set_xlabel('Video', fontsize=10)
        ax.set_ylabel('Score', fontsize=10)
        ax.set_title(metric_name.replace('_', ' ').title(), fontsize=11)
        ax.set_xticks(range(len(values)))
        ax.set_xticklabels([f"V{i+1}" for i in range(len(values))])
        ax.legend(fontsize=9)
        ax.grid(True, alpha=0.3, axis='y')

    plt.tight_layout()
    plot_path = os.path.join(output_dir, 'evaluation_results.png')
    plt.savefig(plot_path, dpi=150, bbox_inches='tight')
    print(f"Visualization saved: {plot_path}")

    # Save results to JSON
    results_json = {
        'individual_results': [
            {
                'video_index': r['video_index'],
                'prompt': r['prompt'],
                'metrics': r['metrics']
            }
            for r in all_results
        ],
        'aggregate_statistics': aggregate
    }

    json_path = os.path.join(output_dir, 'evaluation_results.json')
    with open(json_path, 'w') as f:
        json.dump(results_json, f, indent=2)

    print(f"Results saved: {json_path}")
    print("\nEvaluation complete")
    print("="*70)

    return results_json


print("Batch evaluation function ready")

Batch evaluation function ready


In [None]:
def load_video_from_file(video_path):
    """
    Load video file and convert to tensor format.

    Args:
        video_path: Path to video file (.mp4)

    Returns:
        Tensor of shape [T, C, H, W] normalized to [0, 1]
    """
    import imageio

    print(f"Loading video from: {video_path}")

    # Read video
    reader = imageio.get_reader(video_path)
    frames = []

    for frame in reader:
        # Convert to tensor [H, W, C] -> [C, H, W]
        frame_tensor = torch.from_numpy(frame).permute(2, 0, 1).float()
        # Normalize to [0, 1]
        frame_tensor = frame_tensor / 255.0
        frames.append(frame_tensor)

    reader.close()

    # Stack to [T, C, H, W]
    video_tensor = torch.stack(frames)

    print(f"Loaded {len(frames)} frames, shape: {video_tensor.shape}")

    return video_tensor

## Step 7: Example Usage

Below is an example of how to use the evaluation functions. Replace `video_frames` with your actual generated video tensor.

In [None]:
# Example: Evaluate a single video from file

# Path to your generated video
video_path = "production_video_3.mp4"
text_prompt = "pouring water into glass"

# Load video frames from file
video_frames = load_video_from_file(video_path)

# Evaluate
metrics = evaluate_video_comprehensive(video_frames, text_prompt=text_prompt)

# Print results
print_evaluation_results(metrics, prompt=text_prompt)
# print("Example usage code ready")
# print("Uncomment the lines above and replace with your actual video data")

Loading video from: production_video_3.mp4
Loaded 20 frames, shape: torch.Size([20, 3, 320, 320])
Evaluating video...
  Calculating temporal consistency...
  Calculating frame quality...
  Calculating motion smoothness...
  Calculating CLIP score...
  Calculating Inception score...
Evaluation complete
EVALUATION RESULTS
Prompt: pouring water into glass
----------------------------------------------------------------------
Temporal Consistency: 0.8947 [Excellent]
  Target: >0.75, Higher = more consistent motion

Sharpness: 133.00 [Fair]
  Target: >150, Higher = sharper image

Contrast: 15.14 [Poor]
  Target: >50, Higher = better contrast

Brightness: 121.70 [Good]
  Target: 100-150, Proper exposure

Motion Smoothness: 0.7165 [Excellent]
  Target: >0.55, Higher = smoother motion

CLIP Score: 0.3292 [Excellent]
  Target: >0.25, Higher = better text alignment

Inception Score: 1.2334 +/- 0.1114 [Poor]
  Target: >2.5, Higher = more realistic

OVERALL ASSESSMENT
Overall Score: 0.578
Assessme

In [20]:
# Example: Evaluate a single video from file

# Path to your generated video
video_path = "production_video_4.mp4"
text_prompt = "putting spoon on table"

# Load video frames from file
video_frames = load_video_from_file(video_path)

# Evaluate
metrics = evaluate_video_comprehensive(video_frames, text_prompt=text_prompt)

# Print results
print_evaluation_results(metrics, prompt=text_prompt)
# print("Example usage code ready")
# print("Uncomment the lines above and replace with your actual video data")

Loading video from: production_video_4.mp4
Loaded 20 frames, shape: torch.Size([20, 3, 320, 320])
Evaluating video...
  Calculating temporal consistency...
  Calculating frame quality...
  Calculating motion smoothness...
  Calculating CLIP score...
  Calculating Inception score...
Evaluation complete
EVALUATION RESULTS
Prompt: putting spoon on table
----------------------------------------------------------------------
Temporal Consistency: 0.8820 [Excellent]
  Target: >0.75, Higher = more consistent motion

Sharpness: 66.92 [Fair]
  Target: >150, Higher = sharper image

Contrast: 16.71 [Poor]
  Target: >50, Higher = better contrast

Brightness: 120.36 [Good]
  Target: 100-150, Proper exposure

Motion Smoothness: 0.5418 [Good]
  Target: >0.55, Higher = smoother motion

CLIP Score: 0.3010 [Excellent]
  Target: >0.25, Higher = better text alignment

Inception Score: 1.4980 +/- 0.2023 [Poor]
  Target: >2.5, Higher = more realistic

OVERALL ASSESSMENT
Overall Score: 0.504
Assessment: NEED

## Step 8: Batch Evaluation Example

Example of evaluating multiple videos at once.

In [None]:
# Example: Evaluate multiple videos
# Assuming you have a list of video tensors

# Example setup (replace with your actual videos)
# video_list = [video1, video2, video3]  # Each shape: [T, C, H, W]
# prompts = [
#     "placing bottle on table",
#     "pouring water into glass",
#     "person placing objects"
# ]

# Evaluate all videos
# results = evaluate_multiple_videos(
#     video_list=video_list,
#     prompts=prompts,
#     output_dir="evaluation_results"
# )

# Access results
# print(results['aggregate_statistics'])

print("Batch evaluation example ready")
print("Uncomment the lines above and replace with your actual video data")

## Summary

This notebook provides comprehensive evaluation metrics for your text-to-video model:

**Functions available:**
- `calculate_temporal_consistency(video_frames)` - Frame-to-frame coherence
- `calculate_frame_quality(video_frames)` - Sharpness, contrast, brightness
- `calculate_motion_smoothness(video_frames)` - Optical flow smoothness
- `calculate_clip_score(video_frames, text_prompt)` - Text-video alignment
- `calculate_inception_score(video_frames)` - Diversity and realism
- `evaluate_video_comprehensive(video_frames, text_prompt)` - All metrics at once
- `evaluate_multiple_videos(video_list, prompts)` - Batch evaluation

**Expected Performance (10K Model):**
- Temporal Consistency: 0.75 - 0.85
- Sharpness: 150 - 300
- Motion Smoothness: 0.55 - 0.70
- CLIP Score: 0.27 - 0.33
- Inception Score: 2.5 - 3.5

**To use:**
1. Generate your video with the model
2. Pass the video tensor to `evaluate_video_comprehensive()`
3. Review the metrics
4. Use `print_evaluation_results()` for formatted output