In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Data Curation Pipeline: Splitting and Transcoding 

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/splitting_and_transcoding.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fmultimodal-data-curation%2Fsplitting_and_transcoding.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/multimodal-data-curation/splitting_and_transcoding.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/splitting_and_transcoding.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/splitting_and_transcoding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/splitting_and_transcoding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/splitting_and_transcoding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/splitting_and_transcoding.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/splitting_and_transcoding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

| Author(s) |
| --- |
| [Noa Ben-Efraim](https://github.com/noabenefraim) |

## Overview

Video data presents a unique challenge and opportunity for multi-modal learning. Its sheer scale and temporal redundancy necessitate careful consideration when building datasets for Video Language Models (VLMs). Unlike static images, videos often contain lengthy sequences with minimal informational variance or semantic significance. Training foundation models on such raw data can lead to computational bottlenecks and hinder the model's ability to learn meaningful associations between visual and linguistic cues.

This notebook explores the critical step of video splitting and transcoding involved in constructing effective data curation pipelines tailored for VLM pre-training. By focusing on extracting the most information-dense and semantically relevant segments, we can achieve significant gains in computational efficiency, optimize memory utilization during training, and empower models to capture more fine-grained temporal relationships. We will use a subset VIDGEN-1M to illustrate these steps in the data curation pipeline. The following sections will outline practical approaches to building a VLM data curation pipeline, covering aspects from initial video processing to advanced techniques for identifying and isolating key visual narratives:

+ Data overview
+ Why video splitting?
+ Video splitting strategies 
+ Transcoding overview

## Get started

### Install Google Gen AI SDK and other required packages


In [None]:
%pip install --upgrade --quiet google-genai

### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [None]:
# Use the environment variable if the user doesn't provide Project ID.
import os

from google import genai
from google.cloud import storage

PROJECT_ID = (
    ""  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
)
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

bucket = "vidgen-1m"
storage_client = storage.Client()

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

  from .autonotebook import tqdm as notebook_tqdm


### Import libraries

In [None]:
import glob
import io
import json
import os
import shutil
import subprocess
import tempfile

from google.api_core.exceptions import NotFound
from google.cloud import storage
from google.cloud.transcoder_v1.services.transcoder_service import (
    TranscoderServiceClient,
)
from scenedetect import SceneManager, open_video
from scenedetect.detectors import ContentDetector

  from .autonotebook import tqdm as notebook_tqdm


## Metadata Extraction and Filtering

Metadata extraction is an important step to extract essential metadata (duration, resolution, format) using ffmpeg or a similar library. Then we can filter out videos that do not meet the criteria for VLM training.

* Example: Minimum duration of 10 seconds.
* Example: Minimum resolution of 240p.

In this notebook, we are introducing the library __ffmpeg__. FFmpeg is a command-line tool and library suite for handling multimedia data. Its general uses include transcoding between various video/audio formats, resizing and scaling resolutions, extracting components like audio streams or individual frames, and performing basic editing operations such as cutting or concatenating video segments. In a video data curation pipeline, its role is critical for robust transcoding, ensuring raw video files are converted into consistent, compatible formats, and for precise splitting and manipulation, which are fundamental steps before quality filtering and captioning can begin.

### Load Video Data

This notebook will use a subset of videos from the [VidGen-1M dataset](https://arxiv.org/abs/2408.02629).

First we will load the videos from the GCS bucket.

In [2]:
def list_videos_in_bucket(bucket_name, prefix):
    """Lists video files in a Google Cloud Storage bucket/
    Args:
        bucket_name (str): The name of the Google Cloud Storage bucket.
    """
    blobs = storage_client.bucket(bucket_name).list_blobs(prefix=prefix)
    video_files = [blob.name for blob in blobs if blob.name.lower().endswith(".mp4")]

    if not video_files:
        print("No video files found in the specified bucket (or with the prefix).")
        return []
    else:
        print(f"Found {len(video_files)} video file(s).")
        return video_files

In [None]:
# TODO larger video files
video_files = list_videos_in_bucket(bucket_name="vidgen-1m", prefix="VidGen_video_0")

Found 492 video file(s).


Now, that we have our video files. We perform the following steps:

1. Metadata Extraction: The get_video_metadata function uses ffprobe to extract essential information from video files stored in Google Cloud Storage, such as duration, format, and stream details (codec, resolution). 

2. Video Filtering: The filter_videos function takes the extracted metadata and filters the video files based on user-defined criteria, specifically minimum duration and resolution.

In [None]:
def get_video_metadata(bucket_name, blob_name):
    """
    Retrieves video metadata from a video file stored in Google Cloud Storage.

    Args:
        bucket_name (str): The name of the Google Cloud Storage bucket.
        blob_name (str): The name of the video file (blob) within the bucket.

    Returns:
        dict: A dictionary containing video metadata, including:
            - 'duration' (float): The length of the video in seconds.
            - 'streams' (list): A list of stream dictionaries, containing information
              about video, audio, and other streams.  Useful for codec info,
              resolution, etc.
            - 'format' (dict):  Contains overall format information about the
              video file.
            Returns None if any error occurs.
    """
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)

    try:
        video_data = blob.download_as_bytes()
        video_stream = io.BytesIO(video_data)

        command = [
            "ffprobe",
            "-i",
            "pipe:0",
            "-analyzeduration",
            "2",
            "-probesize",
            "32",
            "-show_format",
            "-show_streams",
            "-print_format",
            "json",
        ]

        process = subprocess.Popen(
            command,
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
        )
        output, error = process.communicate(input=video_stream.getvalue())

        if error:
            error_string = error.decode("utf-8")
            if "partial file" in error_string:
                print(
                    f"ffprobe error (partial file): {error_string}.  Attempting to extract partial metadata."
                )
            else:
                print(f"ffprobe error: {error_string}")
                return None

        try:
            metadata = json.loads(output.decode("utf-8"))
            duration = None
            if "format" in metadata and "duration" in metadata["format"]:
                duration = float(metadata["format"]["duration"])
            metadata_dict = {
                "duration": duration,
                "streams": metadata.get("streams", []),
                "format": metadata.get("format", {}),
                "filename": blob_name,
            }
            return metadata_dict
        except json.JSONDecodeError:
            print(f"Error decoding ffprobe output for {blob_name}.  Skipping file.")
            return None

    except NotFound:
        print(f"File not found: {blob_name} in bucket {bucket_name}")
        return None
    except Exception as e:
        print(f"Error processing {blob_name}: {e}")
        return None


def filter_videos(video_files, bucket_name, min_duration=10, min_resolution_height=240):
    """
    Filters video files based on duration and resolution, using metadata obtained
    from get_video_metadata.

    Args:
        video_files (list): A list of video file names (blobs) in the bucket.
        bucket_name (str): The name of the Google Cloud Storage bucket.
        min_duration (int, optional): The minimum duration in seconds.
        min_resolution_height (int, optional): The minimum resolution height
            in pixels.

    Returns:
        list: A list of video file names that meet the criteria.
    """
    filtered_videos = []
    for video_file in video_files:
        metadata = get_video_metadata(bucket_name, video_file)
        if metadata:
            duration = metadata.get("duration")
            streams = metadata.get("streams", [])

            if duration is not None and duration >= min_duration:
                meets_resolution = False
                for stream in streams:
                    if stream.get("codec_type") == "video":
                        height = stream.get("height")
                        if height is not None and height >= min_resolution_height:
                            meets_resolution = True
                            break

                if meets_resolution:
                    filtered_videos.append(video_file)
                else:
                    print(
                        f"Video {video_file} does not meet minimum resolution criteria."
                    )
            else:
                print(f"Video {video_file} does not meet minimum duration criteria.")
        else:
            print(f"Skipping {video_file} due to metadata extraction failure.")
    return filtered_videos

In [None]:
video_files_small = video_files[:10]

filtered_video_files = filter_videos(
    video_files_small, bucket, min_duration=10, min_resolution_height=240
)

if filtered_video_files:
    print("Filtered Videos:")
    for video_file in filtered_video_files:
        print(video_file)
else:
    print("No videos found that meet the criteria.")

## Defining Split Points

The process of defining split points is crucial for segmenting video data effectively for subsequent analysis, processing, or storage. The "optimal" split point strategy is highly dependent on the intended use case of the segmented data. Different approaches offer varying levels of granularity and rely on distinct characteristics of the video content. Choosing the right method can significantly impact the efficiency and relevance of downstream tasks.

__Fixed-Length Segments__
This is perhaps the simplest method, involving splitting a video into clips of a predetermined, fixed duration (e.g., every 5 seconds). This approach is computationally inexpensive and easy to implement. It's suitable for scenarios where a consistent time-based segmentation is required, such as creating uniform chunks for distributed processing or generating short previews. However, it disregards the video's content, potentially cutting through crucial actions, dialogue, or scene transitions, which can lead to semantically incoherent segments. The fixed length must be chosen carefully based on the expected content and the goals of the data curation.

__Scene Detection Algorithms__
Scene detection algorithms aim to identify transitions between distinct scenes within a video. These algorithms typically analyze visual differences between consecutive frames, such as changes in color histograms, pixel intensity, or edge detection. More advanced methods can also consider temporal information and machine learning to improve accuracy and differentiate between abrupt cuts, fades, dissolves, and other types of transitions. Splitting by scene creates semantically meaningful segments that represent coherent narrative units, which is beneficial for content-based indexing, retrieval, and analysis. However, the accuracy of these algorithms can vary depending on the video quality, editing style, and the complexity of the visual content.

__Content-Aware Methods__
Content-aware methods go beyond simple visual differences to understand the underlying content and structure of the video. This category includes techniques like shot boundary detection and keyframe analysis. Shot boundary detection is a more refined form of scene detection, focusing on identifying the immediate transitions between individual camera shots. Keyframe analysis involves identifying the most representative frames within a shot or a longer segment, often based on visual saliency, information content, or diversity. Splitting or annotating based on these methods allows for a more granular and semantically rich representation of the video content, facilitating tasks such as video summarization, activity recognition, and object tracking. These methods often require more computational resources and can be more complex to implement than fixed-length splitting.


#### Fixed Length Segments

This method offers a straightforward, time-based strategy for segmenting video, commonly implemented using Python libraries like MoviePy which abstract low-level details. The process involves loading the video clip, mathematically determining segment start and end times based on a fixed duration, and programmatically extracting each subclip. The result is a series of video files of uniform length, providing simple partitioning at the cost of disregarding the video's inherent content or scene structure.

Based on our experience, using direct FFmpeg via ```subprocess``` for fixed-length splitting proved more robust and reliable, particularly for handling the final, potentially short segment, leveraging FFmpeg's purpose-built segment mixer and efficient stream copying (```-c copy```). In contrast, while MoviePy offers a more pythonic and user-friendly API, it encountered a specific bug during the re-encoding process for short segments, making it less dependable for this particular task despite its ease of use. Therefore, for critical data curation pipelines prioritizing reliability and performance in splitting, the direct FFmpeg method holds an advantage even with its less abstract command-line interface.

Therefore, to demonstrate fixed length segment splitting we will use FFmpeg.

In [None]:
bucket_name = "vidgen-1m"  # Insert your GCS bucket name
source_blob_name = "YOUR _BLOB_NAME"  # Example GCS blob name

temp_dir = tempfile.mkdtemp()
downloaded_file_path = os.path.join(temp_dir, os.path.basename(source_blob_name))

segment_duration_seconds = 5
output_directory = "ffmpeg_segments_output"
output_filename_pattern = "segment_%04d.mp4"


def download_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from the bucket to a local file using authenticated client."""
    try:
        storage_client = storage.Client()
        bucket = storage_client.bucket(bucket_name)
        blob = bucket.blob(source_blob_name)

        print(
            f"Attempting authenticated download gs://{bucket_name}/{source_blob_name} to {destination_file_name}"
        )

        if not blob.exists():
            print(
                f"Error: Blob gs://{bucket_name}/{source_blob_name} does not exist or is not accessible with current credentials."
            )
            return False

        blob.download_to_filename(destination_file_name)
        print(f"Download complete: {destination_file_name}")
        return True

    except Exception as e:
        print(f"An error occurred during authenticated download: {e}")
        return False


def get_video_duration(file_path):
    """Gets video duration in seconds using ffprobe."""
    ffprobe_command = [
        "ffprobe",
        "-v",
        "error",
        "-show_entries",
        "format=duration",
        "-of",
        "default=noprint_wrappers=1:nokey=1",
        file_path,
    ]
    try:
        result = subprocess.run(
            ffprobe_command, check=True, capture_output=True, text=True
        )
        duration_str = result.stdout.strip()
        return float(duration_str)
    except (
        FileNotFoundError,
        subprocess.CalledProcessError,
        ValueError,
        Exception,
    ) as e:
        print(f"Warning: Could not get duration for {file_path} using ffprobe: {e}")
        return None


def ffmpeg_call():
    """Constructs and executes ffmpeg command for fixed length video segmenting."""

    ffmpeg_command = [
        "ffmpeg",
        "-i",
        downloaded_file_path,
        "-map",
        "0",
        "-c",
        "copy",
        "-f",
        "segment",
        "-segment_time",
        str(segment_duration_seconds),
        "-reset_timestamps",
        "1",
        "-loglevel",
        "error",
        os.path.join(output_directory, output_filename_pattern),
    ]

    print("Executing FFmpeg command...")
    subprocess.run(ffmpeg_command, check=True, capture_output=True, text=True)

    print("FFmpeg command executed successfully.")

    print("\nVerifying output segment files:")
    try:
        expected_prefix = output_filename_pattern.split("%")[0]
        expected_suffix = "." + output_filename_pattern.split(".")[-1]

        output_items = os.listdir(output_directory)
        segment_files = [
            item
            for item in output_items
            if item.startswith(expected_prefix)
            and item.endswith(expected_suffix)
            and os.path.isfile(os.path.join(output_directory, item))
        ]

        if segment_files:
            print(
                f"Detected {len(segment_files)} segment file(s) written to '{output_directory}':"
            )

            for seg_file in sorted(segment_files):
                full_file_path = os.path.join(output_directory, seg_file)

                segment_duration = get_video_duration(full_file_path)
                duration_display = (
                    f"{segment_duration:.2f}s"
                    if segment_duration is not None
                    else "N/A (ffprobe failed)"
                )
                print(f"- {seg_file} (Duration: {duration_display})")

        else:
            print(
                f"No files matching the pattern '{expected_prefix}* {expected_suffix}' found in '{output_directory}'."
            )
            print(
                "This might indicate FFmpeg ran but produced no output, or the output pattern is incorrect."
            )

    except Exception as e:
        print(f"Error listing or verifying output files: {e}")


temp_dir_to_cleanup = temp_dir
input_video_duration = None

try:
    # 1. Download the video from GCS (This step requires authentication)
    download_success = download_blob(
        bucket_name, source_blob_name, downloaded_file_path
    )

    if not download_success:
        print("Authenticated video download failed. Exiting.")
        exit()

    input_video_duration = get_video_duration(downloaded_file_path)
    if input_video_duration is None:
        print(
            f"Error: Could not get duration of input video {downloaded_file_path}. Cannot proceed."
        )
        exit()
    else:
        print(f"\nInput video duration: {input_video_duration:.2f} seconds.")

    # 2. Ensure the output directory for segments exists
    os.makedirs(output_directory, exist_ok=True)
    print(f"\nOutput directory '{output_directory}' ensured.")

    # 3. Construct and Execute FFmpeg Command for Fixed-Length Splitting
    print(
        f"\nStarting FFmpeg fixed-length splitting ({segment_duration_seconds}s segments) for {downloaded_file_path}..."
    )
    ffmpeg_call()


except FileNotFoundError:
    print("Error: FFmpeg or ffprobe command not found.")
    print(
        "Please ensure FFmpeg (which includes ffprobe) is installed and in your system's PATH."
    )
except subprocess.CalledProcessError as e:
    print("Error during FFmpeg execution:")
    print(f"Command: {' '.join(e.cmd)}")
    print(f"Return Code: {e.returncode}")
    print(f"stdout:\n{e.stdout}")
    print(f"stderr:\n{e.stderr}")
except Exception as e:
    print(f"\nAn unexpected error occurred during processing: {e}")


finally:
    # Clean up the temporary local file and the temporary directory
    print("\n--- Cleaning up temporary files ---")
    if os.path.exists(downloaded_file_path):
        try:
            os.remove(downloaded_file_path)
            print(f"Removed temporary file: {downloaded_file_path}")
        except OSError as e:
            print(f"Error removing file {downloaded_file_path}: {e}")

    # This removes the temporary directory the input file was in (e.g., /tmp/tmp...)
    if os.path.exists(temp_dir_to_cleanup):
        try:
            shutil.rmtree(temp_dir_to_cleanup)
            print(f"Removed temporary directory: {temp_dir_to_cleanup}")
        except OSError as e:
            print(f"Error removing temporary directory {temp_dir_to_cleanup}: {e}.")


print("\nFixed-length splitting process finished.")

#### Scene Detection

For this approach, we are using PySceneDetect. This is a command-line tool and a Python library specifically designed for detecting scene changes in videos. The process utilizes the `open_video` function to correctly load the video source, preparing it for analysis by the scene detection algorithms. A `SceneManager` is then configured with specific detectors, such as the `ContentDetector`, which analyzes frame-to-frame visual changes to pinpoint potential scene transitions. For the `ContentDetector`, the `threshold` parameter is critical as it determines the sensitivity to visual changes, requiring careful tuning based on the video's characteristics. Finally, executing the detection process on the video object yields a structured list of detected scene boundaries, providing precise split points for data curation.

In this example, we will show an example script for how to perform scene detection based splitting on one video of the dataset.

In [None]:
bucket_name = "vidgen-1m"
source_blob_name = "VidGen_video_0/-EGK4gGtz44-Scene-0398.mp4"


# Create a temporary directory to store the downloaded file
temp_dir = tempfile.mkdtemp()

destination_file_name = os.path.join(temp_dir, os.path.basename(source_blob_name))

video_object = None
temp_dir_to_cleanup = temp_dir

try:
    # 1. Download the video from GCS (Requires Authentication)
    download_success = download_blob(
        bucket_name, source_blob_name, destination_file_name
    )

    if not download_success:
        print("Authenticated video download failed. Exiting.")
        exit()

    video_source_path = destination_file_name
    print(f"\nProcessing local temporary file: {video_source_path}")

    # 2. Open the video source using the recommended open_video function
    video_object = open_video(video_source_path)
    print("Video source opened successfully with open_video.")

    # 3. Create SceneManager and add detectors
    scene_manager = SceneManager(
        stats_manager=None
    )  # Pass stats_manager=None if not using
    scene_manager.add_detector(
        ContentDetector(threshold=20)
    )  # Adjust threshold as needed

    # 4. Perform scene detection, passing the video object directly
    print("Starting scene detection...")
    scene_manager.detect_scenes(video_object)  # Pass the result of open_video here
    print("Scene detection finished.")

    # 5. Retrieve list of scenes
    scene_list = scene_manager.get_scene_list()

    print(f"\nDetected {len(scene_list)} scenes for {source_blob_name}.")

    # 6. Print scene timestamps
    for i, scene in enumerate(scene_list):
        print(
            f"  Scene {i+1}: Start {scene[0].get_timecode()} - End {scene[1].get_timecode()}"
        )


except Exception as e:
    print(f"\nAn error occurred during scene processing: {e}")

finally:

    # Clean up the temporary local file and the temporary directory
    print("\n--- Cleaning up temporary files ---")
    if "destination_file_name" in locals() and os.path.exists(destination_file_name):
        try:
            os.remove(destination_file_name)
            print(f"Removed temporary file: {destination_file_name}")
        except OSError as e:
            print(f"Error removing file {destination_file_name}: {e}")

    if "temp_dir_to_cleanup" in locals() and os.path.exists(temp_dir_to_cleanup):
        try:
            shutil.rmtree(temp_dir_to_cleanup)
            print(f"Removed temporary directory: {temp_dir_to_cleanup}")
        except OSError as e:
            print(
                f"Error removing temporary directory {temp_dir_to_cleanup}: {e}. It might not be empty."
            )

Attempting authenticated download gs://vidgen-1m/VidGen_video_0/-EGK4gGtz44-Scene-0398.mp4 to /tmp/tmpws4ba8nm/-EGK4gGtz44-Scene-0398.mp4
Download complete: /tmp/tmpws4ba8nm/-EGK4gGtz44-Scene-0398.mp4

Processing local temporary file: /tmp/tmpws4ba8nm/-EGK4gGtz44-Scene-0398.mp4
Video source opened successfully with open_video.
Starting scene detection...
Scene detection finished.

Detected 2 scenes for VidGen_video_0/-EGK4gGtz44-Scene-0398.mp4.
  Scene 1: Start 00:00:00.000 - End 00:00:04.200
  Scene 2: Start 00:00:04.200 - End 00:00:04.367

--- Cleaning up temporary files ---
Removed temporary file: /tmp/tmpws4ba8nm/-EGK4gGtz44-Scene-0398.mp4
Removed temporary directory: /tmp/tmpws4ba8nm


## Transcoding 

Now that you have your video split into segments, the next step in a data curation pipeline is to transcode them. Transcoding is the process of converting a video file from one format (codec, container, resolution, bitrate, etc.) to another. This is often done to standardize the video format for analysis, reduce file size, or ensure compatibility with specific tools or platforms.   

Similar to splitting, the most common and powerful tool for transcoding is FFmpeg. You can perform transcoding on each segment by calling FFmpeg via Python's ```subprocess``` module.   

Determining the "best" target transcoding parameters for a video curation pipeline depends heavily on the specific goals and constraints of that pipeline, such as the required quality for analysis, storage limitations, compatibility with downstream tools, and processing speed requirements. However, here are some common and recommended parameters that strike a good balance for many curation purposes:

1. Container Format (e.g., MP4): MP4 (.mp4) is the most widely compatible container format. It's well-supported across various operating systems, software libraries, and hardware, making it an excellent choice for ensuring your curated data is easily accessible and usable by different tools in the pipeline.   

2. Video Codec (e.g., H.264 or H.265):

    H.264 (AVC) (libx264 in FFmpeg): This is the most compatible and still offers a very good balance between compression efficiency and quality. It's the safest default as almost all video software and hardware can decode it efficiently.   

    H.265 (HEVC) (libx265 in FFmpeg): Offers significantly better compression than H.264 for the same quality, resulting in smaller file sizes. Choose this if storage or bandwidth is a major concern, but be aware it requires more processing power to encode/decode and has less universal compatibility than H.264.   

3. Video Quality/Bitrate (e.g., CRF 20-23):

    Constant Rate Factor (CRF) for H.264/H.265: Using CRF is often preferred over specifying a fixed bitrate. It tells the encoder to maintain a consistent perceptual quality throughout the video, adjusting the bitrate as needed for complex scenes. A CRF value between 18 and 23 typically provides a good balance, where 18 is often considered visually near-lossless and 23 offers significant file size reduction with minimal noticeable quality loss for many purposes.   

    Fixed Bitrate (-b:v): Use this if you have strict file size targets, but quality will vary depending on the complexity of the video content.

    Resolution: The ideal resolution depends on the detail needed for downstream analysis or display.

    Keep Original: If full detail is essential.

    Downscale: Common choices are standard HD resolutions like 1920x1080 (1080p) or 1280x720 (720p). Downscaling reduces file size and standardizes dimensions, which can be beneficial for consistent input to models or analysis tools. Use FFmpeg's scale video filter (-vf scale=1280:720 or -vf scale=-2:720 to maintain aspect ratio).   

4. Frame Rate (FPS):

    Keep Original: Generally recommended unless standardization of temporal sampling is required.

    Standardize: Choosing a common frame rate like 25 or 30 fps can be useful if your analysis or models require consistent time steps. Use FFmpeg's -r option.

5. Audio Codec and Bitrate (e.g., AAC 128k-192k):

    Audio Codec (e.g., AAC): AAC (aac in FFmpeg) is the standard audio codec for MP4 containers and offers good compression with decent quality.   
    
    Audio Bitrate (-b:a): A bitrate between 128k and 192k is typically sufficient for most curation purposes, providing clear audio without excessively increasing file size.
    

A solid starting point for a general-purpose curation pipeline, balancing compatibility, quality, and manageability, would be:

+ Container: MP4 (.mp4)
+ Video Codec: H.264 (libx264)
+ Video Quality: CRF 21-23 (Adjust based on acceptable quality vs. size)
+ Resolution: 1920x1080 or 1280x720 (if downscaling is acceptable, choose based on need)
+ Frame Rate: Keep original (omit -r) or standardize to 25/30.
+ Audio Codec: AAC (aac)
+ Audio Bitrate: 128k or 192k (-b:a 128k)

Always test your chosen parameters on a sample of your video data to ensure the output quality, file size, and processing time meet your pipeline's requirements.


In [None]:
input_directory = "ffmpeg_segments_output"
transcoding_output_directory = "transcoded_segments"

# Define target transcoding parameters (adjust as needed)
target_video_codec = "libx264"  # Example: H.264
target_audio_codec = "aac"  # Example: AAC
target_resolution = (
    "1280x720"  # Example resolution or None to keep original, or 'scale=-2:720' etc.
)
target_video_bitrate = "1500k"  # Example bitrate or None
target_audio_bitrate = "128k"  # Example bitrate or None
target_fps = "25"  # Example frame rate or None to keep original

if not os.path.isdir(input_directory):
    print(f"Error: Input directory not found at {input_directory}")
    exit()

segment_files_to_transcode = glob.glob(os.path.join(input_directory, "*.mp4"))

if not segment_files_to_transcode:
    print(f"No MP4 files found in the input directory: {input_directory}")
    exit()

# Ensure the output directory exists
os.makedirs(transcoding_output_directory, exist_ok=True)
print(f"Transcoded output directory '{transcoding_output_directory}' ensured.")

print(
    f"\nStarting transcoding for {len(segment_files_to_transcode)} files in {input_directory}..."
)


def ffmpeg_transcode_segment(input_segment_path):
    """Constructs the ffmpeg command for transcoding segments with transcoding parameters."""

    input_segment_filename = os.path.basename(input_segment_path)
    base_filename, file_extension = os.path.splitext(input_segment_filename)
    output_extension = ".mp4"
    output_transcoded_filename = f"{base_filename}_transcoded{output_extension}"
    output_transcoded_path = os.path.join(
        transcoding_output_directory, output_transcoded_filename
    )

    print(f"\nProcessing {input_segment_filename} -> {output_transcoded_filename}")

    ffmpeg_command = [
        "ffmpeg",
        "-i",
        input_segment_path,  # Input file
        "-c:v",
        target_video_codec,  # Set video codec
        "-b:v",
        target_video_bitrate,  # Set video bitrate
        "-c:a",
        target_audio_codec,  # Set audio codec
        "-b:a",
        target_audio_bitrate,  # Set audio bitrate
        "-vf",
        f"scale={target_resolution}",  # Scale video filter
        "-r",
        target_fps,  # Set frame rate
        "-y",  # Overwrite output file without asking
        "-loglevel",
        "error",  # Suppress verbose FFmpeg output, show errors only
        output_transcoded_path,  # Output file
    ]

    try:
        print(f"  Executing FFmpeg command for {input_segment_filename}...")
        subprocess.run(ffmpeg_command, check=True, capture_output=True, text=True)
        print(f"  Transcoding successful for {input_segment_filename}.")

    except FileNotFoundError:
        print(f"  Error: FFmpeg command not found. Skipping {input_segment_filename}.")
    except subprocess.CalledProcessError as e:
        print(f"  Error during FFmpeg execution for {input_segment_filename}:")
        print(f"  Command: {' '.join(e.cmd)}")
        print(f"  Return Code: {e.returncode}")
        print(f"  stdout:\n{e.stdout}")
        print(f"  stderr:\n{e.stderr}")
    except Exception as e:
        print(f"  An unexpected error occurred for {input_segment_filename}: {e}")


# Sort the files for consistent processing order
for input_segment_path in sorted(segment_files_to_transcode):
    ffmpeg_transcode_segment(input_segment_path)

print("\nTranscoding process finished for all files.")

Transcoded output directory 'transcoded_segments' ensured.

Starting transcoding for 2 files in ffmpeg_segments_output...

Processing segment_0000.mp4 -> segment_0000_transcoded.mp4
  Executing FFmpeg command for segment_0000.mp4...
  Transcoding successful for segment_0000.mp4.

Processing segment_0001.mp4 -> segment_0001_transcoded.mp4
  Executing FFmpeg command for segment_0001.mp4...
  Transcoding successful for segment_0001.mp4.

Transcoding process finished for all files.


### Transcoding with the Google Cloud Transcoder API

The current implementation uses `ffmpeg`, which is the industry standard. However this requires constructing complex command line arguments. By contrast, Google Cloud Transcoder API is a fully managed service that simplifies complex video conversion, removing the need to operate and maintain your own encoding infrastructure. It specializes in producing adaptive bitrate formats like HLS and MPEG-DASH, ensuring a smooth, high-quality viewing experience for users on any device and network condition. Beyond simple transcoding, the API enables sophisticated content manipulation, including the ability to stitch multiple clips together, overlay watermarks, and embed captions. It's asynchronous, job-based workflow allows applications to submit a detailed processing request and receive notifications upon completion without being blocked. Ultimately, this service streamlines professional media pipelines by offloading heavy processing, which empowers developers to build advanced video features faster.

Below is an example code block with how to use the Transcoder API.

In [None]:
gcs_input_uri = "gs://your-input-bucket/path/to/your-video.mp4"
gcs_output_uri_prefix = "gs://your-output-bucket/path/for-transcoded-segments/"

# Define the start and end times for the clip you want to extract.
# This example will create a 10-second clip starting at the 5-second mark.
start_time_seconds = 5.0
end_time_seconds = 15.0

# Instantiate the Transcoder API client
transcoder_client = TranscoderServiceClient()

# The parent resource for the API call
parent = f"projects/{PROJECT_ID}/locations/{LOCATION}"

# Define a unique name for the output segment
output_segment_filename = (
    f"video_segment_{int(start_time_seconds)}s_to_{int(end_time_seconds)}s.mp4"
)
gcs_output_uri = f"{gcs_output_uri_prefix}{output_segment_filename}"

# Create the job configuration dictionary. This structure tells the Transcoder API
# exactly what to do.
job = {
    "input_uri": gcs_input_uri,
    "output_uri": gcs_output_uri,
    "config": {
        # 'edit_list' is used to define splits or stitch together clips.
        # Here we define one "atom" that points to our input video and
        # specifies the start and end times.
        "edit_list": [
            {
                "key": "atom0",
                "inputs": [{"key": "input0", "uri": gcs_input_uri}],
                "start_time_offset": {"seconds": int(start_time_seconds)},
                "end_time_offset": {"seconds": int(end_time_seconds)},
            }
        ],
        # 'elementary_streams' define the video and audio encoding settings.
        "elementary_streams": [
            {
                "key": "video-stream0",
                "video_stream": {
                    "h264": {
                        "height_pixels": 720,
                        "width_pixels": 1280,
                        "bitrate_bps": 2500000,
                        "frame_rate": 30,
                    }
                },
            },
            {
                "key": "audio-stream0",
                "audio_stream": {"codec": "aac", "bitrate_bps": 128000},
            },
        ],
        # 'mux_streams' combine the elementary streams into an output file container.
        "mux_streams": [
            {
                "key": "output-mp4",
                "container": "mp4",
                "elementary_streams": ["video-stream0", "audio-stream0"],
            }
        ],
    },
}

try:
    print(f"Submitting transcoding job to split '{gcs_input_uri}'...")

    # Use the client to create the job
    response = transcoder_client.create_job(parent=parent, job=job)

    print("\nAPI Response:")
    print(f"Job Name: {response.name}")
    print(f"Creation Time: {response.create_time}")
    print(f"Status: {response.state}")
    print("\nThe job has been submitted and is processing asynchronously.")
    print(f"Check the GCS output path for results: {gcs_output_uri}")

except Exception as e:
    print(f"\nAn error occurred: {e}")

## Conclusion

This notebook demonstrates foundational steps in a video data curation pipeline, essential for transforming raw video sources into a format suitable for analysis or machine learning tasks. A key initial process is video splitting, which breaks down large video files into smaller, more manageable segments based on time or content boundaries. Following splitting, transcoding standardizes the format, resolution, and other technical parameters of these segments, ensuring compatibility and efficiency for consistent downstream processing. By completing these crucial splitting and transcoding stages, the video dataset becomes a standardized collection of clips, prepared for subsequent quality filtering and in-depth analysis.
