In [50]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Data Curation Pipeline: Splitting and Transcoding 

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/semantic-deduplication.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fmultimodal-data-curation%2Fsemantic-deduplication.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/multimodal-data-curation/semantic-deduplication.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/semantic-deduplication.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/semantic-deduplication.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/semantic-deduplication.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/semantic-deduplication.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/semantic-deduplication.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/multimodal-data-curation/semantic-deduplication.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

| Author(s) |
| --- |
| [John Semerdjian](https://github.com/semerj) |

## Overview

Data deduplication is a critical step in any data curation pipeline. Even after splitting, filtering, and captioning video clips, our dataset will still contain redundant information, especially since much of the clips are derived from the same source material. Imagine that you're building a specialized foundation model for sports videos — do you really need thousands of clips of athletes shooting free throws? While more high quality data usually leads to better downstream modeling performance, it's we're likely not spending our compute budget efficiently. Redundant examples in particular do little to enhance a model's ability to generalize to new tasks, and instead these examples waste precious FLOPs.  Depending on the dataset, some researchers have show that eliminating [50% of training datasets using straightforward deduplication techniques can lead to a model of similar performance but at half the cost and twice as fast](https://arxiv.org/pdf/2303.09540).

In this post we'll show one simple approach for detecting semantic duplicates using [multimodal embeddings](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-multimodal-embeddings#video-modes) and [Approximate Nearest Neighbors using BigQuery Vector Search](https://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximation_methods). There are m

## Get started

### Install Google Gen AI SDK and other required packages


In [None]:
%pip install --upgrade --quiet datasets google-cloud-bigquery sentencepiece pandas-gbq

### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [1]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [2]:
# Use the environment variable if the user doesn't provide Project ID.
import os

PROJECT_ID = "genai-scratchpad"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
# if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
# PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

### Import libraries

In [3]:
from concurrent.futures import ThreadPoolExecutor, as_completed
import io
import threading

from PIL import Image
import av
from google.cloud import aiplatform, bigquery, storage
import pandas as pd
import pandas_gbq
import torch
from torch import nn
from tqdm import tqdm
from transformers import AutoModel, AutoProcessor
import vertexai
from vertexai.vision_models import MultiModalEmbeddingModel, Video

vertexai.init(project=PROJECT_ID, location=LOCATION)
aiplatform.init(project=PROJECT_ID, location=LOCATION)

bq_client = bigquery.Client()
storage_client = storage.Client()

## Video Embedding Generation

### Load Video Data

This notebook will use a subset of videos from the [VidGen-1M dataset](https://arxiv.org/abs/2408.02629). This authors of this dataset have already done some preliminary deduplication of the dataset, but let's see if we can identify some new duplicates with a small sample of records. 

In order to use the Multimodal Embeddings API the videos must be stored on Google Cloud Storage. You can download the compressed files and transfer them to a bucket here: https://huggingface.co/datasets/Fudan-FUXI/VIDGEN-1M/tree/main

Alternatively, you can also load a subset of the dataset into memory using the `datasets` library:

```
data = load_dataset("Fudan-FUXI/VIDGEN-1M", data_files="VidGen_video_0.zip")
```

We'll read the clips directly from Cloud Storage below:

In [4]:
def load_video_paths(
    bucket: str,
    num_videos: int,
) -> list[str]:
    video_paths = []
    for i, blob in enumerate(storage_client.list_blobs(bucket)):
        if i >= num_videos:
            break
        video_paths.append(f"gs://{bucket}/" + blob.name)
    return video_paths


video_paths = load_video_paths("vidgen-1m", 5000)
len(video_paths)

5000

### Vertex AI Multimodal Embeddings

We will generate embeddings using the Vertex AI multimodal embedding API. We will parallelize the calls to the embedding API (you may need to request a quota increase to get better throughput). In order to use this API we must store our videos on Cloud Storage. Video embeddings from this model have 1408 dimensions that encode a rich amount of semantic information about the content. It supports common video formats, e.g. mp4, webm, mov, and more. Our clips are already quite short so we don't need to worry about chunking the embeddings. Only 2 minutes of contents can be analyzed at a time, but there is no max video length. Audio data is not considered in the embeddings.

In [5]:
def get_embedding_for_single_video(
    model: vertexai.vision_models.MultiModalEmbeddingModel, video_file: str
) -> dict[str, str | vertexai.vision_models.MultiModalEmbeddingResponse]:
    """Get the initial set of embeddings of a video file.

    Args:
        model: MultiModalEmbeddingModel instance.
        video_file: URI for video on Cloud Storage.

    Returns:
        Dictionary containing the URI, start and end offsets, and embeddings.

    """
    try:
        video = Video.load_from_file(video_file)
        embeddings = model.get_embeddings(video=video)
        return {
            "uri": video_file,
            "embedding": embeddings.video_embeddings[0].embedding,
            "start_offset_sec": embeddings.video_embeddings[0].start_offset_sec,
            "end_offset_sec": embeddings.video_embeddings[0].end_offset_sec,
        }
    except Exception as e:
        print(f"Error processing {video_file}: {e}")
        return {
            "uri": video_file,
            "embedding": None,
            "error": str(e),
        }

In [6]:
def get_embeddings_with_semaphore(
    video_files: list[str],
    model_name: str = "multimodalembedding@001",
    max_workers: int = 2,
) -> list[dict[str, str | vertexai.vision_models.MultiModalEmbeddingResponse]]:
    """Get embeddings for a list of video files.

    Args:
        video_files: List of URIs for video files on Cloud Storage.
        model_name: Name of the Vertex AI Multimodal EmbeddingModel to use.
        max_workers: The maximum number of concurrent requests.

    Returns:
        A list of dictionaries containing the URI and the embedding response or an error.

    """
    model = MultiModalEmbeddingModel.from_pretrained(model_name)
    all_embeddings = []
    semaphore = threading.Semaphore(max_workers)

    def rate_limited_embedding_task(video_file: str):
        """Acquires the semaphore before running the embedding task."""
        with semaphore:
            return get_embedding_for_single_video(model, video_file)

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [
            executor.submit(rate_limited_embedding_task, video_file)
            for video_file in video_files
        ]

        for future in tqdm(
            as_completed(futures), total=len(video_files), desc="Processing videos"
        ):
            all_embeddings.append(future.result())

    return all_embeddings

In [7]:
embeddings = get_embeddings_with_semaphore(
    video_paths,
    model_name="multimodalembedding@001",
    max_workers=30,
)

Processing videos: 100%|██████████| 5000/5000 [13:25<00:00,  6.21it/s]


### Side Quest: Open Source Embeddings

We can also try out open source embedding models designed for images. We can sample N frames from each video, pool their embeddings, and perform the same deduplication steps. We will use average pooling of frame embeddings as a robust and computationally efficient baseline, but other approaches can also be used. A drawback of average pooling ignore the temporal nature of the video frames. Depending on the approach, the number of frames may not be completely accurate. Since (1) each video has a different duration and (2) we're pooling the embeddings anyway, sampling exact same number per video is less important.

In [11]:
def sample_frames(video_path: str, num_frames: int = 10) -> list[Image.Image]:
    """Samples n frames from a video, supporting both local paths and GCS URIs.

    Args:
        video_path: Cloud storage URI.
        num_frames: Number of frames to sample.

    Returns:
        A list of PIL Image objects.

    """
    pil_images: list[Image.Image] = []
    video_source_for_av = None
    total_frames_reported_by_stream = 0

    try:
        bucket_name, blob_name = video_path.replace("gs://", "").split("/", 1)
        bucket = storage_client.bucket(bucket_name)
        blob = bucket.blob(blob_name)

        video_bytes = blob.download_as_bytes()
        video_source_for_av = io.BytesIO(video_bytes)
        container = av.open(video_source_for_av)

        if not container.streams.video:
            print(f"No video streams found in {video_path}")
            return []

        video_stream = container.streams.video[0]
        # total_frames_reported_by_stream can sometimes be 0 or inaccurate.
        # PyAV's video_stream.frames gives an often reliable count.
        total_frames_reported_by_stream = video_stream.frames

        if total_frames_reported_by_stream == 0 or num_frames == 0:
            return []

        target_indices_to_sample = []
        if num_frames == 1:
            # Sample the middle frame index
            target_indices_to_sample.append(
                round((total_frames_reported_by_stream - 1) / 2)
            )
        elif num_frames >= total_frames_reported_by_stream:
            # If more or equal frames are requested than available, sample all
            target_indices_to_sample = list(range(total_frames_reported_by_stream))
        else:
            # Sample num_frames evenly distributed, aiming to include first and last.
            # Using a set to ensure uniqueness if rounding collapses indices, then sort.
            _indices = set()
            for i in range(num_frames):
                idx = round(
                    i * (total_frames_reported_by_stream - 1) / (num_frames - 1)
                )
                _indices.add(int(idx))
            target_indices_to_sample = sorted(list(_indices))

        # Ensure the list is not empty if logic somehow failed or num_frames was valid
        if (
            not target_indices_to_sample
            and num_frames > 0
            and total_frames_reported_by_stream > 0
        ):
            # This case should ideally not be hit if calculations are correct
            # Default to sampling just the first frame if something went wrong with index calculation
            target_indices_to_sample.append(0)

        current_frame_idx = 0
        # Iterate through frames and pick the ones at target_indices_to_sample
        for frame in container.decode(video_stream):
            if not target_indices_to_sample:
                break

            if current_frame_idx == target_indices_to_sample[0]:
                pil_images.append(frame.to_image())
                target_indices_to_sample.pop(0)

            current_frame_idx += 1

    except Exception as e:
        print(f"General error processing video '{video_path}': {e}")
        raise
    finally:
        if container:
            try:
                container.close()
            except Exception as ce:
                print(f"Error closing video container for '{video_path}': {ce}")
        if isinstance(video_source_for_av, io.BytesIO):
            video_source_for_av.close()

    if not pil_images and num_frames > 0 and total_frames_reported_by_stream > 0:
        print(
            f"Warning: No frames were sampled from '{video_path}'. "
            "Check video integrity, stream content, or frame selection logic."
        )

    return pil_images

In [24]:
sampled_frames = sample_frames(video_paths[0])

Here is the first sample frame for the first video:

In [None]:
sampled_frames[0]

and the last sample frame:

In [None]:
sampled_frames[-1]

In [43]:
class SiglipWithProjection(nn.Module):

    def __init__(self, model_name, target_dim):
        super().__init__()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = AutoModel.from_pretrained(model_name).to(self.device)
        for param in self.model.parameters():
            param.requires_grad = False
        original_dim = self.model.config.vision_config.hidden_size
        self.projection = nn.Linear(original_dim, target_dim)

    def forward(self, **inputs):
        """Apply a projection layer to the image features."""
        image_features = self.model.get_image_features(**inputs)
        low_dim_features = self.projection(image_features)
        return low_dim_features


def get_video_embeddings(
    video_paths: list[str],
    model_name: str,
) -> list[dict[str, str | torch.Tensor]]:
    """Get the pooled image embeddings of videos from a list of video paths.

    Args:
        video_paths: Cloud storage video URIs.
        model_name: The name of the model to use.

    Returns:
        A tensor of the average of the image embeddings of the sampled frames of a video.

    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = SiglipWithProjection(model_name, target_dim=256)
    processor = AutoProcessor.from_pretrained(model_name, use_fast=False)

    embeddings = []
    for video_path in video_paths:
        sampled_frames = sample_frames(video_path)
        inputs = processor(images=sampled_frames, return_tensors="pt").to(device)
        outputs = model(**inputs)
        embeddings.append(
            {
                "uri": video_path,
                "embedding": outputs.mean(dim=0).tolist(),
            },
        )
    return embeddings

In [None]:
# "laion/CLIP-ViT-B-16-laion2B-s34B-b88K"

In [None]:
embeddings = get_video_embeddings(video_paths, "google/siglip2-base-patch16-512")

## Storing Embeddings in BigQuery

We'll store the embeddings in a BigQuery table. We'll use the `bq_client` to create a table and insert the embeddings.

In [8]:
dataset_id = "video_data_curation"
table_id = "clip_embeddings"

df_embedding = pd.DataFrame(embeddings)
df_embedding.shape

(5000, 4)

In [9]:
pandas_gbq.to_gbq(
    df_embedding,
    f"{PROJECT_ID}.{dataset_id}.{table_id}",
    if_exists="replace",
    project_id=PROJECT_ID,
)

100%|██████████| 1/1 [00:00<00:00, 5645.09it/s]


## Deduplication in BigQuery

Once the embeddings are in BigQuery we will use its Vector Search functionality to perform the deduplication step. While there are more sophisticated strategies that run K-means clustering on the embeddings, perform pairwise comparisons within each cluster, and prune based on similarity, this approach is can be done entirely within BigQuery and easy to understand and execute.

First, let's create the test table that contains duplicate rows.

In [10]:
full_table_id = f"{PROJECT_ID}.{dataset_id}.{table_id}"
print(full_table_id)

genai-scratchpad.video_data_curation.clip_embeddings


We will not create vector index to accelerate the vector search query. This steps is particularly important if the table is large, but we can still do brute force search since our table is small.

In [11]:
embedding_column = "embedding"

index_job = bq_client.query(
    f"""
CREATE VECTOR INDEX my_index ON "{full_table_id}"({embedding_column})
OPTIONS(index_type='TREE_AH', distance_type='COSINE');
"""
)

The vector index creation takes a few minutes. You can monitor the vector index status using the following query. Once the coverage is 100, we can proceed to the next step.

In [12]:
index_status_job = bq_client.query(
    f"""
SELECT table_name, index_status, coverage_percentage
FROM '{PROJECT_ID}.{dataset_id}'.INFORMATION_SCHEMA.VECTOR_INDEXES
WHERE table_name = "{table_id}";
"""
)

In [16]:
index_status_job.result().to_arrow()

pyarrow.Table
table_name: string
index_status: string
coverage_percentage: int64
----
table_name: [["clip_embeddings"]]
index_status: [["ACTIVE"]]
coverage_percentage: [[0]]

We can now perform the deduplication step. For each embedding, we perform an approximate nearest neighbor (ANN) search and retrieve matching embeddings with a cosine distance. For records that are within the provided threshold - which should be tuned for each dataset and use case -  these are the semantic duplicates and we will remove them. While this approach is simple it comes with some tradeoffs, specifically that we ignore any transitive links between the embeddings (e.g. if record A is near both record B and record C, then B and C may be nearby as well), and that we don't intelligently decide which embedding to include or exclude between matches. Nevertheless, this is a straightforward deduplication approach that leverages the foundational power of the [ScaNN algorithm](https://research.google/blog/announcing-scann-efficient-vector-similarity-search/) for semantic search.

In [17]:
index_column = "uri"
top_k = 10
distance_threshold = 0.05

dedupe_job = bq_client.query(
    f"""
CREATE OR REPLACE TABLE '{full_table_id}_dedupe' AS
WITH dupes AS (
    SELECT 
        DISTINCT query.{index_column}
    FROM VECTOR_SEARCH(
        Table '{full_table_id}', "embedding",
        Table '{full_table_id}', top_k => {top_k})
    WHERE 
        distance < {distance_threshold} AND query.{index_column} > base.{index_column}
)
SELECT * FROM '{full_table_id}'
WHERE {index_column} NOT IN (SELECT {index_column} FROM dupes);
"""
)

In [18]:
bq_client.query(f"SELECT COUNT(*) FROM '{full_table_id}_dedupe'").result().to_arrow()

pyarrow.Table
f0_: int64
----
f0_: [[4989]]

In [19]:
bq_client.query(f"SELECT COUNT(*) FROM '{full_table_id}'").result().to_arrow()

pyarrow.Table
f0_: int64
----
f0_: [[5000]]

Let's query the duplicate records

In [20]:
df_dupes = pandas_gbq.read_gbq(
    f"""
SELECT 
    DISTINCT query.uri,
    base.uri,
    distance
FROM VECTOR_SEARCH(
    Table '{full_table_id}', "embedding",
    Table '{full_table_id}', top_k => 10)
WHERE 
    query.uri > base.uri
    AND distance < .05
ORDER BY 
    distance DESC;
"""
)

Downloading: 100%|[32m██████████[0m|


In [21]:
df_dupes["uri"].tolist()

['gs://vidgen-1m/VidGen_video_1002/1L2Ib7XbFl0-Scene-0002.mp4',
 'gs://vidgen-1m/VidGen_video_1001/7BdLCNVP3vc-Scene-0018.mp4',
 'gs://vidgen-1m/VidGen_video_1003/EumrLe0lv2o-Scene-0055.mp4',
 'gs://vidgen-1m/VidGen_video_1003/26ez1C6FHTU-Scene-0029.mp4',
 'gs://vidgen-1m/VidGen_video_1005/fGgj5Xwhca0-Scene-0220.mp4',
 'gs://vidgen-1m/VidGen_video_1004/BytDnUzySCc-Scene-0047.mp4',
 'gs://vidgen-1m/VidGen_video_1003/frcOHC7TLdA-Scene-0034.mp4',
 'gs://vidgen-1m/VidGen_video_1001/9yMovThU5Hg-Scene-1171.mp4',
 'gs://vidgen-1m/VidGen_video_1005/CVT7j05IHN4-Scene-0061.mp4',
 'gs://vidgen-1m/VidGen_video_1004/8gDxAwsHZzI-Scene-0151.mp4',
 'gs://vidgen-1m/VidGen_video_0/Copy of -Beq3x4K-xA-Scene-0001.mp4']

In [23]:
df_dupes["uri_1"].tolist()

['gs://vidgen-1m/VidGen_video_1001/LVnVfhcFL3g-Scene-0005.mp4',
 'gs://vidgen-1m/VidGen_video_10/7BdLCNVP3vc-Scene-0030.mp4',
 'gs://vidgen-1m/VidGen_video_1001/8LAfQyTauYo-Scene-0053.mp4',
 'gs://vidgen-1m/VidGen_video_1002/OQWeq_TUMDE-Scene-0012.mp4',
 'gs://vidgen-1m/VidGen_video_1000/ekKb8hixxFM-Scene-0010.mp4',
 'gs://vidgen-1m/VidGen_video_1001/doPS18DtqLU-Scene-0390.mp4',
 'gs://vidgen-1m/VidGen_video_1/xzn_rmla6yU-Scene-0200.mp4',
 'gs://vidgen-1m/VidGen_video_1000/-_agFJmVJXk-Scene-0068.mp4',
 'gs://vidgen-1m/VidGen_video_1/JfkzqohuutQ-Scene-0178.mp4',
 'gs://vidgen-1m/VidGen_video_1003/06dEjljE80Q-Scene-0840.mp4',
 'gs://vidgen-1m/VidGen_video_0/-Beq3x4K-xA-Scene-0001.mp4']