In [1]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Scaled Video Analysis with the YouTube Data API, Gemini, and Batch Prediction

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-analysis/video_analysis_with_youtube_data_api_and_batch_prediction.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fvideo-analysis%2Fvideo_analysis_with_youtube_data_api_and_batch_prediction.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/video-analysis/video_analysis_with_youtube_data_api_and_batch_prediction.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/bigquery/import?url=https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-analysis/video_analysis_with_youtube_data_api_and_batch_prediction.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/bigquery/v1/32px.svg" alt="BigQuery Studio logo"><br> Open in BigQuery Studio
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-analysis/video_analysis_with_youtube_data_api_and_batch_prediction.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-analysis/video_analysis_with_youtube_data_api_and_batch_prediction.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-analysis/video_analysis_with_youtube_data_api_and_batch_prediction.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-analysis/video_analysis_with_youtube_data_api_and_batch_prediction.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-analysis/video_analysis_with_youtube_data_api_and_batch_prediction.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-analysis/video_analysis_with_youtube_data_api_and_batch_prediction.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

| | |
|-|-|
| Author(s) | [Alok Pattani](https://github.com/alokpattani/) |

## Overview

In this notebook, you'll explore how to search and analyze publicly available [YouTube](https://www.youtube.com/) videos at scale, using the [YouTube Data API](https://developers.google.com/youtube/v3/getting-started), [Vertex AI Batch Prediction with Gemini](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini), and [BigQuery](https://cloud.google.com/bigquery).

You will complete the following tasks:
- Use the YouTube Data API to find videos of interest, including by search query and channel.
- Summarize those YouTube videos from a specific query or channel using Gemini, with both online and batch prediction.
- Use batch prediction to extract a specific set of structured outputs from a larger set of YouTube videos.
- Get information about and extract insights from those videos by aggregating Gemini's extracted results in BigQuery.

## Get started

### Install Vertex AI SDK and other required packages


In [2]:
%pip install --upgrade --user --quiet google-cloud-aiplatform google-api-python-client google-auth-oauthlib google-auth-httplib2

Note: you may need to restart the kernel to use updated packages.


### Restart runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.

The restart might take a minute or longer. After it's restarted, continue to the next step.

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [1]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and initialize Vertex AI SDK

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [2]:
# Use the environment variable if the user doesn't provide Project ID.
import os

import vertexai

PROJECT_ID = "[your-project-id]"  # @param {type:"string", isTemplate: true}
if PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

API_ENDPOINT = f"{LOCATION}-aiplatform.googleapis.com"

vertexai.init(project=PROJECT_ID, location=LOCATION)

## Set up libraries, YouTube Data API, and Gemini models

### Import libraries

In [None]:
import json
import time

from IPython.display import HTML, Markdown, display
from google.cloud import bigquery
import googleapiclient.discovery
import googleapiclient.errors
import pandas as pd
from vertexai.batch_prediction import BatchPredictionJob
from vertexai.generative_models import GenerativeModel, Part

### Set YouTube Data API v3 Info

You need an API key to access the YouTube Data API for use in this notebook. Please follow [these instructions](https://developers.google.com/youtube/v3/getting-started#before-you-start) to perform the appropriate setup and get your key if you don't already have one.

In [4]:
# Enter your YouTube data API key here
YOUTUBE_DATA_API_KEY = "[your-youtube-data-api-key]"  # @param {type:"string"}

### Load models
Pick from the various Gemini models that support batch prediction [listed here](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini#models_that_support_batch_predictions).

In [5]:
# Set Gemini Flash and Pro models to be used in this notebook
GEMINI_FLASH_MODEL_ID = "gemini-1.5-flash-002"  # @param {type:"string"}
GEMINI_PRO_MODEL_ID = "gemini-1.5-pro-002"  # @param {type:"string"}

gemini_flash_model = GenerativeModel(GEMINI_FLASH_MODEL_ID)
gemini_pro_model = GenerativeModel(GEMINI_PRO_MODEL_ID)

### Setup BigQuery client, dataset, and tables

In [6]:
# Create BQ client
BQ_CLIENT = bigquery.Client(project=PROJECT_ID)

# Function to run BQ query using created client and return results as data frame


def get_bq_query_results_as_df(query_text):
    bq_results_table = BQ_CLIENT.query(query_text).to_dataframe()

    return bq_results_table


# Names of BQ dataset and tables to be created/used in this notebook
BQ_DATASET = "youtube_video_analysis"  # @param {type:"string"}

BATCH_PREDICTION_REQUESTS_TABLE = (
    "video_analysis_batch_requests"  # @param {type:"string"}
)

BATCH_PREDICTION_RESULTS_TABLE = (
    "video_analysis_batch_results"  # @param {type:"string"}
)

# Create BQ dataset if it doesn't already exist
create_dataset_if_nec_query = f"""
    CREATE SCHEMA IF NOT EXISTS `{BQ_DATASET}`
    OPTIONS(
      location='{LOCATION}'
    );
    """

get_bq_query_results_as_df(create_dataset_if_nec_query)

## Search for videos from YouTube Data API and summarize them with Gemini

### Use YouTube Data API to find videos by query

Set various parameters in the cell below to conduct a search on YouTube using [the Data API's search functionality](https://developers.google.com/youtube/v3/docs/search/list).

The default parameters in this notebook create a search for 3 short (<4-minute) videos of ice cream flavor reviews from the last 30 days, sorted by relevance.

In [7]:
# Set various parameters to search YouTube for relevant videos
search_query = "ice cream flavor reviews"  # @param {type:"string"}

video_duration_type = (
    "short"  # @param {type:"string"}['any', 'long', 'medium', 'short']
)

# To get newer/fresher videos, modify to lower # of days
published_within_last_X_days = 30  # @param {type:"integer"}

# Different ways to order results
order_criteria = "relevance"  # @param {type:"string"}['date', 'rating', 'relevance', 'title', 'viewCount']

# # of results to be returned - max is 50 results on 1 API call
num_results = 3  # @param {type:"integer"}

# Function to get response from YouTube API given specific query & various other parameters


def get_yt_data_api_response_for_search_query(
    query, video_duration, max_num_days_ago, channel_id, video_order, num_video_results
):
    api_service_name = "youtube"
    api_version = "v3"
    developer_key = YOUTUBE_DATA_API_KEY

    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey=developer_key
    )

    published_after_timestamp = (
        (pd.Timestamp.now() - pd.DateOffset(days=max_num_days_ago))
        .tz_localize("UTC")
        .isoformat()
    )

    # Using Search:list - https://developers.google.com/youtube/v3/docs/search/list
    yt_data_api_request = youtube.search().list(
        part="id,snippet",
        type="video",
        q=query,
        videoDuration=video_duration,
        maxResults=num_video_results,
        publishedAfter=published_after_timestamp,
        channelId=channel_id,
        order=video_order,
    )

    yt_data_api_response = yt_data_api_request.execute()

    return yt_data_api_response


yt_data_api_results = get_yt_data_api_response_for_search_query(
    query=search_query,
    video_duration=video_duration_type,
    max_num_days_ago=published_within_last_X_days,
    channel_id=None,
    video_order=order_criteria,
    num_video_results=num_results,
)

print(yt_data_api_results)

{'kind': 'youtube#searchListResponse', 'etag': '-8SX2xhpIww5lrEK_5NvfbEa9TQ', 'nextPageToken': 'CAMQAA', 'regionCode': 'US', 'pageInfo': {'totalResults': 26207, 'resultsPerPage': 3}, 'items': [{'kind': 'youtube#searchResult', 'etag': 'beJoEcBaUkM8RJ5SFKkbm9U2Fmc', 'id': {'kind': 'youtube#video', 'videoId': 'mmkkJ2xyXzI'}, 'snippet': {'publishedAt': '2024-11-30T17:00:49Z', 'channelId': 'UCpb-2N1mSSaiiIsFxaM4KcQ', 'title': 'NEW Salt &amp; Straw Holiday Flavors Review 🍨🎅🏽 | Dessert Review #icecream #tastetest', 'description': "morganchomps #holidayfood #dessert #saltandstraw #christmascookies #eggnog Trying Salt & Straw's NEW Holiday flavor ...", 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/mmkkJ2xyXzI/default.jpg', 'width': 120, 'height': 90}, 'medium': {'url': 'https://i.ytimg.com/vi/mmkkJ2xyXzI/mqdefault.jpg', 'width': 320, 'height': 180}, 'high': {'url': 'https://i.ytimg.com/vi/mmkkJ2xyXzI/hqdefault.jpg', 'width': 480, 'height': 360}}, 'channelTitle': 'Morgan Chomps', 'li

In [8]:
# Function to convert YouTube API response into data frame w/ specific schema


def convert_yt_data_api_response_to_df(yt_data_api_response):

    # Convert API response into data frame for further analysis
    yt_data_api_response_items_df = pd.json_normalize(yt_data_api_response["items"])

    yt_data_api_response_df = yt_data_api_response_items_df.assign(
        videoURL="https://www.youtube.com/watch?v="
        + yt_data_api_response_items_df["id.videoId"]
    )[
        [
            "id.videoId",
            "videoURL",
            "snippet.title",
            "snippet.description",
            "snippet.channelId",
            "snippet.channelTitle",
            "snippet.publishedAt",
            "snippet.thumbnails.default.url",
        ]
    ].rename(
        columns={
            "id.videoId": "videoId",
            "snippet.title": "videoTitle",
            "snippet.description": "videoDescription",
            "snippet.channelId": "channelId",
            "snippet.channelTitle": "channelTitle",
            "snippet.publishedAt": "publishedAt",
            "snippet.thumbnails.default.url": "thumbnailURL",
        }
    )

    return yt_data_api_response_df


yt_data_api_results_df = convert_yt_data_api_response_to_df(yt_data_api_results)

display(yt_data_api_results_df.head())

Unnamed: 0,videoId,videoURL,videoTitle,videoDescription,channelId,channelTitle,publishedAt,thumbnailURL
0,mmkkJ2xyXzI,https://www.youtube.com/watch?v=mmkkJ2xyXzI,NEW Salt &amp; Straw Holiday Flavors Review 🍨🎅...,morganchomps #holidayfood #dessert #saltandstr...,UCpb-2N1mSSaiiIsFxaM4KcQ,Morgan Chomps,2024-11-30T17:00:49Z,https://i.ytimg.com/vi/mmkkJ2xyXzI/default.jpg
1,3tKn3EclDpA,https://www.youtube.com/watch?v=3tKn3EclDpA,I Ate The Weirdest Ice Cream Flavors,,UCuuLy9wn33DAE7qNt7oVGww,DanCookedIt,2024-12-03T01:10:41Z,https://i.ytimg.com/vi/3tKn3EclDpA/default.jpg
2,0M3CkYFclgw,https://www.youtube.com/watch?v=0M3CkYFclgw,Alec’s Ice Cream Review #icecream #peanutbutte...,,UCep1MVEsZM7gCFLypGkf63A,Max Leventer,2024-11-14T23:16:26Z,https://i.ytimg.com/vi/0M3CkYFclgw/default.jpg


### Get summary from Gemini for each video

Next, we can ask Gemini to summarize each video using the URLs created from the results of our YouTube search, then display 1 of the videos with its summary for inspection (if desired).

In [9]:
# Get summary from Gemini for each video in data frame


def get_gemini_summary_from_youtube_video_url(video_url):
    video_summary_prompt = "Summarize this video."

    # Gemini Pro for highest quality (can change to Flash if latency/cost are of concern)
    video_summary_response = gemini_pro_model.generate_content(
        [video_summary_prompt, Part.from_uri(mime_type="video/webm", uri=video_url)]
    )

    summary_text = video_summary_response.text

    return summary_text


yt_data_api_results_df["geminiVideoSummary"] = yt_data_api_results_df["videoURL"].apply(
    get_gemini_summary_from_youtube_video_url
)

yt_data_api_results_df

Unnamed: 0,videoId,videoURL,videoTitle,videoDescription,channelId,channelTitle,publishedAt,thumbnailURL,geminiVideoSummary
0,mmkkJ2xyXzI,https://www.youtube.com/watch?v=mmkkJ2xyXzI,NEW Salt &amp; Straw Holiday Flavors Review 🍨🎅...,morganchomps #holidayfood #dessert #saltandstr...,UCpb-2N1mSSaiiIsFxaM4KcQ,Morgan Chomps,2024-11-30T17:00:49Z,https://i.ytimg.com/vi/mmkkJ2xyXzI/default.jpg,"In this video, the creator presents Salt & Str..."
1,3tKn3EclDpA,https://www.youtube.com/watch?v=3tKn3EclDpA,I Ate The Weirdest Ice Cream Flavors,,UCuuLy9wn33DAE7qNt7oVGww,DanCookedIt,2024-12-03T01:10:41Z,https://i.ytimg.com/vi/3tKn3EclDpA/default.jpg,This video shows a man trying various flavors ...
2,0M3CkYFclgw,https://www.youtube.com/watch?v=0M3CkYFclgw,Alec’s Ice Cream Review #icecream #peanutbutte...,,UCep1MVEsZM7gCFLypGkf63A,Max Leventer,2024-11-14T23:16:26Z,https://i.ytimg.com/vi/0M3CkYFclgw/default.jpg,The presenter reviews two flavors of Alec’s ic...


In [10]:
# Pick 1 video above to display video and its summary together
sample_video = yt_data_api_results_df.sample(1).iloc[0].to_dict()

sample_video_embed_url = sample_video["videoURL"].replace("/watch?v=", "/embed/")

# Create HTML code to directly embed video
sample_video_embed_html_code = f"""
<iframe width="560" height="315" src="{sample_video_embed_url}"
title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; 
clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>
</iframe>
"""

# Display embedded YouTube video
display(HTML(sample_video_embed_html_code))

display(
    Markdown(
        f"<b>Summary of Video from Gemini:</b><br>{sample_video['geminiVideoSummary']}"
    )
)

<b>Summary of Video from Gemini:</b><br>The presenter reviews two flavors of Alec’s ice cream. He used coupons from Alec’s for two flavors that looked interesting to him: Peanut Butter Fudge Honeycomb and Nutty Butter Brittle. Both flavors are bought at Whole Foods.

The Peanut Butter Fudge Honeycomb is similar in flavor to Ben and Jerry’s Tonight Dough. The presenter rates the ice cream 8.3 out of 10, but wishes it had a bit more crunch.

The Nutty Butter Brittle flavor is also good, and described as unique. The presenter gave it an 8.1 out of 10, noting the ice cream is not as creamy or as rich as they would like. 

The presenter recommends viewers try both flavors.

## Get a larger set of videos using YouTube Data API, prepare to analyze them in batch

In this section, we'll search for and set up to analyze a larger set of videos from YouTube.  

The default parameters in this notebook search the [Major League Baseball (MLB) YouTube channel](https://www.youtube.com/@MLB), channel ID `UCoLrcjPV5PbUrUyXq5mjc_A`, for medium-length videos from the last 365 days, returning the top 50 videos (the max from 1 API call) by view count.

In [11]:
# Intentionally leaving default empty to search for all videos w/in a channel
search_query = ""  # @param {type:"string"}

video_duration_type = (
    "medium"  # @param {type:"string"}['any', 'long', 'medium', 'short']
)

published_within_last_X_days = 365  # @param {type:"integer"}

# Default value of 'UCoLrcjPV5PbUrUyXq5mjc_A' is for specific MLB channel
channel_id = "UCoLrcjPV5PbUrUyXq5mjc_A"  # @param {type:"string"}

order_criteria = "viewCount"  # @param {type:"string"}['date', 'rating', 'relevance', 'title', 'viewCount']

# Max is 50 results on 1 API call
num_results = 50  # @param {type:"integer"}

yt_data_api_channel_results = get_yt_data_api_response_for_search_query(
    query=search_query,
    video_duration=video_duration_type,
    max_num_days_ago=published_within_last_X_days,
    channel_id=channel_id,
    video_order=order_criteria,
    num_video_results=num_results,
)

yt_data_api_channel_results_df = convert_yt_data_api_response_to_df(
    yt_data_api_channel_results
)

display(yt_data_api_channel_results_df.head())

Unnamed: 0,videoId,videoURL,videoTitle,videoDescription,channelId,channelTitle,publishedAt,thumbnailURL
0,-jYfC4YYXIw,https://www.youtube.com/watch?v=-jYfC4YYXIw,GREATEST GAME EVER?!? Shohei Ohtani goes 6-FOR...,Is this the greatest game in baseball history?...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-09-20T00:01:31Z,https://i.ytimg.com/vi/-jYfC4YYXIw/default.jpg
1,j3ykZoQMJLI,https://www.youtube.com/watch?v=j3ykZoQMJLI,Dodgers vs. Yankees World Series Game 5 Highli...,Dodgers vs. Yankees World Series Game 5 full g...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-10-31T05:24:46Z,https://i.ytimg.com/vi/j3ykZoQMJLI/default.jpg
2,jSgk6wjhY7Q,https://www.youtube.com/watch?v=jSgk6wjhY7Q,Yankees vs. Dodgers World Series Game 1 Highli...,Yankees vs. Dodgers World Series Game 1 full g...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-10-26T05:34:00Z,https://i.ytimg.com/vi/jSgk6wjhY7Q/default.jpg
3,jrDacPNWBGA,https://www.youtube.com/watch?v=jrDacPNWBGA,FREDDIE FREEMAN HITS A WALK-OFF GRAND SLAM TO ...,I DON'T BELIEVE WHAT I JUST SAW! Don't forget ...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-10-26T03:52:45Z,https://i.ytimg.com/vi/jrDacPNWBGA/default.jpg
4,CbiRxl4OMPc,https://www.youtube.com/watch?v=CbiRxl4OMPc,Dodgers vs. Yankees World Series Game 3 Highli...,Dodgers vs. Yankees World Series Game 3 full g...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-10-29T04:52:50Z,https://i.ytimg.com/vi/CbiRxl4OMPc/default.jpg


The next cell has the various specifics for our Gemini video extraction task to be applied to all the video results from YouTube: system instruction, prompt, and response schema for [controlled generation](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output) (i.e. creating structured outputs for further analysis).

The defaults in this notebook are tailored to the use case of counting the number of appearances or various people and teams in videos from the MLB channel (based on the default YouTube Data API search results above). Be sure to modify each of these pieces for your specific use case.

Using those pieces, we create a single Gemini cURL request per row - 1 for each YouTube video - in order to set up for using batch prediction.

In [13]:
# Set up pieces (system instruction, prompt, response schema, config) for Gemini video extraction API calls

video_extraction_system_instruction = """You are a video analyst that carefully looks 
    through all frames of provided videos, extracting out the pieces necessary to respond to
    user prompts. Make sure to look through and listen to the whole video, start to finish.
    Only reference information in the video itself in your response."""

video_extraction_prompt = """Provide a 2-3 sentence summary of the key themes from this video,
    and also provide a list of each athlete, manager/coach, and team that is referenced or
    shown. Use full names for people and teams - e.g. "Shohei Ohtani" instead of just "Ohtani"
    and "Los Angeles Dodgers" instead of just "Dodgers." Make sure to count only those involved
    in the actual baseball in the video, and output only 1 entity per row."""

video_extraction_response_schema = {
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "summary": {"type": "string"},
            "references": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "entity_name": {"type": "string"},
                        "entity_type": {
                            "type": "string",
                            "enum": ["athlete", "manager or coach", "team"],
                        },
                    },
                },
            },
        },
    },
}

video_extraction_generation_config = {
    "temperature": 0.0,
    "max_output_tokens": 8192,
    "response_mime_type": "application/json",
    "response_schema": video_extraction_response_schema,
}

# Function to build CURL request for given YT link, using pieces above


def get_video_extraction_curl_request_for_yt_video_link(youtube_video_link):
    video_extraction_curl_request_dict = {
        "system_instruction": {
            "parts": [{"text": video_extraction_system_instruction}]
        },
        "contents": [
            {
                "role": "user",
                "parts": [
                    {"text": video_extraction_prompt},
                    {
                        "file_data": {
                            "mimeType": "video/*",
                            "fileUri": youtube_video_link,
                        }
                    },
                ],
            }
        ],
        "generation_config": video_extraction_generation_config,
    }

    video_extraction_curl_request = json.dumps(video_extraction_curl_request_dict)

    return video_extraction_curl_request


# Create Gemini API CURL request for each YT video
yt_data_api_channel_results_df["request"] = yt_data_api_channel_results_df.apply(
    lambda row: get_video_extraction_curl_request_for_yt_video_link(row["videoURL"]),
    axis=1,
)

display(yt_data_api_channel_results_df.head())

Unnamed: 0,videoId,videoURL,videoTitle,videoDescription,channelId,channelTitle,publishedAt,thumbnailURL,request
0,-jYfC4YYXIw,https://www.youtube.com/watch?v=-jYfC4YYXIw,GREATEST GAME EVER?!? Shohei Ohtani goes 6-FOR...,Is this the greatest game in baseball history?...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-09-20T00:01:31Z,https://i.ytimg.com/vi/-jYfC4YYXIw/default.jpg,"{""system_instruction"": {""parts"": [{""text"": ""Yo..."
1,j3ykZoQMJLI,https://www.youtube.com/watch?v=j3ykZoQMJLI,Dodgers vs. Yankees World Series Game 5 Highli...,Dodgers vs. Yankees World Series Game 5 full g...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-10-31T05:24:46Z,https://i.ytimg.com/vi/j3ykZoQMJLI/default.jpg,"{""system_instruction"": {""parts"": [{""text"": ""Yo..."
2,jSgk6wjhY7Q,https://www.youtube.com/watch?v=jSgk6wjhY7Q,Yankees vs. Dodgers World Series Game 1 Highli...,Yankees vs. Dodgers World Series Game 1 full g...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-10-26T05:34:00Z,https://i.ytimg.com/vi/jSgk6wjhY7Q/default.jpg,"{""system_instruction"": {""parts"": [{""text"": ""Yo..."
3,jrDacPNWBGA,https://www.youtube.com/watch?v=jrDacPNWBGA,FREDDIE FREEMAN HITS A WALK-OFF GRAND SLAM TO ...,I DON'T BELIEVE WHAT I JUST SAW! Don't forget ...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-10-26T03:52:45Z,https://i.ytimg.com/vi/jrDacPNWBGA/default.jpg,"{""system_instruction"": {""parts"": [{""text"": ""Yo..."
4,CbiRxl4OMPc,https://www.youtube.com/watch?v=CbiRxl4OMPc,Dodgers vs. Yankees World Series Game 3 Highli...,Dodgers vs. Yankees World Series Game 3 full g...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-10-29T04:52:50Z,https://i.ytimg.com/vi/CbiRxl4OMPc/default.jpg,"{""system_instruction"": {""parts"": [{""text"": ""Yo..."


In [14]:
# Output table with YouTube API results and corresponding Gemini requests to BigQuery

yt_api_results_with_bp_requests_table_load_job = BQ_CLIENT.load_table_from_dataframe(
    yt_data_api_channel_results_df,
    f"{BQ_DATASET}.{BATCH_PREDICTION_REQUESTS_TABLE}",
    job_config=bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE"),
)

# Wait for the load job to complete
yt_api_results_with_bp_requests_table_load_job.result()

LoadJob<project=gcp-data-science-demo, location=us-central1, id=307c3e4e-dbf2-4ef9-b4ca-2c05c54c89b6>

## Submit batch prediction job to analyze multiple YouTube videos at once

### Set parameters for batch prediction request

You create a batch prediction job using the BatchPredictionJob.submit() method. To make a batch prediction request, you specify a source model ID, an input source, and an output location - either Cloud Storage or BigQuery - where Vertex AI stores the batch prediction results.

To learn more, see the [batch prediction API page](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/batch-prediction-api).

In the default case in this notebook, we'll use the BigQuery table with requests created in the previous section as input, and output results to another BigQuery table for further analysis.

In [15]:
# BQ URI of input table in form bq://PROJECT_ID.DATASET.TABLE
# or Cloud Storage bucket URI
INPUT_URI = f"bq://{PROJECT_ID}.{BQ_DATASET}.{BATCH_PREDICTION_REQUESTS_TABLE}"

# BQ URI of target output table in form bq://PROJECT_ID.DATASET.TABLE
# If the table doesn't already exist, then it is created for you
OUTPUT_URI = f"bq://{PROJECT_ID}.{BQ_DATASET}.{BATCH_PREDICTION_RESULTS_TABLE}"

# Pick which Gemini model to use here (default Flash)
MODEL_ID = GEMINI_FLASH_MODEL_ID  # @param {type:"raw"} ['GEMINI_FLASH_MODEL_ID', 'GEMINI_PRO_MODEL_ID']

### Submit batch prediction request using Vertex AI SDK

In [None]:
# Submit batch prediction request using Vertex AI SDK
batch_prediction_job = BatchPredictionJob.submit(
    source_model=MODEL_ID, input_dataset=INPUT_URI, output_uri_prefix=OUTPUT_URI
)

If the batch prediction job goes through, the output above should contain a link you can use to monitor the job in [the Vertex AI Batch predictions page in the Google Cloud console](https://console.cloud.google.com/vertex-ai/batch-predictions).

You can also print out the job status and other properties, as shown below.

In [None]:
print(f"Job resource name: {batch_prediction_job.resource_name}")
print(f"Model resource name: {batch_prediction_job.model_name}")
print(f"Job state: {batch_prediction_job.state.name}")

Depending on the number of input items that you submitted, a batch generation task can take some time to complete. You can use the following code to check the job status and wait for the job to complete.

In [18]:
# Refresh batch prediction job until complete
while not batch_prediction_job.has_ended:
    time.sleep(5)
    batch_prediction_job.refresh()

# Check if the job succeeds
if batch_prediction_job.has_succeeded:
    print("Job succeeded!")
else:
    print(f"Job failed: {batch_prediction_job.error}")

Job succeeded!


### Check sample of results in BigQuery

Once the batch prediction job has finished successfully, we can run the following cell to check a sample of our results.

In [19]:
# Pick sampling % and # of results for check of BQ results table - can leave
# 100% & total # of results for big tables, likely sample down for larger ones

sampling_percentage = 100  # @param {type:"number"}

num_results = 50  # @param {type:"integer"}

batch_prediction_results_sample_query = f"""
    SELECT * 
    FROM `{BQ_DATASET}.{BATCH_PREDICTION_RESULTS_TABLE}`
    TABLESAMPLE SYSTEM ({sampling_percentage} PERCENT)
    LIMIT {num_results}
    """

bq_results_table = get_bq_query_results_as_df(batch_prediction_results_sample_query)

display(Markdown("<b>Batch Prediction BigQuery Results Table"))

display(bq_results_table.head())

<b>Batch Prediction BigQuery Results Table

Unnamed: 0,videoId,videoURL,videoTitle,videoDescription,channelId,channelTitle,publishedAt,thumbnailURL,status,processed_time,request,response
0,0VjYP_73TkY,https://www.youtube.com/watch?v=0VjYP_73TkY,FULL INNING: Dodgers win Game 1 after Freeman ...,Freddie Freeman wins Game 1 for the Dodgers. S...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-10-26T04:02:23Z,https://i.ytimg.com/vi/0VjYP_73TkY/default.jpg,,2024-12-10 10:49:03.802000+00:00,"{""contents"":[{""parts"":[{""text"":""Provide a 2-3 ...","{""candidates"":[{""avgLogprobs"":-0.0343309741422..."
1,cwZ8fI8T4Kc,https://www.youtube.com/watch?v=cwZ8fI8T4Kc,Royals vs. Yankees ALDS Game 1 Highlights (10/...,Royals vs. Yankees ALDS Game 1 full game highl...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-10-06T04:09:41Z,https://i.ytimg.com/vi/cwZ8fI8T4Kc/default.jpg,,2024-12-10 10:50:31.271000+00:00,"{""contents"":[{""parts"":[{""text"":""Provide a 2-3 ...","{""candidates"":[{""avgLogprobs"":-0.0297037133877..."
2,-jYfC4YYXIw,https://www.youtube.com/watch?v=-jYfC4YYXIw,GREATEST GAME EVER?!? Shohei Ohtani goes 6-FOR...,Is this the greatest game in baseball history?...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-09-20T00:01:31Z,https://i.ytimg.com/vi/-jYfC4YYXIw/default.jpg,,2024-12-10 10:50:03.239000+00:00,"{""contents"":[{""parts"":[{""text"":""Provide a 2-3 ...","{""candidates"":[{""avgLogprobs"":-0.0680019415698..."
3,2ABGYpVo41k,https://www.youtube.com/watch?v=2ABGYpVo41k,FULL INNING: Dodgers TAKE THE LEAD for the fir...,The Los Angeles Dodgers take the lead vs. the ...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-10-31T03:37:04Z,https://i.ytimg.com/vi/2ABGYpVo41k/default.jpg,,2024-12-10 10:51:08.335000+00:00,"{""contents"":[{""parts"":[{""text"":""Provide a 2-3 ...","{""candidates"":[{""avgLogprobs"":-0.0415774981180..."
4,Jr5AiJ0K9Og,https://www.youtube.com/watch?v=Jr5AiJ0K9Og,Yankees vs. Guardians ALCS Game 4 Highlights (...,Yankees vs. Guardians ALCS Game 4 full game hi...,UCoLrcjPV5PbUrUyXq5mjc_A,MLB,2024-10-19T05:27:40Z,https://i.ytimg.com/vi/Jr5AiJ0K9Og/default.jpg,,2024-12-10 10:49:03.594000+00:00,"{""contents"":[{""parts"":[{""text"":""Provide a 2-3 ...","{""candidates"":[{""avgLogprobs"":-0.0225860974827..."


The results above should show new fields `status`, `processed_time`, and `response` that come from batch prediction with Gemini, with the latter being the one with the results we want to extract.

## Further analysis of Gemini video extraction results

With our results from Gemini video extraction in BigQuery, we can pull out various pieces that might interest us. It's possible to do this further analysis in Python or directly in BigQuery - we'll choose the latter here since the results are already there, and [BigQuery's native JSON functionality](https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions) provides convenient ways to pull out relevant outputs at scale.

### Extract summaries for each YouTube video

In [20]:
# Query to extract summary for each video from JSON Gemini API response
video_summaries_query = f"""
    SELECT
      videoUrl AS url,
      videoTitle AS title,
      videoDescription AS description,

      JSON_EXTRACT_SCALAR(
        JSON_EXTRACT_ARRAY(
          JSON_VALUE(response, '$.candidates[0].content.parts[0].text')
          )[OFFSET(0)],
        '$.summary'
        ) AS geminiSummary,

      publishedAt

    FROM
      `{BQ_DATASET}.{BATCH_PREDICTION_RESULTS_TABLE}`

    ORDER BY
      publishedAt DESC
    """

video_summaries = get_bq_query_results_as_df(video_summaries_query)

# Change column width to be able to read all summary text for each row
pd.set_option("display.max_colwidth", 500)

# Display results
display(Markdown("<b>Batch YouTube Video Analysis Summary Results</b>"))

display(video_summaries.head())

<b>Batch YouTube Video Analysis Summary Results</b>

Unnamed: 0,url,title,description,geminiSummary,publishedAt
0,https://www.youtube.com/watch?v=j3ykZoQMJLI,Dodgers vs. Yankees World Series Game 5 Highlights (10/30/24) | MLB Highlights,"Dodgers vs. Yankees World Series Game 5 full game highlights from 10/30/24, presented by @evanwilliamsbourbon Don't forget ...","The Los Angeles Dodgers defeated the New York Yankees in game 5 of the 2024 World Series. The Yankees had a 5-0 lead early in the game, but the Dodgers came back to win 7-6. This win secured the World Series title for the Dodgers.",2024-10-31T05:24:46Z
1,https://www.youtube.com/watch?v=YezVM5dbSN0,THE DODGERS ARE 2024 WORLD CHAMPIONS! (FULL FINAL INNING OF THEIR CLINCH!),The Dodgers' 5-run comeback is the largest ever in a #WorldSeries clinching victory. Don't forget to subscribe!,"The Los Angeles Dodgers won the World Series in a 7-6 victory over the New York Yankees. This win was notable for being the largest comeback in a clinching game in World Series history, overcoming a 5-0 deficit. The video focuses on the performance of Walker Buehler, who pitched in relief despite limited rest.",2024-10-31T04:26:28Z
2,https://www.youtube.com/watch?v=2ABGYpVo41k,FULL INNING: Dodgers TAKE THE LEAD for the first time in the 8th inning of Game 5!,The Los Angeles Dodgers take the lead vs. the New York Yankees in Game 5 of the 2024 World Series. Don't forget to subscribe!,"This video shows the end of game 5 of the World Series between the Los Angeles Dodgers and the New York Yankees. The Dodgers came back from being down 5-0 to win 7-6, and the video highlights the tension and excitement of the game.",2024-10-31T03:37:04Z
3,https://www.youtube.com/watch?v=9vZVDUjWerI,Dodgers vs. Yankees World Series Game 4 Highlights (10/29/24) | MLB Highlights,"Dodgers vs. Yankees World Series Game 4 full game highlights from 10/29/24, presented by @evanwilliamsbourbon Don't forget ...","The Los Angeles Dodgers defeated the New York Yankees in game four of the 2024 World Series. The Yankees had a strong performance, highlighted by a grand slam from Anthony Volpe, but ultimately fell short against the Dodgers' powerful offense.",2024-10-30T05:26:52Z
4,https://www.youtube.com/watch?v=GGzMqkjvq_Q,GRAND SLAM!! Anthony Volpe gives the Yankees THE LEAD in World Series Game 4!,Anthony Volpe hits a World Series grand slam! Don't forget to subscribe! https://www.youtube.com/mlb Follow us elsewhere too: ...,"The video focuses on the 2024 World Series game between the Los Angeles Dodgers and the New York Yankees. The commentary discusses the Yankees' offensive strategy and the Dodgers' bullpen performance. A key moment is Anthony Volpe's grand slam home run, which gives the Yankees their first lead since game one.",2024-10-30T01:21:20Z


### Find most frequently appearing entities across videos
In the final step of our process of going from unstructured videos to structured data results from analyzing all those videos, we'll use BigQuery to count up the number of references to each entity across videos, and return those that appear most frequently

In the default case in this notebook, this counts the number of appearances for each athlete, manager/coach, or team in the specific videos we pulled from the MLB channel in the steps above, and shows the ones with the most appearances across those videos. Think of this as a list of "who mattered most" in Major League Baseball - at least based on content on the league's own YouTube channel - over the last year.

In [21]:
# Query to extract entity references from Gemini results, count most frequently appearing
most_referenced_entities_query = f"""
    WITH
    ExtractedText AS
    (
      SELECT
        *,
        JSON_EXTRACT_ARRAY(JSON_VALUE(response, '$.candidates[0].content.parts[0].text'))[OFFSET(0)]
          AS extracted_text

      FROM
        `youtube_video_analysis.video_analysis_batch_results`
    ),

    ExtractedRows AS
    (
      SELECT
        ARRAY(
          SELECT AS STRUCT 
            JSON_EXTRACT_SCALAR(references, '$.entity_name') AS entity_name,
            JSON_EXTRACT_SCALAR(references, '$.entity_type') AS entity_type

          FROM 
            UNNEST(JSON_EXTRACT_ARRAY(extracted_text, '$.references')) AS references
          ) AS reference,

      FROM
        ExtractedText
    )

    SELECT
      References.entity_name AS name,
      LOWER(References.entity_type) AS type,
      COUNT(*) AS num_videos

    FROM
      ExtractedRows,
      UNNEST(ExtractedRows.reference) AS References

    GROUP BY
      entity_name, entity_type

    ORDER BY
      num_videos DESC,
      name
    """

most_referenced_entities = get_bq_query_results_as_df(most_referenced_entities_query)

# Display results
display(Markdown("<b>Most Referenced Entities in Videos Analyzed</b>"))

display(most_referenced_entities.head(25))

<b>Most Referenced Entities in Videos Analyzed</b>

Unnamed: 0,name,type,num_videos
0,Los Angeles Dodgers,team,37
1,Mookie Betts,athlete,31
2,Shohei Ohtani,athlete,28
3,Caleb Ferguson,athlete,26
4,New York Yankees,team,26
5,Alex Vesia,athlete,24
6,Evan Phillips,athlete,24
7,Brusdar Graterol,athlete,22
8,Will Smith,athlete,22
9,Aaron Judge,athlete,20
