## Youtube Video Data Exploration
SCOPE: This project aims to build a Retrieval-Augmented Generation (RAG) chatbot that answers user questions based on transcribed content from ServiceNow YouTube videos using both text and audio inputs.

Steps Performed:
* Loads Youtube Video Metadata - total 22 Video files
    * Files Used: SNOW_YT_Videos.csv
    * Created File: ServiceNow_Youtube_Metadata_Clean.csv
* Transcripting MetaData
    * Files Used: ServiceNow_Youtube_Metadata_Clean.csv
    * Created File: video_metadata_with_transcripts-csv

## Step 1: Import Libraries and Load CSV File

In [1]:
%pip install yt-dlp
%pip install pandas numpy --quiet
%pip install openai-whisper --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
import numpy as np
import os
import yt_dlp
import whisper

In [7]:
os.environ["PATH"] += os.pathsep + r"C:\ffmpeg-7.1.1-essentials_build\bin"
df = pd.read_csv("../Data/SNOW_YT_Videos.csv", sep=";")
print(df.head())

   Number                                 Youtube_link  \
0       1  https://www.youtube.com/watch?v=tOaMRG8DX3U   
1       2  https://www.youtube.com/watch?v=vteLoWpNw8Q   
2       3  https://www.youtube.com/watch?v=7WJ6lmxa1WQ   
3       4  https://www.youtube.com/watch?v=fqB-NcZmqXo   
4       5  https://www.youtube.com/watch?v=ZYJqkxGrNiI   

                                             Subject  
0  An AI Agent that knows everything about your P...  
1          What Is Agentic AI and Why Should I Care?  
2                     Agentic AI workflows for AIOps  
3  ServiceNow's agentic AI framework explained: W...  
4  AI and Business Agility: Enhancing Human Intel...  


## Convert Videos to MetaData

In [4]:
os.makedirs("Data", exist_ok=True)

def get_metadata_yt_dlp(video_url):
    ydl_opts = {'quiet': True, 'skip_download': True}
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        try:
            info = ydl.extract_info(video_url, download=False)
            return {
                "title": info.get("title"),
                "channel": info.get("uploader"),
                "description": info.get("description", "")[:200],
                "length": info.get("duration"),
                "publish_date": info.get("upload_date"),
                "views": info.get("view_count")
            }
        except Exception as e:
            return {"error": str(e)}

metadata_list = [get_metadata_yt_dlp(link) for link in df["Youtube_link"]]
metadata_df = pd.DataFrame(metadata_list)
final_df = pd.concat([df, metadata_df], axis=1)
final_df.to_csv("../Data/ServiceNow_Youtube_Metadata_Clean.csv", index=False)

ERROR: [youtube] VFGAvNxaK4Q: Private video. Sign in if you've been granted access to this video. Use --cookies-from-browser or --cookies for the authentication. See  https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp  for how to manually pass cookies. Also see  https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies  for tips on effectively exporting YouTube cookies


## Transcripting MetaData

In [6]:
import warnings
warnings.filterwarnings("ignore", message="FP16 is not supported on CPU; using FP32 instead")
df = pd.read_csv("../Data/ServiceNow_Youtube_Metadata_Clean.csv", sep=";")
output_path = "../data/video_metadata_with_transcripts.csv"

if os.path.exists(output_path):
    existing_df = pd.read_csv(output_path)
    if "transcript" in existing_df.columns and not existing_df["transcript"].isnull().all():
        print("✅ Transcripts already exist. Skipping update.")
        final_df = existing_df
    else:
        print("⚠️ Existing file found but missing or empty transcripts. Using metadata to update.")
        final_df = df.copy()
        run_update = True
else:
    print("📂 File not found. Creating transcript file from metadata.")
    final_df = df.copy()
    run_update = True

if 'run_update' in locals():
    if "transcript" not in final_df.columns:
        print("⚙️ No 'transcript' column found in metadata — creating empty column.")
        final_df["transcript"] = ""  
    
    final_df.to_csv(output_path, index=False)
    print(f"✅ Metadata with transcripts saved to: {output_path}")

✅ Transcripts already exist. Skipping update.
