# Describe YouTube Shorts with HuggingChat LLM

This script combines all previous Jupyter notebooks into one program which includes all necessary functionality to extract YouTube shorts by search terms and generate LLM descriptions.

## Usage

1. Set up authentication. See [Authentication](#authentication).
2. Specify the search terms you want to use to extract the YouTube Shorts IDs in the file `csv/input/youtube_shorts_search_terms.csv`.
3. Run the program by executing all cells. The last cell will call the `main()` function which will trigger the program execution. The `main()` function accepts an integer as parameter which specifies the max. results per search term.
4. Output is saved in `csv/output/youtube_shorts_with_chatbot_summary.csv`.

## <a id="authentication"></a> Authentication:

Edit `client_secrets.json` to contain all relevant authentication information in the following format:

```json
{
    "api_key":"<YOUR_YOUTUBE_API_KEY>", 
    "huggingLogin":"<YOUR_EMAIL>",
    "huggingPassword":"<YOUR_PASSWORD>"
}
```

Create your YouTube API Key here: <https://console.cloud.google.com/apis/credentials>

Create your HuggingChat credentials here: <https://huggingface.co/chat/>

## CSV File Descriptions:

`SHORT_ID_SEARCH_CSV`: CSV file which contains a list with all search queries. Before use, provide your search terms in this file.

`USED_SEARCH_QUERIES`: CSV file which contains all already used search queries. You don't need to touch this file.

`NEW_SHORT_IDS`: CSV file which contains the new short IDs that will be processed by during the Shorts extraction phase of the program. You don't need to touch this file.

`YOUTUBE_SHORTS_INFO`: CSV file which contains the extracted YouTube Shorts information. You don't need to touch this file.

`YOUTUBE_SHORTS_WITH_CHATBOT_SUMMARY`: CSV file which contains the final output. Your results will be merged with the previous results. You don't need to edit this file.

In [1]:
# globals
SHORT_ID_SEARCH_CSV = 'csv/input/youtube_shorts_search_terms.csv'
USED_SEARCH_QUERIES = 'csv/tmp/used_search_queries.csv'
NEW_SHORT_IDS = 'csv/tmp/new_short_ids_temp.csv'
YOUTUBE_SHORTS_INFO = "csv/tmp/youtube_shorts_description.csv"
YOUTUBE_SHORTS_WITH_CHATBOT_SUMMARY = "Insert PATH you want to save the summaries to'

API_VERSION = "v3"
API_SERVICE_NAME = "youtube"

In [2]:
# imports
import json
import csv
import time
import pandas as pd
from tqdm import tqdm
from hugchat import hugchat
from hugchat.login import Login
import googleapiclient.discovery
import googleapiclient.errors
from googleapiclient.discovery import build
from youtube_transcript_api import YouTubeTranscriptApi
from ast import literal_eval


In [3]:
# credentials
with open('client_secrets.json', 'r') as file:
    secrets = json.load(file)

# Set up the API key and YouTube API client
api_key = secrets['api_key']  
youtube = build('youtube', 'v3', developerKey=api_key)
login = secrets['huggingLogin']
password = secrets['huggingPassword']

### Get Short IDs

In [4]:
# get shorts
def run_query(query, max_results):
    video_ids = []
    next_page_token = None

    # Add a keyword to the query to search specifically for Shorts
    modified_query = query + " #shorts"

    while len(video_ids) < max_results:
        # Fetch search results
        request = youtube.search().list(
            part="id",
            q=modified_query,
            type="video",
            maxResults=50,  # Adjust as needed (max 50 per request)
            pageToken=next_page_token
        )
        response = request.execute()

        # Extract video IDs
        for item in response.get('items', []):
            video_ids.append(item['id']['videoId'])
            if len(video_ids) >= max_results:
                break

        next_page_token = response.get('nextPageToken')
        if not next_page_token:
            break

    # Create a DataFrame
    df = pd.DataFrame({'Query': [query], 'Video_IDs': [video_ids]})
    return df

def get_short_ids(max_results_per_search_term):
    search_terms_df = pd.read_csv(SHORT_ID_SEARCH_CSV)
    combined_results_df = pd.DataFrame()
    for search_term in tqdm(search_terms_df['Search Terms'], desc="[+] Processing Search Terms"):
        result_df = run_query(search_term, max_results_per_search_term)
    
        # Combine the result with the combined DataFrame
        combined_results_df = pd.concat([combined_results_df, result_df], ignore_index=True)

    # only merge video ids that have not already been scanned
    curr_video_ids_list = get_video_list(USED_SEARCH_QUERIES)
    clean_result_df = pd.DataFrame()

    for index, row in combined_results_df.iterrows():
        clean_list = []
        for id in row['Video_IDs']:
            if id not in curr_video_ids_list:
                clean_list.append(id)
        clean_result_df = pd.concat([clean_result_df, pd.DataFrame({'Query': [row['Query']],'Video_IDs': [clean_list]})], ignore_index=True)

    curr_video_ids = pd.read_csv(USED_SEARCH_QUERIES)    
    df_merged = pd.concat([curr_video_ids, clean_result_df], ignore_index=True)

    # Save to CSV file
    df_merged.to_csv(USED_SEARCH_QUERIES, index=False)
    clean_result_df.to_csv(NEW_SHORT_IDS, index=False)
    return

### Extract Short Info

In [5]:
class YTVideo:
    def __init__(self, videoId=""):
        self.youtubeClient = self.getYoutubeAPICLient()
        
        self.videoId = videoId
        self.transcript, self.duration = self.extractTranscript()
        if self.transcript is None and self.duration is None:
            return None
       
        videoInfo = self.getVideoInfo()
        self.title = videoInfo['snippet']['title'].encode('utf-8', errors='replace')
        self.description = videoInfo['snippet']['description'].encode('utf-8', errors='replace')
        self.channelTitle = videoInfo['snippet']['channelTitle'].encode('utf-8', errors='replace')
        self.publishedAt = videoInfo['snippet']['publishedAt']
        self.views = videoInfo['statistics']['viewCount']

        if 'likeCount' in videoInfo['statistics']:
            self.likes = videoInfo['statistics']['likeCount']
        else:
            self.likes = 0

        if 'commentCount' in videoInfo['statistics']:
            self.commentCount = videoInfo['statistics']['commentCount']
            try:
                self.top10comments = self.getTopComments()
            except:
                self.top10comments = []
        else:
            self.commentCount = 0
            self.top10comments = []
        
        self.category = self.getCategoryByID(videoInfo['snippet']['categoryId'])
        return

    def getYoutubeAPICLient(self):
        return googleapiclient.discovery.build(API_SERVICE_NAME, API_VERSION, developerKey=api_key)
    
    # extract transcript and transform it to string
    def extractTranscript(self):
        transcript = ""
        duration = 0
        try:
            transcriptList = YouTubeTranscriptApi.get_transcript(self.videoId)
        except:
            return None,None
        for t in transcriptList:
            transcript += f"{t['text']} "
            duration += t['duration']
        return transcript, duration

    # extract video info from YT API
    def getVideoInfo(self):
        response = self.youtubeClient.videos().list(part="snippet,contentDetails,statistics", id=self.videoId).execute()
        return response['items'][0]
        
    def getCategoryByID(self, categoryID):
        response = self.youtubeClient.videoCategories().list(part="snippet", id=categoryID).execute()
        return response['items'][0]['snippet']['title'].encode('utf-8', errors='replace')

    def getTopComments(self):
        response = self.youtubeClient.commentThreads().list(part="snippet", order="relevance", maxResults=10, videoId=self.videoId).execute()
        comment_list = []
        for comment in response['items']:
            comment_list.append(comment['snippet']['topLevelComment']['snippet']['textDisplay'])
        return comment_list

# receives a list of video IDs and generates a csv file with available information about the video
def generate_csv():
    vidList = get_video_list(NEW_SHORT_IDS)
    # open csv and create csv writer
    with open(YOUTUBE_SHORTS_INFO, 'w', encoding='utf-8', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=';', quotechar='"', quoting=csv.QUOTE_MINIMAL)

        # extract info from YT API and write to csv file
        writer.writerow(["Video ID", "Video Title", "Channel Title", "Transcript", "Duration", "Words per Second", "Number of Comments", "Top10 Comments", "Category", "Views", "Likes"])
        for vidID in tqdm(vidList, desc="[+] Extracting Information from YouTube API"):
            video = YTVideo(vidID)
            if video.transcript is None:
                continue
            writer.writerow([video.videoId, video.title, video.channelTitle, video.transcript, video.duration, len(video.transcript.split(" "))/video.duration, video.commentCount, video.top10comments, video.category, video.views, video.likes])
    return

### HuggingChat

In [6]:
def get_llm_descriptions():
    sign = Login(login, password)
    cookies = sign.login()

    # Save cookies to the local directory
    cookie_path_dir = "./cookies_snapshot"
    sign.saveCookiesToDir(cookie_path_dir)

    chatbot = hugchat.ChatBot(cookies=cookies.get_dict())  # or cookie_path="usercookies/<email>.json"

    # load shorts information
    df_shorts = pd.read_csv(YOUTUBE_SHORTS_INFO, sep=";")

    # Apply the function to each row
    df_shorts['Custom Query'] = df_shorts.apply(create_custom_query, axis=1)

    # Process the DataFrame in batches of 20 and write to the CSV file
    process_in_batches(df_shorts, 10, YOUTUBE_SHORTS_WITH_CHATBOT_SUMMARY, chatbot)
    
    return

def create_custom_query(row):
    return (
        "You are a copywriter, create a 100 word summary of what this Youtube Short is about. "
        "Provide a neutral description. The summary should describe the overall atmosphere "
        "and pace of the video. It should also highlight important events from the video. "
        "Do not include any statements about a viewers response to the content or the overall "
        "viewing experience. Output the raw summary text.\n"
        "Title: {}\n"
        "Channel: {}\n"
        "Transcript: {}\n"
        "Comments: {}\n"
        "Category: {}\n"
    ).format(
        row['Video Title'], 
        row['Channel Title'], 
        row['Transcript'], 
        row['Top10 Comments'], 
        row['Category']
    )

def get_chatbot_summary(row, chatbot):
    #Cast to String for regex, since query returns Message object
    time.sleep(10)
    return str(chatbot.query(row['Custom Query']))

# Process the DataFrame in batches and update the DataFrame
def process_in_batches(dataframe, batch_size, output_csv, chatbot):
    for start in range(0, len(dataframe), batch_size):
        end = min(start + batch_size, len(dataframe))
        batch = dataframe.iloc[start:end]

        with tqdm(total=len(batch), desc=f"Processing Batch {start}-{end}") as pbar:
            try:
                # Process each row and update the DataFrame
                for i, row in batch.iterrows():
                    dataframe.at[i, 'LLM Summary'] = get_chatbot_summary(row, chatbot)
                    pbar.update(1)  # Update the batch progress bar

                # Overwrite the CSV file with the current state of the DataFrame
                dataframe.iloc[:end].to_csv(output_csv, index=False)

                print(f"Batch {start} to {end} processed successfully")
            
            except Exception as e:
                print(f"Error processing batch {start} to {end}: {e}")
                time.sleep(60)  # Sleep timer for rate limiting
                process_in_batches(dataframe.iloc[start:], batch_size, output_csv)  # Restart from the current batch
                break


In [7]:
# utils
def get_video_list(filename):
    ret_list = []
    df = pd.read_csv(filename)
    df.Video_IDs = df.Video_IDs.apply(literal_eval)
    for row in df.Video_IDs:
        ret_list += row
    return ret_list

In [8]:
def main(max_results_per_search_term=50):
    # get youtube short IDs
    print('1. STEP: Getting YouTube Short IDs')
    get_short_ids(max_results_per_search_term)

    # extract short information
    print('2. STEP: Extract Youtube Short Info')
    generate_csv()

    # generate LLM descriptions
    print('3. STEP: Generate LLM Descriptions')
    get_llm_descriptions()
    return

In [9]:
main(100)

3. STEP: Generate LLM Descriptions


Processing Batch 0-10: 100%|██████████| 10/10 [02:52<00:00, 17.26s/it]


Batch 0 to 10 processed successfully


Processing Batch 10-20: 100%|██████████| 10/10 [03:03<00:00, 18.36s/it]


Batch 10 to 20 processed successfully


Processing Batch 20-30: 100%|██████████| 10/10 [03:10<00:00, 19.09s/it]


Batch 20 to 30 processed successfully


Processing Batch 30-40: 100%|██████████| 10/10 [03:19<00:00, 19.96s/it]


Batch 30 to 40 processed successfully


Processing Batch 40-50: 100%|██████████| 10/10 [03:15<00:00, 19.53s/it]


Batch 40 to 50 processed successfully


Processing Batch 50-60: 100%|██████████| 10/10 [03:08<00:00, 18.85s/it]


Batch 50 to 60 processed successfully


Processing Batch 60-70: 100%|██████████| 10/10 [03:08<00:00, 18.89s/it]


Batch 60 to 70 processed successfully


Processing Batch 70-80: 100%|██████████| 10/10 [03:20<00:00, 20.02s/it]


Batch 70 to 80 processed successfully


Processing Batch 80-90: 100%|██████████| 10/10 [03:12<00:00, 19.26s/it]


Batch 80 to 90 processed successfully


Processing Batch 90-100: 100%|██████████| 10/10 [03:05<00:00, 18.55s/it]


Batch 90 to 100 processed successfully


Processing Batch 100-110: 100%|██████████| 10/10 [03:13<00:00, 19.31s/it]


Batch 100 to 110 processed successfully


Processing Batch 110-120: 100%|██████████| 10/10 [03:17<00:00, 19.75s/it]


Batch 110 to 120 processed successfully


Processing Batch 120-130: 100%|██████████| 10/10 [03:13<00:00, 19.37s/it]


Batch 120 to 130 processed successfully


Processing Batch 130-140: 100%|██████████| 10/10 [03:11<00:00, 19.17s/it]


Batch 130 to 140 processed successfully


Processing Batch 140-150: 100%|██████████| 10/10 [03:11<00:00, 19.17s/it]


Batch 140 to 150 processed successfully


Processing Batch 150-160: 100%|██████████| 10/10 [03:06<00:00, 18.63s/it]


Batch 150 to 160 processed successfully


Processing Batch 160-170: 100%|██████████| 10/10 [03:04<00:00, 18.44s/it]


Batch 160 to 170 processed successfully


Processing Batch 170-179: 100%|██████████| 9/9 [02:44<00:00, 18.27s/it]

Batch 170 to 179 processed successfully



