### Video-level metadata

We build a video-level metadata df using all the possibly interesting metadata available from the scraped info.jsons. Selected data from this df can then be joined with the matched extracted recommendations df as desired (this df will not contain any extraction-related info). 

We make use of the ``add_infojsons()`` function defined in ``scraping/scraping_utils.py`` to extract data from the info.jsons.

In [1]:
import sys
import os

sys.path.append(os.path.join(os.path.dirname(os.path.abspath('')), '..', 'scraping'))
from scraping_utils import add_infojson_fields

sys.path.append(os.path.join(os.path.dirname(os.path.abspath('')), '..', 'LLM_information_extraction'))
from data_prep_utils import clean_text

import pandas as pd

In [2]:
# read in already existing metadata df from scraping section (which we will expand upon)
df = pd.read_csv("../../scraping/6_filtered_videos_final/filtered_metadata.csv", sep=";")
print(df.columns)

Index(['uploader_id', 'video_id', 'upload_date', 'yt_video_type', 'view_count',
       'duration', 'language', 'title', 'description', 'yt_auto_categories',
       'tags', 'first_three_tags'],
      dtype='object')


In [3]:
infojsons_dir = "../../scraping/5_transcripts_and_metadata/infojsons"
# define fields to be added from info jsons (if available)
fields_to_add = ["like_count", 
                 "comment_count", 
                 "age_limit", 
                 "chapters", 
                 "uploader", # channel name (not id)
                 ]

# add fields
df = add_infojson_fields(df, fields_to_add, infojsons_dir=infojsons_dir, print_missing_fields=False)

# add has_chapters field
if "chapters" in df.columns and not "has_chapters" in df.columns:
    df["has_chapters"] = df["chapters"].apply(lambda x: True if x else False)

------------------------------------------------------------
Adding fields from info jsons to df (nrows: 45968, fields: ['like_count', 'comment_count', 'age_limit', 'chapters', 'uploader'])...
2500/45968 rows processed
5000/45968 rows processed
7500/45968 rows processed
10000/45968 rows processed
12500/45968 rows processed
15000/45968 rows processed
17500/45968 rows processed
20000/45968 rows processed
22500/45968 rows processed
25000/45968 rows processed
27500/45968 rows processed
30000/45968 rows processed
32500/45968 rows processed
35000/45968 rows processed
37500/45968 rows processed
40000/45968 rows processed
42500/45968 rows processed
45000/45968 rows processed
adding intermediate lists to df...
postprocessing...


In [4]:
# add transcripts field (very memory intensive!)

# import text cleaning function


transcripts_path = "../../scraping/5_transcripts_and_metadata/transcripts_csvs"

transcripts_list = []
for i, row in df.iterrows():
    uploader_id = row["uploader_id"]
    video_id = row["video_id"]

    # load transcript csv
    if not os.path.exists(f"{transcripts_path}/{uploader_id}_{video_id}.csv"):
        print(f"Transcript for {uploader_id}_{video_id} does not exist.")
        transcripts_list.append(None)
    else:
        transcript_csv = pd.read_csv(f"{transcripts_path}/{uploader_id}_{video_id}.csv", sep=";")
        # convert transcript to single string 
        transcript_text = " ".join([line for line in transcript_csv.text if isinstance(line, str)]) # filter out empty lines
        # clean
        transcript_text = clean_text(transcript_text)
        
        if transcript_text == "":
            transcript_text = None
        transcripts_list.append(transcript_text)
    # progress
    if (i+1) % 2500 == 0:
        print(f"Processed {i+1} transcripts.")
        
print(f"Finished processing transcripts.")
# add column to df
df["transcript"] = transcripts_list


Processed 2500 transcripts.
Processed 5000 transcripts.
Processed 7500 transcripts.
Processed 10000 transcripts.
Processed 12500 transcripts.
Processed 15000 transcripts.
Processed 17500 transcripts.
Processed 20000 transcripts.
Processed 22500 transcripts.
Processed 25000 transcripts.
Processed 27500 transcripts.
Processed 30000 transcripts.
Processed 32500 transcripts.
Processed 35000 transcripts.
Processed 37500 transcripts.
Processed 40000 transcripts.
Processed 42500 transcripts.
Processed 45000 transcripts.
Finished processing transcripts.


In [5]:
# save (only if file doesn't exist yet)
save_path = "video_metadata.csv"
if not os.path.exists(save_path):
    df.to_csv(save_path, sep=";", index=False)
else:
    print(f"File {save_path} already exists. Not saving.")

In [8]:
# load df
#df_loaded = pd.read_csv(save_path, sep=";")