
# YouTube Video Metadata Processing Pipeline

## Overview
This Jupyter notebook contains functionality for processing YouTube video metadata from downloaded JSON files. The main component is the `load_YT_info` function which creates a structured DataFrame from video information files.

### Key Features
- Processes multiple video info JSON files
- Extracts relevant metadata fields like:
  - Video details (ID, title, description)
  - Channel information
  - Engagement metrics (views, likes, comments)
  - Technical details (codecs, formats)
- Progress tracking during processing
- Optional CSV export functionality

### Prerequisites


In [None]:
import os
import json
import pandas as pd



### Usage
The pipeline expects video metadata JSON files in a specified directory structure, typically generated from YouTube video downloads. The data can be processed and optionally saved to CSV format for further analysis.

In [None]:
def load_YT_info(folder_path, save_as=None):
    """
    This function creates a dataframe from the info.json files from the downloads folder.
    --- args ---
    folder_path: string  # folder where watch-history files are located (.json)
    
    --- kwargs ---
    save_as: string (optional)  # save path for the output CSV file | default: None
    
    --- output ---
    video_data: pandas.DataFrame  # DataFrame with required columns
    
    Outputs to current directory (if save_as=True)
    video_data: .csv
    """
    
    # Define the variables of interest and their new names
    variables_of_interest = [
        'id', 'title', 'description', 'channel_id', 'channel', 'channel_is_verified', 'duration', 'duration_string',
        'view_count', 'comment_count', 'like_count', 'channel_follower_count', 'categories', 'tags', 'upload_date',
        'was_live', 'language', 'vcodec', 'acodec', 'video_ext', 'audio_ext', 'format',
        
    ]
    
    names = [
        'video_id', 'title', 'description_raw', 'channel_id', 'channel_name', 'channel_is_verified', 'duration_seconds',
        'duration_string', 'view_count', 'comment_count', 'like_count', 'subscriber_count', 'categories', 'tags', 'upload_date',
        'was_live', 'language', 'vcodec', 'acodec', 'video_ext', 'audio_ext', 'format'
    ]
    
    # Get the list of JSON files
    files = [file for file in os.listdir(folder_path) if file.endswith(".json")]
    total_files = len(files)
    
    def load_json(file_path, i):
        print(f"\rProcessing file {i + 1}/{total_files}", end="")
        # Load JSON data from the file
        with open(file_path, "r") as file:
            data = json.load(file)
        
        # Extract only the variables of interest and handle missing keys
        row = {}
        for var, name in zip(variables_of_interest, names):
            row[name] = data.get(var, None)  # Use None if key is missing
        
        return row
    
    # Use a generator to process files and create rows for the DataFrame
    data_rows = (load_json(os.path.join(folder_path, file), i) for i, file in enumerate(files))

    # Convert the rows into a DataFrame
    video_data = pd.DataFrame(data_rows)
    
    
    # Save to CSV if needed
    if save_as:
        video_data.to_csv(save_as, index=False)

    print("\nProcessing complete.")
    return video_data

In [3]:
path = '../../YouTube_Downloader/Complete_Downloads'
video_df = load_YT_info(path, save_as='video_metadata1.csv')

Processing file 20180/20180
Processing complete.


In [4]:
video_df

Unnamed: 0,video_id,title,description_raw,channel_id,channel_name,channel_is_verified,duration_seconds,duration_string,view_count,comment_count,...,categories,tags,upload_date,was_live,language,vcodec,acodec,video_ext,audio_ext,format
0,qjXZWjqkpGQ,Mixing drinks colors in eye #art #artwork #sat...,,UCgeZd65CusXsksnmDLFmz_w,Livalaina,True,32.0,32,103109291,7800.0,...,[Howto & Style],[],20230818,False,,avc1.42001E,mp4a.40.2,mp4,none,"18 - 360x640 (360p, WEB)"
1,wLgEuiP30BY,brooklyn nine nine being relatable for six min...,"""i'm gonna go cry in the bathroom. peace out h...",UCPp4TDFWY_keLuuTiBsblzQ,LMthedream,,371.0,6:11,802132,614.0,...,[People & Blogs],"[brooklyn nine nine, b99, brooklyn nine nine b...",20201024,False,en,avc1.42001E,mp4a.40.2,mp4,none,"18 - 640x360 (360p, WEB)"
2,jn2mgWDC7Q0,BARBIE GIRL #shorts,Barbie Girl #shorts,UCmrn1ElAUe77KQG0SYoSZHg,Bruna Barbie,True,14.0,14,44314588,3100.0,...,[People & Blogs],"[Barbie, Barbie Girl, Shorts, Pink house, casa...",20211115,False,,avc1.42001E,mp4a.40.2,mp4,none,"18 - 360x640 (360p, WEB)"
3,ipJ28h_ySpU,“When he’s in uniform 🤭”,,UCSiMTQtZkgJz8-kUxpStMXg,Aaron Paulsen,True,25.0,25,2576051,1000.0,...,[Comedy],[],20231002,False,en,avc1.42001E,mp4a.40.2,mp4,none,"18 - 360x640 (360p, WEB)"
4,02yZJDuh8tY,One World: Together at Home,"Global Citizens, now is your chance to fight t...",UCg3_C7BwcV0kBlJbBFHTPJQ,Global Citizen,True,61.0,1:01,554835,,...,[Nonprofits & Activism],"[Global Citizen, Global Citizenship, Music Fes...",20200407,False,,avc1.42001E,mp4a.40.2,mp4,none,"18 - 360x360 (360p, WEB)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20175,MXJuJNGX8m4,Avengers as zodiac signs,avengers as zodiac signs (just for fun) Its ge...,UCLJdfFYMNOr1S9VLc1xe_jg,Sarah Richy,,376.0,6:16,1060162,,...,[People & Blogs],[],20200113,False,,avc1.42001E,mp4a.40.2,mp4,none,"18 - 640x360 (360p, WEB)"
20176,4gspvBymj28,Indoor Cycling. Outdone.,Zwift has taken your boring indoor cycling rou...,UCeOCqLG5Wy65aiENWfuFzUQ,Zwift,,100.0,1:40,254246,41.0,...,[Sports],"[zwift, Zwift, Cycling (Interest)]",20151126,False,en,avc1.42001E,mp4a.40.2,mp4,none,"18 - 640x360 (360p, WEB)"
20177,-518j79hFxc,How to run cluster analysis in Excel,A step by step guide of how to run k-means clu...,UCqVk1NO2dx2OiG2GJxGfWsw,Marketing Study Guide,,675.0,11:15,219412,60.0,...,[Education],"[cluster analysis, marketing, Excel, segmentat...",20160131,False,en,avc1.42001E,mp4a.40.2,mp4,none,"18 - 640x318 (360p, WEB)"
20178,1d6AtUddCz0,Uncle Roger HATE Jamie Oliver Butter Chicken,Go to https://www.expressvpn.com/uncleroger an...,UCVjlpEjEY9GpksqbEesJnNA,mrnigelng,True,661.0,11:01,11651371,21000.0,...,[Comedy],"[nigel ng, uncle roger, nigel ng comedy]",20220109,False,en,vp09.00.21.08,mp4a.40.2,,,"605 - 640x360 (IOS)+140 - audio only (medium, ..."


In [9]:
video_df['language'].value_counts()

language
en       10655
da         757
es         189
de         118
ko          92
nl          75
ru          64
pl          62
fr          61
vi          60
hi          59
id          54
ar          54
th          35
pt          34
ja          34
it          32
ro          28
tr          27
sv          13
no          11
el           9
cs           8
bg           3
iw           3
uk           3
lt           2
sk           1
kn           1
ta           1
hu           1
fil          1
fi           1
en-US        1
pa           1
lv           1
Name: count, dtype: int64

In [1]:
import pandas as pd
import os
import json
import random

def load_random_json_to_df(path):
    files = [f for f in os.listdir(path) if f.endswith('.json')]
    if not files:
        raise FileNotFoundError("No JSON files found in the specified path.")
    
    random_file = random.choice(files)
    file_path = os.path.join(path, random_file)

    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    print(f"Loaded {random_file}, JSON type: {type(data)}")
    
    if isinstance(data, list):
        return pd.DataFrame(data)
    elif isinstance(data, dict):
        return pd.DataFrame.from_dict(data, orient='index')
    else:
        raise ValueError("Unexpected JSON structure")

# Example usage
df = load_random_json_to_df("../../YouTube_Downloader/Complete_Downloads")


Loaded _FHGkHjALyU.info.json, JSON type: <class 'dict'>


In [3]:
print(df)

                                                              0
id                                                  _FHGkHjALyU
title          Can you get rich with just *dirt* in The Sims 4?
formats       [{'format_id': 'sb2', 'format_note': 'storyboa...
thumbnails    [{'url': 'https://i.ytimg.com/vi/_FHGkHjALyU/3...
thumbnail     https://i.ytimg.com/vi_webp/_FHGkHjALyU/maxres...
...                                                         ...
aspect_ratio                                               1.78
http_headers  {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; ...
format                                 18 - 640x360 (360p, WEB)
_type                                                     video
_version      {'version': '2025.02.10.232934', 'release_git_...

[69 rows x 1 columns]


In [8]:
print(df.index)

Index(['id', 'title', 'formats', 'thumbnails', 'thumbnail', 'description',
       'channel_id', 'channel_url', 'duration', 'view_count', 'age_limit',
       'webpage_url', 'categories', 'tags', 'playable_in_embed', 'live_status',
       '_format_sort_fields', 'automatic_captions', 'subtitles',
       'comment_count', 'heatmap', 'like_count', 'channel',
       'channel_follower_count', 'channel_is_verified', 'uploader',
       'uploader_id', 'uploader_url', 'upload_date', 'timestamp',
       'availability', 'webpage_url_basename', 'webpage_url_domain',
       'extractor', 'extractor_key', 'display_id', 'fulltitle',
       'duration_string', 'is_live', 'was_live', 'epoch', 'asr', 'format_id',
       'format_note', 'source_preference', 'fps', 'audio_channels', 'height',
       'quality', 'has_drm', 'tbr', 'filesize_approx', 'url', 'width',
       'language_preference', 'ext', 'vcodec', 'acodec', 'dynamic_range',
       'downloader_options', 'protocol', 'video_ext', 'audio_ext',
       're

In [23]:
df.loc['format_note']

0    360p, WEB
Name: format_note, dtype: object