# Transcribing YouTube Videos

## Workflow Overview

1. Downloading YouTube videos and converting to MP3 format using yt-dlp
2. Setting up Gradio as a server on a remote GPU computer to run Whisper AI
3. Using Gradio client to send audio data to the remote server for processing
4. Processing the audio files with Whisper AI for transcription
5. Retrieving and storing the transcripts with checkpointing for reliability

#### Installing dependencies

In [None]:
%pip install moviepy
%pip install -U openai-whisper
%pip install -U yt-dlp
%pip install gradio_client
%pip install gradio
%pip install pickle

In [None]:
# Importing
from moviepy import *
import json
from os import listdir, path
import yt_dlp
from gradio_client import Client
from gradio_client.utils import handle_file
import pickle

Set your notebook to the right directrory

In [None]:
%cd "/Users/DIRECTORY/PATH"

### Data

In [None]:
# Load JSON file into a pandas DataFrame
JSON_PATH = "data/dataset_youtube_vids.json"
with open(JSON_PATH, "r") as file: 
    df = json.load(file)

### Structure

Step 01 <br>
TODO: Loop throug the dataset <br>
TODO: add a new column index starting 0 for each row <br>

Step 02 <br>
TODO: Loop through the dataset <br>
TODO: Get URL for each video <br>
TODO: Get index <br>
TODO: name the audio {index}-video<br>
TODO: Download audios

Step 03<br>
TODO: Transicribe with Whisper AI on different computer with GPU <br>
TODO: ON SERVER -> Function to receive an audio file and transcribe with Whisper<br>
TODO: ON SERVER -> get Gradio Interface API for the local client<br>
TODO: LOCAL: Run the client <br>
TODO: LOCAL: Load files 
TODO: LOCAL: Loop throug files and send to Gradio <br>
TODO: Append the transcript to a new column caled transcribed

In [None]:
# TODO: Step 01
# TODO: Loop throug the dataset 
# TODO: add a new column index starting 0 for each row 
for  idx, row in enumerate(df):
    row["index"] = idx

#### Downlaod YouTube and convert to Audio
Using ydl_opts

In [None]:
# TODO: Loop through the dataset 
for row in df:
    # TODO: Get URL for each video
    print(row["url"])
    url = row["url"]
    # TODO: Get index 
    print(row["index"])
    id = row["index"]
    # TODO: name the audio index-video
    ydl_opts = {
    'format': 'bestaudio/best',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'mp3',
        'preferredquality': '192',
    }],
    'outtmpl': f'audio_files/{id}-audio.%(ext)s'
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([url])

### Checking pkl file for interrupted sessions
The computers are not very fast and we have 1014 audios to transcribe the transcription process is done through multiple days

In [None]:
# Load the latest checkpoint
with open('export_transcribed_data.pkl', 'rb') as f:
    df_copy = pickle.load(f)
transcribed_count = sum(1 for item in df_copy if 'transcript' in item and item['transcript'])
print(f"Loaded checkpoint with {transcribed_count} transcribed files")

### Using gradio_client hosted on different computer to use GPU to run Whisper AI 

In [None]:
#clients
# 1 
client = Client("https://b95cbd9371e8e6844b.gradio.live/")     
#2
client = Client("https://b15ac7efb347bf7f1a.gradio.live/")     
#3
client = Client("https://debe7dedf66c60de09.gradio.live/")     
#4
client = Client("https://c571097d3ed33d1fb0.gradio.live/")  

In [None]:
audio_folder = "audio_files/"
audio_files = [f for f in listdir(audio_folder) if f.endswith(".mp3")]

# Track progress - now using the full length
total_files = len(audio_files)
processed = 0
skipped = 0

# Loop through ALL audio files (removed the [:100] slice)
for fname in audio_files:
    file_path = path.join(audio_folder, fname)
    number = int(fname.split("-")[0])
    
    # Find matching item in df_copy
    matching_item = next((item for item in df_copy if item["index"] == number), None)
    
    # Skip if already transcribed
    if matching_item and "transcript" in matching_item and matching_item["transcript"]:
        print(f"Skipping {fname} - already transcribed")
        skipped += 1
        continue
        
    if matching_item:
        print(f"Processing {fname} ({processed+1}/{total_files})")
        try:
            result = client.predict(
                audio=handle_file(file_path),
                api_name="/predict"
            )
            matching_item["transcript"] = result
            print(f"Transcribed: {result[:100]}...")  # Print start of transcript
            
            # Save checkpoint every 10 files (or adjust as needed for larger datasets)
            processed += 1
            if processed % 10 == 0:
                with open('transcribed_data.pkl', 'wb') as f:
                    pickle.dump(df_copy, f)
                print(f"Checkpoint saved ({processed}/{total_files})")
                
        except Exception as e:
            print(f"Error processing {fname}: {e}")
            # Save on error to prevent losing progress
            with open('transcribed_data_error.pkl', 'wb') as f:
                pickle.dump(df_copy, f)
            print("Progress saved after error")
    else:
        print(f"No matching index found for {fname}")

print(f"Finished processing. Transcribed: {processed}, Skipped: {skipped}")

# Save final results
with open('transcribed_data.pkl', 'wb') as f:
    pickle.dump(df_copy, f)

with open('transcribed_data.json', 'w') as f:
    json.dump(df_copy, f, indent=2)

### Saving the check ponits

Double checking just in case!

In [None]:
with open('export_transcribed_data.pkl', 'wb') as f:
    pickle.dump(df_copy, f)
    
with open('export_transcribed_data.json', 'w') as f:
    json.dump(df_copy, f, indent=2)