<a href="https://colab.research.google.com/github/Adylitto/Adylonfleek/blob/main/WhisperYouTube.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you're looking at this on GitHub and new to Python Notebooks or Colab, click the Google Colab badge above 👆


#**Creating YouTube transcripts with OpenAI's Whisper model**

📺 Getting started video: https://youtu.be/kENRf82_RQs

*Colab beginner notes:*
<br>
1. These files are being loaded on a virtual machine in the cloud. Nothing is being downloaded to your computer (except for the transcript when you click to download it.) When you close this session the instance will be erased.
<br>
2. The run button is visible when you move your mouse close to the left edge of the code block. It looks kind of like this: ▶️ ...but round...and white on black...so nothing like this. You'll know it when you see it.

###**Note: For faster performance set your runtime to "GPU"**
*Click on "Runtime" in the menu and click "Change runtime type". Select "GPU".*


**Step 1.** Follow the instructions in each block and select the options you want
<br>
**Step 2.** Get the url of the video you want to transcribe
<br>
**Step 3.** Refresh the folder on the left and download your transcript
<br>
**Step 4.** Go to your YouTube account and upload the transcript to the video it came from and use "autosync."

That's it!

Have a question? Hit me up on Twitter:[ @AndrewMayne](https://twitter.com/andrewmayne)

<br>



---


**What is this?**
<br>
This is a Python notebook that creates a transcript from a YouTube url using OpenAI's Whisper transcription model that you can then upload to YouTube using the autosync feature to create captions.
<br>  
**What is OpenAI's Whisper model?**
<br>
Whisper is an automatic speech recognition (ASR) neural net created by OpenAI that transcribes audio at close to human level.
<br>
<br>
**Why use this?**
<br>
The quality of the OpenAI Whisper model is amazing (I am slightly biased, but seriously, check it out.) You can also use it to transcribe in other languages.
<br>
<br>
**What do the different model sizes do?**
<br>
Each model size has an improvement in quality – especially with different languages. I've found that for a YouTube video with clear speech, the base model works really well. If you see transcription errors, you can try a larger model.
<br>
<br>
**Do I need timestamps?**
<br>
Nope. YouTube's autosync function will match the text to the spoken words and syncs up really well. All you need is each spoken sentence in a .txt file.
<br>
<br>
**How do I do this?**
<br>
Just follow each step. If you've never used Colab of a Python notebook, don't panic. It's super easy and runs in the cloud.
<br>
<br>
**Does this cost anything to use?**
<br>
Nope. You can use Colab for free and Whisper is an open source model.
<br>
<br>
[Tips for creating a YouTube transcript file](https://support.google.com/youtube/answer/2734799?hl=en)
<br>
[Information on OpenAI's Whisper model](https://openai.com/blog/whisper/)
<br>
[OpenAI's Whisper GitHub page](https://github.com/openai/whisper)
<br>












In [1]:
"""
1. Click the start button in the upper left side of this block to load the necessary libraries

You will need to run this every time you reload this notebook.
"""

!pip install youtube_dl
!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg
!pip install librosa

import whisper
import time
import librosa
import re
import youtube_dl

Collecting youtube_dl
  Downloading youtube_dl-2021.12.17-py2.py3-none-any.whl.metadata (1.5 kB)
Downloading youtube_dl-2021.12.17-py2.py3-none-any.whl (1.9 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.9/1.9 MB[0m [31m109.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m51.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: youtube_dl
Successfully installed youtube_dl-2021.12.17
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-oy_o54h1
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-oy_o54h1
  Resolved https://github.com/openai/whisper.git to commit 517a43ecd132a2089d85f4ebc044728a71d49f6e
  Installing build dependencies ... [

In [2]:
"""
2. Select the model you want to use.

Base works really well so it's the default.

(For multilingual, remove ".en" from the model name.)

Click the run button after you've made your choice (or left it at default.)
"""

# model = whisper.load_model("tiny.en")
model = whisper.load_model("base")
# model = whisper.load_model("small.en")
# model = whisper.load_model("medium.en")
# model = whisper.load_model("large")

100%|████████████████████████████████████████| 139M/139M [00:01<00:00, 109MiB/s]
  checkpoint = torch.load(fp, map_location=device)


In [6]:
!pip install yt-dlp

Collecting yt-dlp
  Downloading yt_dlp-2025.2.19-py3-none-any.whl.metadata (171 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.9/171.9 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading yt_dlp-2025.2.19-py3-none-any.whl (3.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.2/3.2 MB[0m [31m204.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m88.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: yt-dlp
Successfully installed yt-dlp-2025.2.19


In [None]:
# Install required packages
!pip install yt-dlp tqdm

# If you haven't already installed these
# !pip install git+https://github.com/openai/whisper.git
# !sudo apt update && sudo apt install ffmpeg
# !pip install librosa

import os
import re
import time
import json
import concurrent.futures
import whisper
import librosa
from tqdm.notebook import tqdm
import yt_dlp

# Configure logging
import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler("transcription.log"),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

# Load Whisper model
def load_model(model_size="medium", device="cuda"):
    """Load Whisper model with specified size and device."""
    print(f"Loading Whisper {model_size} model on {device}...")
    model = whisper.load_model(model_size, device=device)
    print("Model loaded successfully!")
    return model

def get_channel_videos(channel_url, max_videos=None):
    """Get all video URLs from a YouTube channel using yt-dlp."""
    ydl_opts = {
        'extract_flat': True,
        'skip_download': True,
        'ignoreerrors': True,
        'quiet': True,
        'no_warnings': True,
    }

    video_urls = []
    print(f"Extracting videos from channel: {channel_url}")

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        try:
            result = ydl.extract_info(channel_url, download=False)

            if result and 'entries' in result:
                for entry in result['entries']:
                    if entry:
                        video_urls.append(f"https://www.youtube.com/watch?v={entry['id']}")
                        if max_videos and len(video_urls) >= max_videos:
                            break

            print(f"Found {len(video_urls)} videos")
        except Exception as e:
            print(f"Error extracting channel videos: {str(e)}")

    return video_urls

def chunk_text(text, segments, max_chunk_size=1000, overlap=100):
    """
    Split transcript into semantic chunks with overlapping text.
    This improves vector retrieval by maintaining context across chunks.
    """
    chunks = []

    # First, create a mapping of start and end times for the transcript
    time_map = []
    for segment in segments:
        segment_text = segment["text"]
        start_time = segment["start"]
        end_time = segment["end"]
        time_map.append({
            "text": segment_text,
            "start": start_time,
            "end": end_time
        })

    # Now chunk the full text with overlap
    full_text = text

    # If the text is short enough, just return it as a single chunk
    if len(full_text) <= max_chunk_size:
        # Find the time range
        start_time = time_map[0]["start"] if time_map else 0
        end_time = time_map[-1]["end"] if time_map else 0

        return [{
            "text": full_text,
            "start_time": start_time,
            "end_time": end_time
        }]

    # Otherwise, split it into overlapping chunks
    start_idx = 0

    while start_idx < len(full_text):
        # Find the end index for this chunk
        end_idx = start_idx + max_chunk_size

        # If we're at the end of the text, just use the rest
        if end_idx >= len(full_text):
            end_idx = len(full_text)
        else:
            # Try to find a good breaking point (period, question mark, etc.)
            # Look for these punctuation marks within the last 20% of the chunk
            breaking_point = end_idx
            search_start = max(start_idx, end_idx - int(max_chunk_size * 0.2))

            # Find the last sentence break in the search range
            last_period = full_text.rfind(". ", search_start, end_idx)
            last_question = full_text.rfind("? ", search_start, end_idx)
            last_exclamation = full_text.rfind("! ", search_start, end_idx)

            # Use the latest of these breaking points
            candidates = [p for p in [last_period, last_question, last_exclamation] if p != -1]
            if candidates:
                breaking_point = max(candidates) + 2  # +2 to include the punctuation and space

            end_idx = breaking_point

        chunk_text = full_text[start_idx:end_idx].strip()

        # Find the time range that corresponds to this chunk
        # This is approximate and could be improved
        start_time = None
        end_time = None

        # Calculate the approximate position of this chunk in the overall text
        chunk_start_ratio = start_idx / len(full_text)
        chunk_end_ratio = end_idx / len(full_text)

        # Find corresponding timestamps based on position ratios
        if time_map:
            # Interpolate to find approximate start and end times
            start_time = time_map[int(chunk_start_ratio * len(time_map))]["start"]
            end_time = time_map[min(int(chunk_end_ratio * len(time_map)), len(time_map) - 1)]["end"]

        chunks.append({
            "text": chunk_text,
            "start_time": start_time if start_time is not None else 0,
            "end_time": end_time if end_time is not None else 0
        })

        # Move the start index for the next chunk, accounting for overlap
        start_idx = end_idx - overlap

    return chunks

def download_and_transcribe(url, model, output_dir="transcripts", chunk_size=1000, overlap=100):
    """Download and transcribe a single YouTube video with semantic chunking for vector databases."""
    # Create output directory if it doesn't exist
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # Create a yt-dlp options dictionary
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
        'outtmpl': '%(title)s.%(ext)s',
        'quiet': False,
        'no_warnings': True,
    }

    # Download the video and extract the audio
    try:
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            info = ydl.extract_info(url, download=True)
            video_title = info.get('title', 'Unknown Title')
            video_id = info.get('id', 'Unknown ID')
            channel = info.get('channel', 'Unknown Channel')
            upload_date = info.get('upload_date', 'Unknown Date')
            duration = info.get('duration', 0)
            file_path = ydl.prepare_filename(info)

    except Exception as e:
        print(f"Error downloading video: {str(e)}")
        raise

    # Get correct file path after post-processing
    file_path = file_path.replace('.webm', '.mp3')
    file_path = file_path.replace('.m4a', '.mp3')

    # Get the duration
    audio_duration = librosa.get_duration(filename=file_path)
    print(f"Processing: {video_title}")
    print(f"Video length: {audio_duration:.2f} seconds")

    # Transcribe with word-level timestamps
    start = time.time()
    result = model.transcribe(file_path, word_timestamps=True)
    end = time.time()
    seconds = end - start
    print(f"Transcription time: {seconds:.2f} seconds")

    # Create output file paths
    safe_title = re.sub(r'[\/:*?"<>|]', '_', video_title)
    base_name = os.path.join(output_dir, safe_title)
    json_path = f"{base_name}.json"
    txt_path = f"{base_name}.txt"
    word_timing_path = f"{base_name}_word_timestamps.txt"
    vector_db_path = f"{base_name}_vector_chunks.json"

    # Save raw JSON results (with word timestamps)
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(result, f, indent=2)

    # Save plain text transcript
    with open(txt_path, "w", encoding="utf-8") as f:
        f.write(result["text"])

    # Save word-level timestamps in a readable format
    with open(word_timing_path, "w", encoding="utf-8") as f:
        f.write(f"Transcript with word-level timestamps for: {video_title}\n\n")
        for segment in result["segments"]:
            f.write(f"[{format_time(segment['start'])} --> {format_time(segment['end'])}]\n")

            # Write each word with its timestamp
            for word in segment.get("words", []):
                word_text = word["word"]
                word_start = format_time(word["start"])
                word_end = format_time(word["end"])
                f.write(f"{word_text} [{word_start}-{word_end}]\n")

            f.write("\n")

    # Create semantic chunks optimized for vector database
    chunks = chunk_text(result["text"], result["segments"], chunk_size, overlap)

    # Add metadata to chunks for vector database
    vector_chunks = []
    for i, chunk in enumerate(chunks):
        chunk_id = f"{video_id}_{i}"
        vector_chunks.append({
            "id": chunk_id,
            "text": chunk["text"],
            "metadata": {
                "video_id": video_id,
                "video_title": video_title,
                "channel": channel,
                "upload_date": upload_date,
                "start_time": chunk["start_time"],
                "end_time": chunk["end_time"],
                "duration": duration,
                "url": f"{url}&t={int(chunk['start_time'])}",
                "chunk_index": i,
                "total_chunks": len(chunks)
            }
        })

    # Save vector database chunks
    with open(vector_db_path, "w", encoding="utf-8") as f:
        json.dump(vector_chunks, f, indent=2)

    # Clean up the audio file
    try:
        os.remove(file_path)
    except Exception as e:
        print(f"Couldn't delete temporary audio file: {str(e)}")

    print(f"Saved transcripts to: {json_path}, {txt_path}, and {word_timing_path}")
    print(f"Saved vector database chunks to: {vector_db_path}")
    return json_path, txt_path, word_timing_path, vector_db_path

def format_time(seconds):
    """Format timestamp as MM:SS.mmm."""
    minutes = int(seconds / 60)
    secs = seconds % 60
    return f"{minutes:02d}:{secs:06.3f}"

def process_video_worker(args):
    """Worker function for parallel processing."""
    url, model, output_dir, chunk_size, overlap = args
    try:
        return download_and_transcribe(url, model, output_dir, chunk_size, overlap)
    except Exception as e:
        print(f"Error processing {url}: {str(e)}")
        return None

def batch_process_parallel(video_urls, model, output_dir="transcripts", max_workers=2, chunk_size=1000, overlap=100):
    """Process multiple videos in parallel."""
    print(f"Processing {len(video_urls)} videos with {max_workers} parallel workers")

    results = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Create arguments for each worker
        worker_args = [(url, model, output_dir, chunk_size, overlap) for url in video_urls]

        # Process videos in parallel with progress bar
        futures = [executor.submit(process_video_worker, arg) for arg in worker_args]

        for i, future in enumerate(tqdm(concurrent.futures.as_completed(futures), total=len(futures))):
            result = future.result()
            if result:
                json_path, txt_path, word_timing_path, vector_db_path = result
                results.append({
                    "url": video_urls[i],
                    "json_path": json_path,
                    "txt_path": txt_path,
                    "word_timing_path": word_timing_path,
                    "vector_db_path": vector_db_path,
                    "status": "success"
                })
            else:
                results.append({
                    "url": video_urls[i],
                    "status": "error"
                })

    # Save a summary of all processed videos
    summary_path = os.path.join(output_dir, "batch_summary.json")
    with open(summary_path, "w", encoding="utf-8") as f:
        json.dump(results, f, indent=2)

    print(f"Completed batch processing. Summary saved to {summary_path}")
    return results

# Run this cell to load the model (do this only once)
model = load_model("medium")  # Options: "tiny", "base", "small", "medium", "large"

# Cell for processing a single video
def process_single_video():
    video_url = input("Enter YouTube video URL: ")
    output_dir = input("Output directory (default: 'transcripts'): ") or "transcripts"

    download_and_transcribe(video_url, model, output_dir)

# Cell for processing a YouTube channel
def process_channel():
    channel_url = input("Enter YouTube channel URL: ")
    max_videos_input = input("Maximum number of videos to process (leave blank for all): ")
    max_videos = int(max_videos_input) if max_videos_input.strip() else None
    output_dir = input("Output directory (default: 'transcripts'): ") or "transcripts"
    max_workers_input = input("Number of parallel workers (default: 2, recommended max: 3 for T4): ") or "2"
    max_workers = int(max_workers_input)

    video_urls = get_channel_videos(channel_url, max_videos)
    batch_process_parallel(video_urls, model, output_dir, max_workers)

# Cell for processing videos from a text file
def process_from_file():
    file_path = input("Enter the path to the text file with URLs (one per line): ")
    output_dir = input("Output directory (default: 'transcripts'): ") or "transcripts"
    max_workers_input = input("Number of parallel workers (default: 2, recommended max: 3 for T4): ") or "2"
    max_workers = int(max_workers_input)

    with open(file_path, 'r') as f:
        urls = [line.strip() for line in f if line.strip()]

    batch_process_parallel(urls, model, output_dir, max_workers)

# Simple menu for Colab
def show_menu():
    print("\nYouTube Batch Transcription with Vector DB Output")
    print("-" * 50)
    print("1. Process a single video")
    print("2. Process a YouTube channel")
    print("3. Process videos from a text file")
    print("4. Exit")

    choice = input("\nEnter your choice (1-4): ")

    if choice == "1":
        process_single_video()
    elif choice == "2":
        process_channel()
    elif choice == "3":
        process_from_file()
    elif choice == "4":
        print("Exiting...")
    else:
        print("Invalid choice. Please try again.")
        show_menu()

# Run this cell to show the menu
show_menu()

Loading Whisper medium model on cuda...


  checkpoint = torch.load(fp, map_location=device)


Model loaded successfully!

YouTube Batch Transcription with Vector DB Output
--------------------------------------------------
1. Process a single video
2. Process a YouTube channel
3. Process videos from a text file
4. Exit

Enter your choice (1-4): 1
Enter YouTube video URL: https://www.youtube.com/watch?v=7tPI3PdJhwQ
Output directory (default: 'transcripts'): 
[youtube] Extracting URL: https://www.youtube.com/watch?v=7tPI3PdJhwQ
[youtube] 7tPI3PdJhwQ: Downloading webpage
[youtube] 7tPI3PdJhwQ: Downloading tv client config
[youtube] 7tPI3PdJhwQ: Downloading player 91201489
[youtube] 7tPI3PdJhwQ: Downloading tv player API JSON
[youtube] 7tPI3PdJhwQ: Downloading ios player API JSON
[youtube] 7tPI3PdJhwQ: Downloading m3u8 information
[info] 7tPI3PdJhwQ: Downloading 1 format(s): 251
[download] Destination: ＊LIVE＊ Comment établir une véritable DISCIPLINE ？.webm
[download] 100% of   65.06MiB in 00:00:14 at 4.55MiB/s   
[ExtractAudio] Destination: ＊LIVE＊ Comment établir une véritable DISC