<a href="https://colab.research.google.com/github/Amartya1911/YTSummariser/blob/main/YTSummariser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [26]:
!pip install -U youtube-transcript-api transformers accelerate sentencepiece



In [27]:
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled, NoTranscriptFound
import re

def extract_video_id(url):
    """Extracts video ID from different YouTube URL formats."""
    # We use Regex to hunt for the 11-character ID after 'v=' or 'youtu.be/'
    match = re.search(r"(?:v=|youtu\.be/)([a-zA-Z0-9_-]{11})", url)
    return match.group(1) if match else None

def get_transcript(video_id):
    """Fetch transcript using the NEW API format."""
    try:
        api = YouTubeTranscriptApi()
        # The .fetch method grabs the subtitle object list
        transcript = api.fetch(video_id)
        # We join the list into a single long string of text
        return " ".join([t.text for t in transcript])

    except TranscriptsDisabled:
        return "Error: Transcripts are disabled for this video."
    except NoTranscriptFound:
        return "Error: No transcript found for this video."
    except Exception as e:
        return f"Error: {str(e)}"

In [28]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Check if we have a GPU (CUDA) available
device = "cuda" if torch.cuda.is_available() else "cpu"

# Switching to BART-Large-CNN for high-quality summarization
model_name = "facebook/bart-large-cnn"

print(f"Loading {model_name} on {device}... this might take a minute.")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

print("Model loaded successfully!")

Loading facebook/bart-large-cnn on cuda... this might take a minute.


Loading weights:   0%|          | 0/511 [00:00<?, ?it/s]

Model loaded successfully!


In [29]:
def summarize_chunk(text_chunk, max_len=150, min_len=40):
    # Convert text to tensor numbers (inputs)
    inputs = tokenizer(
        text_chunk,
        return_tensors="pt",
        truncation=True,
        max_length=1024
    ).to(device)

    # Generate the summary with dynamic length parameters
    summary_ids = model.generate(
        **inputs,
        max_length=max_len,     # Dynamic upper limit
        min_length=min_len,     # Dynamic lower limit
        length_penalty=2.0,     # Encourages longer, more complete thoughts
        num_beams=4,            # Explores multiple generation paths
        no_repeat_ngram_size=3  # Stops the model from looping
    )

    # Decode back to text
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

In [30]:
def chunk_text(text, chunk_words=600, overlap=50):
    """Splits text by word count to avoid cutting off mid-sentence, with overlap for context."""
    words = text.split()
    chunks = []

    if len(words) == 0:
        return chunks

    i = 0
    while i < len(words):
        # Grab a larger slice of words to maximize BART's context window
        chunk = " ".join(words[i:i + chunk_words])
        chunks.append(chunk)

        # Move forward, but step back by the 'overlap' amount
        i += (chunk_words - overlap)

    return chunks

In [31]:
def generate_video_notes(video_url):
    print(f"\nüé¨ Processing video: {video_url}")

    video_id = extract_video_id(video_url)
    if not video_id:
        print("Invalid YouTube URL.")
        return

    print("üéß Fetching transcript...")
    transcript = get_transcript(video_id)

    if transcript.startswith("Error"):
        print(transcript)
        return

    # Using the 600-word chunks from Cell 5
    print("üî™ Chunking transcript...")
    chunks = chunk_text(transcript, chunk_words=600, overlap=50)
    print(f"   -> {len(chunks)} chunks created.")

    print("üß† Generating initial notes (Map)...")
    initial_notes = []

    # Step 1: Summarize the individual chunks (keep these short and punchy)
    for i, chunk in enumerate(chunks):
        print(f"   Summarizing chunk {i+1}/{len(chunks)}...")
        summary = summarize_chunk(chunk, max_len=100, min_len=30)
        initial_notes.append(summary)

    print("‚ú® Generating Master Summary (Reduce)...")

    # Step 2: Combine the mini-summaries
    combined_notes = " ".join(initial_notes)

    # Step 3: Chunk the combined notes into medium-sized sections.
    # By chunking at 300 words here, we force the model to write a detailed
    # paragraph for every ~300 words of initial notes, giving us a longer total output.
    final_chunks = chunk_text(combined_notes, chunk_words=300, overlap=0)

    final_summary = []
    for i, chunk in enumerate(final_chunks):
        print(f"   Writing final section {i+1}/{len(final_chunks)}...")
        # Ask for a longer, detailed summary for the final output
        final_summary.append(summarize_chunk(chunk, max_len=250, min_len=80))

    print("\n" + "="*50)
    print("üìù AI GENERATED MASTER SUMMARY")
    print("="*50)

    # Print out nicely formatted paragraphs
    for i, paragraph in enumerate(final_summary):
        print(f"\nPart {i+1}:\n{paragraph}")


if __name__ == "__main__":
    url = input("Paste YouTube URL: ")
    generate_video_notes(url)

Paste YouTube URL: https://youtu.be/GFuDk8PB1KU?si=oGwTjfLP3IMZV-K6

üé¨ Processing video: https://youtu.be/GFuDk8PB1KU?si=oGwTjfLP3IMZV-K6
üéß Fetching transcript...
üî™ Chunking transcript...
   -> 3 chunks created.
üß† Generating initial notes (Map)...
   Summarizing chunk 1/3...
   Summarizing chunk 2/3...
   Summarizing chunk 3/3...
‚ú® Generating Master Summary (Reduce)...
   Writing final section 1/1...

üìù AI GENERATED MASTER SUMMARY

Part 1:
Robert Moldun, game warden, easel nubla. park, documented the feeding process for velociaptor anteropus. The big one and her two remaining minions were tranquilized so that our vest could enter the enclosure and inspect the raptor's injuries. She sustained thirdderee burns to all four limbs and lacerations to her mouth before being hurled 40 ft away from the electric fence.
