# <span style="color:orange">YouTube Transcript Summarizer Using BART + NLTK</span>

This notebook extracts a YouTube video's transcript, cleans it, splits it into sentence-aware chunks using NLTK, and summarizes it using the BART model (`facebook/bart-large-cnn`).

# <span style="color:orange">Install Dependencies</span>

In [1]:
!pip install youtube-transcript-api



In [2]:
pip show youtube-transcript-api

Name: youtube-transcript-api
Version: 1.2.3
Summary: This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles and it does not require a headless browser, like other selenium based solutions do!
Home-page: https://github.com/jdepoix/youtube-transcript-api
Author: Jonas Depoix
Author-email: jonas.depoix@web.de
License: MIT
Location: /usr/local/lib/python3.12/dist-packages
Requires: defusedxml, requests
Required-by: 


In [3]:
!pip install transformers==4.52.4 torch



In [4]:
!pip install sentencepiece



In [5]:
!pip install nltk



- `youtube-transcript-api` ‚Üí fetches YouTube captions  
- `transformers` ‚Üí loads BART model  
- `sentencepiece` ‚Üí required for BART tokenizer  
- `nltk` ‚Üí for sentence tokenization (better chunking)  

# <span style="color:orange">Imports</span>

In [6]:
from urllib.parse import urlparse, parse_qs
from youtube_transcript_api import YouTubeTranscriptApi
from transformers import BartTokenizer, BartForConditionalGeneration
import nltk
from nltk.tokenize import sent_tokenize
import torch
import re

- URL parsing for video ID  
- YouTube transcript extraction  
- BART tokenizer + model  
- NLTK for sentence tokenization  
- Regex for cleaning

# <span style="color:orange">Download NLTK Model</span>

In [17]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

We download the `punkt` tokenizer which allows us to split text into sentences for better summarization.

# <span style="color:orange">Extracting the Video ID</span>

In [8]:
def extract_video_id(url: str) -> str:
    """
    Extract YouTube video ID from any YouTube URL format.
    """
    parsed = urlparse(url)

    # Standard YouTube link
    if parsed.hostname in ["www.youtube.com", "youtube.com"]:
        return parse_qs(parsed.query).get("v", [None])[0]

    # Shortened youtu.be link
    if parsed.hostname == "youtu.be":
        return parsed.path.lstrip("/")

    return None

### üîç Extract Video ID
Accepts both:
- `https://youtube.com/watch?v=ID`
- `https://youtu.be/ID`

Returns only the video ID.

# <span style="color:orange">Fetch the Youtube Transcript</span>

In [9]:
def fetch_transcript(video_id: str) -> str:
    """
    Fetch transcript as a single text string using the new API.
    """
    try:
        ytt_api = YouTubeTranscriptApi()  # create instance
        fetched = ytt_api.fetch(video_id)  # fetch transcript
        raw = fetched.to_raw_data()  # convert to list of dicts
        text = " ".join([entry["text"] for entry in raw])
        return text
    except Exception as e:
        print(f"Error fetching transcript: {e}")
        return None

### üìú Fetch Transcript
Using `YouTubeTranscriptApi.get_transcript()`, we return the entire transcript merged into a single string.


# <span style="color:orange">Clean Transcript Text</span>

In [10]:
def clean_text(text: str) -> str:
    """
    Clean transcript: remove extra spaces, newlines, and bracketed tags.
    """
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'\[.*?\]', '', text)  # remove things like [Music]
    return text.strip()

### üßπ Cleaning Transcript
Removes:
- Extra spaces  
- Tags like `[Music]`, `[Applause]`  
- Newline artifacts  

Creates a clean, smooth transcript for summarization.


# <span style="color:orange">Sentence-Aware Chunking (NLTK)</span>


In [11]:
def chunk_text(text, max_words=900):
    """
    Split transcript into chunks using NLTK sentence tokenization.
    Prevents mid-sentence cuts and improves summary coherence.
    """
    sentences = sent_tokenize(text)

    chunks = []
    current_chunk = []
    current_count = 0

    for sentence in sentences:
        words = sentence.split()
        count = len(words)

        if current_count + count > max_words:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_count = 0

        current_chunk.append(sentence)
        current_count += count

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

### ‚úÇÔ∏è Sentence-Aware Chunking (NLTK)
This version uses **NLTK sentence tokenization** so chunks never cut off mid-sentence.

Benefits:
- Better coherence  
- Higher BART summary quality  
- Less model confusion  

# <span style="color:orange">Load BART Model</span>

[Model link](https://huggingface.co/facebook/bart-large-cnn)

In [12]:
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### ü§ñ Load BART Model
Loads the `facebook/bart-large-cnn` model trained specifically for summarization tasks.

# <span style="color:orange">Summarize One Chunk</span>

In [13]:
# from transformers import pipeline, AutoTokenizer

In [14]:
def summarize_chunk(text: str) -> str:
    inputs = tokenizer.encode(text, return_tensors="pt", max_length=1024, truncation=True)

    summary_ids = model.generate(
        inputs,
        max_length=200,
        min_length=60,
        num_beams=4,
        length_penalty=2.0,
        early_stopping=True
    )

    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

### üìù Summarizing One Chunk
Uses BART with beam search for a high-quality summary:
- Beam search (4 beams)  
- Minimum length enforced  
- Early stopping enabled  

This function:

1. Tokenizes the chunk of text

2. Truncates it to fit the model's maximum sequence length

3. Generates a summary using BART

4. Decodes the summary tokens back into readable text

1. **Tokenizer Parameters (`tokenizer.encode`)**

`text`

- The text chunk to summarize.

`return_tensors="pt"`

- Returns the tokenized output as PyTorch tensors.

- Required because the model expects tensor input.

`max_length=1024`

- BART-large has a maximum input length of 1024 tokens.

- This prevents overflow errors.

- If the chunk is longer, it will be truncated.

`truncation=True`

- Ensures that any text exceeding 1024 tokens is safely cut off.

- Prevents crashes and overflows.

2. **Model Generation Parameters (`model.generate`)**

`max_length=200`

- The maximum number of tokens allowed in the generated summary.

- Higher value ‚Üí longer summary

- Lower value ‚Üí shorter summary

`min_length=60`

- The minimum number of tokens the model must generate.

- Prevents overly short or meaningless summaries.

`num_beams=4`

- This activates beam search, a smarter search algorithm for generation.

- The model considers 4 possible next-token paths at every step.

- Higher values improve quality but increase compute time.

- Common values: 3-6

- 4 is a good balance between quality and speed.

`length_penalty=2.0`

- Controls how much the model is penalized for generating longer sequences.

- Values > 1.0 encourage the model to be more concise.

- Values < 1.0 allow longer outputs.

Why 2.0?
It produces shorter, more focused summaries.

`early_stopping=True`

- Stops beam search when all beams finish generating.

- Makes generation faster and more predictable.

3. **Decoding Parameters (`tokenizer.decode`)**

`summary_ids[0]`

- The generated sequence of token IDs.

`skip_special_tokens=True`

- Removes tokens like `<s>`, `</s>`, `<pad>`, `<unk>`.

- Ensures the output is clean, readable text.

# <span style="color:orange">Full Pipeline Function</span>


In [15]:
def summarize_youtube_video(url: str):
    video_id = extract_video_id(url)

    if not video_id:
        return "‚ùå Invalid YouTube URL."

    text = fetch_transcript(video_id)
    if not text:
        return "‚ùå Transcript unavailable."

    cleaned = clean_text(text)
    chunks = chunk_text(cleaned)

    print(f"Total words: {len(cleaned.split())}")
    print(f"Total chunks: {len(chunks)}\n")

    summaries = []
    for i, chunk in enumerate(chunks, 1):
        print(f"Summarizing chunk {i}/{len(chunks)}...")
        summaries.append(summarize_chunk(chunk))

    final_summary = "\n\n".join(summaries)
    return final_summary

### üöÄ Full Summarization Pipeline
1. Extract video ID  
2. Download transcript  
3. Clean the text  
4. Sentence-based chunking (NLTK)  
5. Summarize each chunk individually  
6. Combine all summaries into a final result  

# <span style="color:orange">Execute Summarizer</span>


Link of the video:https://www.youtube.com/watch?v=IzQ2siryQrM

In [20]:
video_url = input("Enter YouTube video URL: ")
summary = summarize_youtube_video(video_url)

print("\n===== FINAL SUMMARY =====\n")
print(summary)

Enter YouTube video URL: https://www.youtube.com/watch?v=IzQ2siryQrM
Total words: 261
Total chunks: 1

Summarizing chunk 1/1...

===== FINAL SUMMARY =====

Adults typically need seven to nine hours of sleep for maximum brain performance. Too little sleep negatively affects your ability to remember and concentrate. It can also make you moodier and more irritable and increase the risk of anxiety and depression. To ensure you're getting enough sleep, practice good sleep hygiene.


### ‚ñ∂Ô∏è Run the Summarizer
Paste any YouTube link to generate a coherent summary powered by BART and NLTK.




---



---



## Why Not Use pipeline like in the intern and in the older video?

You can use:
```
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
```

but it comes with drawbacks:

üëç Pros

- Very simple

- One-line model loader

- Clean syntax

üëé Cons

- Slower for repeated summarization

- Less control over model behavior

- More internal ‚Äúhidden rules‚Äù

- Easier to get inconsistent summaries with long text chunks

For a multi-chunk, long-document summarizer, the direct model method is the better engineering choice.

### Summary Table

| Feature                   | `BartForConditionalGeneration` | `pipeline("summarization")` |
| ------------------------- | ------------------------------ | --------------------------- |
| Speed (multiple calls)    | ‚≠ê Faster                       | ‚ö†Ô∏è Slower                   |
| Control over parameters   | ‚≠ê Full control                 | ‚ö†Ô∏è Limited                  |
| Good for long transcripts | ‚≠ê Yes                          | ‚ö†Ô∏è Not ideal                |
| Hidden defaults           | None                           | Many                        |
| Best for this project     | ‚úÖ Yes                          | Optional                    |


The pipeline("summarization") approach hides many generation settings behind defaults.
For example:

- `max_length`

- `min_length`

- `num_beams`

- `length_penalty`

- `early_stopping`

When summarizing long transcripts in multiple chunks, having full control over these settings gives better quality summaries.

Using:
```
model.generate(...)
```

ensures the model follows the exact parameters we specify.