### Build a Podcast Summarizer Using Python

- Download Video
- Extract Audio segment and save as MP3
- Transcribe the Audio to text
- Summarize the text

### Installation

In [1]:
# !pip install yt_dlp pydub transformers datasets torch
# !pip install python-dotenv

### Importing Packages

In [2]:
import yt_dlp
from pydub import AudioSegment
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm
2025-10-03 12:43:57.515776: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1759475638.026323   66592 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1759475638.163727   66592 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1759475639.210084   66592 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1759475639.210135   66592 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1759475639.210137   66592

### Loading env variables into runtime environment

In [3]:
from dotenv import load_dotenv
import os

load_dotenv()

True

### Connecting to Huggingface Env

In [4]:
from huggingface_hub import login

# If running in a notebook environment
# from huggingface_hub import notebook_login
# notebook_login()

your_token = os.getenv("HUGGINGFACE_API_KEY")  # Get your token from environment variables
login(token=your_token)

### Podcast URL

In [None]:
url = "https://www.youtube.com/watch?v=9kB14zbtLOk" #Hitler's first big mistake in World War II | Norman Ohler and Lex Fridman


### Downloading the Podcast

In [6]:
opts = {"format": "bestaudio/best", "outtmpl": "podcast.%(ext)s"}
with yt_dlp.YoutubeDL(opts) as ydl:
    ydl.download([url])

[youtube] Extracting URL: https://www.youtube.com/watch?v=9kB14zbtLOk
[youtube] 9kB14zbtLOk: Downloading webpage
[youtube] 9kB14zbtLOk: Downloading tv client config
[youtube] 9kB14zbtLOk: Downloading tv player API JSON
[youtube] 9kB14zbtLOk: Downloading web safari player API JSON
[youtube] 9kB14zbtLOk: Downloading m3u8 information
[info] 9kB14zbtLOk: Downloading 1 format(s): 251
[download] podcast.webm has already been downloaded
[download] 100% of    8.55MiB


### Extracting Audio Segment from Podcast

In [7]:
# Step 2: Convert to MP3
audio = AudioSegment.from_file("podcast.webm")
audio.export("podcast.mp3", format="mp3")

<_io.BufferedRandom name='podcast.mp3'>

### Transcribing text from Podcast

In [8]:
# Step 3: Transcribe with open-source model
asr = pipeline("automatic-speech-recognition", model="openai/whisper-small", return_timestamps=True)
transcript = asr("podcast.mp3")["text"]

Device set to use cuda:0
Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


In [None]:
"""Model Deletion"""

del asr

In [10]:
transcript

" I see most of the officers of the Wehrmacht as not necessarily Nazis in the way that they would shout Heil Hitler all the time. They were highly intelligent, highly trained, super professionals that ran a very effective war machine. And at one point more and more of these generals realized that the orders that Hitler were were given were not really helping, you know, and they have their men dying because of it. So that creates a lot of tension. And that that led to the mistake that Hitler did in Dunkirk, basically. What Churchill called the sickle cut, which was the idea to storm through the Anden mountains and kind of cut off the British and French troops who were still in the north of Belgium trying to figure out what was going on. the Germans are behind them so that they kind of cut as a sickle into enemy territory, the sickle cut. That was so successful that basically the campaign was won already. So then the Germans invaded, like occupied all the cities on the canal back to Engl

### Clearning GPU memory

In [11]:
import gc
import torch
# 2c. Clear the GPU memory cache and run garbage collection
torch.cuda.empty_cache()
gc.collect()


210

### Summarize the transcribed text

In [13]:
# # Step 4: Summarize with open-source model
# summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device='cpu')
# summary = summarizer(transcript, max_length=150, min_length=40, do_sample=False)

# print(summary[0]["summary_text"])

In [14]:
import torch
from transformers import BartTokenizer, BartForConditionalGeneration
import math

# --- Configuration for DistilBART ---
MODEL_NAME = "sshleifer/distilbart-cnn-12-6" # <-- Updated Model Name
MAX_INPUT_LENGTH = 1024                     # Max tokens DistilBART can handle
OVERLAP = 200                               # Overlap tokens between chunks
DEVICE = "cpu" #"cuda" if torch.cuda.is_available() else "cpu"

# Load Tokenizer and Model
# Note: DistilBART uses the standard BART tokenizer
tokenizer = BartTokenizer.from_pretrained(MODEL_NAME) 

# Load the DistilBART model in half-precision (FP16)
model = BartForConditionalGeneration.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16  # Highly recommended for 8GB VRAM
).to(DEVICE)


def chunk_and_summarize(text: str, model, tokenizer, max_len: int = MAX_INPUT_LENGTH, overlap: int = OVERLAP) -> str:
    """
    Splits text into overlapping chunks, summarizes each chunk, and returns
    the combined summaries as a single string (the 'Meta-Document').
    """
    
    # 1. Tokenize the input text
    input_ids = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True).to(DEVICE)
    
    # Calculate the step size and number of chunks
    input_size = input_ids.size(1)
    step = max_len - overlap
    
    # Determine the number of chunks needed
    num_chunks = math.ceil((input_size - overlap) / step)

    chunk_summaries = []

    print(f"Transcript size: {input_size} tokens. Creating {num_chunks} chunks.")

    # 2. Iterate through chunks and summarize (MAP phase)
    for i in range(num_chunks):
        start = i * step
        end = min(start + max_len, input_size)
        
        chunk_input_ids = input_ids[:, start:end]
        
        # Perform generation (summarization) for the chunk
        summary_ids = model.generate(
            chunk_input_ids,
            num_beams=4,
            max_length=200, # Max length for individual chunk summaries
            min_length=50,
            early_stopping=True
        )
        
        summary_text = tokenizer.decode(summary_ids.squeeze(), skip_special_tokens=True)
        chunk_summaries.append(summary_text)

        if end >= input_size:
            break

    # 3. Combine chunk summaries
    combined_summaries = " ".join(chunk_summaries)
    
    return combined_summaries

def final_summarize(text: str, model, tokenizer) -> str:
    """
    Performs the final summarization on the combined chunk summaries (REDUCE phase).
    """
    print("\n--- Performing Final Reduction Summary ---")
    
    # Encode the combined summaries (Truncate if the combined text is still too long)
    input_ids = tokenizer.encode(text, return_tensors="pt", max_length=MAX_INPUT_LENGTH, truncation=True).to(DEVICE)
    
    # Generate the final summary
    final_summary_ids = model.generate(
        input_ids,
        num_beams=4,
        max_length=500, # Max length for the final podcast summary
        min_length=100,
        length_penalty=2.0,
        early_stopping=True
    )
    
    final_summary = tokenizer.decode(final_summary_ids.squeeze(), skip_special_tokens=True)
    return final_summary


# --- Example Execution ---
# NOTE: Replace this with the actual long transcript you get from Whisper!
podcast_transcript = transcript

# 1. Map: Chunk and Summarize
meta_document = chunk_and_summarize(podcast_transcript, model, tokenizer)

# 2. Reduce: Final Summary
final_summary_text = final_summarize(meta_document, model, tokenizer)

print("\n--- Final Podcast Summary ---")
print(final_summary_text)

Token indices sequence length is longer than the specified maximum sequence length for this model (1761 > 1024). Running this sequence through the model will result in indexing errors


Transcript size: 1761 tokens. Creating 2 chunks.

--- Performing Final Reduction Summary ---

--- Final Podcast Summary ---
 I see most of the officers of the Wehrmacht as not necessarily Nazis in the way that they would shout Heil Hitler all the time . They were highly intelligent, highly trained, super professionals that ran a very effective war machine . But at one point more and more of these generals realized that the orders that Hitler were given were not really helping, you know, and they have their men dying because of it . The movie Dunkirk by Christopher Nolan is not good because he doesn't describe what happened on the German side .
