# Podcast Transcript Summarizer and Recommendation Engine

This notebook covers:
1. Fetching podcast transcripts.
2. Summarizing transcripts using an LLM.
3. Keyword search within transcripts.
4. Content recommendation engine using embeddings.


## 1. Install Dependencies

Install required libraries.

In [1]:
!pip install requests bs4 transformers sentence-transformers scikit-learn
!pip install datasets transformers
!git clone https://github.com/FelipeGRK/theamericanlifepodcast.git
!pip install datasets transformers ipywidgets


Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Col

## 2. Import Libraries

Import necessary libraries and modules.

In [2]:
import os
import requests
from getpass import getpass

# === Data Handling & Processing ===
from datasets import Dataset
from bs4 import BeautifulSoup
from sklearn.metrics.pairwise import cosine_similarity

# === Transformers & Hugging Face Utilities ===
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from huggingface_hub import InferenceClient, login

# === Interactive Widgets & Display ===
import ipywidgets as widgets
from IPython.display import display, clear_output



## 3. Authenticate with Hugging Face

Sign in to Hugging Face using an API Key.

In [8]:
hf_api_key = getpass("Please enter your Hugging Face API key: ")
login(token=hf_api_key)

Please enter your Hugging Face API key: ··········


## 5. Fetch Podcast Transcripts

Fetch transcripts from provided URLs.

In [3]:
def fetch_transcript(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, "html.parser")
        transcript_div = soup.find("div", class_="transcript")
        if transcript_div:
            transcript_text = transcript_div.get_text(separator="\n")
        else:
            transcript_text = soup.get_text(separator="\n")
        return transcript_text.strip()
    else:
        return None

transcript_urls = [
    "https://www.thisamericanlife.org/1/transcript",
    "https://www.thisamericanlife.org/2/transcript",
    "https://www.thisamericanlife.org/3/transcript",
    "https://www.thisamericanlife.org/4/transcript",
    "https://www.thisamericanlife.org/5/transcript",
    "https://www.thisamericanlife.org/6/transcript",
    "https://www.thisamericanlife.org/7/transcript",
    "https://www.thisamericanlife.org/8/transcript",
    "https://www.thisamericanlife.org/9/transcript",
    "https://www.thisamericanlife.org/10/transcript",
]

transcripts = []
for url in transcript_urls:
    transcript_text = fetch_transcript(url)
    if transcript_text:
        transcripts.append(transcript_text)
        print(f"Transcript fetched from {url}")
    else:
        print(f"Failed to retrieve transcript from {url}")

if transcripts:
    print("Transcripts fetched successfully.")
else:
    print("Failed to fetch transcripts.")

Transcript fetched from https://www.thisamericanlife.org/1/transcript
Transcript fetched from https://www.thisamericanlife.org/2/transcript
Transcript fetched from https://www.thisamericanlife.org/3/transcript
Transcript fetched from https://www.thisamericanlife.org/4/transcript
Transcript fetched from https://www.thisamericanlife.org/5/transcript
Transcript fetched from https://www.thisamericanlife.org/6/transcript
Transcript fetched from https://www.thisamericanlife.org/7/transcript
Transcript fetched from https://www.thisamericanlife.org/8/transcript
Transcript fetched from https://www.thisamericanlife.org/9/transcript
Transcript fetched from https://www.thisamericanlife.org/10/transcript
Transcripts fetched successfully.


## 7. Define a Custom Prompt

Create a function to generate custom prompts for summarization.

In [4]:
def generate_prompt(transcript, episode, title, date):
    prompt = f"""
You are an assistant specialized in summarizing podcast episodes.
From the following transcript, generate a concise and informative summary that includes:
- Episode number: {episode}
- Title: {title}
- Publication date: {date}
- Main topics discussed in the episode
- Names of speakers and guests mentioned

Transcript:
{transcript}

Respond with a clear and structured summary.
"""
    return prompt

## 8. Create Prompts for the First 10 Episodes

Construct prompts using the fetched transcripts and metadata.

In [5]:
metadata = [
    {"episode": "001", "title": "New Beginnings", "date": "95-11-17"},
    {"episode": "002", "title": "Small Scale Sin", "date": "95-11-24"},
    {"episode": "003", "title": "A Violent Utopia", "date": "95-12-01"},
    {"episode": "004", "title": "Animals", "date": "95-12-08"},
    {"episode": "005", "title": "Anger and Forgiveness", "date": "95-12-15"},
    {"episode": "006", "title": "Poultry Slam 1995", "date": "95-12-22"},
    {"episode": "007", "title": "Quitting", "date": "96-01-05"},
    {"episode": "008", "title": "On Work", "date": "96-01-12"},
    {"episode": "009", "title": "Julia Sweeney", "date": "96-01-19"},
    {"episode": "010", "title": "Double Lives", "date": "96-01-26"},
]

prompts = []
for i, transcript in enumerate(transcripts):
    prompt_custom = generate_prompt(transcript, metadata[i]["episode"], metadata[i]["title"], metadata[i]["date"])
    prompts.append(prompt_custom)
    print(f"Custom prompt created for episode {metadata[i]['episode']}.")

Custom prompt created for episode 001.
Custom prompt created for episode 002.
Custom prompt created for episode 003.
Custom prompt created for episode 004.
Custom prompt created for episode 005.
Custom prompt created for episode 006.
Custom prompt created for episode 007.
Custom prompt created for episode 008.
Custom prompt created for episode 009.
Custom prompt created for episode 010.


6. Select Podcast Transcript
Allow the user to select the podcast transcript they want to see and provide an option to listen to the podcast.


In [6]:
def display_transcript_options(transcripts, metadata):
    for i, meta in enumerate(metadata):
        print(f"{i+1}. Title: {meta['title']} - Publication date: {meta['date']}")

    selection = int(input("Select the transcript you want to see (1-10): ")) - 1
    if 0 <= selection < len(transcripts):
        print(f"Showing transcript for episode: {metadata[selection]['title']}")
        print(transcripts[selection])
    else:
        print("Invalid selection.")

display_transcript_options(transcripts, metadata)

1. Title: New Beginnings - Publication date: 95-11-17
2. Title: Small Scale Sin - Publication date: 95-11-24
3. Title: A Violent Utopia - Publication date: 95-12-01
4. Title: Animals - Publication date: 95-12-08
5. Title: Anger and Forgiveness - Publication date: 95-12-15
6. Title: Poultry Slam 1995 - Publication date: 95-12-22
7. Title: Quitting - Publication date: 96-01-05
8. Title: On Work - Publication date: 96-01-12
9. Title: Julia Sweeney - Publication date: 96-01-19
10. Title: Double Lives - Publication date: 96-01-26
Select the transcript you want to see (1-10): 2
Showing transcript for episode: Small Scale Sin
2: Small Scale Sin - This American Life










































Skip to main content
























Hi. We love you. Be our Life Partner.


Support the show to get ad-free listening, bonus content, and our new Greatest Hits Archive.








Learn more
















 








































 




















 






00:00


# 9. Summarize the Transcripts
Loading and Generating summaries for each transcript.


In [16]:
import os
from datasets import Dataset
from transformers import pipeline

# Initialize the summarization pipeline (using facebook/bart-large-cnn)
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Function to split text into manageable chunks for summarization.
def chunk_text(text, max_chunk_chars=1024):
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0
    for word in words:
        if current_length + len(word) + 1 > max_chunk_chars:
            chunks.append(" ".join(current_chunk))
            current_chunk = [word]
            current_length = len(word) + 1
        else:
            current_chunk.append(word)
            current_length += len(word) + 1
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

# Function to summarize a transcript.
def summarize_transcript(transcript):
    summary = ""
    chunks = chunk_text(transcript, max_chunk_chars=1024)
    for chunk in chunks:
        # Adjust summarization parameters based on the estimated length of the chunk.
        input_length = len(chunk.split())
        max_len = min(200, int(input_length * 0.8))
        min_len = min(50, int(input_length * 0.4))

        generated = summarizer(chunk, max_length=max_len, min_length=min_len, do_sample=False)
        summary += generated[0]['summary_text'] + " "
    return summary.strip()

# Path to the folder containing transcripts
folder_path = "/content/theamericanlifepodcast/transcript-text"

# List all files in the folder
transcript_files = [f for f in os.listdir(folder_path) if f.endswith(".txt")]
print("Available transcripts:")
for file in transcript_files:
    print(file)

# Allow the user to select which transcript to summarize
print("\nEnter the filename of the transcript you want to summarize (e.g., 1.txt):")
selected_filename = input().strip()

# Check if the selected file exists in the folder
selected_file_path = os.path.join(folder_path, selected_filename)
if not os.path.exists(selected_file_path):
    print(f"Error: The file '{selected_filename}' does not exist in the folder.")
else:
    # Read the content of the selected file
    with open(selected_file_path, 'r') as file:
        transcript_text = file.read()

    # Create a dataset-like structure for processing
    data = {"filename": [selected_filename], "transcript": [transcript_text]}
    dataset = Dataset.from_dict(data)

    # Summarize the selected transcript
    def process_batch(batch):
        batch["summary"] = summarize_transcript(batch["transcript"])
        return batch

    dataset = dataset.map(process_batch)

    # Print the summary for the selected file
    print("\nFilename:", dataset[0]["filename"])
    print("Generated Summary:\n", dataset[0]["summary"])

Device set to use cuda:0


Available transcripts:
1.txt
3.txt
5.txt
8.txt
7.txt
6.txt
9.txt
10.txt
4.txt
2.txt

Enter the filename of the transcript you want to summarize (e.g., 1.txt):
4.txt


Map:   0%|          | 0/1 [00:00<?, ? examples/s]


Filename: 4.txt
Generated Summary:
 This American Life is produced for the ear and designed to be heard. If you are able, we strongly encourage you to listen to the audio, which includes emotion and emphasis that's not on the page. Transcripts are generated using a combination of speech recognition software and human transcribers and may contain errors. If an American family can't get along in paradise, what hope is there? Ira Glass and Sandra Tsing Loh and David Sedaris tackle that team. Today's program, Nightmare Vacations, includes stories by Sandra Loh. My parents brought their favorite breakfast cereals with them, 10,000 miles to Hawaii. They like the familiar comforts of home, like many people do. In all sorts of stressful situations, what my parents do is that they make themselves comfortable by creating a comfortable, personal space. When my mom got breast cancer five years ago, they decided to start a major rehab on the house. And in Hawaii, with the stress of having to deal 

## 10. Keyword Search Engine

Implement a keyword search engine.

In [None]:
def keyword_search(transcript, keyword):
    lines = transcript.split('\n')
    results = [line for line in lines if keyword.lower() in line.lower()]
    return results

if 'transcripts' in globals():
    keyword = "betray"
    search_results = keyword_search(transcripts[1], keyword)

    print(f"Search results for '{keyword}':\n")
    for result in search_results:
        print(result)
else:
    print("Transcripts are not defined. Please run the fetching cell .")

Search results for 'betray':

[,Full episodeToggle Audio and Transcript SyncTranscript2: Small Scale SinNote: This American Life is produced for the ear and designed to be heard. If you are able, we strongly encourage you to listen to the audio, which includes emotion and emphasis that's not on the page. Transcripts are generated using a combination of speech recognition software and human transcribers, and may contain errors. Please check the corresponding audio before quoting in print.PrologueIra GlassOK, three boys, aged 13, 15, and 16. All three chose to appear with fake names on this radio program. And the fake names they chose, you ready? K-Rad, Mr. Warez, and Fred. Those first two names come from the world of computer hacking and software piracy. Mr. Warez, for example, that's "warez," as in "wares," as in "softwares," as in pirated softwares, illegal softwares. And as for Fred--FredWhy Fred? For no reason, man. There's got to be someone else named Fred out there.Ira GlassYou se

## 11. Semantic Search with Embeddings

Implement a semantic search engine using embeddings.

In [17]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
transcript_embeddings = model.encode(transcripts)

def semantic_search(query, embeddings, top_k=5):
    query_embedding = model.encode([query])[0]
    similarities = cosine_similarity([query_embedding], embeddings)[0]
    similar_indices = similarities.argsort()[-top_k:][::-1]
    return similar_indices, similarities

query = "economic impact"
similar_indices, similarities = semantic_search(query, transcript_embeddings)

print("Semantic search results for query \"economic impact\":\n")
for idx in similar_indices:
    print(f"Episode {metadata[idx]['episode']}, Similarity: {similarities[idx]:.4f}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Semantic search results for query "economic impact":

Episode 001, Similarity: -0.0536


## 12. Content Recommendation Engine

Implement a content recommendation engine using embeddings.

In [27]:
def recommend_episodes(transcript_embedding, all_embeddings, top_k=5):
    similarities = cosine_similarity([transcript_embedding], all_embeddings)[0]
    similar_indices = similarities.argsort()[-top_k:][::-1]
    return similar_indices, similarities

example_transcript_embedding = transcript_embeddings[1]
similar_indices, similarities = recommend_episodes(example_transcript_embedding, transcript_embeddings)

print("Recommended episodes based on similarity:\n")
for idx in similar_indices:
    print(f"Episode {metadata[idx]['episode']}, Similarity: {similarities[idx]:.4f}")

IndexError: index 1 is out of bounds for axis 0 with size 1