# Podcast Transcript Summarizer and Recommendation System

In this notebook, I will:
- Clone a Git repository containing the following folders/files:
  - `episode-html` – HTML files for each episode (main episode pages)
  - `transcript-text` – Text files with transcript content for each episode
  - `transcript-html` – HTML files of the full transcript pages
  - `episodes.html` – The main page with episodes listings
  - `episodes_info.csv` – Episode metadata (columns: ep_num, ep_name, pub_date, host, contributers, num_acts, last_timestamp, url_suffix)
- Load and parse the data.
- Perform exploratory data analysis (EDA) on the transcripts and metadata.
- Use prompt engineering with a Hugging Face LLM (e.g., T5-base) to generate short summaries for each transcript so that users can preview the episode content.
- (Optionally) Set up a basic recommendation engine using article metadata and keywords.

**Solution Space Overview:**

- **Keyword Search Engine**  
  *Pros:* Simple implementation.  
  *Cons:* Limited by exact-match limitations.

- **LLM-Based Summarization**  
  *Pros:* Quickly generates episode summaries for fast comprehension.  
  *Cons:* May need fine-tuning to avoid inaccuracies.

- **Semantic Search with Embeddings**  
  *Pros:* Supports natural language queries for better search relevance.  
  *Cons:* Requires careful tuning and compute resources.

- **Content Recommendation Engine**  
  *Pros:* Enhances exploration by suggesting similar episodes.  
  *Cons:* Risks recommending irrelevant content if embeddings aren’t well-calibrated.

In this notebook, I focus on LLM-based summarization (and later, recommendation) using my existing data.




## 1. Clone the Git Repository and List Files

In this cell, I clone the Git repository that contains all the folders and files.


In [51]:
# Clone your Git repository (adjust the URL to your repo)
!git clone https://github.com/FelipeGRK/theamericanlifepodcast.git

# Change directory into the repository
%cd yourrepo

# List the directory structure to confirm that folders/files are present
!find . -maxdepth=2 | sort

fatal: destination path 'theamericanlifepodcast' already exists and is not an empty directory.
[Errno 2] No such file or directory: 'yourrepo'
/content
find: unknown predicate `-maxdepth=2'


## 2. Load Episode Metadata and Transcripts

Here, I load the episode metadata from `episodes_info.csv` and read the transcripts from the `transcript-text` folder. The metadata file should include columns such as:  
- ep_num, ep_name, pub_date, host, contributers, num_acts, last_timestamp, url_suffix




In [50]:
import os
import pandas as pd

# Define the paths to your local folders/files
transcript_folder = "transcript-text"   # Folder with text transcripts (e.g., "1.txt", "2.txt", etc.)
episode_info_file = "episodes_info.csv"   # CSV file with episode metadata

# Load the episode metadata
episode_info = pd.read_csv(episode_info_file)
print("Episode Metadata:")
display(episode_info.head())

# Function to load a transcript for a given episode number
def load_transcript(ep_num):
    file_path = os.path.join(transcript_folder, f"{ep_num}.txt")
    if os.path.exists(file_path):
        with open(file_path, encoding="utf-8") as f:
            return f.read().strip()
    else:
        return None

# Add a transcript column to the metadata DataFrame
episode_info['transcript'] = episode_info['ep_num'].apply(lambda x: load_transcript(x))

# Display a preview of transcript for Episode 2
ep2 = episode_info[episode_info['ep_num'] == 2]
if not ep2.empty:
    print("Transcript for Episode 2 (preview):")
    print(ep2['transcript'].values[0][:500])
else:
    print("Episode 2 transcript not found.")


FileNotFoundError: [Errno 2] No such file or directory: 'episodes_info.csv'

## 3. Exploratory Data Analysis (EDA)

Now I analyze the transcripts' characteristics, such as transcript length distribution.


In [36]:
import matplotlib.pyplot as plt

# Calculate transcript lengths (in words)
episode_info['transcript_length'] = episode_info['transcript'].apply(lambda x: len(x.split()) if x else 0)

# Plot a histogram of transcript lengths
plt.figure(figsize=(8, 4))
plt.hist(episode_info['transcript_length'], bins=30, color='skyblue', edgecolor='black')
plt.title('Distribution of Transcript Lengths')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.show()


Transcript fetched successfully. Preview:
2: Small Scale Sin - This American Life










































Skip to main content
























Hi. We love you. Be our Life Partner.


Support the show to get ad-free listening, bonus content, and our new Greatest Hits Archive.








Learn more
















 








































 




















 






00:00


00:00






Transcript


Share


























 


 






 














This American Life








Life Partners...



## 4. Generate Short Summaries Using LLM and Prompt Engineering

In this section, I build a custom prompt that combines the transcript with episode metadata (episode number, title, publication date, host, and contributors). This prompt is then used with a Hugging Face model (e.g., T5-base) to generate a short summary.

If you require an API key, you can provide it as shown.



In [53]:
# Install transformers if not installed (already done above if necessary)
!pip install transformers

from transformers import pipeline

# Option 1: If you need to use your Hugging Face API key, uncomment and update:
# summarizer = pipeline("text2text-generation", model="t5-base", use_auth_token="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")

# Option 2: For public models:
summarizer = pipeline("text2text-generation", model="t5-base", use_auth_token="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")





Device set to use cpu


### 4.1 Define the Prompt Function

This function constructs a prompt that instructs the model to summarize the transcript and include the episode metadata.



In [54]:
def generate_prompt(transcript, ep_num, ep_name, pub_date, host, contribuidores):
    prompt = f"""
You are an assistant specialized in summarizing podcast episodes.
Based on the following transcript, generate a concise and informative summary that includes:
- Episode Number: {ep_num}
- Title: {ep_name}
- Publication Date: {pub_date}
- Host: {host}
- Contributors: {contribuidores}
- Main topics discussed, speakers, and guests mentioned

Transcript:
{transcript}

Please respond with a clear and structured summary.
"""
    return prompt



### 4.2 Build the Custom Prompt for Episode 2

Using the metadata and transcript for Episode 2, I create the custom prompt.


In [55]:
# Retrieve Episode 2 data
ep2_info = episode_info[episode_info['ep_num'] == 2].iloc[0]
transcript_ep2 = ep2_info['transcript']

if transcript_ep2:
    custom_prompt = generate_prompt(
        transcript=transcript_ep2,
        ep_num=ep2_info['ep_num'],
        ep_name=ep2_info['ep_name'],
        pub_date=ep2_info['pub_date'],
        host=ep2_info['host'],
        contribuidores=ep2_info['contributers']
    )
    print("Custom Prompt for Episode 2 (preview):\n")
    print(custom_prompt[:500], "...\n")
else:
    print("Transcript for Episode 2 not found. Cannot generate prompt.")


NameError: name 'episode_info' is not defined

## 5. (Optional) Recommend Related Episodes

For a simple content recommendation, I can use the episode metadata (such as keywords in the title) and similarity metrics.  
Below is an example of a simple recommendation function based on matching keywords between episodes.


In [56]:
def recommend_episodes(target_ep_num, metadata_df, top_n=3):
    """
    Recommend related episodes based on similar keywords in the title.
    This is a simple approach: it compares lowercase words in the episode titles.
    """
    target_title = metadata_df[metadata_df['ep_num'] == target_ep_num]['ep_name'].values[0].lower()
    target_keywords = set(target_title.split())

    recommendations = []
    for _, row in metadata_df.iterrows():
        if row['ep_num'] == target_ep_num:
            continue
        title = row['ep_name'].lower()
        keywords = set(title.split())
        common = target_keywords.intersection(keywords)
        score = len(common)
        recommendations.append((row['ep_num'], row['ep_name'], score))

    # Sort recommendations by score in descending order
    recommendations = sorted(recommendations, key=lambda x: x[2], reverse=True)
    return recommendations[:top_n]

# Example: Recommend episodes related to Episode 2
recommended = recommend_episodes(2, episode_info, top_n=3)
print("Recommended Episodes for Episode 2:")
for rec in recommended:
    print(f"Episode {rec[0]} - {rec[1]} (Score: {rec[2]})")


NameError: name 'episode_info' is not defined