# Proposed Solution Overview

In this project, I am building a podcast discovery system that combines several components:

- **LLM-Based Summarization:** Quickly generates short summaries for each transcript so users can preview the episode content. _Pros:_ Fast and informative summaries. _Cons:_ May need prompt engineering or fine-tuning.

- **Content Recommendation Engine:** Uses article metadata (episode number, title, publication date, etc.) combined with keywords to recommend related episodes. _Pros:_ Enhances exploration by suggesting similar content. _Cons:_ Requires calibration to avoid irrelevant recommendations.

- **Keyword Search Engine:** A simple search over transcripts based on exact keyword matching. _Pros:_ Easy to implement. _Cons:_ Limited by exact-match requirements.

Below, I implement LLM-based summarization and a simple recommendation engine using my local data.

In [None]:
# -----------------------------
# 1. Load Episode Metadata and Transcripts
# -----------------------------

import os
import pandas as pd

# Define paths
transcript_folder = "transcript-txt"  # Folder with transcript text files (e.g., "1.txt", "2.txt", etc.)
episode_info_file = "episodes_info.csv"  # CSV file with episode metadata

# Load episode metadata
episode_info = pd.read_csv(episode_info_file)
print("Episode Metadata:")
display(episode_info.head())

# Function to load a transcript given an episode number
def load_transcript(ep_num):
    file_path = os.path.join(transcript_folder, f"{ep_num}.txt")
    if os.path.exists(file_path):
        with open(file_path, encoding="utf-8") as f:
            return f.read().strip()
    else:
        return None

# Add a transcript column to the metadata DataFrame
episode_info['transcript'] = episode_info['ep_num'].apply(lambda x: load_transcript(x))

# Preview transcript for Episode 2
ep2 = episode_info[episode_info['ep_num'] == 2]
if not ep2.empty:
    print("Transcript for Episode 2 (preview):")
    print(ep2['transcript'].values[0][:500])
else:
    print("Episode 2 transcript not found.")

## LLM-Based Summarization

This section uses prompt engineering to generate a short summary for an episode. The custom prompt will include the transcript along with key metadata (episode number, title, publication date, host, contributors) so users can preview the content before listening.

In [None]:
from transformers import pipeline

# Initialize the summarization pipeline.
# If you need to use your Hugging Face API key, uncomment the following line and replace YOUR_API_KEY:
# summarizer = pipeline("text2text-generation", model="t5-base", use_auth_token="YOUR_API_KEY")
summarizer = pipeline("text2text-generation", model="t5-base")

# Define a function to build a custom prompt
def generate_prompt(transcript, ep_num, ep_name, pub_date, host, contribuidores):
    prompt = f"""
You are an assistant specialized in summarizing podcast episodes.
Based on the following transcript, generate a concise and informative summary that includes:
- Episode Number: {ep_num}
- Title: {ep_name}
- Publication Date: {pub_date}
- Host: {host}
- Contributors: {contribuidores}
- Main topics discussed, speakers, and guests mentioned

Transcript:
{transcript}

Please respond with a clear and structured summary.
"""
    return prompt

# Retrieve data for Episode 2
ep2_info = episode_info[episode_info['ep_num'] == 2].iloc[0]
transcript_ep2 = ep2_info['transcript']

if transcript_ep2:
    custom_prompt = generate_prompt(
        transcript=transcript_ep2,
        ep_num=ep2_info['ep_num'],
        ep_name=ep2_info['ep_name'],
        pub_date=ep2_info['pub_date'],
        host=ep2_info['host'],
        contribuidores=ep2_info['contributers']
    )
    print("Custom Prompt for Episode 2 (preview):\n")
    print(custom_prompt[:500], "...\n")
    
    # Generate summary using the LLM
    summary_output = summarizer(custom_prompt, max_length=200, truncation=True)
    print("Generated Summary for Episode 2:\n")
    print(summary_output[0]['generated_text'])
else:
    print("Transcript for Episode 2 not found. Cannot generate summary.")

## Content Recommendation Engine

This section experiments with recommending related episodes based on article metadata and keyword overlap in the episode titles. This simple approach splits titles into keywords and scores episodes based on the number of common keywords.

In [None]:
def recommend_episodes(target_ep_num, metadata_df, top_n=3):
    # Get the title of the target episode
    target_title = metadata_df[metadata_df['ep_num'] == target_ep_num]['ep_name'].values[0].lower()
    target_keywords = set(target_title.split())
    
    recommendations = []
    for _, row in metadata_df.iterrows():
        if row['ep_num'] == target_ep_num:
            continue
        title = row['ep_name'].lower()
        keywords = set(title.split())
        common = target_keywords.intersection(keywords)
        score = len(common)
        recommendations.append((row['ep_num'], row['ep_name'], score))
    
    # Sort recommendations by score in descending order
    recommendations = sorted(recommendations, key=lambda x: x[2], reverse=True)
    return recommendations[:top_n]

# Example: Recommend episodes related to Episode 2
recommended = recommend_episodes(2, episode_info, top_n=3)
print("Recommended Episodes for Episode 2:")
for rec in recommended:
    print(f"Episode {rec[0]} - {rec[1]} (Score: {rec[2]})")

## Keyword Search Engine

As a simple baseline, this section implements a keyword search that scans the transcript texts for a given query. Although limited to exact matches, it serves as a comparison for more advanced semantic search methods.

In [None]:
def keyword_search(query, metadata_df):
    query = query.lower()
    results = []
    for _, row in metadata_df.iterrows():
        transcript = row['transcript']
        if transcript and query in transcript.lower():
            results.append((row['ep_num'], row['ep_name']))
    return results

# Example: Search for the keyword "sin" in transcripts
search_results = keyword_search("sin", episode_info)
print("Keyword Search Results for 'sin':")
for res in search_results:
    print(f"Episode {res[0]} - {res[1]}")

## URL Suffix Generation

This cell defines a function to scrub punctuation from a name, remove extra whitespace, and replace inner whitespace with hyphens to generate a URL suffix.

In [None]:
import re
import string

# Scrubbing punctuation from name, removing whitespace and replacing inner whitespace with hyphens
def get_url_suffix(name):
    translator = str.maketrans("", "", string.punctuation + "’‘…")
    clean = name.translate(translator)
    clean = re.sub("—", " ", clean).strip()
    clean = re.sub(" - ", " ", clean).strip()
    clean = re.sub("ä", "a", clean).strip()
    print(clean)
    return re.sub(r" +", " ", clean).strip().replace(" ", "-")

# Example usage
suffix = get_url_suffix("The Problem We All Live With Part One")
print(suffix)