# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

Instructions:

## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.

### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

In [10]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import torch
from transformers import BertTokenizer, BertModel
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USUARIO\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USUARIO\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

In [2]:
# Load the dataset from the local file
podcast_data = pd.read_csv('podcastdata_dataset.csv')


### Step 3: Text Preprocessing

You know what to do ;)

In [3]:
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # Remove whitespace
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_text = [w for w in word_tokens if not w in stop_words]
    return ' '.join(filtered_text)

# Apply preprocessing to the text column
podcast_data['cleaned_text'] = podcast_data['text'].apply(preprocess_text)


###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

In [4]:
# Create TF-IDF vector representations
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(podcast_data['cleaned_text'])


### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [5]:
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def embed_bert(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().detach().numpy()

# Create BERT vector representations
bert_embeddings = np.array([embed_bert(text) for text in podcast_data['cleaned_text']])


### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

In [6]:
#Process the query with the preprocess function
def process_query(query, vectorizer, tfidf_matrix, bert_model, bert_tokenizer):
    query_cleaned = preprocess_text(query)
    
    # TF-IDF
    query_tfidf = vectorizer.transform([query_cleaned])
    tfidf_scores = cosine_similarity(query_tfidf, tfidf_matrix).flatten()
    
    # BERT
    query_bert = embed_bert(query_cleaned)
    bert_scores = cosine_similarity([query_bert], bert_embeddings).flatten()
    
    return tfidf_scores, bert_scores


### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

In [11]:
#Retrieve the first 5 relevant results
def retrieve_top_results(scores, podcast_data, top_n=5):
    top_indices = np.argsort(scores)[::-1][:top_n]
    top_results = podcast_data.iloc[top_indices]
    return top_results
#Display the results of TF-IDF and BERT 
def display_results(results, method):
    print(f"Top results using {method}:")
    for i, row in results.iterrows():
        print(f"Episode ID: {row['id']}, Guest: {row['guest']}, Title: {row['title']}")
        print(f"Transcript: {row['text'][:100]}...")  # Display the first 100 characters
        print()


### Step 8: Test the IR System

Test the system with a sample query.

In [8]:
query = "Artificial Intelligence"
tfidf_scores, bert_scores = process_query(query, tfidf_vectorizer, tfidf_matrix, model, tokenizer)

# Retrieve and display the top results
tfidf_results = retrieve_top_results(tfidf_scores, podcast_data)
bert_results = retrieve_top_results(bert_scores, podcast_data)

display_results(tfidf_results, "TF-IDF")
display_results(bert_results, "BERT")


Top results using TF-IDF:
Episode ID: 3, Guest: Steven Pinker, Title: AI in the Age of Reason
Transcript: You've studied the human mind, cognition, language, vision, evolution, psychology, from child to adu...

Episode ID: 61, Guest: Melanie Mitchell, Title: Concepts, Analogies, Common Sense & Future of AI
Transcript: The following is a conversation with Melanie Mitchell. She's a professor of computer science at Port...

Episode ID: 120, Guest: François Chollet, Title: Measures of Intelligence
Transcript: The following is a conversation with Francois Chollet, his second time on the podcast. He's both a w...

Episode ID: 38, Guest: François Chollet, Title: Keras, Deep Learning, and the Progress of AI
Transcript: The following is a conversation with Francois Chollet. He's the creator of Keras, which is an open s...

Episode ID: 302, Guest: Richard Haier, Title: IQ Tests, Human Intelligence, and Group Differences
Transcript: Let me ask you to this question, whether it's bell curve or any 

### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

In [9]:
# Analyze and compare the results
def compare_results(tfidf_results, bert_results):
    print("TF-IDF Results:")
    display_results(tfidf_results, "TF-IDF")
    print("\nBERT Results:")
    display_results(bert_results, "BERT")

compare_results(tfidf_results, bert_results)

TF-IDF Results:
Top results using TF-IDF:
Episode ID: 3, Guest: Steven Pinker, Title: AI in the Age of Reason
Transcript: You've studied the human mind, cognition, language, vision, evolution, psychology, from child to adu...

Episode ID: 61, Guest: Melanie Mitchell, Title: Concepts, Analogies, Common Sense & Future of AI
Transcript: The following is a conversation with Melanie Mitchell. She's a professor of computer science at Port...

Episode ID: 120, Guest: François Chollet, Title: Measures of Intelligence
Transcript: The following is a conversation with Francois Chollet, his second time on the podcast. He's both a w...

Episode ID: 38, Guest: François Chollet, Title: Keras, Deep Learning, and the Progress of AI
Transcript: The following is a conversation with Francois Chollet. He's the creator of Keras, which is an open s...

Episode ID: 302, Guest: Richard Haier, Title: IQ Tests, Human Intelligence, and Group Differences
Transcript: Let me ask you to this question, whether it's be

### Discuss the differences, strengths, and weaknesses of each method based on the retrieval results
- **TF-IDF:**
  - **Strengths:**
    - Generally faster and easier to implement.
    - Requires less computational resources compared to BERT.
  - **Weaknesses:**
    - May not capture semantic meaning as effectively as BERT.
    - Relies heavily on term frequency, which might miss contextual nuances.

- **BERT:**
  - **Strengths:**
    - Can capture more complex semantic relationships.
    - Understands context better due to its deep learning architecture.
  - **Weaknesses:**
    - Computationally expensive and requires more resources.
    - Slower to process compared to TF-IDF, especially for large datasets.
