## Workshop: Building an Information Retrieval System for Podcast Episodes

**Objective:**
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

**Instructions:**

**Step 1: Import Libraries**
Import necessary libraries for data handling, text processing, and machine learning.


In [1]:
import re
import os
import nltk
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from collections import defaultdict
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

**Step 2: Load the Dataset**
Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript
we donwload the dataset in csv format and save in folder data

In [2]:
df=pd.read_csv("data/podcastdata_dataset.csv")
df

Unnamed: 0,id,guest,title,text
0,1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera..."
1,2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...
2,3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang..."
3,4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...
4,5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...
...,...,...,...,...
314,321,Ray Kurzweil,"Singularity, Superintelligence, and Immortality","By the time he gets to 2045, we'll be able to ..."
315,322,Rana el Kaliouby,"Emotion AI, Social Robots, and Self-Driving Cars","there's a broader question here, right? As we ..."
316,323,Will Sasso,"Comedy, MADtv, AI, Friendship, Madness, and Pr...",Once this whole thing falls apart and we are c...
317,324,Daniel Negreanu,Poker,you could be the seventh best player in the wh...


**Step 3: Text Preprocessing**

- Delete punctuation 
- Delete stop words


In [3]:
corpus=df['text']
corpus.head(10)

0    As part of MIT course 6S099, Artificial Genera...
1    As part of MIT course 6S099 on artificial gene...
2    You've studied the human mind, cognition, lang...
3    What difference between biological neural netw...
4    The following is a conversation with Vladimir ...
5    The following is a conversation with Guido van...
6    The following is a conversation with Jeff Atwo...
7    The following is a conversation with Eric Schm...
8    The following is a conversation with Stuart Ru...
9    The following is a conversation with Peter Abb...
Name: text, dtype: object

In [4]:
# First, we delete punctuation

corpus_nopunct = []
for doc in corpus:
    corpus_nopunct.append(doc.lower().translate(str.maketrans('', '', string.punctuation)))
    

In [None]:
print(corpus_nopunct[:10])

In [5]:
#Add the text without punctuation
df['text_nopunct'] = corpus_nopunct

In [6]:
stop_words = set(stopwords.words('english'))

In [7]:
corpus_nostopw = [
    ' '.join(word for word in doc.split(' ') if word not in stop_words)
    for doc in corpus_nopunct
]

In [None]:
corpus_nopunct[300]

In [8]:
df['text_nostopw'] = corpus_nostopw

In [9]:
df.head(10)

Unnamed: 0,id,guest,title,text,text_nopunct,text_nostopw
0,1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera...",as part of mit course 6s099 artificial general...,part mit course 6s099 artificial general intel...
1,2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...,as part of mit course 6s099 on artificial gene...,part mit course 6s099 artificial general intel...
2,3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang...",youve studied the human mind cognition languag...,youve studied human mind cognition language vi...
3,4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...,what difference between biological neural netw...,difference biological neural networks artifici...
4,5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...,the following is a conversation with vladimir ...,following conversation vladimir vapnik hes co ...
5,6,Guido van Rossum,Python,The following is a conversation with Guido van...,the following is a conversation with guido van...,following conversation guido van rossum creato...
6,7,Jeff Atwood,Stack Overflow and Coding Horror,The following is a conversation with Jeff Atwo...,the following is a conversation with jeff atwo...,following conversation jeff atwood cofounder s...
7,8,Eric Schmidt,Google,The following is a conversation with Eric Schm...,the following is a conversation with eric schm...,following conversation eric schmidt ceo google...
8,9,Stuart Russell,Long-Term Future of AI,The following is a conversation with Stuart Ru...,the following is a conversation with stuart ru...,following conversation stuart russell hes prof...
9,10,Pieter Abbeel,Deep Reinforcement Learning,The following is a conversation with Peter Abb...,the following is a conversation with peter abb...,following conversation peter abbeel hes profes...


**Step 4: Vector Space Representation - TF-IDF**

Create TF-IDF vector representations of the transcripts.

In [10]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix=tfidf_vectorizer.fit_transform(df['text_nostopw'])

***Step 5: Vector Space Representation - BERT***

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [11]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

2024-08-05 17:50:39.842811: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-08-05 17:50:41.461069: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-08-05 17:50:42.385205: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:479] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-05 17:50:43.183483: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10575] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-05 17:50:43.185984: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1442] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-05 17:50:44.372973: I tensorflow/core/platform/cpu_feature_guard.cc:

In [12]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [13]:
# Load pre-trained BERT model and tokenizer

model = TFBertModel.from_pretrained('bert-base-uncased')

2024-08-05 17:51:33.980645: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.
2024-08-05 17:51:34.254614: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.
2024-08-05 17:51:34.364332: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.
2024-08-05 17:51:37.434766: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing TFBertModel from

In [14]:
import numpy as np
def generate_bert_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)

corpus_bert = generate_bert_embeddings(corpus)

2024-08-05 17:51:50.387048: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 12582912 exceeds 10% of free system memory.


***Step 6: Query Processing***

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.


In [31]:
def queryProcessing(query):
    query_tovec=tfidf_vectorizer.transform([query])
    tfidf_similarities=cosine_similarity(tfidf_matrix,query_tovec)
    df_similarities = pd.DataFrame(tfidf_similarities, columns=['Similaridad'])
    df_similarities['episodes']=df['title']
    
    query_to_bert=[f'{query}']
    print(query_to_bert)
    query_bert = generate_bert_embeddings(query_to_bert)
    similarities = cosine_similarity(corpus_bert.reshape(319,768), query_bert.reshape(1,768))
    similarities_df = pd.DataFrame(similarities, columns=['sim'])
    similarities_df['ep'] = df['title']
    return df_similarities, similarities_df
    
    

***Step 7: Retrieve and Compare Results***

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

In [32]:
def retrievalTop(query):
    df_resultsTFIDF,df_resultsBERT=queryProcessing(query)
    df_resultsTFIDFSorted= df_resultsTFIDF.sort_values(by='Similaridad', ascending=False)
    df_resultsBERTSorted=df_resultsBERT.sort_values(by='sim',ascending=False)
    return df_resultsBERTSorted,df_resultsTFIDFSorted
    
    
    

***Step 8: Test the IR System***

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

In [34]:
def print_results(dfBERT, dfTFIDF):
    print("\nTop results using BERT:")
    print(dfBERT.head(10))
    
    print("\nTop results using TFIDF:")
    print(dfTFIDF.head(10))

def interactive_console():
    while True:
        query = input("Insert the query: ")
        
        df_resultsBERTSorted, df_resultsTFIDFSorted = retrievalTop(query)
        
        print_results(df_resultsBERTSorted, df_resultsTFIDFSorted)
        
        print("\nOptions:")
        print("1. New Query")
        print("2. Exit")
        
        user_choice = input("Select an option (1 or 2): ").strip()
        
        if user_choice == '2':
            print("Exiting the system. Goodbye!")
            break
        elif user_choice == '1':
            continue
        else:
            print("Invalid option. Please select '1' for New Query or '2' for Exit.")

interactive_console()

['gpt']

Top results using BERT:
          sim                                                 ep
216  0.709173  Virtual Reality, Social Media & the Future of ...
49   0.703856    Neuralink, AI, Autopilot, and the Pale Blue Dot
199  0.669967                        Totalitarianism and Anarchy
133  0.666933  On the Nature of Good and Evil, Genius and Mad...
39   0.660287                                             iRobot
153  0.659555  Aliens, Black Holes, and the Mystery of the Ou...
96   0.657686           Going Big in Business, Investing, and AI
163  0.654897  Sleep, Dreams, Creativity & the Limits of the ...
34   0.654667        Machines Who Think and the Early Days of AI
273  0.654259        Bitcoin, Inflation, and the Future of Money

Top results using TFIDF:
     Similaridad                                           episodes
213     0.099371  OpenAI Codex, GPT-3, Robotics, and the Future ...
17      0.032536                                     OpenAI and AGI
94      0.028676      

***Step 9: Compare Results***

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

When analyzing the results from BERT and TF-IDF for different queries, distinct differences in their effectiveness become apparent. For the query `['gpt']`, BERT provides results that are highly relevant to the topic of artificial intelligence and technology. The top results include titles like *"Virtual Reality, Social Media & the Future of ..."* and *"Neuralink, AI, Autopilot, and the Pale Blue Dot"*. These results reflect a deeper understanding of the context and semantic relationships around the term "GPT." In contrast, TF-IDF yields results that are less contextually aligned with the query. Although it includes relevant topics such as *"OpenAI Codex, GPT-3, Robotics, and the Future ..."*, the overall relevance is lower compared to BERT’s results.

For the query `['virtuality']`, BERT again demonstrates its strength by returning results that are closely related to the themes of reality, science, and philosophy. Titles like *"Revolutionary Ideas in Science, Math, and Society"* and *"Nature of Reality, Dreams, and Consciousness"* show that BERT can effectively interpret and match the semantic context of "virtuality." On the other hand, TF-IDF results are less specific and often less relevant. For instance, results such as *"Python and the Source Code of Humans, Computer ..."* and *"Life 3.0"* do not align as well with the thematic essence of the query.

When the query is `['physics']`, BERT continues to deliver results that are relevant to the topic of physics and related sciences, showing a good understanding of the context. Results such as *"Totalitarianism and Anarchy"* (though not directly relevant, could be interpreted within a broader context) and *"Virtual Reality, Social Media & the Future of ..."* highlight BERT’s ability to grasp the conceptual relevance of "physics." Conversely, TF-IDF provides results that are more directly related to the specific term "physics," such as *"String Theory"* and *"Physics View of the Mind and Neurobiology."* While these results are more precise in terms of keyword matching, they may miss out on broader contextual relevance.

Overall, BERT excels in understanding and capturing the context and semantic relationships of queries, making it more effective for complex and abstract searches. It provides results that reflect a deeper comprehension of the query’s intent. In contrast, TF-IDF is better suited for searches where exact term matching is crucial. Although it may offer more straightforward results for specific keywords, it often lacks the ability to grasp the deeper contextual meaning of the query.
