## Workshop: Building an Information Retrieval System for Podcast Episodes

**Objective:**
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

**Instructions:**

**Step 1: Import Libraries**
Import necessary libraries for data handling, text processing, and machine learning.


In [1]:
import re
import os
import nltk
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from collections import defaultdict
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

**Step 2: Load the Dataset**
Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript
we donwload the dataset in csv format and save in folder data

In [2]:
df=pd.read_csv("/content/podcastdata_dataset.csv")
df

Unnamed: 0,id,guest,title,text
0,1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera..."
1,2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...
2,3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang..."
3,4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...
4,5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...
...,...,...,...,...
314,321,Ray Kurzweil,"Singularity, Superintelligence, and Immortality","By the time he gets to 2045, we'll be able to ..."
315,322,Rana el Kaliouby,"Emotion AI, Social Robots, and Self-Driving Cars","there's a broader question here, right? As we ..."
316,323,Will Sasso,"Comedy, MADtv, AI, Friendship, Madness, and Pr...",Once this whole thing falls apart and we are c...
317,324,Daniel Negreanu,Poker,you could be the seventh best player in the wh...


**Step 3: Text Preprocessing**

- Delete punctuation
- Delete stop words


In [3]:
corpus=df['text']
corpus.head(10)

0    As part of MIT course 6S099, Artificial Genera...
1    As part of MIT course 6S099 on artificial gene...
2    You've studied the human mind, cognition, lang...
3    What difference between biological neural netw...
4    The following is a conversation with Vladimir ...
5    The following is a conversation with Guido van...
6    The following is a conversation with Jeff Atwo...
7    The following is a conversation with Eric Schm...
8    The following is a conversation with Stuart Ru...
9    The following is a conversation with Peter Abb...
Name: text, dtype: object

In [4]:
# First, we delete punctuation

corpus_nopunct = []
for doc in corpus:
    corpus_nopunct.append(doc.lower().translate(str.maketrans('', '', string.punctuation)))


In [5]:
print(corpus_nopunct[:10])



In [6]:
#Add the text without punctuation
df['text_nopunct'] = corpus_nopunct

In [7]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [8]:
corpus_nostopw = [
    ' '.join(word for word in doc.split(' ') if word not in stop_words)
    for doc in corpus_nopunct
]

In [9]:
corpus_nopunct[300]

'the following is a conversation with brian armstrong cofounder and ceo of coinbase the largest cryptocurrency exchange platform with 98 million users in 100 countries listing bitcoin ethereum cardano and over 100 popular cryptocurrencies i recorded this conversation with brian before this weeks sec probe into whether some of the crypto listings are securities and thus need to be regulated as such as always with conversations that involve cryptocurrency i try to make it timeless so that the price soaring high or crashing down low doesnt distract from the fundamental technological economic social and philosophical ideas underlying this new form of money energy and information our world runs on money the exchange and store of value and cryptocurrency seeks to build the next chapter of how money works and what it can do coinbase and brian are trying to do this by working together with regulators and governments which is a long and difficult road bureaucracies resist change for better and 

In [10]:
df['text_nostopw'] = corpus_nostopw

In [11]:
df.head(10)

Unnamed: 0,id,guest,title,text,text_nopunct,text_nostopw
0,1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera...",as part of mit course 6s099 artificial general...,part mit course 6s099 artificial general intel...
1,2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...,as part of mit course 6s099 on artificial gene...,part mit course 6s099 artificial general intel...
2,3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang...",youve studied the human mind cognition languag...,youve studied human mind cognition language vi...
3,4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...,what difference between biological neural netw...,difference biological neural networks artifici...
4,5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...,the following is a conversation with vladimir ...,following conversation vladimir vapnik hes co ...
5,6,Guido van Rossum,Python,The following is a conversation with Guido van...,the following is a conversation with guido van...,following conversation guido van rossum creato...
6,7,Jeff Atwood,Stack Overflow and Coding Horror,The following is a conversation with Jeff Atwo...,the following is a conversation with jeff atwo...,following conversation jeff atwood cofounder s...
7,8,Eric Schmidt,Google,The following is a conversation with Eric Schm...,the following is a conversation with eric schm...,following conversation eric schmidt ceo google...
8,9,Stuart Russell,Long-Term Future of AI,The following is a conversation with Stuart Ru...,the following is a conversation with stuart ru...,following conversation stuart russell hes prof...
9,10,Pieter Abbeel,Deep Reinforcement Learning,The following is a conversation with Peter Abb...,the following is a conversation with peter abb...,following conversation peter abbeel hes profes...


**Step 4: Vector Space Representation - TF-IDF**

Create TF-IDF vector representations of the transcripts.

In [12]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix=tfidf_vectorizer.fit_transform(df['text_nostopw'])

***Step 5: Vector Space Representation - BERT***

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [13]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

In [14]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [15]:
# Load pre-trained BERT model and tokenizer

model = TFBertModel.from_pretrained('bert-base-uncased')

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [16]:
import numpy as np
def generate_bert_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)

corpus_bert = generate_bert_embeddings(corpus)

***Step 6: Query Processing***

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.


In [17]:
def queryProcessing(query):
    query_tovec=tfidf_vectorizer.transform([query])
    tfidf_similarities=cosine_similarity(tfidf_matrix,query_tovec)
    df_similarities = pd.DataFrame(tfidf_similarities, columns=['Similaridad'])
    df_similarities['episodes']=df['title']

    query_to_bert=[f'{query}']
    print(query_to_bert)
    query_bert = generate_bert_embeddings(query_to_bert)
    similarities = cosine_similarity(corpus_bert.reshape(319,768), query_bert.reshape(1,768))
    similarities_df = pd.DataFrame(similarities, columns=['sim'])
    similarities_df['ep'] = df['title']
    return df_similarities, similarities_df



***Step 7: Retrieve and Compare Results***

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

In [18]:
def retrievalTop(query):
    df_resultsTFIDF,df_resultsBERT=queryProcessing(query)
    df_resultsTFIDFSorted= df_resultsTFIDF.sort_values(by='Similaridad', ascending=False)
    df_resultsBERTSorted=df_resultsBERT.sort_values(by='sim',ascending=False)
    return df_resultsBERTSorted,df_resultsTFIDFSorted




***Step 8: Test the IR System***

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

In [19]:
def print_results(dfBERT, dfTFIDF):
    print("\nTop results using BERT:")
    print(dfBERT.head(10))

    print("\nTop results using TFIDF:")
    print(dfTFIDF.head(10))

def interactive_console():
    while True:
        query = input("Insert the query: ")

        df_resultsBERTSorted, df_resultsTFIDFSorted = retrievalTop(query)

        print_results(df_resultsBERTSorted, df_resultsTFIDFSorted)

        print("\nOptions:")
        print("1. New Query")
        print("2. Exit")

        user_choice = input("Select an option (1 or 2): ").strip()

        if user_choice == '2':
            print("Exiting the system. Goodbye!")
            break
        elif user_choice == '1':
            continue
        else:
            print("Invalid option. Please select '1' for New Query or '2' for Exit.")

interactive_console()

Insert the query: Revolutionary Ideas in Science
['Revolutionary Ideas in Science']

Top results using BERT:
          sim                                                 ep
199  0.765462                        Totalitarianism and Anarchy
16   0.759926  Revolutionary Ideas in Science, Math, and Society
273  0.757006        Bitcoin, Inflation, and the Future of Money
133  0.745524  On the Nature of Good and Evil, Genius and Mad...
210  0.741545       Nature of Reality, Dreams, and Consciousness
161  0.740891  The Future of Computing, AI, Life, and Conscio...
295  0.739649  IQ Tests, Human Intelligence, and Group Differ...
5    0.736096                                             Python
306  0.736014              Life, Death, Power, Fame, and Meaning
164  0.733208  Philosophy of Violence, Power, and the Martial...

Top results using TFIDF:
     Similaridad                                           episodes
87      0.029435     Evolution, Intelligence, Simulation, and Memes
78      0.0261

***Step 9: Compare Results***

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

When analyzing the results from BERT and TF-IDF for different queries, distinct differences in their effectiveness become apparent. For the query `['gpt']`, BERT provides results that are highly relevant to the topic of artificial intelligence and technology. The top results include titles like *"Virtual Reality, Social Media & the Future of ..."* and *"Neuralink, AI, Autopilot, and the Pale Blue Dot"*. These results reflect a deeper understanding of the context and semantic relationships around the term "GPT." In contrast, TF-IDF yields results that are less contextually aligned with the query. Although it includes relevant topics such as *"OpenAI Codex, GPT-3, Robotics, and the Future ..."*, the overall relevance is lower compared to BERT’s results.

For the query `['virtuality']`, BERT again demonstrates its strength by returning results that are closely related to the themes of reality, science, and philosophy. Titles like *"Revolutionary Ideas in Science, Math, and Society"* and *"Nature of Reality, Dreams, and Consciousness"* show that BERT can effectively interpret and match the semantic context of "virtuality." On the other hand, TF-IDF results are less specific and often less relevant. For instance, results such as *"Python and the Source Code of Humans, Computer ..."* and *"Life 3.0"* do not align as well with the thematic essence of the query.

When the query is `['physics']`, BERT continues to deliver results that are relevant to the topic of physics and related sciences, showing a good understanding of the context. Results such as *"Totalitarianism and Anarchy"* (though not directly relevant, could be interpreted within a broader context) and *"Virtual Reality, Social Media & the Future of ..."* highlight BERT’s ability to grasp the conceptual relevance of "physics." Conversely, TF-IDF provides results that are more directly related to the specific term "physics," such as *"String Theory"* and *"Physics View of the Mind and Neurobiology."* While these results are more precise in terms of keyword matching, they may miss out on broader contextual relevance.

Overall, BERT excels in understanding and capturing the context and semantic relationships of queries, making it more effective for complex and abstract searches. It provides results that reflect a deeper comprehension of the query’s intent. In contrast, TF-IDF is better suited for searches where exact term matching is crucial. Although it may offer more straightforward results for specific keywords, it often lacks the ability to grasp the deeper contextual meaning of the query.
