# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

In [33]:
%pip install transformers
import pandas as pd
import string
import numpy as np
import time
from transformers import BertTokenizer, TFBertModel

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

In [34]:
df = pd.read_csv('podcastdata_dataset.csv')
df

Unnamed: 0,id,guest,title,text
0,1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera..."
1,2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...
2,3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang..."
3,4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...
4,5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...
...,...,...,...,...
314,321,Ray Kurzweil,"Singularity, Superintelligence, and Immortality","By the time he gets to 2045, we'll be able to ..."
315,322,Rana el Kaliouby,"Emotion AI, Social Robots, and Self-Driving Cars","there's a broader question here, right? As we ..."
316,323,Will Sasso,"Comedy, MADtv, AI, Friendship, Madness, and Pr...",Once this whole thing falls apart and we are c...
317,324,Daniel Negreanu,Poker,you could be the seventh best player in the wh...


In [35]:
corpus = df['text']
corpus

0      As part of MIT course 6S099, Artificial Genera...
1      As part of MIT course 6S099 on artificial gene...
2      You've studied the human mind, cognition, lang...
3      What difference between biological neural netw...
4      The following is a conversation with Vladimir ...
                             ...                        
314    By the time he gets to 2045, we'll be able to ...
315    there's a broader question here, right? As we ...
316    Once this whole thing falls apart and we are c...
317    you could be the seventh best player in the wh...
318    turns out that if you train a planarian and th...
Name: text, Length: 319, dtype: object


### Step 3: Text Preprocessing

You know what to do ;)

In [36]:
# First, we delete punctuation
corpus_nopunct = []
for doc in corpus:
    corpus_nopunct.append(doc.lower().translate(str.maketrans('', '', string.punctuation)))

In [37]:
print(corpus_nopunct[:1])

['as part of mit course 6s099 artificial general intelligence ive gotten the chance to sit down with max tegmark he is a professor here at mit hes a physicist spent a large part of his career studying the mysteries of our cosmological universe but hes also studied and delved into the beneficial possibilities and the existential risks of artificial intelligence amongst many other things he is the cofounder of the future of life institute author of two books both of which i highly recommend first our mathematical universe second is life 30 hes truly an out of the box thinker and a fun personality so i really enjoy talking to him if youd like to see more of these videos in the future please subscribe and also click the little bell icon to make sure you dont miss any videos also twitter linkedin agimitedu if you wanna watch other lectures or conversations like this one better yet go read maxs book life 30 chapter seven on goals is my favorite its really where philosophy and engineering com

In [38]:
df['text_nopunct'] = corpus_nopunct
print(df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  
0  as part of mit course 6s099 artificial general...  
1  as part of mit course 6s099 on artificial gene...  
2  youve studied the human mind cognition languag...  
3  what difference between biological neural netw...  
4  the following is a conversation with vladimir ...  


### Stopwords

In [39]:
from nltk.corpus import stopwords

stopw = set(stopwords.words('english'))
print(len(stopw))

179


In [40]:
text_no_stopwords = []

for doc in corpus_nopunct:
    clean_doc = []
    doc_array = doc.split(' ')
    for word in doc_array:
        if word not in stopw:
            clean_doc.append(word)

    text_no_stopwords.append(' '.join(clean_doc))

In [42]:
print(len(text_no_stopwords))

319


In [43]:
df['text_no_stopwords'] = text_no_stopwords
df

Unnamed: 0,id,guest,title,text,text_nopunct,text_no_stopwords
0,1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera...",as part of mit course 6s099 artificial general...,part mit course 6s099 artificial general intel...
1,2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...,as part of mit course 6s099 on artificial gene...,part mit course 6s099 artificial general intel...
2,3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang...",youve studied the human mind cognition languag...,youve studied human mind cognition language vi...
3,4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...,what difference between biological neural netw...,difference biological neural networks artifici...
4,5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...,the following is a conversation with vladimir ...,following conversation vladimir vapnik hes co ...
...,...,...,...,...,...,...
314,321,Ray Kurzweil,"Singularity, Superintelligence, and Immortality","By the time he gets to 2045, we'll be able to ...",by the time he gets to 2045 well be able to mu...,time gets 2045 well able multiply intelligence...
315,322,Rana el Kaliouby,"Emotion AI, Social Robots, and Self-Driving Cars","there's a broader question here, right? As we ...",theres a broader question here right as we bui...,theres broader question right build socially e...
316,323,Will Sasso,"Comedy, MADtv, AI, Friendship, Madness, and Pr...",Once this whole thing falls apart and we are c...,once this whole thing falls apart and we are c...,whole thing falls apart climbing kudzu vines s...
317,324,Daniel Negreanu,Poker,you could be the seventh best player in the wh...,you could be the seventh best player in the wh...,could seventh best player whole world like lit...


###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['text_no_stopwords'])

In [45]:
query = "Computer Science"

In [46]:
query_vector = vectorizer.transform([query])

In [47]:
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(tfidf_matrix, query_vector)
similarities

array([[0.04507994],
       [0.0727281 ],
       [0.01451431],
       [0.05681495],
       [0.02340847],
       [0.05354464],
       [0.02846494],
       [0.03237011],
       [0.0169459 ],
       [0.00614624],
       [0.02421822],
       [0.00593825],
       [0.07372364],
       [0.01775848],
       [0.01775848],
       [0.07208534],
       [0.01616015],
       [0.0054441 ],
       [0.04327091],
       [0.00503707],
       [0.00667508],
       [0.01006973],
       [0.        ],
       [0.02116541],
       [0.10470165],
       [0.01085762],
       [0.06837741],
       [0.01049236],
       [0.00295929],
       [0.00373142],
       [0.00665419],
       [0.01236771],
       [0.02354847],
       [0.        ],
       [0.06939767],
       [0.00935121],
       [0.02700536],
       [0.037525  ],
       [0.07715644],
       [0.03107165],
       [0.07983962],
       [0.08368498],
       [0.02933174],
       [0.03290134],
       [0.02826904],
       [0.04788385],
       [0.01707148],
       [0.014

In [48]:
similarities_df = pd.DataFrame(similarities, columns=["similarity"])
similarities_df['title'] = df['title']

In [49]:
similarities_df

Unnamed: 0,similarity,title
0,0.045080,Life 3.0
1,0.072728,Consciousness
2,0.014514,AI in the Age of Reason
3,0.056815,Deep Learning
4,0.023408,Statistical Learning
...,...,...
314,0.036157,"Singularity, Superintelligence, and Immortality"
315,0.018635,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.000945,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.003397,Poker


### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model
.

In [50]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [51]:
def generate_bert_embeddings(texts):
    embeddings = []
    start_time = time.time()
    for i, text in enumerate(texts):
        #for text in texts:
        if i == 10:  # Mide el tiempo después de 10 textos como muestra
            elapsed_time = time.time() - start_time
            estimated_time = (elapsed_time / 10) * len(texts)
            print(f"Tiempo estimado para procesar el corpus completo: {estimated_time / 60:.2f} minutos")

        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
    return np.array(embeddings).transpose(0, 2, 1)

In [52]:
corpus_bert = generate_bert_embeddings(corpus)

Tiempo estimado para procesar el corpus completo: 3.00 minutos


In [61]:
corpus.shape

(319,)

In [62]:
query = ["Computer Science"]
query_bert = generate_bert_embeddings(query)

In [63]:
query_bert.shape

(1, 768, 1)

### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

In [55]:
def retrieve_tfidf(query):
    query_vector = vectorizer.transform([query])
    similarities = cosine_similarity(tfidf_matrix, query_vector)
    similarities_df = pd.DataFrame(similarities, columns=["similarity"])
    similarities_df['title'] = df['title']
    return similarities_df

In [56]:
retrieve_tfidf("Artificial Intelligence")

Unnamed: 0,similarity,title
0,0.101050,Life 3.0
1,0.096279,Consciousness
2,0.205319,AI in the Age of Reason
3,0.069249,Deep Learning
4,0.063585,Statistical Learning
...,...,...
314,0.024883,"Singularity, Superintelligence, and Immortality"
315,0.018644,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.004331,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.001274,Poker


In [57]:
def retrieve_bert(query):
    query_bert = generate_bert_embeddings(query)
    similarities = cosine_similarity(corpus_bert.reshape(319, 768), query_bert.reshape(1, 768))
    similarities_df = pd.DataFrame(similarities, columns=["similarity"])
    similarities_df['title'] = df['title']
    return similarities_df

In [64]:
retrieve_bert(["Artificial Intelligence"])

Unnamed: 0,similarity,title
0,0.678347,Life 3.0
1,0.711803,Consciousness
2,0.688676,AI in the Age of Reason
3,0.629812,Deep Learning
4,0.712927,Statistical Learning
...,...,...
314,0.655902,"Singularity, Superintelligence, and Immortality"
315,0.621022,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.682716,"Comedy, MADtv, AI, Friendship, Madness, and Pro Wrestling"
317,0.692973,Poker



### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

In [59]:
def sort_by_similarity(query):
    sorted_similarity = query.sort_values(by='similarity', ascending=False)
    return sorted_similarity

### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

#### Top Results TF-IDF

In [60]:
tfidf_result_df = retrieve_tfidf("Artificial Intelligence")

sorted_tfidf_df = sort_by_similarity(tfidf_result_df)
print(sorted_tfidf_df[:10])

     similarity                                                title
2      0.205319                              AI in the Age of Reason
61     0.160961     Concepts, Analogies, Common Sense & Future of AI
119    0.149631                             Measures of Intelligence
38     0.142291         Keras, Deep Learning, and the Progress of AI
295    0.136549  IQ Tests, Human Intelligence, and Group Differences
12     0.132376                          Brains, Minds, and Machines
91     0.101213  Square, Cryptocurrency, and Artificial Intelligence
0      0.101050                                             Life 3.0
1      0.096279                                        Consciousness
75     0.093354     Universal Artificial Intelligence, AIXI, and AGI


#### Top Results BERT

In [65]:
bert_results_df = retrieve_bert(["Artificial Intelligence"])

sorted_bert_df = sort_by_similarity(bert_results_df)
print(sorted_bert_df[:10])

     similarity                                                        title
199    0.747838                                  Totalitarianism and Anarchy
295    0.744217          IQ Tests, Human Intelligence, and Group Differences
292    0.735387                                                     DeepMind
273    0.735202                  Bitcoin, Inflation, and the Future of Money
165    0.734717      Deep Work, Focus, Productivity, Email, and Social Media
216    0.732198  Virtual Reality, Social Media & the Future of Humans and AI
133    0.731452           On the Nature of Good and Evil, Genius and Madness
210    0.728670                 Nature of Reality, Dreams, and Consciousness
306    0.727041                        Life, Death, Power, Fame, and Meaning
12     0.726703                                  Brains, Minds, and Machines


### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.


Con
este
ejercicio
pudimos
observar
que
el
método
TF - IDF
recupera
mejores
resultados
que
el
método
BERT

debido
a
que
los
textos
que
se
recuperan
a
través
de
TF - IDF
tienen
más
relación
con
la
query ingresada