# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:

Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

## Instructions:
### Step 1: Import Libraries

Import necessary libraries for data handling, text processing, and machine learning.

### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

### Step 3: Text Preprocessing

You know what to do ;)

### Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

## Instructions:


*   Follow the steps outlined above to implement the IR system.
*   Run the provided code snippets to understand how each part of the system works.
*   Test the system with various queries to observe the results from both TF-IDF and BERT representations.
*   Compare and analyze the results. Discuss the pros and cons of each method.
*   Document your findings and any improvements you make to the system.


### Step 1: Import Libraries

Import necessary libraries for data handling, text processing, and machine learning.

In [2]:
import pandas as pd
import numpy as np
import string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BertTokenizer, TFBertModel
from sklearn.metrics.pairwise import cosine_similarity


### Step 2: Load the Dataset

Load the dataset of podcast transcripts.


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
df = pd.read_csv("/content/drive/MyDrive/bases_de_datos_para_colab/podcastdata_dataset.csv", delimiter=",")
print(df.head())
print(df.shape)

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  
0  As part of MIT course 6S099, Artificial Genera...  
1  As part of MIT course 6S099 on artificial gene...  
2  You've studied the human mind, cognition, lang...  
3  What difference between biological neural netw...  
4  The following is a conversation with Vladimir ...  
(319, 4)


### Step 3: Text Preprocessing


*   Delete punctuation
*   Delete stop words



In [5]:
corpus = df['text']
print(corpus.head())

0    As part of MIT course 6S099, Artificial Genera...
1    As part of MIT course 6S099 on artificial gene...
2    You've studied the human mind, cognition, lang...
3    What difference between biological neural netw...
4    The following is a conversation with Vladimir ...
Name: text, dtype: object


Delete punctuation

In [6]:
corpus_nopunct = []
for doc in corpus:
    corpus_nopunct.append(doc.lower().translate(str.maketrans('', '', string.punctuation)))


In [7]:
corpus_nopunct[:4]

['as part of mit course 6s099 artificial general intelligence ive gotten the chance to sit down with max tegmark he is a professor here at mit hes a physicist spent a large part of his career studying the mysteries of our cosmological universe but hes also studied and delved into the beneficial possibilities and the existential risks of artificial intelligence amongst many other things he is the cofounder of the future of life institute author of two books both of which i highly recommend first our mathematical universe second is life 30 hes truly an out of the box thinker and a fun personality so i really enjoy talking to him if youd like to see more of these videos in the future please subscribe and also click the little bell icon to make sure you dont miss any videos also twitter linkedin agimitedu if you wanna watch other lectures or conversations like this one better yet go read maxs book life 30 chapter seven on goals is my favorite its really where philosophy and engineering com

In [8]:
df['text_nopunct'] = corpus_nopunct
print(df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  
0  as part of mit course 6s099 artificial general...  
1  as part of mit course 6s099 on artificial gene...  
2  youve studied the human mind cognition languag...  
3  what difference between biological neural netw...  
4  the following is a conversation with vladimir ...  


Delete stop words

In [9]:
nltk.download('stopwords') #Descargar stopwords
stopw = set(stopwords.words('english')) #Cargar stopwords en ingles

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [10]:
len(stopw)

179

In [11]:
corpus_nostopw=[]
for doc in corpus_nopunct:
    clean_doc = []
    doc_array = doc.split(' ')
    for word in doc_array:
        if word not in stopw:
           clean_doc.append(word)
    corpus_nostopw.append(' '.join(clean_doc))

In [12]:
len(corpus_nostopw)

319

In [13]:
corpus_nostopw[300]

'following conversation brian armstrong cofounder ceo coinbase largest cryptocurrency exchange platform 98 million users 100 countries listing bitcoin ethereum cardano 100 popular cryptocurrencies recorded conversation brian weeks sec probe whether crypto listings securities thus need regulated always conversations involve cryptocurrency try make timeless price soaring high crashing low doesnt distract fundamental technological economic social philosophical ideas underlying new form money energy information world runs money exchange store value cryptocurrency seeks build next chapter money works coinbase brian trying working together regulators governments long difficult road bureaucracies resist change better worse latest sec probe good representation serious attempt limit fraud one also runs risk limiting innovation limiting financial freedom individuals complicated mess applaud everyone involved trying work hope end interest individual wins decentralization hedge corrupting nature c

In [14]:
df['text_nostopw'] = corpus_nostopw
print(df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  \
0  as part of mit course 6s099 artificial general...   
1  as part of mit course 6s099 on artificial gene...   
2  youve studied the human mind cognition languag...   
3  what difference between biological neural netw...   
4  the following is a conversation with vladimir ...   

                   


### Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.


In [15]:
vectorizer = TfidfVectorizer()
tfidf_mtx = vectorizer.fit_transform(df['text_nostopw'])

In [16]:
query = 'Computer Science'

In [17]:
query_vector = vectorizer.transform([query])

In [18]:
similarities = cosine_similarity(tfidf_mtx, query_vector)
type(similarities)

numpy.ndarray

In [19]:
similarities_df = pd.DataFrame(similarities, columns=['sim'])
similarities_df['ep'] = df['title']
print(similarities_df.head())

        sim                       ep
0  0.045080                 Life 3.0
1  0.072728            Consciousness
2  0.014514  AI in the Age of Reason
3  0.056815            Deep Learning
4  0.023408     Statistical Learning


In [20]:
similarities_df

Unnamed: 0,sim,ep
0,0.045080,Life 3.0
1,0.072728,Consciousness
2,0.014514,AI in the Age of Reason
3,0.056815,Deep Learning
4,0.023408,Statistical Learning
...,...,...
314,0.036157,"Singularity, Superintelligence, and Immortality"
315,0.018635,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.000945,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.003397,Poker



### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.


In [21]:
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [22]:
def generate_bert_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)

In [23]:
corpus_bert = generate_bert_embeddings(corpus) #Full corpus

In [24]:
corpus_bert.shape

(319, 768, 1)

In [25]:
query = ['Computer Science']
query_bert = generate_bert_embeddings(query)

In [28]:
query_bert.shape

(1, 768, 1)

In [30]:
similarities = cosine_similarity(corpus_bert.reshape(319,768), query_bert.reshape(1,768))
type(similarities)

numpy.ndarray


### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.


BERT

Create the embedding using tokenizer, load in dataframe with the cosine similarity and the tittle

In [33]:
def retrieve_bert(query):
    query_bert = generate_bert_embeddings(query)
    similarities = cosine_similarity(corpus_bert.reshape(319,768), query_bert.reshape(1,768))
    similarities_df = pd.DataFrame(similarities, columns=['sim'])
    similarities_df['ep'] = df['title']
    return similarities_df

In [34]:
bert_similarity_query=retrieve_bert(['gpt'])

TF-IDF

Vectorize the query and create the dataframe with the cosine similarity

In [35]:
def retrieve_tfidf(query):
    query_vector = vectorizer.transform([query])
    similarities = cosine_similarity(tfidf_mtx, query_vector)
    similarities_df = pd.DataFrame(similarities, columns=['sim'])
    similarities_df['ep'] = df['title']
    return similarities_df

In [36]:
tfidf_similarity_query=retrieve_tfidf('gpt')

### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

In [37]:
def ordenar_df(df):
    return df.sort_values(by='sim', ascending=False)

### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

TF-IDF

In [38]:
idf_top_df = retrieve_tfidf('gpt')
top_idf = ordenar_df(idf_top_df)
print(top_idf[:10])

          sim                                                 ep
213  0.099371  OpenAI Codex, GPT-3, Robotics, and the Future ...
17   0.032536                                     OpenAI and AGI
94   0.028676                                      Deep Learning
120  0.028510                    Friendship with an AI Companion
117  0.025214  Math, Manim, Neural Networks & Teaching with 3...
119  0.011263                           Measures of Intelligence
130  0.011053  The Future of Computing and Programming Languages
276  0.007757                         Sara Walker and Lee Cronin
35   0.007033         fast.ai Deep Learning Courses and Research
266  0.006228  Origin of Life, Aliens, Complexity, and Consci...


BERT

In [40]:
bert_top_df = retrieve_bert(['gpt'])
bert_top = ordenar_df(bert_top_df)
print(bert_top[:10])

          sim                                                 ep
216  0.709173  Virtual Reality, Social Media & the Future of ...
49   0.703856    Neuralink, AI, Autopilot, and the Pale Blue Dot
199  0.669967                        Totalitarianism and Anarchy
133  0.666933  On the Nature of Good and Evil, Genius and Mad...
39   0.660287                                             iRobot
153  0.659555  Aliens, Black Holes, and the Mystery of the Ou...
96   0.657686           Going Big in Business, Investing, and AI
163  0.654897  Sleep, Dreams, Creativity & the Limits of the ...
34   0.654668        Machines Who Think and the Early Days of AI
273  0.654259        Bitcoin, Inflation, and the Future of Money


### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

**Conclusion**

The relevance of the results differs between TF-IDF and BERT, as TF-IDF produces lower and less precise similarity scores, with less relevant and more general episodes, while BERT generates higher scores and more pertinent results to the query. In terms of strengths and weaknesses, TF-IDF is easy to use and effective with simple texts, but it struggles to grasp deep context, limiting its effectiveness in complex queries. On the other hand, BERT better understands meaning and context, providing more relevant results for complicated queries, although it is more resource-intensive and complex to use. In conclusion, although TF-IDF is suitable for simple tasks, BERT provides better results for complex queries thanks to its better understanding of the content.