# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

Instructions:

### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

### Step 3: Text Preprocessing

You know what to do ;)

###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.

### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

In [1]:
import pandas as pd
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizer, TFBertModel
import numpy as np
import multiprocessing
from tqdm import tqdm

### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

In [2]:

df = pd.read_csv("data/podcastdata_dataset.csv")

### Step 3: Text Preprocessing

- Delete punctuation
- Delete stopwords

In [3]:
corpus = df["text"]
print(corpus)

0      As part of MIT course 6S099, Artificial Genera...
1      As part of MIT course 6S099 on artificial gene...
2      You've studied the human mind, cognition, lang...
3      What difference between biological neural netw...
4      The following is a conversation with Vladimir ...
                             ...                        
314    By the time he gets to 2045, we'll be able to ...
315    there's a broader question here, right? As we ...
316    Once this whole thing falls apart and we are c...
317    you could be the seventh best player in the wh...
318    turns out that if you train a planarian and th...
Name: text, Length: 319, dtype: object


In [4]:
corpus_nopunct = []
for doc in corpus:
    corpus_nopunct.append(doc.lower().translate(str.maketrans("","",string.punctuation)))

In [5]:
df["text_nopunct"] = corpus_nopunct
print(df.head())    

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  
0  as part of mit course 6s099 artificial general...  
1  as part of mit course 6s099 on artificial gene...  
2  youve studied the human mind cognition languag...  
3  what difference between biological neural netw...  
4  the following is a conversation with vladimir ...  


In [6]:
stopw = set(stopwords.words("english"))

In [7]:
print(len(stopw))

179


In [8]:
corpus_nostw = []
for doc in corpus_nopunct:
    clean_doc = []
    doc_array = doc.split(" ")
    for word in doc_array:
        if word not in stopw:
            clean_doc.append(word)
    corpus_nostw.append(" ".join(clean_doc))

In [9]:
print(len(corpus_nostw[0]))

44733


In [10]:
print(corpus_nostw[0])

part mit course 6s099 artificial general intelligence ive gotten chance sit max tegmark professor mit hes physicist spent large part career studying mysteries cosmological universe hes also studied delved beneficial possibilities existential risks artificial intelligence amongst many things cofounder future life institute author two books highly recommend first mathematical universe second life 30 hes truly box thinker fun personality really enjoy talking youd like see videos future please subscribe also click little bell icon make sure dont miss videos also twitter linkedin agimitedu wanna watch lectures conversations like one better yet go read maxs book life 30 chapter seven goals favorite really philosophy engineering come together opens quote dostoevsky mystery human existence lies staying alive finding something live lastly believe every failure rewards us opportunity learn sense ive fortunate fail many new exciting ways conversation different ive learned something called radio f

In [11]:
df["text_stopw"] = corpus_nostw

###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.


In [12]:
vectorizer = TfidfVectorizer()
tfidf_mtx = vectorizer.fit_transform(df["text_stopw"])

In [13]:
tfidf_mtx

<319x49728 sparse matrix of type '<class 'numpy.float64'>'
	with 754029 stored elements in Compressed Sparse Row format>

In [14]:
query = "Computer science"

In [15]:
query_vector = vectorizer.transform([query])

In [16]:
similarities = cosine_similarity(tfidf_mtx,query_vector)

In [17]:
similarities

array([[0.04507994],
       [0.0727281 ],
       [0.01451431],
       [0.05681495],
       [0.02340847],
       [0.05354464],
       [0.02846494],
       [0.03237011],
       [0.0169459 ],
       [0.00614624],
       [0.02421822],
       [0.00593825],
       [0.07372364],
       [0.01775848],
       [0.01775848],
       [0.07208534],
       [0.01616015],
       [0.0054441 ],
       [0.04327091],
       [0.00503707],
       [0.00667508],
       [0.01006973],
       [0.        ],
       [0.02116541],
       [0.10470165],
       [0.01085762],
       [0.06837741],
       [0.01049236],
       [0.00295929],
       [0.00373142],
       [0.00665419],
       [0.01236771],
       [0.02354847],
       [0.        ],
       [0.06939767],
       [0.00935121],
       [0.02700536],
       [0.037525  ],
       [0.07715644],
       [0.03107165],
       [0.07983962],
       [0.08368498],
       [0.02933174],
       [0.03290134],
       [0.02826904],
       [0.04788385],
       [0.01707148],
       [0.014

In [18]:
similarities_df = pd.DataFrame(similarities, columns=["Sim"])
similarities_df["Episodio"] = df["title"]
similarities_df

Unnamed: 0,Sim,Episodio
0,0.045080,Life 3.0
1,0.072728,Consciousness
2,0.014514,AI in the Age of Reason
3,0.056815,Deep Learning
4,0.023408,Statistical Learning
...,...,...
314,0.036157,"Singularity, Superintelligence, and Immortality"
315,0.018635,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.000945,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.003397,Poker


In [19]:
def retrieve (query):
    query_vector = vectorizer.transform([query])
    similarities = cosine_similarity(tfidf_mtx,query_vector)
    similarities_df = pd.DataFrame(similarities,columns=["Sim"])
    similarities_df["Episodio"] = df["title"]
    return similarities_df.sort_values(by="Sim", ascending=False)

In [20]:
retrieve("music")

Unnamed: 0,Sim,Episodio
29,0.383218,Spotify
272,0.100674,Legendary Music Producer
192,0.091897,The Existential Threat of Engineered Viruses a...
157,0.086500,The Next Generation of Big Ideas and Brave Minds
216,0.068726,"Virtual Reality, Social Media & the Future of ..."
...,...,...
83,0.000000,Simulation and Superintelligence
212,0.000000,Isaac Newton and the Philosophy of Science
213,0.000000,"OpenAI Codex, GPT-3, Robotics, and the Future ..."
214,0.000000,Viruses and Vaccines


### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [21]:
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [22]:
df_bert = pd.DataFrame(columns=["Indices","bert_embeding","Text"])
df_bert["Indices"] = df["title"]
df_bert["Text"] = df["text_stopw"]
df_bert["bert_embeding"] =  np.zeros(319, dtype=int)
df_bert

Unnamed: 0,Indices,bert_embeding,Text
0,Life 3.0,0,part mit course 6s099 artificial general intel...
1,Consciousness,0,part mit course 6s099 artificial general intel...
2,AI in the Age of Reason,0,youve studied human mind cognition language vi...
3,Deep Learning,0,difference biological neural networks artifici...
4,Statistical Learning,0,following conversation vladimir vapnik hes co ...
...,...,...,...
314,"Singularity, Superintelligence, and Immortality",0,time gets 2045 well able multiply intelligence...
315,"Emotion AI, Social Robots, and Self-Driving Cars",0,theres broader question right build socially e...
316,"Comedy, MADtv, AI, Friendship, Madness, and Pr...",0,whole thing falls apart climbing kudzu vines s...
317,Poker,0,could seventh best player whole world like lit...


In [23]:
def generate_bert_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)

In [24]:
res = generate_bert_embeddings(corpus_nostw[0])
res

KeyboardInterrupt: 

In [25]:
from multiprocessing import Pool

with Pool(processes=8) as pool:
    res = pool.map(generate_bert_embeddings,corpus_nostw[0])

print(res)

In [None]:
def retrieve_bert(query):
    query_bert = ([query])[0]
    similarities_bert = cosine_similarity(query_bert.reshape(1, -1), bert_embeddings)
    similarities_df = pd.DataFrame(similarities_bert[:, 0], columns=["Sim"])
    similarities_df["Episodio"] = df["title"]
    return similarities_df.sort_values(by="Sim", ascending=False)

In [None]:
retrieve_bert("Computer Science")

Unnamed: 0,Sim,Episodio
0,0.647408,Life 3.0
