## Workshop: Building an Information Retrieval System for Podcast Episodes

**Objective:**
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

**Instructions:**

**Step 1: Import Libraries**
Import necessary libraries for data handling, text processing, and machine learning.


In [10]:
import re
import os
import nltk
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from collections import defaultdict
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

**Step 2: Load the Dataset**
Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

**Step 3: Text Preprocessing**

You know what to do ;)

In [14]:
df=pd.read_csv("data/podcastdata_dataset.csv")
df

Unnamed: 0,id,guest,title,text
0,1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera..."
1,2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...
2,3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang..."
3,4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...
4,5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...
...,...,...,...,...
314,321,Ray Kurzweil,"Singularity, Superintelligence, and Immortality","By the time he gets to 2045, we'll be able to ..."
315,322,Rana el Kaliouby,"Emotion AI, Social Robots, and Self-Driving Cars","there's a broader question here, right? As we ..."
316,323,Will Sasso,"Comedy, MADtv, AI, Friendship, Madness, and Pr...",Once this whole thing falls apart and we are c...
317,324,Daniel Negreanu,Poker,you could be the seventh best player in the wh...


In [15]:
lemmatizer = WordNetLemmatizer()
#nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

In [16]:
corpus=df['text']

In [17]:
# First, we delete punctuation
corpus_nopunct = []
for doc in corpus:
    corpus_nopunct.append(doc.lower().translate(str.maketrans('', '', string.punctuation)))
    

In [18]:
df['text_nopunct'] = corpus_nopunct
df.head()

Unnamed: 0,id,guest,title,text,text_nopunct
0,1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera...",as part of mit course 6s099 artificial general...
1,2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...,as part of mit course 6s099 on artificial gene...
2,3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang...",youve studied the human mind cognition languag...
3,4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...,what difference between biological neural netw...
4,5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...,the following is a conversation with vladimir ...


In [19]:
corpus_nost2=[]
corpus_nost3=[]
for doc in corpus_nopunct:
    doc_array = doc.split(' ')
    corpus_nost2=list(filter(lambda word: word not in stop_words,doc_array))
    corpus_nost3.append(" ".join(corpus_nost2))
    

In [21]:
df['text_nostopw']=corpus_nost3
df.head()

Unnamed: 0,id,guest,title,text,text_nopunct,text_nostopw
0,1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera...",as part of mit course 6s099 artificial general...,part mit course 6s099 artificial general intel...
1,2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...,as part of mit course 6s099 on artificial gene...,part mit course 6s099 artificial general intel...
2,3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang...",youve studied the human mind cognition languag...,youve studied human mind cognition language vi...
3,4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...,what difference between biological neural netw...,difference biological neural networks artifici...
4,5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...,the following is a conversation with vladimir ...,following conversation vladimir vapnik hes co ...


**Step 4: Vector Space Representation - TF-IDF**

Create TF-IDF vector representations of the transcripts.

In [22]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix=tfidf_vectorizer.fit_transform(df['text_nostopw'])

***Step 5: Vector Space Representation - BERT***

Create BERT vector representations of the transcripts using a pre-trained BERT model.

***Step 6: Query Processing***

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.


***Step 7: Retrieve and Compare Results***

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

In [23]:
def retreival(query):
    query_tovec=tfidf_vectorizer.transform([query])
    similarities=cosine_similarity(tfidf_matrix,query_tovec)
    df_similarities = pd.DataFrame(similarities, columns=['Similaridad'])
    df_similarities['episodes']=df['title']
    df_similarities_sorted = df_similarities.sort_values(by='Similaridad', ascending=False)
    return df_similarities_sorted
    

In [24]:
df_results=retreival('Computer Science')

In [26]:
df_results.head(10)

Unnamed: 0,Similaridad,episodes
109,0.110994,Computer Vision
70,0.108095,"Moore’s Law, Microprocessors, Abstractions, an..."
236,0.105548,National Institutes of Health (NIH)
24,0.104702,"Affective Computing, Emotion, Privacy, and Health"
78,0.101648,"Cosmos, Carl Sagan, Voyager, and the Beauty of..."
217,0.100617,"Programming, Algorithms, Hard Problems & the G..."
72,0.097796,Quantum Computing
87,0.088561,"Evolution, Intelligence, Simulation, and Memes"
62,0.087663,"Algorithms, TeX, Life, and The Art of Computer..."
41,0.083685,"Quantum Mechanics, String Theory, and Black Holes"
