# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

Instructions:

### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

In [24]:
import pandas as pd
import string
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

In [2]:
csv_file_path = 'data/podcastdata_dataset.csv'

# Cargar el archivo CSV en un DataFrame
df = pd.read_csv(csv_file_path)

In [3]:
df

Unnamed: 0,id,guest,title,text
0,1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera..."
1,2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...
2,3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang..."
3,4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...
4,5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...
...,...,...,...,...
314,321,Ray Kurzweil,"Singularity, Superintelligence, and Immortality","By the time he gets to 2045, we'll be able to ..."
315,322,Rana el Kaliouby,"Emotion AI, Social Robots, and Self-Driving Cars","there's a broader question here, right? As we ..."
316,323,Will Sasso,"Comedy, MADtv, AI, Friendship, Madness, and Pr...",Once this whole thing falls apart and we are c...
317,324,Daniel Negreanu,Poker,you could be the seventh best player in the wh...


### Step 3: Text Preprocessing

You know what to do ;)

In [16]:
stop_words_file = 'data/stopwords.txt'

with open(stop_words_file, 'r') as file:
    stop_words = set(file.read().split())

nlp = spacy.load('en_core_web_sm')

In [17]:
def preprocess_and_lemmatize_text(text):
    # Convertir a minúsculas
    text = text.lower()
    
    # Eliminar caracteres especiales
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenizar el texto
    words = text.split()
    
    # Eliminar stop words
    words = [word for word in words if word not in stop_words]
    
    # Unir las palabras procesadas de nuevo en una sola cadena
    text = ' '.join(words)
    
    # Lematizar el texto usando spaCy
    doc = nlp(text)
    lemmatized_text = ' '.join([token.lemma_ for token in doc])
    
    return lemmatized_text


In [18]:
df['cleaned_text'] = df['text'].apply(preprocess_and_lemmatize_text)

In [19]:
df

Unnamed: 0,id,guest,title,text,cleaned_text
0,1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera...",part mit 6s099 artificial general intelligence...
1,2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...,part mit 6s099 artificial general intelligence...
2,3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang...",you ve study human mind cognition language vis...
3,4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...,difference biological neural network artificia...
4,5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...,conversation vladimir vapnik he s inventor sup...
...,...,...,...,...,...
314,321,Ray Kurzweil,"Singularity, Superintelligence, and Immortality","By the time he gets to 2045, we'll be able to ...",time 2045 multiply intelligence million fold h...
315,322,Rana el Kaliouby,"Emotion AI, Social Robots, and Self-Driving Cars","there's a broader question here, right? As we ...",broad question build socially emotionally inte...
316,323,Will Sasso,"Comedy, MADtv, AI, Friendship, Madness, and Pr...",Once this whole thing falls apart and we are c...,thing fall climb kudzu vine spiral sears tower...
317,324,Daniel Negreanu,Poker,you could be the seventh best player in the wh...,seventh player world literally seventh player ...


###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

In [25]:
# Función para vectorizar textos utilizando TF-IDF
def TF_IDF(texts):
    # Vectorización usando TF-IDF
    tfidf_vectorizer = TfidfVectorizer()
    X_tfidf = tfidf_vectorizer.fit_transform(texts)
    
    return X_tfidf, tfidf_vectorizer

In [26]:
# Vectorizar los documentos
X_tfidf_lemmatized, tfidf_vectorizer_lemmatized = TF_IDF(df['cleaned_text'])

# Ver los resultados de TF-IDF
print("TF-IDF Lemmatized:")
print(X_tfidf_lemmatized.toarray())
print("Caracteristicas de TF-IDF Lemmatized:", tfidf_vectorizer_lemmatized.get_feature_names_out())

TF-IDF Lemmatized:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Caracteristicas de TF-IDF Lemmatized: ['00' '000' '00000' ... 'это' 'этот' '들어가']


### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.