# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.


In [1]:
import pandas as pd  # Librería para el manejo de datos
import string  # Librería para operaciones con cadenas de texto
from sklearn.feature_extraction.text import TfidfVectorizer  # Para la representación vectorial TF-IDF
from sklearn.metrics.pairwise import cosine_similarity  # Para calcular similitudes de coseno
import tensorflow as tf  # Librería de machine learning
from transformers import BertTokenizer, TFBertModel  # Modelos preentrenados de BERT
import numpy as np  # Librería para operaciones con arrays
from nltk.corpus import stopwords  # Para eliminar palabras de parada en inglés


  from .autonotebook import tqdm as notebook_tqdm


### Step 2: Load the Dataset
Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript


In [2]:
df = pd.read_csv('data/podcastdata_dataset.csv')  # Cargar el conjunto de datos de las transcripciones de podcasts
print(df.head())  # Mostrar las primeras filas del DataFrame para ver la estructura de los datos


   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  
0  As part of MIT course 6S099, Artificial Genera...  
1  As part of MIT course 6S099 on artificial gene...  
2  You've studied the human mind, cognition, lang...  
3  What difference between biological neural netw...  
4  The following is a conversation with Vladimir ...  


### Step 3: Text Preprocessing
* Delete punctuation
* Delete stop words

In [3]:
corpus = df['text']  # Extraer la columna de texto del DataFrame

# Eliminar puntuación
corpus_nopunct = [doc.lower().translate(str.maketrans('', '', string.punctuation)) for doc in corpus]  # Convertir a minúsculas y eliminar puntuación

# Eliminar palabras de parada
stopw = set(stopwords.words('english'))  # Obtener el conjunto de palabras de parada en inglés
corpus_nostopw = [' '.join([word for word in doc.split() if word not in stopw]) for doc in corpus_nopunct]  # Eliminar palabras de parada

df['text_nostopw'] = corpus_nostopw  # Añadir la columna de texto preprocesado al DataFrame
print(df.head())  # Mostrar las primeras filas del DataFrame para verificar los cambios


   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nostopw  
0  part mit course 6s099 artificial general intel...  
1  part mit course 6s099 artificial general intel...  
2  youve studied human mind cognition language vi...  
3  difference biological neural networks artifici...  
4  following conversation vladimir vapnik hes co ...  


### Step 4: Vector Space Representation - TF-IDF
Create TF-IDF vector representations of the transcripts.
This step transforms the text data into numerical vectors using the TF-IDF method.


In [4]:
vectorizer = TfidfVectorizer()  # Crear el vectorizador TF-IDF
tfidf_mtx = vectorizer.fit_transform(df['text_nostopw'])  # Ajustar y transformar el texto preprocesado en representaciones TF-IDF


### Step 5: Vector Space Representation - BERT
Create BERT vector representations of the transcripts using a pre-trained BERT model.
This step uses a pre-trained BERT model to transform text data into contextual embeddings.


In [5]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')  # Cargar el tokenizador preentrenado de BERT
model = TFBertModel.from_pretrained('bert-base-uncased')  # Cargar el modelo preentrenado de BERT

def generate_bert_embeddings(texts):
    """Generar representaciones de embeddings BERT para una lista de textos."""
    embeddings = []  # Lista para almacenar los embeddings
    for text in texts:
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)  # Tokenizar el texto
        outputs = model(**inputs)  # Obtener las salidas del modelo BERT
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Usar la representación del token [CLS]
    return np.array(embeddings).squeeze()  # Devolver los embeddings como un array de numpy

corpus_bert = generate_bert_embeddings(corpus_nostopw[:50])  # Generar embeddings BERT para los primeros 50 documentos del corpus


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

### Step 6: Query Processing
Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.
This step defines functions to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.


In [6]:
def retrieve_tfidf(query):
    """Recuperar resultados basados en la similitud de TF-IDF."""
    query_vector = vectorizer.transform([query])  # Transformar la consulta en un vector TF-IDF
    similarities = cosine_similarity(tfidf_mtx, query_vector)  # Calcular las similitudes de coseno entre la consulta y el corpus
    similarities_df = pd.DataFrame(similarities, columns=['sim'])  # Crear un DataFrame con las similitudes
    similarities_df['ep'] = df['title']  # Añadir los títulos de los episodios al DataFrame
    return similarities_df  # Devolver el DataFrame de resultados


In [7]:
def retrieve_bert(query):
    """Recuperar resultados basados en la similitud de BERT."""
    query_bert = generate_bert_embeddings(query)  # Generar embeddings BERT para la consulta
    similarities = cosine_similarity(corpus_bert.reshape(50, 768), query_bert.reshape(1, 768))  # Calcular las similitudes de coseno entre la consulta y el corpus BERT
    similarities_df = pd.DataFrame(similarities, columns=['sim'])  # Crear un DataFrame con las similitudes
    similarities_df['ep'] = df['title']  # Añadir los títulos de los episodios al DataFrame
    return similarities_df  # Devolver el DataFrame de resultados

### Step 7: Retrieve and Compare Results
Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.
This step retrieves and compares the results for a given query using both methods.


In [8]:
query = 'Computer Science'  # Definir una consulta de ejemplo
tfidf_results = retrieve_tfidf(query)  # Recuperar resultados utilizando TF-IDF
bert_results = retrieve_bert([query])  # Recuperar resultados utilizando BERT

# Mostrar los resultados principales para ambas representaciones
print("TF-IDF Results:\n", tfidf_results.sort_values(by='sim', ascending=False).head(10))
print("BERT Results:\n", bert_results.sort_values(by='sim', ascending=False).head(10))


TF-IDF Results:
           sim                                                 ep
109  0.110994                                    Computer Vision
70   0.108095  Moore’s Law, Microprocessors, Abstractions, an...
236  0.105548                National Institutes of Health (NIH)
24   0.104702  Affective Computing, Emotion, Privacy, and Health
78   0.101648  Cosmos, Carl Sagan, Voyager, and the Beauty of...
217  0.100617  Programming, Algorithms, Hard Problems & the G...
72   0.097796                                  Quantum Computing
87   0.088561     Evolution, Intelligence, Simulation, and Memes
62   0.087663  Algorithms, TeX, Life, and The Art of Computer...
41   0.083685  Quantum Mechanics, String Theory, and Black Holes
BERT Results:
          sim                                                 ep
15  0.604827     Reinforcement Learning, Planning, and Robotics
11  0.583524                              Poker and Game Theory
42  0.579720         Artificial Intelligence: A Modern Approa

### Step 8: Test the IR System
Test the system with a sample query.
Retrieve and display the top results using both TF-IDF and BERT representations.
This step validates the retrieval system by testing with a sample query.


In [9]:
sample_query = 'Artificial Intelligence'  # Definir una consulta de ejemplo
tfidf_results_sample = retrieve_tfidf(sample_query)  # Recuperar resultados utilizando TF-IDF
bert_results_sample = retrieve_bert([sample_query])  # Recuperar resultados utilizando BERT

# Mostrar los resultados principales para ambas representaciones
print("TF-IDF Sample Query Results:\n", tfidf_results_sample.sort_values(by='sim', ascending=False).head(10))
print("BERT Sample Query Results:\n", bert_results_sample.sort_values(by='sim', ascending=False).head(10))


TF-IDF Sample Query Results:
           sim                                                 ep
2    0.205319                            AI in the Age of Reason
61   0.160961   Concepts, Analogies, Common Sense & Future of AI
119  0.149631                           Measures of Intelligence
38   0.142291       Keras, Deep Learning, and the Progress of AI
295  0.136549  IQ Tests, Human Intelligence, and Group Differ...
12   0.132376                        Brains, Minds, and Machines
91   0.101213  Square, Cryptocurrency, and Artificial Intelli...
0    0.101050                                           Life 3.0
1    0.096279                                      Consciousness
75   0.093354   Universal Artificial Intelligence, AIXI, and AGI
BERT Sample Query Results:
          sim                                                 ep
15  0.670569     Reinforcement Learning, Planning, and Robotics
3   0.664713                                      Deep Learning
42  0.663018         Artificial Int

In [10]:
sample_query = 'Spotify'  # Definir una consulta de ejemplo
tfidf_results_sample = retrieve_tfidf(sample_query)  # Recuperar resultados utilizando TF-IDF
bert_results_sample = retrieve_bert([sample_query])  # Recuperar resultados utilizando BERT

# Mostrar los resultados principales para ambas representaciones
print("TF-IDF Sample Query Results:\n", tfidf_results_sample.sort_values(by='sim', ascending=False).head(10))
print("BERT Sample Query Results:\n", bert_results_sample.sort_values(by='sim', ascending=False).head(10))


TF-IDF Sample Query Results:
           sim                                                 ep
29   0.280275                                            Spotify
135  0.063249                                   Hardcore History
192  0.048562  The Existential Threat of Engineered Viruses a...
150  0.035882              Speech Recognition with AI and Humans
261  0.021342                                         Big Pharma
126  0.016533  Conversations, Ideas, Love, Freedom & The Joe ...
278  0.015965              Music, AI, and the Future of Humanity
38   0.015540       Keras, Deep Learning, and the Progress of AI
133  0.013715  On the Nature of Good and Evil, Genius and Mad...
65   0.013439      Thinking Fast and Slow, Deep Learning, and AI
BERT Sample Query Results:
          sim                                              ep
42  0.660279      Artificial Intelligence: A Modern Approach
11  0.650013                           Poker and Game Theory
15  0.649782  Reinforcement Learning, Planni

### Step 9: Compare Results
Analyze and compare the results obtained from TF-IDF and BERT representations.
Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.
This step involves analyzing the retrieval results and comparing the effectiveness of TF-IDF and BERT methods.


Los resultados de TF-IDF pueden ser menos precisos a comparacion de BERT para consultas complejas debido a la falta de contexto en las representaciones de palabras.
BERT, por otro lado, proporciona representaciones contextuales que pueden capturar mejor el significado de las consultas y los documentos, aunque requiere más recursos computacionales, ya que pudimos observar que el tiempo en generar los embeddings de BERT, es mayor que la funcion de TF-IDF

En general, BERT tiende a proporcionar mejores resultados en tareas de recuperación de información debido a su capacidad para entender el contexto de las palabras en corpus de grandes cantidades de datos.
