# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.







## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.


### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.


In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import torch
from transformers import BertTokenizer, BertModel
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import re



  from .autonotebook import tqdm as notebook_tqdm



### Step 2: Load the Dataset

Load the dataset of podcast transcripts.



In [2]:
# Ruta al archivo .csv
file_path = 'D:\\U\\7. Septimo\\RI\\ir24a\\week11\\data\\podcastdata_dataset.csv'

# Cargar el archivo .csv en un DataFrame
df = pd.read_csv(file_path, index_col=0)
print(df.head())

              guest                    title  \
id                                             
1       Max Tegmark                 Life 3.0   
2     Christof Koch            Consciousness   
3     Steven Pinker  AI in the Age of Reason   
4     Yoshua Bengio            Deep Learning   
5   Vladimir Vapnik     Statistical Learning   

                                                 text  
id                                                     
1   As part of MIT course 6S099, Artificial Genera...  
2   As part of MIT course 6S099 on artificial gene...  
3   You've studied the human mind, cognition, lang...  
4   What difference between biological neural netw...  
5   The following is a conversation with Vladimir ...  



### Step 3: Text Preprocessing

You know what to do 

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\matte\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\matte\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
# Pre-procesamiento de texto
def preprocess_text(text):
    # Convertir a minúsculas
    text = text.lower()
    # Eliminar puntuación
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenizar
    words = word_tokenize(text)
    # Eliminar palabras vacías
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    # Stemming
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]
    # Unir palabras procesadas en una sola cadena
    processed_text = ' '.join(words)
    return processed_text

In [5]:
# Aplicar el preprocesamiento a una columna específica del DataFrame
# Supongamos que la columna de interés se llama 'text'
df['processed_text'] = df['text'].apply(preprocess_text)

In [6]:
# Mostrar las primeras filas del DataFrame procesado
print(df[['text', 'processed_text']].head())

                                                 text  \
id                                                      
1   As part of MIT course 6S099, Artificial Genera...   
2   As part of MIT course 6S099 on artificial gene...   
3   You've studied the human mind, cognition, lang...   
4   What difference between biological neural netw...   
5   The following is a conversation with Vladimir ...   

                                       processed_text  
id                                                     
1   part mit cours 6s099 artifici gener intellig i...  
2   part mit cours 6s099 artifici gener intellig g...  
3   youv studi human mind cognit languag vision ev...  
4   differ biolog neural network artifici neural n...  
5   follow convers vladimir vapnik he co inventor ...  



###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

In [7]:
# Crear representaciones TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['processed_text'])

# Convertir la matriz TF-IDF a un DataFrame para su visualización
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

In [8]:
# Mostrar las primeras filas del DataFrame TF-IDF
print(tfidf_df.head())

    00  000  00000  0000001  00001  000073  0001  001  0011  002  ...  целом  \
0  0.0  0.0    0.0      0.0    0.0     0.0   0.0  0.0   0.0  0.0  ...    0.0   
1  0.0  0.0    0.0      0.0    0.0     0.0   0.0  0.0   0.0  0.0  ...    0.0   
2  0.0  0.0    0.0      0.0    0.0     0.0   0.0  0.0   0.0  0.0  ...    0.0   
3  0.0  0.0    0.0      0.0    0.0     0.0   0.0  0.0   0.0  0.0  ...    0.0   
4  0.0  0.0    0.0      0.0    0.0     0.0   0.0  0.0   0.0  0.0  ...    0.0   

   часа  четыре  чрезвычайно  что  чтобы  шесть  это  этот  들어가  
0   0.0     0.0          0.0  0.0    0.0    0.0  0.0   0.0  0.0  
1   0.0     0.0          0.0  0.0    0.0    0.0  0.0   0.0  0.0  
2   0.0     0.0          0.0  0.0    0.0    0.0  0.0   0.0  0.0  
3   0.0     0.0          0.0  0.0    0.0    0.0  0.0   0.0  0.0  
4   0.0     0.0          0.0  0.0    0.0    0.0  0.0   0.0  0.0  

[5 rows x 30301 columns]



### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [9]:
# Cargar el modelo preentrenado de BERT y el tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

In [10]:
# Función para obtener las representaciones de BERT
def get_bert_embeddings(text):
    # Tokenización y codificación
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    # Obtener las representaciones del modelo BERT
    with torch.no_grad():
        outputs = model(**inputs)
    # Usar el embedding de la primera palabra ([CLS])
    embeddings = outputs.last_hidden_state[:, 0, :]
    return embeddings.squeeze().numpy()

In [11]:
# Aplicar la función de embeddings a la columna 'processed_text'
df['bert_embeddings'] = df['processed_text'].apply(get_bert_embeddings)

# Mostrar las primeras filas del DataFrame con los embeddings BERT
print(df[['text', 'bert_embeddings']].head())

# Convertir las representaciones de BERT a una matriz
bert_matrix = np.vstack(df['bert_embeddings'].values)

                                                 text  \
id                                                      
1   As part of MIT course 6S099, Artificial Genera...   
2   As part of MIT course 6S099 on artificial gene...   
3   You've studied the human mind, cognition, lang...   
4   What difference between biological neural netw...   
5   The following is a conversation with Vladimir ...   

                                      bert_embeddings  
id                                                     
1   [-0.30752316, 0.062645406, 0.42602414, 0.17253...  
2   [-0.081817366, 0.03131118, 0.30538267, 0.11819...  
3   [-0.17911603, 0.09844374, 0.4419198, 0.0657610...  
4   [-0.27540696, -0.06768431, 0.2799198, 0.155715...  
5   [-0.31522432, 0.081350125, 0.23485659, 0.12684...  


In [12]:
# Mostrar las dimensiones de la matriz de embeddings BERT
#print(bert_matrix.shape)
print("BERT Embeddings:", bert_matrix)
print("BERT Shape:", bert_matrix.shape)


BERT Embeddings: [[-0.30752316  0.06264541  0.42602414 ... -0.24010745  0.23031083
   0.04865966]
 [-0.08181737  0.03131118  0.30538267 ... -0.16805509  0.21031383
   0.17329094]
 [-0.17911603  0.09844374  0.4419198  ... -0.29068568  0.29972103
   0.02240052]
 ...
 [-0.05981405  0.02161749  0.5930199  ... -0.20664145  0.23309557
  -0.10616215]
 [-0.18685055  0.07436559  0.36574128 ... -0.19735147  0.10025676
   0.16720584]
 [-0.10755596 -0.00942336  0.42898262 ... -0.15812056  0.25815174
   0.20240471]]
BERT Shape: (319, 768)



### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.


In [13]:
# Definir la función clean_query
def clean_query(query):
    # Preprocesar la consulta de la misma forma que los textos del dataset
    return preprocess_text(query)

In [14]:
query = "machine learning"

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

def process_query(query, tfidf_vectorizer, tfidf_matrix, bert_model, bert_tokenizer, bert_matrix):
    cleaned_query = clean_query(query)
    query_tfidf = tfidf_vectorizer.transform([cleaned_query])
    tfidf_cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix).flatten()
    
    query_tokens = bert_tokenizer.encode(cleaned_query, add_special_tokens=True)
    query_embedding = bert_model(torch.tensor([query_tokens]))[0].mean(dim=1).detach().numpy()
    bert_cosine_similarities = cosine_similarity(query_embedding, bert_matrix).flatten()
    
    return tfidf_cosine_similarities, bert_cosine_similarities



### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.


In [16]:
def retrieve_top_results(similarities, df, top_n=5):
    top_indices = similarities.argsort()[-top_n:][::-1]
    return df.iloc[top_indices]


### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.


In [17]:
# Procesar la consulta
tfidf_similarities, bert_similarities = process_query(query, vectorizer, tfidf_matrix, model, tokenizer, bert_matrix)

# Recuperar los resultados principales usando TF-IDF
top_tfidf_results = retrieve_top_results(tfidf_similarities, df, top_n=5)
print("Top TF-IDF Results:")
print(top_tfidf_results[['text', 'processed_text']])

# Recuperar los resultados principales usando BERT
top_bert_results = retrieve_top_results(bert_similarities, df, top_n=5)
print("Top BERT Results:")
print(top_bert_results[['text', 'processed_text']])


Top TF-IDF Results:
                                                 text  \
id                                                      
73  The following is a conversation with Andrew Ng...   
36  The following is a conversation with Yann LeCu...   
4   What difference between biological neural netw...   
93  The following is a conversation with Daphne Ko...   
43  The following is a conversation with Gary Marc...   

                                       processed_text  
id                                                     
73  follow convers andrew ng one impact educ resea...  
36  follow convers yann lecun he consid one father...  
4   differ biolog neural network artifici neural n...  
93  follow convers daphn koller professor comput s...  
43  follow convers gari marcu he professor emeritu...  
Top BERT Results:
                                                  text  \
id                                                       
317  The universe doesn't care about your personal ...


### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.


In [18]:
# Analizar y comparar los resultados obtenidos por TF-IDF y BERT
def compare_results(query, tfidf_results, bert_results):
    print(f"Query: {query}")
    print("\nTF-IDF Results:")
    for idx, row in tfidf_results.iterrows():
        print(f"Text: {row['text']}\nProcessed Text: {row['processed_text']}\n")

    print("\nBERT Results:")
    for idx, row in bert_results.iterrows():
        print(f"Text: {row['text']}\nProcessed Text: {row['processed_text']}\n")

# Comparar los resultados de una consulta de ejemplo
compare_results(query, top_tfidf_results, top_bert_results)


Query: machine learning

TF-IDF Results:
Text: The following is a conversation with Andrew Ng, one of the most impactful educators, researchers, innovators, and leaders in artificial intelligence and technology space in general. He cofounded Coursera and Google Brain, launched Deep Learning AI, Landing AI, and the AI Fund, and was the chief scientist at Baidu. As a Stanford professor and with Coursera and Deep Learning AI, he has helped educate and inspire millions of students, including me. This is the Artificial Intelligence Podcast. If you enjoy it, subscribe on YouTube, give it five stars on Apple Podcast, support it on Patreon, or simply connect with me on Twitter at Lex Friedman, spelled F R I D M A N. As usual, I'll do one or two minutes of ads now and never any ads in the middle that can break the flow of the conversation. I hope that works for you and doesn't hurt the listening experience. This show is presented by Cash App, the number one finance app in the App Store. When yo