# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.







## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.


### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.


In [3]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import torch
#from transformers import BertTokenizer, BertModel


### Step 2: Load the Dataset

Load the dataset of podcast transcripts.



In [4]:
# Ruta al archivo .csv
file_path = 'D:\\U\\7. Septimo\\RI\\ir24a\\week11\\data\\podcastdata_dataset.csv'

# Cargar el archivo .csv en un DataFrame
df = pd.read_csv(file_path)
print(df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  
0  As part of MIT course 6S099, Artificial Genera...  
1  As part of MIT course 6S099 on artificial gene...  
2  You've studied the human mind, cognition, lang...  
3  What difference between biological neural netw...  
4  The following is a conversation with Vladimir ...  



### Step 3: Text Preprocessing

You know what to do 

In [5]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\matte\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\matte\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
# Pre-procesamiento de texto
def preprocess_text(text):
    # Convertir a minúsculas
    text = text.lower()
    # Eliminar puntuación
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenizar
    words = word_tokenize(text)
    # Eliminar palabras vacías
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    # Stemming
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]
    # Unir palabras procesadas en una sola cadena
    processed_text = ' '.join(words)
    return processed_text

In [7]:
# Aplicar el preprocesamiento a una columna específica del DataFrame
# Supongamos que la columna de interés se llama 'text'
df['processed_text'] = df['text'].apply(preprocess_text)

In [8]:
# Mostrar las primeras filas del DataFrame procesado
print(df[['text', 'processed_text']].head())

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                      processed_text  
0  part mit cours 6s099 artifici gener intellig i...  
1  part mit cours 6s099 artifici gener intellig g...  
2  youv studi human mind cognit languag vision ev...  
3  differ biolog neural network artifici neural n...  
4  follow convers vladimir vapnik he co inventor ...  



###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

In [None]:
# Crear representaciones TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['processed_text'])

# Convertir la matriz TF-IDF a un DataFrame para su visualización
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

In [None]:
# Mostrar las primeras filas del DataFrame TF-IDF
print(tfidf_df.head())


### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [None]:
# Cargar el modelo preentrenado de BERT y el tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

In [None]:
# Función para obtener las representaciones de BERT
def get_bert_embeddings(text):
    # Tokenización y codificación
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    # Obtener las representaciones del modelo BERT
    with torch.no_grad():
        outputs = model(**inputs)
    # Usar el embedding de la primera palabra ([CLS])
    embeddings = outputs.last_hidden_state[:, 0, :]
    return embeddings.squeeze().numpy()

In [None]:
# Aplicar la función de embeddings a la columna 'processed_text'
df['bert_embeddings'] = df['processed_text'].apply(get_bert_embeddings)

# Mostrar las primeras filas del DataFrame con los embeddings BERT
print(df[['text', 'bert_embeddings']].head())

# Convertir las representaciones de BERT a una matriz
bert_matrix = np.vstack(df['bert_embeddings'].values)

In [None]:
# Mostrar las dimensiones de la matriz de embeddings BERT
print(bert_matrix.shape)


### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.



### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.



### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.



### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.
