# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

Instructions:

## Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

In [17]:
import tensorflow as tf
import gensim.downloader as api
from transformers import BertTokenizer, TFBertModel
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
import numpy as np


### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

In [4]:
df = pd.read_csv('podcastdata_dataset.csv')
print(df.head()) #mostrar las 5 primeras filas del dataframe

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  
0  As part of MIT course 6S099, Artificial Genera...  
1  As part of MIT course 6S099 on artificial gene...  
2  You've studied the human mind, cognition, lang...  
3  What difference between biological neural netw...  
4  The following is a conversation with Vladimir ...  


In [5]:
print(df.shape) #antes 6

(319, 4)


### Step 3: Text Preprocessing
* Delete punctuation
* Delete stop words

In [6]:
#mostrar 5 primeros textos en un dataframe
corpus = df['text']
print(corpus.head())

0    As part of MIT course 6S099, Artificial Genera...
1    As part of MIT course 6S099 on artificial gene...
2    You've studied the human mind, cognition, lang...
3    What difference between biological neural netw...
4    The following is a conversation with Vladimir ...
Name: text, dtype: object


#### Preprocesar los datos


In [7]:
#convertir el texto del documento a minusculas y eliminar caracteres de puntuacion
corpus_nopunct = []
for doc in corpus:
    corpus_nopunct.append(doc.lower().translate(str.maketrans('', '', string.punctuation)))

df['text_nopunct'] = corpus_nopunct #crear una lista vacia para almacenar los documentos procesados
print(df.head()) #muestra los 5 primero documentos procesados

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  
0  as part of mit course 6s099 artificial general...  
1  as part of mit course 6s099 on artificial gene...  
2  youve studied the human mind cognition languag...  
3  what difference between biological neural netw...  
4  the following is a conversation with vladimir ...  


In [11]:
#cargar los stopwords y eliminar del corpus
nltk.download('stopwords')
stopw = set(stopwords.words('english'))
corpus_nostopw = []
for doc in corpus_nopunct:
    clean_doc = []
    #tokenizar el documento en palabras
    doc_array = doc.split(' ')
    for word in doc_array:
       #verifica si la palabra no esta en la lista de stopwords y la almacena
        if word not in stopw:
           clean_doc.append(word)
    corpus_nostopw.append(' '.join(clean_doc))

df['text_nostopw'] = corpus_nostopw
print(df.head()) #muestra los 5 primero documentos procesados

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  \
0  as part of mit course 6s099 artificial general...   
1  as part of mit course 6s099 on artificial gene...   
2  youve studied the human mind cognition languag...   
3  what difference between biological neural netw...   
4  the following is a conversation with vladimir ...   

                   

###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

In [14]:
vectorizer = TfidfVectorizer()
tfidf_mtx = vectorizer.fit_transform(df['text_nostopw'])

### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [16]:
# Cargar el tokenizador y el modelo Bert
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

#### Generar embeddings con BERT

In [36]:
#funcion para generar embeddings con Bert
def generate_bert_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True) #tokenizar el texto y convertirlo en tensores usando TensorFlow
        outputs = model(**inputs) #pasar los inputs al modelo Bert para obtener las representaciones
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)

corpus_bert = generate_bert_embeddings(corpus)

#### Definir una Query

In [22]:
query = ['Information retrieval']
query_bert = generate_bert_embeddings(query) #generar embeddings von Bert para la consulta

In [27]:
corpus_bert.shape

(319, 768, 1)

In [29]:
similarities = cosine_similarity(corpus_bert.reshape(319,768), query_bert.reshape(1, 768)) #calcula la similitud coseno entre corpus_bert y query_bert
similarities

array([[0.62841034],
       [0.63740784],
       [0.6265833 ],
       [0.5593694 ],
       [0.6410234 ],
       [0.6785215 ],
       [0.62680125],
       [0.5873474 ],
       [0.583053  ],
       [0.6040446 ],
       [0.5997988 ],
       [0.61576426],
       [0.67719537],
       [0.6239811 ],
       [0.6239811 ],
       [0.65643275],
       [0.6717957 ],
       [0.65059465],
       [0.52325785],
       [0.6173468 ],
       [0.66820467],
       [0.5522302 ],
       [0.6098412 ],
       [0.619785  ],
       [0.5773812 ],
       [0.65587807],
       [0.65060943],
       [0.5833802 ],
       [0.5894332 ],
       [0.61623174],
       [0.6216509 ],
       [0.64528203],
       [0.65114635],
       [0.67166   ],
       [0.6919246 ],
       [0.6500761 ],
       [0.5727416 ],
       [0.6445159 ],
       [0.65447164],
       [0.6750872 ],
       [0.6342424 ],
       [0.610653  ],
       [0.61181414],
       [0.64926624],
       [0.67141396],
       [0.62601966],
       [0.5912664 ],
       [0.566

### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

In [31]:
def retrieve_tfidf(query):
    query_vector = vectorizer.transform([query]) #transforma la consulta usando vectorizer (TF-IDF)
    similarities = cosine_similarity(tfidf_mtx, query_vector) #calcula la similitud coseno entre tfidf_mtx y la consulta
    similarities_df = pd.DataFrame(similarities, columns=['sim']) #crea un dataframe con las similitudes
    similarities_df['ep'] = df['title'] #agrega los titulos de los documentos al dataframe
    return similarities_df

retrieve_tfidf('Information retrieval')

Unnamed: 0,sim,ep
0,0.011714,Life 3.0
1,0.003381,Consciousness
2,0.000000,AI in the Age of Reason
3,0.006230,Deep Learning
4,0.012452,Statistical Learning
...,...,...
314,0.008666,"Singularity, Superintelligence, and Immortality"
315,0.001396,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.001805,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.002163,Poker


In [37]:
def retrieve_bert(query):
    query_bert = generate_bert_embeddings([query]) #genera embeddings Bert para la consulta
    similarities = cosine_similarity(corpus_bert.reshape(319, 768), query_bert.reshape(1, 768)) #calcula la similitud coseno entre corpus_bert y la consulta Bert
    similarities_df = pd.DataFrame(similarities, columns=['similarity']) #crea un dataframe con las similitudes
    similarities_df['document'] = df['title'] #agrega los titulos de los episodios al DataFrame
    return similarities_df

retrieve_bert('Information retrieval')

Unnamed: 0,similarity,document
0,0.628410,Life 3.0
1,0.637408,Consciousness
2,0.626583,AI in the Age of Reason
3,0.559369,Deep Learning
4,0.641023,Statistical Learning
...,...,...
314,0.591515,"Singularity, Superintelligence, and Immortality"
315,0.595819,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.644438,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.643001,Poker


### Step 7: Retrieve and Compare Results
Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

In [38]:
def process_query(query, tfidf_vectorizer, tfidf_matrix, bert_embeddings, tokenizer, model):
    # TF-IDF
    query_vector = tfidf_vectorizer.transform([query]) #transformar la consulta usando el vectorizador TF-IDF
    tfidf_similarity = cosine_similarity(tfidf_matrix, query_vector) #calcular la similitud coseno entre tfidf_matrix y la consulta transformada

    # BERT
    query_embedding = generate_bert_embeddings([query]) #genera embeddings Ber para la consulta
    bert_similarity = cosine_similarity(bert_embeddings.reshape(len(bert_embeddings), -1), query_embedding.reshape(1, -1)) #calcular la similitud coseno entre los embeddings Bert y la consulta Bert

    return tfidf_similarity, bert_similarity

### Step 8: Test the IR System
Test the system with a sample query

#### Definir funciones para el Test

In [39]:
#funcion para recuperar los mejores resultados
def retrieve_top_results(similarity_scores, top_n=5):
    top_indices = similarity_scores.argsort()[0][-top_n:][::-1] #obtener los indices de los documentos con las mayores similitudes
    return top_indices

#funcion para mostrar los resultados
def display_results(indices, df, similarities):
    for index in indices:
        print(f"Episode: {df.iloc[index]['title']}")
        print(f"Similarity: {similarities[0][index]:.4f}")

        print(f"Transcript: {df.iloc[index]['text'][:200]}...")  #mostrar los primeros 200 caracteres del texto del episodio
        print("\n")

#### Realizar prueba con una Query

In [43]:
query = "The artificial intelligence is magnifict"
# Procesando la query con TF-IDF y BERT
tfidf_similarity, bert_similarity = process_query(query, vectorizer, tfidf_mtx, corpus_bert, tokenizer, model)

#### Recuperar los resultados principales

In [44]:
tfidf_top_indices = retrieve_top_results(tfidf_similarity)
bert_top_indices = retrieve_top_results(bert_similarity)

#### Mostrar los resultados

In [45]:
print("TF-IDF Results:")
display_results(tfidf_top_indices, df, tfidf_similarity)

print("BERT Results:")
display_results(bert_top_indices, df, bert_similarity)

TF-IDF Results:
Episode: Life 3.0
Similarity: 0.0211
Transcript: As part of MIT course 6S099, Artificial General Intelligence, I've gotten the chance to sit down with Max Tegmark. He is a professor here at MIT. He's a physicist, spent a large part of his career stu...


BERT Results:
Episode: Life 3.0
Similarity: 0.6978
Transcript: As part of MIT course 6S099, Artificial General Intelligence, I've gotten the chance to sit down with Max Tegmark. He is a professor here at MIT. He's a physicist, spent a large part of his career stu...




### Step 9: Compare Results
Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

<br>

Resultados del analisis y comparacion

TFIDF

Similitud: 0.0211

Transcripción: Episodio "Life 3.0"

Observación: La similitud obtenida es bastante baja. Esto se debe a que TF-IDF se basa en la coincidencia exacta de palabras, lo que limita su capacidad para capturar el contexto o el significado semántico de la consulta en relación con el texto del episodio.

<br>

BERT

Similitud: 0.6978

Transcripción: Episodio "Life 3.0"

Observación: La similitud es significativamente mayor, lo que demuestra que BERT es capaz de captar mejor el contexto y las relaciones semánticas entre las palabras, ofreciendo una mayor precisión en la recuperación de información.

<br>

#### Se concluye que Bert es significativamente más efectivo que TF-IDF para la recuperación de información relevante. Mientras que TF-IDF arroja una similitud muy baja debido a su enfoque en la coincidencia exacta de palabras, BERT, con una similitud mucho mayor, comprende mejor el contexto y el significado semántico, resultando en una mayor precisión en los resultados.