# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

# Step 1: Import libraries



In [12]:
import tensorflow as tf
import gensim.downloader as api
from transformers import BertTokenizer, TFBertModel
from sklearn.metrics.pairwise import cosine_similarity


# Step 2: Load the Dataset

In [13]:
from google.colab import drive
import pandas as pd
import string
file_path = '/content/podcastdata_dataset.csv'
df = pd.read_csv(file_path)
print(df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  
0  As part of MIT course 6S099, Artificial Genera...  
1  As part of MIT course 6S099 on artificial gene...  
2  You've studied the human mind, cognition, lang...  
3  What difference between biological neural netw...  
4  The following is a conversation with Vladimir ...  


In [14]:
print(df.shape)

(319, 4)


# Step 3: Text Preprocessing

Eliminar caracteres especiales y convierte los textos a minúsculas, almacenando el resultado en una nueva lista.

In [15]:
corpus = df['text']
print(corpus.head())

0    As part of MIT course 6S099, Artificial Genera...
1    As part of MIT course 6S099 on artificial gene...
2    You've studied the human mind, cognition, lang...
3    What difference between biological neural netw...
4    The following is a conversation with Vladimir ...
Name: text, dtype: object


In [16]:
corpus_nopunct = []
for doc in corpus:
  corpus_nopunct.append(doc.lower().translate(str.maketrans('','',string.punctuation)))

In [17]:
df['text_nopunct'] = corpus_nopunct
print(df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  
0  as part of mit course 6s099 artificial general...  
1  as part of mit course 6s099 on artificial gene...  
2  youve studied the human mind cognition languag...  
3  what difference between biological neural netw...  
4  the following is a conversation with vladimir ...  


Eliminar stopwords del corpus

In [18]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stopw = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [19]:
print(stopw)
print(len(stopw))

{"you'll", 'these', "needn't", 'further', 'me', 'and', "she's", "hadn't", 'down', 'against', 'didn', "shouldn't", 'at', 'up', 'if', 'from', 'mightn', 'while', 'because', 'wasn', 'him', "it's", "you'd", "didn't", 'shan', 'themselves', 'or', 'o', 'then', 'most', 'd', 'have', "mightn't", "haven't", 'all', 'below', 'just', 'wouldn', "hasn't", 'll', "that'll", 'them', "wouldn't", 'out', 'hasn', 'until', 'yourself', 'did', 'ourselves', "you're", 'will', 'you', 'she', 'herself', 'having', 'itself', 'each', 'those', 'into', 'once', 'not', 'this', 'for', "should've", 'being', 'who', 'where', 'are', 'than', 're', "shan't", 'it', 'was', 'myself', 'above', 'hers', 'couldn', 'yourselves', 'there', 'your', 'what', 'as', 'ain', 'too', 'so', "you've", "weren't", 'other', 'ma', 'over', 'himself', 'how', 'we', 'm', 'only', 'any', 'same', 'can', 'won', 'does', 'been', 'her', 'their', 'the', 'weren', 'but', 'haven', 'y', 'our', 'were', 'in', 'don', "mustn't", 'very', 've', 'a', "isn't", 'has', 'be', 'my',

In [20]:
corpus_nostopw = []

In [21]:
for doc in corpus_nopunct:
    clean_doc = []
    doc_array = doc.split(' ')
    for word in doc_array:
        if word not in stopw:
           clean_doc.append(word)
    corpus_nostopw.append(' '.join(clean_doc))

### Datos limpios mostrados en un DataFrame

In [22]:
df['text_nostopw'] = corpus_nostopw
print(df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  \
0  as part of mit course 6s099 artificial general...   
1  as part of mit course 6s099 on artificial gene...   
2  youve studied the human mind cognition languag...   
3  what difference between biological neural netw...   
4  the following is a conversation with vladimir ...   

                   

In [23]:
print(len(corpus_nostopw[0]))

44733


In [24]:
len(df['text'].iloc[0].split(' '))

13424

# Step 4 : Vector Space Representation - TF-IDF

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_mtx = vectorizer.fit_transform(df['text_nostopw'])

# Step 5: Vector Space Representation  BERT

In [26]:
import numpy as np
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Definir una función para generar embeddings utilizando un modelo BERT, y luego utiliza esta función para generar embeddings tanto para un corpus de textos como para una consulta específica.

In [27]:
def generate_bert_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :])
    return np.array(embeddings).transpose(0,2,1)
corpus_bert = generate_bert_embeddings(corpus)

In [28]:
corpus_bert.shape


(319, 768, 1)

In [29]:
query = ['Consciousness']
query_bert = generate_bert_embeddings(query)

In [30]:
query_bert.shape

(1, 768, 1)

In [31]:
similarities = cosine_similarity(corpus_bert.reshape(319,768),query_bert.reshape(1,768))
similarities


array([[0.65497744],
       [0.6906209 ],
       [0.6776754 ],
       [0.5498574 ],
       [0.67551464],
       [0.71004367],
       [0.65272295],
       [0.62607   ],
       [0.61705166],
       [0.60445213],
       [0.6143959 ],
       [0.63732296],
       [0.69855064],
       [0.6300031 ],
       [0.6300031 ],
       [0.65894747],
       [0.7478554 ],
       [0.6550164 ],
       [0.44903553],
       [0.6066952 ],
       [0.6945493 ],
       [0.5381412 ],
       [0.59196186],
       [0.6645098 ],
       [0.56909204],
       [0.642003  ],
       [0.67623436],
       [0.59563726],
       [0.55421007],
       [0.6249443 ],
       [0.60414517],
       [0.66532856],
       [0.6671777 ],
       [0.639228  ],
       [0.67805624],
       [0.6868213 ],
       [0.58272296],
       [0.6818521 ],
       [0.659326  ],
       [0.6763057 ],
       [0.6389482 ],
       [0.6055414 ],
       [0.61016077],
       [0.6579503 ],
       [0.6978295 ],
       [0.60649675],
       [0.60146976],
       [0.580

# Step 6: Query processing

procesa una consulta de texto utilizando la técnica TF-IDF para calcular similitudes entre la consulta y una colección de documentos. Luego, se ejecuta esta función con una consulta específica y se devuelve un DataFrame que contiene las similitudes y los títulos de los documentos


### Funcion retrieve TFIDF

In [32]:
def retrieve_tfidf(query):
  query_vector = vectorizer.transform([query])
  similarities = cosine_similarity(tfidf_mtx, query_vector)
  similarities_df = pd.DataFrame(similarities, columns=['similarity'])
  similarities_df['document'] = df['title']
  return similarities_df

In [33]:
retrieve_tfidf('Consciousness')

Unnamed: 0,similarity,document
0,0.104010,Life 3.0
1,0.314104,Consciousness
2,0.039268,AI in the Age of Reason
3,0.014467,Deep Learning
4,0.000000,Statistical Learning
...,...,...
314,0.046445,"Singularity, Superintelligence, and Immortality"
315,0.008106,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.005240,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.000000,Poker


In [34]:
def retrieve_bert(query):
  query_bert = generate_bert_embeddings(query)
  similarities = cosine_similarity(corpus_bert.reshape(319 ,768),query_bert.reshape(1,768))
  similarities_df = pd.DataFrame(similarities, columns=['similarity'])
  similarities_df['document'] = df['title']
  return similarities_df

In [35]:
retrieve_bert(['Consciousness'])

Unnamed: 0,similarity,document
0,0.654977,Life 3.0
1,0.690621,Consciousness
2,0.677675,AI in the Age of Reason
3,0.549857,Deep Learning
4,0.675515,Statistical Learning
...,...,...
314,0.591253,"Singularity, Superintelligence, and Immortality"
315,0.609999,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.671616,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.671452,Poker


# Step 7: Retrieve and Compare Results

procesa una consulta de texto para calcular las similitudes entre la consulta y una colección de documentos utilizando dos métodos diferentes: TF-IDF y BERT. Luego, la función devuelve las similitudes calculadas para ambos métodos.

In [36]:
from sklearn.metrics.pairwise import cosine_similarity

def process_query(query, tfidf_vectorizer, tfidf_matrix, bert_embeddings, tokenizer, model):
    query_vector = tfidf_vectorizer.transform([query])
    tfidf_similarity = cosine_similarity(tfidf_matrix, query_vector)
    query_embedding = generate_bert_embeddings([query])
    bert_similarity = cosine_similarity(bert_embeddings.reshape(len(bert_embeddings), -1), query_embedding.reshape(1, -1))
    return tfidf_similarity, bert_similarity

# Step 8: Test the IR System

Test the system with a sample query

In [37]:
def retrieve_top_results(similarity_scores, top_n=5):
    top_indices = similarity_scores.argsort()[0][-top_n:][::-1]
    return top_indices

In [38]:
def display_results(indices, df, similarities):
    for index in indices:
        print(f"Episode: {df.iloc[index]['title']}")
        print(f"Similarity: {similarities[0][index]:.4f}")
        print(f"Transcript: {df.iloc[index]['text'][:200]}...")
        print("\n")


In [39]:
query = "So the army of the good in terms of the development of technology is large"

In [40]:
tfidf_similarity, bert_similarity = process_query(query, vectorizer, tfidf_mtx, corpus_bert, tokenizer, model)


In [41]:
tfidf_top_indices = retrieve_top_results(tfidf_similarity)
bert_top_indices = retrieve_top_results(bert_similarity)

In [42]:
print("TF-IDF Results:")
display_results(tfidf_top_indices, df, tfidf_similarity)

print("BERT Results:")
display_results(bert_top_indices, df, bert_similarity)

TF-IDF Results:
Episode: Life 3.0
Similarity: 0.0044
Transcript: As part of MIT course 6S099, Artificial General Intelligence, I've gotten the chance to sit down with Max Tegmark. He is a professor here at MIT. He's a physicist, spent a large part of his career stu...


BERT Results:
Episode: Life 3.0
Similarity: 0.7204
Transcript: As part of MIT course 6S099, Artificial General Intelligence, I've gotten the chance to sit down with Max Tegmark. He is a professor here at MIT. He's a physicist, spent a large part of his career stu...




Retrieve and display the top results using both TF-IDF and BERT representations.

# Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

## Resultados de TF-IDF
### Documento más similar:
* Episodio: Life 3.0
* Similitud: 0.0044
* Transcripción: "As part of MIT course 6S099, Artificial General Intelligence, I've gotten the chance to sit down with Max Tegmark. He is a professor here at MIT. He's a physicist..."
* El resultado muestra una baja similitud (0.0044) con el texto de la consulta, lo que indica que la representación TF-IDF no encuentra una fuerte relación entre la consulta y el contenido del documento.

## Resultados de BERT
### Documento más similar:
* Episodio: Life 3.0
* Similitud: 0.7204
* Transcripción: "As part of MIT course 6S099, Artificial General Intelligence, I've gotten the chance to sit down with Max Tegmark. He is a professor here at MIT. He's a physicist..."
* En este caso, la similitud es mucho mayor (0.7204), lo que sugiere que la representación BERT encuentra una relación más fuerte y contextual entre la consulta y el documento.
Comparación y Discusión
# TF-IDF:
* Fortalezas:
Simplicidad y eficiencia computacional.
Fácil de entender e implementar.
* Debilidades:
* No captura el contexto y las relaciones semánticas entre las palabras.
* La similitud basada en TF-IDF se basa únicamente en la frecuencia de términos y no en el significado subyacente.
# BERT:
* Fortalezas:
Captura el contexto y las relaciones semánticas de las palabras en la oración.
Utiliza el contexto bidireccional, lo que mejora la comprensión del significado en el texto.
* Debilidades:
Computacionalmente más costoso y requiere más recursos.
Más complejo de entender e implementar en comparación con TF-IDF.

## Conclusiones
### Precisión y Contexto:

* BERT es claramente superior en términos de capturar el contexto y la relación semántica entre la consulta y el documento. Esto se refleja en la similitud mucho mayor obtenida con BERT en comparación con TF-IDF.
Aplicabilidad y Recursos:

* Mientras que TF-IDF puede ser útil para tareas simples de recuperación de información debido a su eficiencia y simplicidad, BERT es preferible para aplicaciones que requieren una comprensión más profunda del texto, a pesar de su mayor costo computacional.