# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

# Step 1: Import libraries



In [None]:
import tensorflow as tf
import gensim.downloader as api
from transformers import BertTokenizer, TFBertModel
from sklearn.metrics.pairwise import cosine_similarity


# STEP 2: Load the Dataset

In [None]:
from google.colab import drive
import pandas as pd
import string

# Montar Google Drive
drive.mount('/content/drive')

# Especificar la ruta completa al archivo CSV
file_path = '/content/drive/My Drive/Week10RI/podcastdata_dataset.csv'

# Leer el archivo CSV
df = pd.read_csv(file_path)

# Mostrar las primeras filas del DataFrame
print(df.head())

# Mostrar las primeras filas

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  
0  As part of MIT course 6S099, Artificial Genera...  
1  As part of MIT course 6S099 on artificial gene...  
2  You've studied the human mind, cognition, lang...  
3  What difference between biological neural netw...  
4  The following is a conversation with Vladimir ...  


In [None]:
print(df.shape)

(319, 4)


# Step 3: Text Preprocessing

- Delete puntuaciones
- Delete stop words

In [None]:
# Mostrar los textos en un  DataFrame
corpus = df['text']
print(corpus.head())

0    As part of MIT course 6S099, Artificial Genera...
1    As part of MIT course 6S099 on artificial gene...
2    You've studied the human mind, cognition, lang...
3    What difference between biological neural netw...
4    The following is a conversation with Vladimir ...
Name: text, dtype: object


### Eliminamos los caracteres especiales, y transformamos Mayusculas a minusculas

In [None]:
# Mostrar los textos en un  DataFrame
corpus_nopunct = []
#Iterar sobre cada documento del corpus
for doc in corpus:
   # Convertir el texto del documento a minúsculas y eliminar todos los caracteres de puntuación
  corpus_nopunct.append(doc.lower().translate(str.maketrans('','',string.punctuation)))

In [None]:
# Crear una lista vacía para almacenar los documentos procesados
df['text_nopunct'] = corpus_nopunct
#muestra los docuemntos procesados una ves que se ha realizado el preprocesamiento
print(df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  
0  as part of mit course 6s099 artificial general...  
1  as part of mit course 6s099 on artificial gene...  
2  youve studied the human mind cognition languag...  
3  what difference between biological neural netw...  
4  the following is a conversation with vladimir ...  


# Cargamos los stopwords para proceder a eliminarlos de nuestro corpus

In [None]:
import nltk

# Download the stopwords corpus
nltk.download('stopwords')

from nltk.corpus import stopwords

#muestra los docuemntos procesados una ves que se ha realizado el preprocesamiento
stopw = set(stopwords.words('english'))




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:

#Mostrar los stopwords
print(stopw)
#Mostar el tamano del arreglo de stopwords
print(len(stopw))

{'itself', "aren't", 'its', 'myself', 'don', 'while', "couldn't", 'on', 'is', "hadn't", 'for', 'you', 'doing', 'her', 'how', 'our', 'are', 'about', 'where', 'such', 'himself', 'at', 'only', "doesn't", 'i', 'shan', 'isn', 'here', 'needn', 'any', 'some', "should've", "she's", 'when', 'these', 'few', 'a', 'being', 'again', 'y', 'both', 'aren', "won't", 'same', 'than', 'all', 'it', 'below', 'we', 'under', 'his', 'and', 'was', 'now', 'from', 'over', 'to', 'ours', 'who', 'hadn', 'won', 'the', 'll', 'nor', 'very', 'do', "wasn't", 'am', "you're", 'just', 'theirs', 'through', 'ma', "that'll", 'in', 'weren', 'out', 'why', "you've", 've', "wouldn't", 'doesn', 'after', "it's", 'once', 'if', 'so', 'they', 'd', 'into', "haven't", 'by', 'most', 'have', 'she', 'this', 'but', "don't", 'own', 'couldn', 'wasn', 'didn', 'or', "you'd", 'ourselves', "weren't", 'had', "hasn't", "shouldn't", 'what', 'whom', 'further', 'herself', "didn't", 'haven', 'wouldn', 'o', 'm', 'hasn', 'mustn', 'has', 'until', "mustn't"

In [None]:
# Inicializa una lista vacía para almacenar los documentos sin stopwords
corpus_nostopw = []

In [None]:
 # Itera sobre todos los documentos
for doc in corpus_nopunct:
    # Inicializa una lista vacía para almacenar las palabras sin stopwords
    clean_doc = []
    # Tokenizar el documento en palabras
    doc_array = doc.split(' ')
    # Iterar sobre cada palabra en la lista
    for word in doc_array:
       # Verifica si la palabra no está en la lista de stopwords
        if word not in stopw:
        # Si la palabra no está en la lista de stopwords, la agrega a la lista de palabras sin stopwords
           clean_doc.append(word)
      # Une las palabras en clean_doc con un espacio y agrega el documento procesado a corpus_nostopw
    corpus_nostopw.append(' '.join(clean_doc))

### Datos limpios mostrados en un DataFrame

In [None]:
# Crear una lista vacía para almacenar los documentos procesados
df['text_nostopw'] = corpus_nostopw
#Impresion de las primeras filas
print(df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  \
0  As part of MIT course 6S099, Artificial Genera...   
1  As part of MIT course 6S099 on artificial gene...   
2  You've studied the human mind, cognition, lang...   
3  What difference between biological neural netw...   
4  The following is a conversation with Vladimir ...   

                                        text_nopunct  \
0  as part of mit course 6s099 artificial general...   
1  as part of mit course 6s099 on artificial gene...   
2  youve studied the human mind cognition languag...   
3  what difference between biological neural netw...   
4  the following is a conversation with vladimir ...   

                   

In [None]:
#Muestra el tamaño del corpus procesados
print(len(corpus_nostopw[0]))
#rint(len(df['text'][0].split()))

44733


In [None]:
len(df['text'].iloc[0].split(' '))

13424

# Step 4 : Vector Space Representation - TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
#Impresion de las primeras filas
vectorizer = TfidfVectorizer()
# Aplica el vectorizador para ajustar y transformar los documentos limpios en una matriz TF-IDF
tfidf_mtx = vectorizer.fit_transform(df['text_nostopw'])

# Step 5: Vector Space Representation  BERT

In [None]:
import numpy as np
from transformers import BertTokenizer, BertModel

# Cargar el tokenizador y el modelo BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

### Funcion generador de embeddings BERT

In [None]:
def generate_bert_embeddings(texts):
    embeddings = []
    for text in texts:
      # Tokenizar el texto y convertirlo en tensores usando TensorFlow
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
      #Pasar los inputs al modelo BERT para obtener las representaciones
        outputs = model(**inputs)
        # Usar la representación de la [CLS] token o la media de todas las representaciones
        embeddings.append(outputs.last_hidden_state[:, 0, :])
 # Convertir la lista de embeddings en un numpy array y transponer las dimensiones
    return np.array(embeddings).transpose(0,2,1)
# Generar embeddings BERT para el corpus de textos
corpus_bert = generate_bert_embeddings(corpus)

In [None]:
# Devuelve la forma del array corpus_bert
corpus_bert.shape


(319, 768, 1)

In [158]:
# Define una lista con la consulta 'Computer Science'
query = ['Artificial General']
# Genera embeddings BERT para la consulta
query_bert = generate_bert_embeddings(query)



In [159]:
# Devuelve la forma del array query_bert
query_bert.shape

(1, 768, 1)

In [160]:
# Calcula la similitud coseno entre corpus_bert y query_bert
similarities = cosine_similarity(corpus_bert.reshape(319,768),query_bert.reshape(1,768))
# Muestra las similitudes calculadas
similarities


array([[0.66573834],
       [0.68808705],
       [0.6628542 ],
       [0.5722449 ],
       [0.6859926 ],
       [0.71603954],
       [0.6740874 ],
       [0.6259572 ],
       [0.63718104],
       [0.6339799 ],
       [0.6250173 ],
       [0.66667867],
       [0.6957108 ],
       [0.64849   ],
       [0.64849   ],
       [0.6591395 ],
       [0.7493383 ],
       [0.6533493 ],
       [0.47185946],
       [0.6283725 ],
       [0.70002836],
       [0.5759804 ],
       [0.5983999 ],
       [0.6757225 ],
       [0.5966219 ],
       [0.63746464],
       [0.68690896],
       [0.6173605 ],
       [0.5717934 ],
       [0.6350664 ],
       [0.6329876 ],
       [0.6906017 ],
       [0.6739389 ],
       [0.6484343 ],
       [0.67641956],
       [0.69708866],
       [0.6202234 ],
       [0.69530123],
       [0.6773106 ],
       [0.68360865],
       [0.64828914],
       [0.61575466],
       [0.6287988 ],
       [0.6648556 ],
       [0.69731116],
       [0.62786067],
       [0.6176902 ],
       [0.598

#Step 6: Query processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

### Funcion retrieve TFIDF

In [None]:
def retrieve_tfidf(query):
  # Transforma la consulta usando vectorizer (TF-IDF)
  query_vector = vectorizer.transform([query])

  # Calcula la similitud coseno entre tfidf_mtx y la consulta
  similarities = cosine_similarity(tfidf_mtx, query_vector)

  # Crea un DataFrame con las similitudes
  similarities_df = pd.DataFrame(similarities, columns=['similarity'])

  # Agrega los títulos de los documentos al DataFrame
  similarities_df['document'] = df['title']

  # Devuelve el DataFrame con las similitudes calculadas
  return similarities_df



In [163]:
# Ejecuta la función retrieve_tfidf con la consulta 'image'
retrieve_tfidf('Artificial General')

Unnamed: 0,similarity,document
0,0.033378,Life 3.0
1,0.052715,Consciousness
2,0.131379,AI in the Age of Reason
3,0.069172,Deep Learning
4,0.041889,Statistical Learning
...,...,...
314,0.003836,"Singularity, Superintelligence, and Immortality"
315,0.013679,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.002737,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.003851,Poker


### Funcion retrieve BERT

In [None]:
def retrieve_bert(query):
  # Genera embeddings BERT para la consulta
  query_bert = generate_bert_embeddings(query)

  # Calcula la similitud coseno entre corpus_bert y la consulta BERT
  similarities = cosine_similarity(corpus_bert.reshape(319 ,768),query_bert.reshape(1,768))

# Crea un DataFrame con las similitudes
  similarities_df = pd.DataFrame(similarities, columns=['similarity'])

  # Agrega los títulos de los episodios al DataFrame
  similarities_df['document'] = df['title']

  # Devuelve el DataFrame con las similitudes calculadas
  return similarities_df

In [162]:
# Ejecuta la función retrieve_bert con la consulta ['Computer Science']
retrieve_bert(['Artificial General'])

Unnamed: 0,similarity,document
0,0.665738,Life 3.0
1,0.688087,Consciousness
2,0.662854,AI in the Age of Reason
3,0.572245,Deep Learning
4,0.685993,Statistical Learning
...,...,...
314,0.608068,"Singularity, Superintelligence, and Immortality"
315,0.622087,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.677050,"Comedy, MADtv, AI, Friendship, Madness, and Pr..."
317,0.672401,Poker


# Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def process_query(query, tfidf_vectorizer, tfidf_matrix, bert_embeddings, tokenizer, model):

    # TF-IDF
    # Transforma la consulta usando el vectorizador TF-IDF
    query_vector = tfidf_vectorizer.transform([query])

    # Calcula la similitud coseno entre tfidf_matrix y la consulta transformada
    tfidf_similarity = cosine_similarity(tfidf_matrix, query_vector)

    # BERT
    # Genera embeddings BERT para la consulta
    query_embedding = generate_bert_embeddings([query])

     # Calcula la similitud coseno entre los embeddings BERT y la consulta BERT
    bert_similarity = cosine_similarity(bert_embeddings.reshape(len(bert_embeddings), -1), query_embedding.reshape(1, -1))


   # Devuelve las similitudes calculadas para TF-IDF y BERT
    return tfidf_similarity, bert_similarity



#Step 8: Test the IR System

Test the system with a sample query

In [None]:
def retrieve_top_results(similarity_scores, top_n=5):

   # Obtiene los índices de los documentos con las mayores similitudes
    top_indices = similarity_scores.argsort()[0][-top_n:][::-1]

    # Devuelve los índices de los documentos más similares
    return top_indices

In [None]:
def display_results(indices, df, similarities):
    for index in indices:

      # Imprime el título del episodio
        print(f"Episode: {df.iloc[index]['title']}")

        # Imprime la similitud del episodio
        print(f"Similarity: {similarities[0][index]:.4f}")

         # Imprime los primeros 200 caracteres del texto del episodio
        print(f"Transcript: {df.iloc[index]['text'][:200]}...")  # Display the first 200 characters
        print("\n")


In [164]:
#Definiendo la query
query = "So the army of the good in terms of the development of technology is large"

In [165]:
# Procesando la query con TF-IDF y BERT
tfidf_similarity, bert_similarity = process_query(query, vectorizer, tfidf_mtx, corpus_bert, tokenizer, model)


In [166]:
# Recuperar los resultados principales
tfidf_top_indices = retrieve_top_results(tfidf_similarity)
bert_top_indices = retrieve_top_results(bert_similarity)



In [167]:
# Mostrar los resultados
print("TF-IDF Results:")
display_results(tfidf_top_indices, df, tfidf_similarity)

print("BERT Results:")
display_results(bert_top_indices, df, bert_similarity)


TF-IDF Results:
Episode: Life 3.0
Similarity: 0.0044
Transcript: As part of MIT course 6S099, Artificial General Intelligence, I've gotten the chance to sit down with Max Tegmark. He is a professor here at MIT. He's a physicist, spent a large part of his career stu...


BERT Results:
Episode: Life 3.0
Similarity: 0.7204
Transcript: As part of MIT course 6S099, Artificial General Intelligence, I've gotten the chance to sit down with Max Tegmark. He is a professor here at MIT. He's a physicist, spent a large part of his career stu...




Retrieve and display the top results using both TF-IDF and BERT representations.

#Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

- Resultados del analisis y comparacion
  - TFIDF
    - Similitud: 0.0044
    - Transcripción: Episodio "Life 3.0"
    - Observación: La similitud obtenida es bastante baja. Esto se debe a que TF-IDF se basa en la coincidencia exacta de palabras clave, lo que no captura adecuadamente el contexto o el significado semántico de la consulta en relación con el texto del episodio.

  - BERT
    - Similitud: 0.7204
    - Transcripción: Episodio "Life 3.0"
    - Observación: La similitud es significativamente mayor. Esto indica que BERT puede capturar mejor el contexto y las relaciones semánticas entre las palabras, proporcionando una mayor precisión en la recuperación de información.

En conclusión, BERT es  mejor que TF-IDF para recuperar la información relevante. Mientras TF-IDF tiene una similitud muy baja (0.0044) porque solo busca coincidencias exactas de palabras, BERT, con una similitud mucho mayor (0.7204), entiende mejor el contexto y el significado, haciendo que los resultados sean  más precisos



