# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

Instructions:

### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

In [26]:
import pandas as pd
import string
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BertTokenizer, TFBertModel
import numpy as np
import re
from sklearn.metrics.pairwise import cosine_similarity

### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

In [2]:
csv_file_path = 'data/podcastdata_dataset.csv'

# Cargar el archivo CSV en un DataFrame
df = pd.read_csv(csv_file_path, index_col=0)

In [3]:
df

Unnamed: 0_level_0,guest,title,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera..."
2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...
3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang..."
4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...
5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...
...,...,...,...
321,Ray Kurzweil,"Singularity, Superintelligence, and Immortality","By the time he gets to 2045, we'll be able to ..."
322,Rana el Kaliouby,"Emotion AI, Social Robots, and Self-Driving Cars","there's a broader question here, right? As we ..."
323,Will Sasso,"Comedy, MADtv, AI, Friendship, Madness, and Pr...",Once this whole thing falls apart and we are c...
324,Daniel Negreanu,Poker,you could be the seventh best player in the wh...


### Step 3: Text Preprocessing

You know what to do ;)

In [4]:
stop_words_file = 'data/stopwords.txt'

with open(stop_words_file, 'r') as file:
    stop_words = set(file.read().split())

nlp = spacy.load('en_core_web_sm')

In [5]:
def preprocess_and_lemmatize_text(text):
    # Convertir a minúsculas
    text = text.lower()
    
    # Eliminar caracteres especiales
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenizar el texto
    words = text.split()
    
    # Eliminar stop words
    words = [word for word in words if word not in stop_words]
    
    # Unir las palabras procesadas de nuevo en una sola cadena
    text = ' '.join(words)
    
    # Lematizar el texto usando spaCy
    doc = nlp(text)
    lemmatized_text = ' '.join([token.lemma_ for token in doc])
    
    return lemmatized_text


In [6]:
df['cleaned_text'] = df['text'].apply(preprocess_and_lemmatize_text)

In [7]:
df

Unnamed: 0_level_0,guest,title,text,cleaned_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera...",part mit 6s099 artificial general intelligence...
2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...,part mit 6s099 artificial general intelligence...
3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang...",you ve study human mind cognition language vis...
4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...,difference biological neural network artificia...
5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...,conversation vladimir vapnik he s inventor sup...
...,...,...,...,...
321,Ray Kurzweil,"Singularity, Superintelligence, and Immortality","By the time he gets to 2045, we'll be able to ...",time 2045 multiply intelligence million fold h...
322,Rana el Kaliouby,"Emotion AI, Social Robots, and Self-Driving Cars","there's a broader question here, right? As we ...",broad question build socially emotionally inte...
323,Will Sasso,"Comedy, MADtv, AI, Friendship, Madness, and Pr...",Once this whole thing falls apart and we are c...,thing fall climb kudzu vine spiral sears tower...
324,Daniel Negreanu,Poker,you could be the seventh best player in the wh...,seventh player world literally seventh player ...


###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

In [8]:
# Función para vectorizar textos utilizando TF-IDF
def TF_IDF(texts):
    # Vectorización usando TF-IDF
    tfidf_vectorizer = TfidfVectorizer()
    X_tfidf = tfidf_vectorizer.fit_transform(texts)
    
    return X_tfidf, tfidf_vectorizer

In [9]:
# Vectorizar los documentos
X_tfidf_lemmatized, tfidf_vectorizer_lemmatized = TF_IDF(df['cleaned_text'])

# Ver los resultados de TF-IDF
print("TF-IDF Lemmatized:")
print(X_tfidf_lemmatized.toarray())
print("Caracteristicas de TF-IDF Lemmatized:", tfidf_vectorizer_lemmatized.get_feature_names_out())

TF-IDF Lemmatized:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Caracteristicas de TF-IDF Lemmatized: ['00' '000' '00000' ... 'это' 'этот' '들어가']


### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [12]:
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [15]:
def generate_bert_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)

bert_embeddings = generate_bert_embeddings(df['cleaned_text'])
print("BERT Embeddings:", bert_embeddings)
print("BERT Shape:", bert_embeddings.shape)

BERT Embeddings: [[[ 0.06086778]
  [ 0.26899973]
  [ 0.37482524]
  ...
  [-0.24746999]
  [ 0.11418366]
  [ 0.12912685]]

 [[-0.03892427]
  [ 0.1501886 ]
  [ 0.18922803]
  ...
  [-0.33711198]
  [ 0.13962337]
  [ 0.08770353]]

 [[-0.15095787]
  [ 0.06957125]
  [ 0.15182185]
  ...
  [-0.3761814 ]
  [ 0.16042688]
  [ 0.12493543]]

 ...

 [[-0.01123998]
  [ 0.02063517]
  [ 0.4532022 ]
  ...
  [-0.1953187 ]
  [ 0.19546221]
  [-0.07874274]]

 [[-0.02113011]
  [ 0.28063643]
  [ 0.16844061]
  ...
  [-0.25958553]
  [ 0.19384375]
  [ 0.12385748]]

 [[-0.03289397]
  [ 0.21554986]
  [ 0.25829577]
  ...
  [-0.28611344]
  [ 0.07276228]
  [ 0.19506222]]]
BERT Shape: (319, 768, 1)


### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

In [19]:
query = 'Stack Overflow and Coding Horror'

In [21]:
def clean_query(query):
    # Limpiamos la query
    query = query.lower()
    stop_words_file = 'data/stopwords.txt'
    with open(stop_words_file, 'r', encoding='utf-8') as file:
        stop_words = set(file.read().split())

    cleaned_query = ' '.join([word for word in query.split() if word not in stop_words])
    cleaned_query = re.sub(r'[^A-Za-z0-9\s]', '', cleaned_query)
    return cleaned_query

In [23]:
cleaned_query = clean_query(query)
# lematizamos y stematizamos la query

lemmatized_query = preprocess_and_lemmatize_text(cleaned_query)

In [24]:
# Vectorizar Query usando tdidf lemmatized

query_tfidf_lemmatized = tfidf_vectorizer_lemmatized.transform([lemmatized_query])

### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

In [38]:
def top_n_documents(scores, n, filenames):

    # Asegurarse de que los puntajes sean un array 1D
    scores = scores.flatten()

    # Obtener los índices de los n puntajes más altos
    top_indices = np.argsort(scores)[-n:][::-1]

    # Obtener los textos correspondientes a esos índices

    top_texts = []
    top_titles = []

    for idx in top_indices:
        top_texts.append(df['title'])

    top_scores = scores[top_indices]

    # Imprimir los 10 textos con mayor similitud y sus puntajes
    for i, (text, title, score) in enumerate(zip(top_texts, top_titles, top_scores), 1):

        print(f"Top {i}: {title} - Similitud: {score:.4f}")
        print(text)
        print()

In [39]:
cosine_scores_lemmatized = cosine_similarity(query_tfidf_lemmatized, X_tfidf_lemmatized)

top_n_documents(cosine_scores_lemmatized, 5, df['title'])

In [42]:
cosine_scores_lemmatized

array([[1.05121997e-03, 1.14090344e-03, 1.80429485e-03, 0.00000000e+00,
        1.07825069e-03, 2.16699391e-02, 3.34643049e-01, 0.00000000e+00,
        1.05902109e-03, 1.43089768e-03, 8.62788215e-03, 2.31556940e-03,
        0.00000000e+00, 1.06194520e-02, 1.06194520e-02, 1.08095882e-03,
        5.34781287e-03, 1.54863843e-03, 3.78983961e-03, 3.54187641e-03,
        2.54097295e-03, 5.08895533e-02, 3.78679696e-03, 3.93808673e-03,
        0.00000000e+00, 1.80996223e-03, 0.00000000e+00, 9.61147734e-04,
        0.00000000e+00, 7.22093153e-03, 3.66884015e-03, 3.14403967e-02,
        6.53510324e-03, 7.58434337e-04, 0.00000000e+00, 1.10269175e-02,
        5.56657524e-03, 0.00000000e+00, 7.44890564e-03, 0.00000000e+00,
        1.06385618e-03, 0.00000000e+00, 2.04376890e-02, 8.87238662e-03,
        1.12602945e-03, 0.00000000e+00, 0.00000000e+00, 1.68912727e-03,
        8.88285587e-02, 0.00000000e+00, 1.73784058e-03, 3.42098844e-03,
        2.24740084e-03, 4.13562372e-03, 1.79240811e-03, 1.400990

### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

In [43]:
bert_embeddings_query = generate_bert_embeddings(query)
print("BERT Query Embeddings:", bert_embeddings_query)
print("BERT Query Shape:", bert_embeddings_query.shape)

BERT Query Embeddings: [[[-0.45446125]
  [ 0.05358833]
  [ 0.12024155]
  ...
  [ 0.07028826]
  [ 0.24653077]
  [ 0.41181576]]

 [[-0.25475633]
  [-0.07631426]
  [ 0.37731266]
  ...
  [-0.24108677]
  [ 0.2728721 ]
  [ 0.6000002 ]]

 [[-0.44223228]
  [-0.0525881 ]
  [-0.00906704]
  ...
  [ 0.09634119]
  [ 0.16249628]
  [ 0.30312762]]

 ...

 [[-0.6948799 ]
  [-0.00157599]
  [ 0.01014499]
  ...
  [-0.3159717 ]
  [ 0.15906921]
  [ 0.46990094]]

 [[-0.26669952]
  [ 0.02476714]
  [-0.10877821]
  ...
  [-0.21664155]
  [ 0.5348886 ]
  [ 0.57949764]]

 [[-0.6948799 ]
  [-0.00157599]
  [ 0.01014499]
  ...
  [-0.3159717 ]
  [ 0.15906921]
  [ 0.46990094]]]
BERT Query Shape: (32, 768, 1)


In [45]:
# Cosine similarity between BERT embeddings
bert_similarity_query = cosine_similarity(bert_embeddings_query.reshape(32,768))
print("BERT Query Cosine Similarity:\n", bert_similarity_query)

BERT Query Cosine Similarity:
 [[1.0000002  0.923278   0.9624414  ... 0.92472047 0.9155723  0.92472047]
 [0.923278   0.99999976 0.89562345 ... 0.8880375  0.9121578  0.8880375 ]
 [0.9624414  0.89562345 0.99999994 ... 0.92312145 0.91849136 0.92312145]
 ...
 [0.92472047 0.8880375  0.92312145 ... 0.9999999  0.90264225 0.9999999 ]
 [0.9155723  0.9121578  0.91849136 ... 0.90264225 0.9999999  0.90264225]
 [0.92472047 0.8880375  0.92312145 ... 0.9999999  0.90264225 0.9999999 ]]


In [47]:
# Calcular la similitud coseno entre cada documento y la query
similarities_Bert = cosine_similarity(bert_embeddings.reshape(319,768), bert_embeddings_query.reshape(32,768))

# Obtener la similitud promedio por documento
avg_similarities_Bert = similarities_Bert.mean(axis=1)

In [48]:
# Crear un ranking de los documentos basado en la similitud promedio
ranking_Bert = np.argsort(-avg_similarities_Bert)

# Imprimir el ranking de los documentos
print("Ranking de documentos (índices):", ranking_Bert)
print("Similitudes de los documentos:", avg_similarities_Bert[ranking_Bert])


Ranking de documentos (índices): [119 172  16 108 219 128 233  82 237 179  34 310 162 252 120 296  57 106
 185 316 147 159 290   5  46  61  62 117 137 184  14  13 114 251  99   3
 236 134  84 158 182 271  53  48 225 112 311  42   7  37  11  28 300 209
  31  23 165 163 125  87 215 260 289 173 109 194 131  59 189  67 177  43
 297 144 268  55   1 245  65 217 266 301 107  20 261 122 267 223  35 295
  63 307  54 243 207  98   8 314 101 213 103 302  15 284  52 259 299 118
  30  94  24  58 256 239 166 135 133 248 208 281 168  44  80 111 298 202
 254  86 171  75 313  27  78  39 258 116 110 105 228   2 102 127  12 274
 143 234  83 201 285 187 123 292 247 224 257 221 282 167 129  40 175 222
  19 276 249 178 220  70 241  69  60  29 204 309 306  38 206  18   0 286
 170 150 138  56  76 199 317 279  92   4 273 265 113 205  45 148 227 255
 315 145  89 140  10 104 149 214 169 139 115  81 156  91  85  66  79 176
 246 132 235 210   9 141 216  51 124   6 121 157 287 130 154  25 272  26
  33  41  32 174 2

### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.