<a href="https://colab.research.google.com/github/PaolaMaribel18/RI_2024a/blob/main/week11/podcast_ir.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

Instructions:

### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.


In [1]:
import os
import re
import nltk
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk
from nltk.corpus import stopwords
import tensorflow as tf
import gensim.downloader as api
from transformers import BertTokenizer, TFBertModel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


In [2]:
nltk.download('punkt')
nltk.download('stopwords')
stemmer = PorterStemmer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
stopwords_set = set(stopwords.words('english'))

In [4]:
print(len(stopwords_set))

179


### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

In [34]:
file_path = '/content/drive/MyDrive/Colab Notebooks/week11/data/podcastdata_dataset.csv'
podcast_df = pd.read_csv(file_path, index_col=0)
podcast_df.head()

Unnamed: 0_level_0,guest,title,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera..."
2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...
3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang..."
4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...
5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...


In [8]:
podcast_df.shape

(319, 3)

### Step 3: Text Preprocessing

You know what to do ;)

In [9]:
# Función para preprocesar el texto
def preprocesar_texto(texto):
    # Eliminación de caracteres no deseados y normalización
    texto = re.sub(r'\W', ' ', texto)
    texto = re.sub(r'\s+', ' ', texto)
    texto = texto.lower()
    # Tokenizar el texto
    tokens = word_tokenize(texto)
    # Eliminar stopwords
    tokens = [word for word in tokens if word not in stopwords_set]
    # Aplicar stemming
    tokens = [stemmer.stem(word) for word in tokens]
    # Unir las palabras procesadas en un solo texto
    texto_procesado = ' '.join(tokens)
    return texto_procesado

In [10]:
# Crear un nuevo DataFrame con la columna 'text' preprocesada
podcast_pre_df = podcast_df.copy()
podcast_pre_df['text_pre'] = podcast_pre_df['text'].apply(preprocesar_texto)

In [11]:
podcast_pre_df.head()

Unnamed: 0_level_0,guest,title,text,text_pre
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera...",part mit cours 6s099 artifici gener intellig g...
2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...,part mit cours 6s099 artifici gener intellig g...
3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang...",studi human mind cognit languag vision evolut ...
4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...,differ biolog neural network artifici neural n...
5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...,follow convers vladimir vapnik co inventor sup...


In [12]:
podcast_pre_df.shape

(319, 4)

In [13]:
# Guardar los resultados en un archivo CSV
podcast_pre_df.to_csv('/content/drive/My Drive/Colab Notebooks/week11/data/podcast_pre_df.csv', index=False)

###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

In [14]:
def create_tfidf_representation(corpus_df):
    tfidf_vectorizer = TfidfVectorizer()
    X_tfidf = tfidf_vectorizer.fit_transform(corpus_df['text_pre'])
    tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
    return tfidf_df,tfidf_vectorizer

In [15]:
print(podcast_pre_df.columns)


Index(['guest', 'title', 'text', 'text_pre'], dtype='object')


In [16]:
# Crear la representación TF-IDF
podcast_tfidf_df, vectorizer = create_tfidf_representation(podcast_pre_df)

In [17]:
print("\nRepresentación TF-IDF:")
podcast_tfidf_df


Representación TF-IDF:


Unnamed: 0,00,000,0000,0000001,000073,0001,000hour,000th,000x,001,...,целом,часа,четыре,чрезвычайно,что,чтобы,шесть,это,этот,들어가
0,0.0,0.009230,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.007667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.024692,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.007691,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
314,0.0,0.015218,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
315,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
316,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
317,0.0,0.036824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [18]:
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [47]:
def generate_bert_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)

bert_embeddings = generate_bert_embeddings(podcast_pre_df)
print("BERT Embeddings:", bert_embeddings)
print("BERT Shape:", bert_embeddings.shape)

BERT Embeddings: [[[-0.18896306]
  [ 0.2624014 ]
  [-0.15489869]
  ...
  [-0.12260249]
  [ 0.24914369]
  [ 0.24250938]]

 [[-0.41557804]
  [ 0.15562901]
  [ 0.09742901]
  ...
  [-0.30488974]
  [-0.05800811]
  [ 0.26350507]]

 [[-0.40360537]
  [ 0.16065195]
  [-0.16231938]
  ...
  [-0.21682927]
  [ 0.20362087]
  [ 0.5632472 ]]

 [[-0.18125309]
  [ 0.03536531]
  [-0.19476674]
  ...
  [-0.24290958]
  [ 0.14357528]
  [ 0.57741135]]]
BERT Shape: (4, 768, 1)


### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

In [20]:
def retrieve_tfidf(query, vectorizer, tfidf_df, podcast_pre_df):
    query_vector = vectorizer.transform([query])
    similarities = cosine_similarity(query_vector, tfidf_df)
    similarities_df = pd.DataFrame(similarities.T, columns=['sim'])
    # Reset index of podcast_pre_df to ensure unique index for alignment
    similarities_df['ep'] = podcast_pre_df.reset_index(drop=True)['title']  # Reset index of podcast_pre_df
    return similarities_df.sort_values(by='sim', ascending=False)

# Prueba con una consulta de ejemplo
query = 'gpt'
tfidf_results = retrieve_tfidf(query, vectorizer, podcast_tfidf_df, podcast_pre_df)
tfidf_results

Unnamed: 0,sim,ep
213,0.097006,"OpenAI Codex, GPT-3, Robotics, and the Future ..."
17,0.031047,OpenAI and AGI
120,0.027697,Friendship with an AI Companion
94,0.027369,Deep Learning
117,0.024672,"Math, Manim, Neural Networks & Teaching with 3..."
...,...,...
103,0.000000,Computer Architecture and Data Storage
102,0.000000,Artificial General Intelligence
101,0.000000,The War of Art
100,0.000000,Artificial Consciousness and the Nature of Rea...


In [21]:
def retrieve_bert(query, bert_embeddings, podcast_pre_df):
    query_embedding = generate_bert_embeddings([query]) # Pass query as a single-element list
    # Reshape embeddings to 2D
    query_embedding = query_embedding.reshape(query_embedding.shape[0], -1)
    bert_embeddings_2d = bert_embeddings.reshape(bert_embeddings.shape[0], -1)
    similarities = cosine_similarity(query_embedding, bert_embeddings_2d)
    similarities_df = pd.DataFrame(similarities.T, columns=['sim'])
    # Reset index of podcast_pre_df to ensure unique index for alignment
    similarities_df['ep'] = podcast_pre_df.reset_index(drop=True)['title']  # Reset index of podcast_pre_df
    return similarities_df.sort_values(by='sim', ascending=False)

# Prueba con una consulta de ejemplo
bert_results = retrieve_bert(query, bert_embeddings, podcast_pre_df)
bert_results

Unnamed: 0,sim,ep
2,0.750658,AI in the Age of Reason
0,0.729068,Life 3.0
1,0.7155,Consciousness
3,0.670224,Deep Learning


### **Step** 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.


In [27]:
def get_top_results(similarities_df, top_n=5):
    return similarities_df.head(top_n)

top_n = 5


In [28]:
# Retrieve top results for TF-IDF
top_tfidf_results = get_top_results(tfidf_results, top_n)
print("Top TF-IDF Results:")
top_tfidf_results

Top TF-IDF Results:


Unnamed: 0,sim,ep
213,0.097006,"OpenAI Codex, GPT-3, Robotics, and the Future ..."
17,0.031047,OpenAI and AGI
120,0.027697,Friendship with an AI Companion
94,0.027369,Deep Learning
117,0.024672,"Math, Manim, Neural Networks & Teaching with 3..."


In [29]:
# Retrieve top results for BERT
top_bert_results = get_top_results(bert_results, top_n)
print("\nTop BERT Results:")
top_bert_results


Top BERT Results:


Unnamed: 0,sim,ep
2,0.750658,AI in the Age of Reason
0,0.729068,Life 3.0
1,0.7155,Consciousness
3,0.670224,Deep Learning


### Step 8: Test the IR System
* Test the system with a sample query.

* Retrieve and display the top results using both TF-IDF and BERT representations

In [35]:
# Test query
sample_query = 'I think, reminding ourselves that the reason we try to solve problems'

In [36]:
# Retrieve results using TF-IDF
tfidf_results = retrieve_tfidf(sample_query, vectorizer, podcast_tfidf_df, podcast_pre_df)
top_tfidf_results = get_top_results(tfidf_results, top_n)
print("Top TF-IDF Results for query '{}':".format(sample_query))
top_tfidf_results


Top TF-IDF Results for query 'I think, reminding ourselves that the reason we try to solve problems':


Unnamed: 0,sim,ep
157,0.074143,The Next Generation of Big Ideas and Brave Minds
17,0.071229,OpenAI and AGI
15,0.065721,"Reinforcement Learning, Planning, and Robotics"
44,0.059648,"IBM Watson, Jeopardy & Deep Conversations with AI"
292,0.058319,DeepMind


In [37]:
# Retrieve results using BERT
bert_results = retrieve_bert(sample_query, bert_embeddings, podcast_pre_df)
top_bert_results = get_top_results(bert_results, top_n)
print("\nTop BERT Results for query '{}':".format(sample_query))
top_bert_results


Top BERT Results for query 'I think, reminding ourselves that the reason we try to solve problems':


Unnamed: 0,sim,ep
2,0.864473,AI in the Age of Reason
0,0.852647,Life 3.0
1,0.84282,Consciousness
3,0.801064,Deep Learning


### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.


Example #1

In [44]:
# Test query
sample_query = 'machine learning'
# Retrieve results using TF-IDF
tfidf_results = retrieve_tfidf(sample_query, vectorizer, podcast_tfidf_df, podcast_pre_df)
top_tfidf_results = get_top_results(tfidf_results, top_n)
print("Top TF-IDF Results for query '{}':".format(sample_query))
print(top_tfidf_results)
# Retrieve results using BERT
bert_results = retrieve_bert(sample_query, bert_embeddings, podcast_pre_df)
top_bert_results = get_top_results(bert_results, top_n)
print("\nTop BERT Results for query '{}':".format(sample_query))
print(top_bert_results)


Top TF-IDF Results for query 'machine learning':
     sim                                                 ep
318  0.0  Biology, Life, Aliens, Evolution, Embryogenesi...
0    0.0                                           Life 3.0
1    0.0                                      Consciousness
2    0.0                            AI in the Age of Reason
3    0.0                                      Deep Learning

Top BERT Results for query 'machine learning':
        sim                       ep
2  0.839966  AI in the Age of Reason
0  0.812407                 Life 3.0
1  0.798970            Consciousness
3  0.785222            Deep Learning


Example #2

In [45]:
# Test query
sample_query = 'artificial intelligence'
# Retrieve results using TF-IDF
tfidf_results = retrieve_tfidf(sample_query, vectorizer, podcast_tfidf_df, podcast_pre_df)
top_tfidf_results = get_top_results(tfidf_results, top_n)
print("Top TF-IDF Results for query '{}':".format(sample_query))
print(top_tfidf_results)
# Retrieve results using BERT
bert_results = retrieve_bert(sample_query, bert_embeddings, podcast_pre_df)
top_bert_results = get_top_results(bert_results, top_n)
print("\nTop BERT Results for query '{}':".format(sample_query))
print(top_bert_results)


Top TF-IDF Results for query 'artificial intelligence':
     sim                                                 ep
318  0.0  Biology, Life, Aliens, Evolution, Embryogenesi...
0    0.0                                           Life 3.0
1    0.0                                      Consciousness
2    0.0                            AI in the Age of Reason
3    0.0                                      Deep Learning

Top BERT Results for query 'artificial intelligence':
        sim                       ep
2  0.853677  AI in the Age of Reason
0  0.832118                 Life 3.0
1  0.816438            Consciousness
3  0.777315            Deep Learning


In [46]:
# Test query
sample_query = 'GPT'
# Retrieve results using TF-IDF
tfidf_results = retrieve_tfidf(sample_query, vectorizer, podcast_tfidf_df, podcast_pre_df)
top_tfidf_results = get_top_results(tfidf_results, top_n)
print("Top TF-IDF Results for query '{}':".format(sample_query))
print(top_tfidf_results)
# Retrieve results using BERT
bert_results = retrieve_bert(sample_query, bert_embeddings, podcast_pre_df)
top_bert_results = get_top_results(bert_results, top_n)
print("\nTop BERT Results for query '{}':".format(sample_query))
print(top_bert_results)


Top TF-IDF Results for query 'GPT':
          sim                                                 ep
213  0.097006  OpenAI Codex, GPT-3, Robotics, and the Future ...
17   0.031047                                     OpenAI and AGI
120  0.027697                    Friendship with an AI Companion
94   0.027369                                      Deep Learning
117  0.024672  Math, Manim, Neural Networks & Teaching with 3...

Top BERT Results for query 'GPT':
        sim                       ep
2  0.750658  AI in the Age of Reason
0  0.729068                 Life 3.0
1  0.715500            Consciousness
3  0.670224            Deep Learning


We retrieve and print the top results for a given query using both TF-IDF and BERT representations. This allows us to compare the outputs of the two methods directly, showing which podcast episodes are most relevant according to each approach. The TF-IDF results focus on keyword frequency and importance, while the BERT results capture the contextual meaning of the text, providing a richer, more nuanced match.
For the queries "machine learning," "artificial intelligence," and "GPT," the TF-IDF method produced very low or zero similarity scores, indicating it failed to effectively match the relevant episodes. In contrast, the BERT method produced higher similarity scores and consistently identified relevant episodes, such as "AI in the Age of Reason," "Life 3.0," and "Deep Learning." This demonstrates that BERT's contextual understanding provides more accurate and meaningful results compared to the keyword-focused approach of TF-IDF, especially for complex queries related to AI and machine learning.

## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.