# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.

### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

In [2]:
#Necessary libraries
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import torch
from transformers import BertTokenizer, BertModel
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


  from .autonotebook import tqdm as notebook_tqdm


### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

In [17]:
#Load data
podcast_df = pd.read_csv('../WEEK11/archive/podcastdata_dataset.csv')
# Mostrar los nombres de las columnas
print(podcast_df.columns)
# Extract main information in this case is the title and text
titles = podcast_df['title']
text = podcast_df['text']  

# View texxt
print(text.head())

Index(['id', 'guest', 'title', 'text'], dtype='object')
0    As part of MIT course 6S099, Artificial Genera...
1    As part of MIT course 6S099 on artificial gene...
2    You've studied the human mind, cognition, lang...
3    What difference between biological neural netw...
4    The following is a conversation with Vladimir ...
Name: text, dtype: object


In [25]:
# Download necessary resources from NLTK for preprocessing
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Karen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Karen\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Karen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Step 3: Text Preprocessing

You know what to do ;)

In [18]:
# Preprocessing function
def preprocess_text(text):
    # convert lowercase
    text = text.lower()
    # remove special carathers and numbers
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\d+', ' ', text)
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Join tokens into a single chain
    return ' '.join(tokens)

# Apply preprocessing  to text
podcast_df['processed_text'] = podcast_df['text'].apply(preprocess_text)

# Show some processed texts
print(podcast_df['processed_text'].head())


0    part mit course artificial general intelligenc...
1    part mit course artificial general intelligenc...
2    studied human mind cognition language vision e...
3    difference biological neural networks artifici...
4    following conversation vladimir vapnik co inve...
Name: processed_text, dtype: object


###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

In [19]:

# Inicializar el vectorizador TF-IDF
tfidf_vectorizer = TfidfVectorizer()

# Ajustar y transformar las transcripciones procesadas
tfidf_matrix = tfidf_vectorizer.fit_transform(podcast_df['processed_text'])

# Mostrar la forma de la matriz TF-IDF
print("Shape of TF-IDF matrix:", tfidf_matrix.shape)

Shape of TF-IDF matrix: (319, 47172)


In [20]:
tfidf_matrix

<319x47172 sparse matrix of type '<class 'numpy.float64'>'
	with 728207 stored elements in Compressed Sparse Row format>

In [4]:
%pip install transformers torch

Note: you may need to restart the kernel to use updated packages.


### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [21]:

# Cargar el tokenizer y el modelo de BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Función para obtener las representaciones BERT
def get_bert_embeddings(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()

# Aplicar la función de embeddings BERT a las transcripciones procesadas
podcast_df['bert_embeddings'] = podcast_df['processed_text'].apply(get_bert_embeddings)

# Mostrar la forma de las representaciones BERT
print("Shape of BERT embeddings:", podcast_df['bert_embeddings'].iloc[0].shape)

Shape of BERT embeddings: (768,)


### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

In [22]:
# Función para procesar la consulta
def process_query(query, podcast_df, tfidf_vectorizer, tfidf_matrix):
    # Preprocesar la consulta
    query_processed = preprocess_text(query)

    # Vectorizar la consulta utilizando TF-IDF
    query_tfidf = tfidf_vectorizer.transform([query_processed])
    
    # Calcular la similitud de coseno para TF-IDF
    tfidf_similarities = cosine_similarity(query_tfidf, tfidf_matrix).flatten()

    # Vectorizar la consulta utilizando BERT
    query_bert_embedding = get_bert_embeddings(query_processed).reshape(1, -1)
    
    # Calcular la similitud de coseno para BERT
    bert_embeddings = np.vstack(podcast_df['bert_embeddings'].values)
    bert_similarities = cosine_similarity(query_bert_embedding, bert_embeddings).flatten()

    # Crear un DataFrame con los resultados
    results_df = podcast_df.copy()
    results_df['tfidf_similarity'] = tfidf_similarities
    results_df['bert_similarity'] = bert_similarities

    # Ordenar los resultados por la similitud más alta
    results_df = results_df.sort_values(by=['tfidf_similarity', 'bert_similarity'], ascending=False)

    return results_df[['title', 'tfidf_similarity', 'bert_similarity', 'text']]

In [24]:
# Solicitar la entrada del usuario para la consulta
query = input("Ingrese su consulta: ")

# Procesar la consulta del usuario
results = process_query(query, podcast_df, tfidf_vectorizer, tfidf_matrix)

# Mostrar los resultados
print("Query is:",query)
results.head()

Query is: artificial


Unnamed: 0,title,tfidf_similarity,bert_similarity,text
2,AI in the Age of Reason,0.146613,0.1558,"You've studied the human mind, cognition, lang..."
91,"Square, Cryptocurrency, and Artificial Intelli...",0.074839,0.129934,The following is a conversation with Jack Dors...
3,Deep Learning,0.072252,0.136753,What difference between biological neural netw...
12,"Brains, Minds, and Machines",0.059004,0.142168,The following is a conversation with Tommaso P...
7,Google,0.052326,0.138043,The following is a conversation with Eric Schm...


### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

In [25]:
# Función para recuperar y comparar los mejores resultados
def retrieve_and_compare_results(query, podcast_df, tfidf_vectorizer, tfidf_matrix, top_n=5):
    # Procesar la consulta
    results_df = process_query(query, podcast_df, tfidf_vectorizer, tfidf_matrix)

    # Ordenar los resultados por similitud de TF-IDF y recuperar los top N
    top_tfidf_results = results_df.sort_values(by='tfidf_similarity', ascending=False).head(top_n)

    # Ordenar los resultados por similitud de BERT y recuperar los top N
    top_bert_results = results_df.sort_values(by='bert_similarity', ascending=False).head(top_n)

    return top_tfidf_results, top_bert_results

# Recuperar y comparar los resultados principales utilizando la consulta del usuario
top_tfidf_results, top_bert_results = retrieve_and_compare_results(query, podcast_df, tfidf_vectorizer, tfidf_matrix, top_n=5)

print("Query is:",query)
# Mostrar los resultados principales para TF-IDF
print("\nTop TF-IDF Results:")
for idx, row in top_tfidf_results.iterrows():
    print(f"Title: {row['title']}")
    print(f"TF-IDF Similarity: {row['tfidf_similarity']}")
    print(f"Transcript Excerpt: {row['text'][:200]}...")  # Mostrar los primeros 200 caracteres de la transcripción
    print("\n")

# Mostrar los resultados principales para BERT
print("\nTop BERT Results:")
for idx, row in top_bert_results.iterrows():
    print(f"Title: {row['title']}")
    print(f"BERT Similarity: {row['bert_similarity']}")
    print(f"Transcript Excerpt: {row['text'][:200]}...")  # Mostrar los primeros 200 caracteres de la transcripción
    print("\n")

Query is: artificial

Top TF-IDF Results:
Title: AI in the Age of Reason
TF-IDF Similarity: 0.14661302183592007
Transcript Excerpt: You've studied the human mind, cognition, language, vision, evolution, psychology, from child to adult, from the level of individual to the level of our entire civilization. So I feel like I can start...


Title: Square, Cryptocurrency, and Artificial Intelligence
TF-IDF Similarity: 0.07483942890201313
Transcript Excerpt: The following is a conversation with Jack Dorsey, co founder and CEO of Twitter and founder and CEO of Square. Given the happenings at the time related to Twitter leadership and the very limited time ...


Title: Deep Learning
TF-IDF Similarity: 0.07225237743107563
Transcript Excerpt: What difference between biological neural networks and artificial neural networks is most mysterious, captivating, and profound for you? First of all, there's so much we don't know about biological ne...


Title: Brains, Minds, and Machines
TF-IDF Similarity

### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

In [26]:
# Definir la consulta de prueba
test_query = "artificial intelligence"

# Recuperar y comparar los resultados principales utilizando la consulta de prueba
top_tfidf_results, top_bert_results = retrieve_and_compare_results(test_query, podcast_df, tfidf_vectorizer, tfidf_matrix, top_n=5)

print("Query test is:",test_query)
# Mostrar los resultados principales para TF-IDF
print("\nTop TF-IDF Results:")
for idx, row in top_tfidf_results.iterrows():
    print(f"Title: {row['title']}")
    print(f"TF-IDF Similarity: {row['tfidf_similarity']}")
    print(f"Transcript Excerpt: {row['text'][:200]}...")  # Mostrar los primeros 200 caracteres de la transcripción
    print("\n")

# Mostrar los resultados principales para BERT
print("\nTop BERT Results:")
for idx, row in top_bert_results.iterrows():
    print(f"Title: {row['title']}")
    print(f"BERT Similarity: {row['bert_similarity']}")
    print(f"Transcript Excerpt: {row['text'][:200]}...")  # Mostrar los primeros 200 caracteres de la transcripción
    print("\n")


Query test is: artificial intelligence

Top TF-IDF Results:
Title: AI in the Age of Reason
TF-IDF Similarity: 0.21138929368299314
Transcript Excerpt: You've studied the human mind, cognition, language, vision, evolution, psychology, from child to adult, from the level of individual to the level of our entire civilization. So I feel like I can start...


Title: Concepts, Analogies, Common Sense & Future of AI
TF-IDF Similarity: 0.1700409854657989
Transcript Excerpt: The following is a conversation with Melanie Mitchell. She's a professor of computer science at Portland State University and an external professor at Santa Fe Institute. She has worked on and written...


Title: Measures of Intelligence
TF-IDF Similarity: 0.15418131660687703
Transcript Excerpt: The following is a conversation with Francois Chollet, his second time on the podcast. He's both a world class engineer and a philosopher in the realm of deep learning and artificial intelligence. Thi...


Title: Keras, Deep Learning

### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

TF-IDF 

Ventaja: Los resultados suelen ser relevantes cuando las palabras clave de la consulta aparecen exactamente en el texto.
Desventaja: Puede incluir resultados menos relevantes si las palabras clave son comunes en diferentes contextos.

BERT

Ventaja: Captura mejor el contexto y la semántica, proporcionando resultados más precisos y relevantes.
Desventaja: Los resultados pueden variar y ser menos predecibles, pero suelen tener una comprensión más profunda del tema.


Eficiencia Computacional


TF-IDF: Es más rápido y eficiente computacionalmente. Adecuado para sistemas con recursos limitados.

BERT: Requiere más recursos de procesamiento y memoria, pero ofrece una mayor precisión en la recuperación de información.

Análisis de resultados

TF-IDF: No tiene en cuenta el contexto de las palabras. Por ejemplo, la consulta "inteligencia artificial" puede coincidir con cualquier mención de esas palabras, sin importar el contexto.

BERT: Entiende el contexto y puede identificar sinónimos y frases relacionadas. Por ejemplo, "inteligencia artificial" puede coincidir con discusiones sobre "AI" y "machine learning" de manera más precisa.