# Integrating Embeddings with Queries in an Information Retrieval System

## Objective

In this exercise, we will learn how to integrate embeddings with a query to enhance an Information Retrieval (IR) system. We will use both static and contextual embeddings to generate representations of queries and documents, compute their similarities, and rank the documents based on relevance to the query.

---

## Stages Covered

1. **Introduction to Pre-trained Transformer Models**
   - Load and use BERT for contextual embeddings.
   - Load and use Word2Vec for static embeddings.

2. **Generating Text Embeddings**
   - Generate embeddings for queries and documents using BERT.
   - Generate embeddings for queries and documents using Word2Vec.

3. **Computing Similarity Between Embeddings**
   - Compute cosine similarity between query and document embeddings.
   - Rank documents based on similarity scores.

4. **Integrating Embeddings with Queries**
   - Practical implementation of embedding-based retrieval for a given text corpus.

---

## Prerequisites

- TensorFlow
- Hugging Face's Transformers library
- Gensim library
- Scikit-learn library
- A text corpus in the `../data` folder

---

## Exercise

Follow the steps below to integrate embeddings with a query and enhance your IR system.



Step 0: Verify requirements:

* tensorflow
* transformers
* scikit-learn
* matplotlib
* seaborn

Step 1: Download dataset from Kaggle

URL: https://www.kaggle.com/datasets/zynicide/wine-reviews

In [None]:
import pandas as pd

# Replace 'data/winemag-data_first150k.csv' with the actual path to your downloaded CSV file
wine_df = pd.read_csv('data/winemag-data_first150k.csv')

# Now you can work with the DataFrame as usual
print(wine_df.head())
corpus = wine_df['description']


Step 2: Load a Pre-trained Transformer Model

Use the BERT model for generating contextual embeddings and Word2Vec for static embeddings.

In [None]:
import tensorflow as tf
import gensim.downloader as api
from transformers import BertTokenizer, TFBertModel

# Load pre-trained Word2Vec model
word2vec_model = api.load('word2vec-google-news-300')

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')


Step 3: Generate Text Embeddings

Static Embeddings with Word2Vec

In [None]:
import concurrent.futures
import numpy as np
import tensorflow as tf

# Funci贸n para generar los embeddings de Word2Vec
def generate_word2vec_embedding(text):
    tokens = text.lower().split()
    word_vectors = [word2vec_model[word] for word in tokens if word in word2vec_model]
    if word_vectors:
        return np.mean(word_vectors, axis=0)
    else:
        return np.zeros(word2vec_model.vector_size)

# Funci贸n para generar embeddings para una lista de textos
def generate_word2vec_embeddings(texts):
    embeddings = []
    with concurrent.futures.ThreadPoolExecutor() as executor:
        future_to_text = {executor.submit(generate_word2vec_embedding, text): text for text in texts}
        for future in concurrent.futures.as_completed(future_to_text):
            embeddings.append(future.result())
    return np.array(embeddings)

# Ejemplo de uso
word2vec_embeddings = generate_word2vec_embeddings(corpus)
print("Word2Vec Embeddings:", word2vec_embeddings)
print("Word2Vec Shape:", word2vec_embeddings.shape)


Contextual Embeddings with BERT

In [None]:
# Funci贸n para generar los embeddings de BERT
def generate_bert_embedding(text):
    # Tokenize text
    inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
    # Generate BERT embeddings
    outputs = model(**inputs)
    # Return the first token's embedding
    return outputs.last_hidden_state[:, 0, :]

# Funci贸n para generar embeddings para una lista de textos
def generate_bert_embeddings(texts):
    embeddings = []
    with concurrent.futures.ThreadPoolExecutor() as executor:
        future_to_text = {executor.submit(generate_bert_embedding, text): text for text in texts}
        for future in concurrent.futures.as_completed(future_to_text):
            embeddings.append(future.result())
    return np.array(embeddings)

# Ejemplo de uso
bert_embeddings = generate_bert_embeddings(corpus)
print("BERT Embeddings:", bert_embeddings)
print("BERT Shape:", bert_embeddings.shape)


Step 4: Compute Similarity Between Embeddings

Use the scikit-learn library.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Cosine similarity between Word2Vec embeddings
word2vec_embeddings = word2vec_embeddings.squeeze()
word2vec_similarity = cosine_similarity(word2vec_embeddings)
print("Word2Vec Cosine Similarity:\n", word2vec_similarity)

# Cosine similarity between BERT embeddings
bert_embeddings = bert_embeddings.squeeze()
bert_similarity = cosine_similarity(bert_embeddings)
print("BERT Cosine Similarity:\n", bert_similarity)


Step 5: Compare Contextual and Static Embeddings

Analyze and compare the similarity results from both BERT and Word2Vec embeddings.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

def plot_similarity_matrix(matrix, title, figsize=(8, 6), annotation=True):
    plt.figure(figsize=figsize)
    sns.heatmap(matrix, annot=annotation, cmap='coolwarm', fmt='.2f')
    plt.title(title)
    plt.show()

plot_similarity_matrix(word2vec_similarity, "Word2Vec Cosine Similarity")
plot_similarity_matrix(bert_similarity, "BERT Cosine Similarity")

Step 6: Applying to Corpus

In [None]:
# Generar embeddings para el corpus
corpus_word2vec_embeddings = generate_word2vec_embeddings(corpus)
corpus_bert_embeddings = generate_bert_embeddings(corpus)

# Asegurarse de que los embeddings tengan la forma correcta
corpus_word2vec_embeddings = corpus_word2vec_embeddings.squeeze()
corpus_bert_embeddings = corpus_bert_embeddings.squeeze()

# Computar la similitud del coseno para el corpus
corpus_word2vec_similarity = cosine_similarity(corpus_word2vec_embeddings)
corpus_bert_similarity = cosine_similarity(corpus_bert_embeddings)

# Mostrar las matrices de similitud
plot_similarity_matrix(corpus_word2vec_similarity, "Corpus Word2Vec Cosine Similarity", figsize=(16, 12), annotation=False)
plot_similarity_matrix(corpus_bert_similarity, "Corpus BERT Cosine Similarity", figsize=(16, 12), annotation=False)


Step 7: Generate Embeddings for the Query

Generate embeddings for the query using the same model used for the documents.

In [None]:
# Definir la consulta
query = "Sample query text for information retrieval"

# Generar embeddings para la consulta usando Word2Vec
query_word2vec_embedding = generate_word2vec_embedding(query).reshape(1, -1)
print("Query Word2Vec Embedding Shape:", query_word2vec_embedding.shape)

# Generar embeddings para la consulta usando BERT
query_bert_embedding = generate_bert_embedding(query).numpy().reshape(1, -1)
print("Query BERT Embedding Shape:", query_bert_embedding.shape)


Step 8: Compute Similarity Between Query and Documents

Compute the similarity between the query embedding and each document embedding.

In [None]:
# Calcular la similitud del coseno entre la consulta y los documentos (Word2Vec)
query_word2vec_similarity = cosine_similarity(query_word2vec_embedding, corpus_word2vec_embeddings)
print("Query-Document Word2Vec Cosine Similarity:", query_word2vec_similarity)

# Calcular la similitud del coseno entre la consulta y los documentos (BERT)
query_bert_similarity = cosine_similarity(query_bert_embedding, corpus_bert_embeddings)
print("Query-Document BERT Cosine Similarity:", query_bert_similarity)


Step 9: Retrieve and Rank Documents Based on Similarity Scores

Retrieve and rank the documents based on their similarity scores to the query.

In [None]:
# Recuperar y clasificar documentos basados en la similitud (Word2Vec)
word2vec_sorted_indices = np.argsort(-query_word2vec_similarity[0])
sorted_word2vec_docs = [corpus[i] for i in word2vec_sorted_indices]

# Recuperar y clasificar documentos basados en la similitud (BERT)
bert_sorted_indices = np.argsort(-query_bert_similarity[0])
sorted_bert_docs = [corpus[i] for i in bert_sorted_indices]

# Mostrar los documentos clasificados
print("Top documents based on Word2Vec similarity:")
for doc in sorted_word2vec_docs[:5]:
    print(doc)

print("\nTop documents based on BERT similarity:")
for doc in sorted_bert_docs[:5]:
    print(doc)
