[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/francisco-ortin/data-science-course/blob/main/deep-learning/rnn/semantic_search.ipynb)
[![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

# Semantic search

In this notebook, we use sentence embeddings to perform [semantic search](https://en.wikipedia.org/wiki/Semantic_search). Semantic search can be defined as the process of retrieving the most relevant contexts based on a query, even if the query and the contexts are not exactly the same. This is different from keyword-based search, where the search engine looks for exact matches of the keywords in the query.

We utilize the [Universal Sentence Encoder](https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder) (USE) model by Google to encode both questions and contexts/answers (paragraphs) as embeddings. USE is a pre-trained model that generates semantic embeddings for small to medium-sized text inputs. Then, using a question/answer dataset, we perform semantic search to retrieve the most relevant contexts based on a query. We search for similarity between the encodings of the query and both the contexts and questions, and return the most relevant ones.

*Notice*: USE is a relatively lightweight model that provides good performance. For better embeddings but slower performance, transformer-based models like [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) (Bidirectional encoder representations from transformers) can be used.


We use the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) (Stanford Question Answering Dataset) dataset. SQuAD is one of the most widely used datasets for question-answering tasks. It consists of questions posed by crowdworkers on Wikipedia articles, with the corresponding answers as text spans within the articles.

In [1]:
# make sure the required packages are installed
%pip install pandas numpy seaborn matplotlib scikit-learn keras tensorflow tensorflow-hub datasets --quiet
# if running in colab, install the required packages and copy the necessary files
directory='data-science-course/deep-learning/rnn'
if get_ipython().__class__.__module__.startswith('google.colab'):
    !git clone https://github.com/francisco-ortin/data-science-course.git  2>/dev/null
    !cp --update {directory}/*.py .
    !mkdir -p img data
    !cp {directory}/data/* data/.
    !cp {directory}/img/* img/.

from datasets import load_dataset
import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

Note: you may need to restart the kernel to use updated packages.


## Load the dataset and the model

We load the SQuAD question-answering dataset and the Universal Sentence Encoder (USE) model. 

In [2]:
# Load the SQuAD dataset (using only the 'train' split for demonstration)
squad_dataset = load_dataset("squad", split="train")

# Load the USE model
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")











We extract the contexts (paragraphs) and questions from the dataset. We then display the first few entries for reference.

In [8]:
# Extract contexts (paragraphs) and corresponding questions
contexts = squad_dataset["context"]
questions = squad_dataset["question"]

# Display the first few entries for reference
for i in range(5):
    print(f"Question: {questions[i]}\n")    
    print(f"Context: {contexts[i]}")
    print("-" * 50)

Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?

Context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
--------------------------------------------------
Question: What is in front of the Notre Dame Main Building?

Context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is

## Semantic search function

The following `semantic_search` function performs semantic search to retrieve the most relevant contexts based on a query. A query (`query_p`) parameter is provided. If the embeddings of the questions in the dataset are provided, the function searches the most similar questions, but comparing their embeddings with the one obtained for the query. Otherwise, it searches the most similar contexts (comparing their embeddings with the one for the query). Embedding similarity is computed with cosine similarities, retriving the `top_k` most similar results. 

In [9]:
def semantic_search(query_p: str, contexts_p: np.array, questions_p: np.array,
                    top_k: int, questions_embeddings_p: np.array, context_embeddings_p: np.array) \
        -> list[dict[str, str]]:
    """
    This function performs semantic search to retrieve the most relevant contexts based on a query.
    If questions_embeddings_p is provided, it searches by questions.
    Otherwise, it searches by contexts.
    :param query_p: the query to be semantically searched
    :param contexts_p: the contexts (paragraphs) to search from
    :param questions_p: the questions to search from
    :param top_k: how many results to return
    :param questions_embeddings_p: the embeddings of the questions (if searching by questions; otherwise None)
    :param context_embeddings_p: the embeddings of the contexts (if searching by contexts; otherwise None)
    :return: a list of dictionaries containing the most relevant contexts ('context'),
    sample questions ('sample_question'), and similarity scores ('similarity')
    """
    # Encode the query
    query_embedding = embed([query_p])
    # this function allow searching by question or by context
    semantic_embeddings = questions_embeddings_p if questions_embeddings_p is not None else context_embeddings_p
    # Compute cosine similarities between the query and all queries embeddings
    similarities = cosine_similarity(query_embedding, semantic_embeddings).flatten()
    # Get the indices of the top_k most similar contexts
    # argsort returns the indices that would sort an array ascending, [-top_k:] gets the top_k largest values,
    # and [::-1] reverses the order
    top_k_indices = similarities.argsort()[-top_k:][::-1]
    # Retrieve the most similar contexts and their corresponding questions
    results = []
    for idx in top_k_indices:
        results.append({
            'context': contexts_p[idx],
            'sample_question': questions_p[idx],  # A sample question related to this context
            'similarity': similarities[idx]
        })
    return results

## Queries

We define some example queries to test the semantic search function. We then encode the contexts (`contexts_embeddings`) and questions (`questions_embeddings`) using the Universal Sentence Encoder (USE). 

In [10]:
# Example queries
queries = ['What is the capital of the United States of America?',
           'What language is spoken in Andorra?',
           'Who was Martin Luther King?']

# Encode the contexts (paragraphs) and questions. This may take some time.
questions_embeddings = embed(questions)
contexts_embeddings = embed(contexts)

## Show the results

We show the results by comparing the query with the questions and the contexts. We display the top 3 most relevant results for each query.

The results are pretty good, considering the simplicity of the USE model. For more complex tasks, transformer-based models like [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) can be used, but they are slower and require more computational resources.

In [11]:
def show_results(search_results_p: list[dict[str, str]]) -> None:
    """
    Display the search results
    :param search_results_p: the search results to display
    """
    for idx, result in enumerate(search_results_p):
        print(f"\tResult {idx + 1}")
        print(f"\tContext: {result['context']}")
        print(f"\tSample Question: {result['sample_question']}")
        print(f"\tSimilarity Score: {result['similarity']:.4f}", "\n")


for query in queries:
    print(f"Query: {query}")
    print("Question-based search:")
    # Get top 3 most relevant
    search_results = semantic_search(query, contexts, questions, 3, questions_embeddings, None)
    show_results(search_results)
    print("Question-based search:")
    # Get top 3 most relevant
    search_results = semantic_search(query, contexts, questions, 3, None, contexts_embeddings)
    # Display the results
    show_results(search_results)
    print("-" * 50)

Query: What is the capital of the United States of America?
Question-based search:
	Result 1
	Context: The capital city, Washington, District of Columbia, is a federal district located on land donated by the state of Maryland. (Virginia had also donated land, but it was returned in 1849.) The United States also has overseas territories with varying levels of independence and organization: in the Caribbean the territories of Puerto Rico and the U.S. Virgin Islands, and in the Pacific the inhabited territories of Guam, American Samoa, and the Northern Mariana Islands, along with a number of uninhabited island territories.
	Sample Question: What is the capital city of the US?
	Similarity Score: 0.9184 

	Result 2
	Context: From 1981 to 2010, the average annual precipitation measured at Seattle–Tacoma International Airport was 37.49 inches (952 mm). Annual precipitation has ranged from 23.78 in (604 mm) in 1952 to 55.14 in (1,401 mm) in 1950; for water year (October 1 – September 30) preci