## Embedding queries and fetching the most similar results

Questions to a model an be asked using embedding based search. Embeddings are simple to implement and work especially well with questions, as questions often don't lexically overlap with their answers. Embeddings measure the relatedness of text strings and are commonly used for:
- Search (where results are ranked by relevance to a query string)
- Clustering (where text strings are grouped by similarity)
- Recommendations (where items with related text strings are recommended)
- Diversity measurement (where similarity distributions are analyzed)
- Classification (where text strings are classified by their most similar label)

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

Embedding a user query is highly dependent on the model being employed for encoding the query. These pre trained large language models will often provide ways to embed or encode queries to the same vector space as the model was trained on, so that the relatedness of the query to the documents can be measured effectively. Commonly used models include:
- BERT by Google
- Llama 2 by Meta AI
- DistilBERT by HuggingFace
- GPT-3 by OpenAI

Here is an example of using the OpenAI API to embed user query for searching relevant answers in the documents dataset. It should be noted that closed models have pricing on API calls. We can nonetheless use models like BERT and DistilBERT for free, as they are open source. The following code is for illustrative purposes only. To retrieve the most relevant documents we use the cosine similarity between the embedding vectors of the query and each document, and return the highest scored documents.

In [None]:
import pandas as pd
from openai import OpenAI
from openai.embeddings_utils import cosine_similarity, get_embedding

client = OpenAI()
df = pd.read_csv(datafile_path)


# As seen here, the API provided by the model developers can be used to create embeddings
def get_embedding(text, model="text-embedding-ada-002"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding


# Use the cosine similarity to find the top n most similar embeddings
def search(df, description, n=3):
    embedding = get_embedding(description, model="text-embedding-ada-002")
    df["similarities"] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))
    res = df.sort_values("similarities", ascending=False).head(n)
    return res


# embed query and search for top 3 results in the dataframe
res = search(df, "query", n=3)

We might also have to tokenize our query first before creating embeddings. This step is crucial for models that use subword tokenization like BERT. After that, we can apply the language model to generate embeddings for the individual tokens. For BERT or similar models, one will typically use the embeddings of the [CLS] token, which is usually used as a sentence-level embedding. Token embeddings are also aggregated into a single vector using common methods like mean pooling, max pooling, or using the [CLS] token. For ensuring that the embeddings are on a comparable scale, we can also normalize them, which is very crucial if we are using cosine similarity to measure the relatedness between the query and the documents.

Here is another illustrative example using BERT and the transformers library by Hugging Face

In [None]:
from transformers import BertForQuestionAnswering, BertTokenizer

model_name = "bert-base-uncased"
model = BertForQuestionAnswering.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Tokenize and encode the reference text using the same tokenizer used for the language model.
reference_text = "reference text goes here."

encoded_reference = tokenizer.encode_plus(
    reference_text, add_special_tokens=True, return_tensors="pt"
)


# example function to answer questions
def answer_question(question):
    encoded_question = tokenizer.encode_plus(
        question, add_special_tokens=True, return_tensors="pt"
    )

    input_ids = encoded_reference["input_ids"]
    attention_mask = encoded_reference["attention_mask"]

    outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        start_positions=None,
        end_positions=None,
    )

    start_idx = torch.argmax(outputs.start_logits)
    end_idx = torch.argmax(outputs.end_logits)

    # convert the tokens back to strings
    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[0][start_idx : end_idx + 1])
    )

    return answer

Below is a basic example of usage of such a model for embedding queries and retrieving the most similar results.

In [None]:
question = "How tall is the Golden Gate Bridge?"

response = answer_question(question=question)
print("Answer:", response)

To summarize,
-  Tokenization: Break down the user-question into individual tokens using a tokenizer provided by the LLM.
 - Word Embeddings: Represent each token as a vector of numerical values. This is typically done using a pre-trained word embedding matrix, which maps words to their corresponding vectors.
- Contextual Embeddings: Apply the LLM's contextualized embedding layer to the token vectors. This layer takes into account the surrounding context of each word, providing more nuanced representations.
-  Question Embedding: Aggregate the contextualized word embeddings to form a single vector representation for the entire question.
 - Normalization: Normalize the question embedding vector to ensure it has a unit length. Helps standardize the comparison of embeddings and improve the accuracy of similarity measures like cosine similarity

## ElasticSearch

If one is using ElasticSearch, we can use the "More Like This" (MLT) query or the "Similarity Search" feature. ElasticSearch also provides a k-nearest neighbor (kNN) search feature to a given query vector. This feature is useful for tasks such as semantic search

To use the kNN search feature, we first need to index your documents with their corresponding vector representations. This can be done using text embedding models, such as Word2Vec, GloVe, or ELMo. Once the documents are indexed, we can use the match_all query and the knn_field parameter to find the top-k most similar documents to a given query vector. For example, the following query will find the top-5 most similar documents to the query vector [0.321, 0.432, 0.287, 0.123] in the index my_index with the field vector:

In [None]:
GET / my_index / _search
{
    "query": {"match_all": {}},
    "knn_field": {
        "field": "vector",
        "query": [0.321, 0.432, 0.287, 0.123],
        "size": 5,  # number of nearest neighbors to return
    },
}