## Semantic Search

Semantic search seeks to improve search accuracy by understanding the semantic meaning of the search query and the corpus to search over.


In [1]:
from sentence_transformers import SentenceTransformer, util
import torch

  from tqdm.autonotebook import tqdm, trange


In [2]:
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Manual Implementation

In [3]:
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]

In [4]:
queries = [
    "A man is eating pasta.",
    "Someone in a gorilla costume is playing a set of drums.",
    "A cheetah chases prey on across a field.",
]

In [5]:
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
query_embeddings = model.encode(queries, convert_to_tensor=True)

In [6]:
# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(3, len(corpus))
for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 3 scores
    similarity_scores = model.similarity(query_embedding, corpus_embeddings)[0]
    scores, indices = torch.topk(similarity_scores, k=top_k)

    print("\nQuery:", query)
    print("Top 3 most similar sentences in corpus:")

    for score, idx in zip(scores, indices):
        print(corpus[idx], "(Score: {:.4f})".format(score))


Query: A man is eating pasta.
Top 3 most similar sentences in corpus:
A man is eating food. (Score: 0.8385)
A man is eating a piece of bread. (Score: 0.7468)
A man is riding a horse. (Score: 0.5328)

Query: Someone in a gorilla costume is playing a set of drums.
Top 3 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.7613)
The girl is carrying a baby. (Score: 0.3815)
A man is riding a white horse on an enclosed ground. (Score: 0.3685)

Query: A cheetah chases prey on across a field.
Top 3 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.8704)
A man is riding a white horse on an enclosed ground. (Score: 0.3741)
A monkey is playing drums. (Score: 0.3468)


### Optimized Implementation

In [7]:
# normalize vectors for fast calculation
corpus_embeddings = util.normalize_embeddings(corpus_embeddings)
query_embeddings = util.normalize_embeddings(query_embeddings)

In [8]:
hits = util.semantic_search(query_embeddings, corpus_embeddings, score_function=util.dot_score, top_k=3)

In [9]:
hits

[[{'corpus_id': 0, 'score': 0.8384666442871094},
  {'corpus_id': 1, 'score': 0.7468274831771851},
  {'corpus_id': 3, 'score': 0.5328127145767212}],
 [{'corpus_id': 7, 'score': 0.7612733840942383},
  {'corpus_id': 2, 'score': 0.3815287947654724},
  {'corpus_id': 6, 'score': 0.36845868825912476}],
 [{'corpus_id': 8, 'score': 0.8703994750976562},
  {'corpus_id': 6, 'score': 0.37411704659461975},
  {'corpus_id': 7, 'score': 0.3468022346496582}]]

In [10]:
for query, hit in zip(queries, hits):
  for q_hit in hit:
    id = q_hit['corpus_id']
    score = q_hit['score']

    print(query, "<>", corpus[id], "(Score: {:.4f})".format(score))

  print()

A man is eating pasta. <> A man is eating food. (Score: 0.8385)
A man is eating pasta. <> A man is eating a piece of bread. (Score: 0.7468)
A man is eating pasta. <> A man is riding a horse. (Score: 0.5328)

Someone in a gorilla costume is playing a set of drums. <> A monkey is playing drums. (Score: 0.7613)
Someone in a gorilla costume is playing a set of drums. <> The girl is carrying a baby. (Score: 0.3815)
Someone in a gorilla costume is playing a set of drums. <> A man is riding a white horse on an enclosed ground. (Score: 0.3685)

A cheetah chases prey on across a field. <> A cheetah is running behind its prey. (Score: 0.8704)
A cheetah chases prey on across a field. <> A man is riding a white horse on an enclosed ground. (Score: 0.3741)
A cheetah chases prey on across a field. <> A monkey is playing drums. (Score: 0.3468)

