<a href="https://www.kaggle.com/code/aisuko/semantic-search?scriptVersionId=162021554" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Semantic search seeks to improve search accuracy by understanding the context of the search query. In contrast to traditional search engines which only find documents bases on lexical matches, semantic search can also find synonyms.


# Background

The idea behind semantic search is to embed all entries in our corpus, whether they be sentences, paragraphs, or documents, into a vector space.

A search time, the query is embedded into the same vector space and the closest embeddings from our corpus are found. these entries should have a high semantic overlap with the query.

<div style="text-align: center"><img src="https://hostux.social/system/media_attachments/files/111/888/433/770/763/125/original/dafea983f4905b4b.png" width="60%" heigh="60%" alt="Semantic Search"></div>


# Symmetric VS Asymmetric Semantic Search

A **critical distinction** for your setup is symmetric vs asymmetric semantic search:

**Symmetric semantic search**

Our query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar question: Your query could for example be "How to learn Python online?" and you want to find an entry like "How to learn Python on the web?". For symmetric tasks, you could potentially flip the query and the entries in your corpus.


**asymmetric semantic search**

We usually have a **short query**(like a question or some keywords) and we want to find a longer paragraph answering the query. An example would be a quary like "What is Python" and we want to find the paragraph "Python is an interpreted, high-level and general-purpose programming language. Python...". For asymmetric tasks, flipping the query and the entries in our corpus usually does not make sense.

So, it is critial that we choose the right model for our type of task.

In [1]:
!pip install sentence-transformers==2.3.1
!pip install datasets==2.15.0

Collecting sentence-transformers==2.3.1
  Downloading sentence_transformers-2.3.1-py3-none-any.whl.metadata (11 kB)
Downloading sentence_transformers-2.3.1-py3-none-any.whl (132 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-2.3.1
Collecting datasets==2.15.0
  Downloading datasets-2.15.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets==2.15.0)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec<=2023.10.0,>=2023.1.0 (from fsspec[http]<=2023.10.0,>=2023.1.0->datasets==2.15.0)
  Downloading fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspe

# Embeddings

If available, the model is automatically executed on the GPU. We can specify the device for the model,like CPU,cuda,cuda:0 etc.

Transformer models like BERT/RoBERTa/DistilBERT etc. the runtime and the memory requirement grows quadratic with the input length. This limits transformers to inputs of certain lengths. A common value for BERT&Co. are 512 word pieces, which corresponds to about 300-400 words(for English). Longer texts than this are truncated to the first x word pieces. By default, the provided methods use a limit of 128 word pieces, longer inputs will be truncated. We can get and set the maximal sequence length though `max_seq_length` attribute.

Note: You cannot increase the length higher than what is maximally supported by the respective transformer model. Also note that if a model was trained on short texts, the representations for long texts might note be that good.

In [2]:
from sentence_transformers import SentenceTransformer

# the list of sentences to encode
melbourne_info = [
    "Melbourne is the capital city of the Australian state of Victoria.",
    "It is known for its diverse and vibrant cultural scene.",
    "The city is famous for its coffee culture, with numerous cafes scattered throughout.",
    "Melbourne is home to iconic landmarks like the Royal Exhibition Building and Flinders Street Station.",
    "The Yarra River runs through the heart of the city, providing a picturesque setting.",
    "The Melbourne Cricket Ground (MCG) is a historic sports venue and a key part of the city's identity.",
    "The city hosts various events, including the Australian Open, Melbourne Fashion Week, and the Melbourne International Comedy Festival.",
    "Melbourne's street art scene is renowned, with vibrant murals adorning many laneways.",
    "The Queen Victoria Market is a popular spot for fresh produce, local crafts, and diverse international cuisines.",
    "Melbourne is often considered one of the most livable cities globally, offering a high quality of life."
]

model=SentenceTransformer('all-MiniLM-L6-v2', device='cuda')
model.get_max_seq_length=200

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [3]:
from sentence_transformers import util

embedding_1=model.encode(melbourne_info[0], convert_to_tensor=True)
embedding_2=model.encode(melbourne_info[1], convert_to_tensor=True)

util.pytorch_cos_sim(embedding_1, embedding_2)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

tensor([[0.1413]], device='cuda:0')

In [4]:
embeddings=model.encode(melbourne_info, convert_to_tensor=True).to('cuda')
embeddings

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

tensor([[ 0.1098,  0.0037,  0.0023,  ..., -0.0178,  0.0163,  0.0190],
        [ 0.1096,  0.0234, -0.0461,  ..., -0.0273,  0.0157,  0.0221],
        [ 0.1075,  0.0198, -0.0181,  ...,  0.0867,  0.0412, -0.0741],
        ...,
        [ 0.0818, -0.0181, -0.0052,  ...,  0.0749, -0.0412, -0.0409],
        [ 0.1064, -0.0994,  0.0054,  ..., -0.0467, -0.0271,  0.0051],
        [ 0.1075, -0.0649,  0.0414,  ..., -0.0095, -0.0006,  0.1152]],
       device='cuda:0')

# Semantic Search(customize)

Here we use the model have been specifucally trained for **Semantic Search**. Given a question/search query, these models are able to find relevant text passages.

In [5]:
import torch
from sentence_transformers import util

# Query sentences
melbourne_questions = [
    "What iconic landmarks can be found in Melbourne?",
    "How is Melbourne's coffee culture described?",
    "Which river runs through the heart of the city?"
]

# fint the closest 5 sentences of the corpus for each query sentence
# based on cosine similarity
top_k=min(5, len(melbourne_info))
for query in melbourne_questions:
    query_embedding=model.encode(query, convert_to_tensor=True).to('cuda')
    
    # we use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores=util.cos_sim(query_embedding, embeddings)[0]
    top_results=torch.topk(cos_scores, k=top_k)
    
    print('\n\n===============\n\n')
    print('Query:',query)
    print('\nTop 5 most similar sentences:')
    
    for score, idx in zip(top_results[0], top_results[1]):
        print(melbourne_info[idx],'(Score: {:.4f})'.format(score))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]





Query: What iconic landmarks can be found in Melbourne?

Top 5 most similar sentences:
Melbourne is home to iconic landmarks like the Royal Exhibition Building and Flinders Street Station. (Score: 0.7592)
Melbourne is the capital city of the Australian state of Victoria. (Score: 0.5906)
Melbourne's street art scene is renowned, with vibrant murals adorning many laneways. (Score: 0.5419)
The city hosts various events, including the Australian Open, Melbourne Fashion Week, and the Melbourne International Comedy Festival. (Score: 0.5302)
Melbourne is often considered one of the most livable cities globally, offering a high quality of life. (Score: 0.5137)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]





Query: How is Melbourne's coffee culture described?

Top 5 most similar sentences:
The city is famous for its coffee culture, with numerous cafes scattered throughout. (Score: 0.6358)
Melbourne is the capital city of the Australian state of Victoria. (Score: 0.5399)
Melbourne is often considered one of the most livable cities globally, offering a high quality of life. (Score: 0.4549)
The city hosts various events, including the Australian Open, Melbourne Fashion Week, and the Melbourne International Comedy Festival. (Score: 0.4469)
The Melbourne Cricket Ground (MCG) is a historic sports venue and a key part of the city's identity. (Score: 0.3950)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]





Query: Which river runs through the heart of the city?

Top 5 most similar sentences:
The Yarra River runs through the heart of the city, providing a picturesque setting. (Score: 0.7078)
The city is famous for its coffee culture, with numerous cafes scattered throughout. (Score: 0.2728)
Melbourne's street art scene is renowned, with vibrant murals adorning many laneways. (Score: 0.2725)
The city hosts various events, including the Australian Open, Melbourne Fashion Week, and the Melbourne International Comedy Festival. (Score: 0.2243)
It is known for its diverse and vibrant cultural scene. (Score: 0.1911)


# Semantic Search(by semantic_search)

The score list should similar to the result above.

In [6]:
from sentence_transformers.util import semantic_search

query_embeddings=model.encode(melbourne_questions, convert_to_tensor=True).to('cuda')

score_list=semantic_search(
    query_embeddings,
    embeddings,
    query_chunk_size=100,
    corpus_chunk_size=500000,
    top_k=top_k
)
score_list

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[[{'corpus_id': 3, 'score': 0.7591984868049622},
  {'corpus_id': 0, 'score': 0.5905877351760864},
  {'corpus_id': 7, 'score': 0.541886568069458},
  {'corpus_id': 6, 'score': 0.5301910638809204},
  {'corpus_id': 9, 'score': 0.5137180089950562}],
 [{'corpus_id': 2, 'score': 0.6358156204223633},
  {'corpus_id': 0, 'score': 0.5399420857429504},
  {'corpus_id': 9, 'score': 0.45487603545188904},
  {'corpus_id': 6, 'score': 0.44690436124801636},
  {'corpus_id': 5, 'score': 0.39503976702690125}],
 [{'corpus_id': 4, 'score': 0.7078141570091248},
  {'corpus_id': 2, 'score': 0.27283042669296265},
  {'corpus_id': 7, 'score': 0.27252620458602905},
  {'corpus_id': 6, 'score': 0.22430312633514404},
  {'corpus_id': 1, 'score': 0.19112762808799744}]]

# Converting embedding to `.csv` files

We can also store & load embeddings by using `pickle`. See https://www.sbert.net/examples/applications/computing-embeddings/README.html

In [7]:
import pandas as pd

embeddings_data=pd.DataFrame(embeddings.cpu())

embeddings_data.to_csv('sentences_of_Melbourne.csv', index=False)

# Loading the embeddings from the datasets

In [8]:
from datasets import load_dataset

embeddings_ds=load_dataset('aisuko/sentences_of_Melbourne')

embeddings_ds

Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/48.4k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


DatasetDict({
    train: Dataset({
        features: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150',

Let's convert the embeddings from the dataset to the format of `torch.float`.

In [9]:
dataset_embeddings=torch.from_numpy(embeddings_ds['train'].to_pandas().to_numpy()).to(torch.float)

melbourne_questions2 = [
    "Which sports venue is a historic landmark in Melbourne?",
    "What are some of the events hosted in Melbourne throughout the year?"
]

query_embedding2=model.encode(melbourne_questions2, convert_to_tensor=True).to('cuda')

hits=semantic_search(query_embedding2, dataset_embeddings, top_k=top_k)
hits

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[[{'corpus_id': 3, 'score': 0.7171748876571655},
  {'corpus_id': 5, 'score': 0.663895845413208},
  {'corpus_id': 6, 'score': 0.5955651998519897},
  {'corpus_id': 0, 'score': 0.5639505982398987},
  {'corpus_id': 7, 'score': 0.43221524357795715}],
 [{'corpus_id': 6, 'score': 0.7750500440597534},
  {'corpus_id': 0, 'score': 0.5321306586265564},
  {'corpus_id': 3, 'score': 0.5310368537902832},
  {'corpus_id': 5, 'score': 0.5011793971061707},
  {'corpus_id': 7, 'score': 0.4073331654071808}]]

# Reference

* https://www.sbert.net/examples/applications/semantic-search/README.html
* https://huggingface.co/blog/getting-started-with-embeddings#2-host-embeddings-for-free-on-the-hugging-face-hub
* https://huggingface.co/tasks/sentence-similarity