# Semantic Search (Sentence Transformer)

In natural language processing (NLP), semantic search refers to the ability of a search engine to understand the intent behind a user's query, and to return results that are semantically related to the user's search terms. This is in contrast to a keyword-based search, which returns results based on the exact words that the user entered, regardless of their intended meaning. Semantic search is typically based on natural language processing techniques such as word sense disambiguation and part-of-speech tagging, which allow the search engine to understand the meaning of the words in a query and to identify semantically related terms. This can help to improve the accuracy and relevance of search results, and can make it easier for users to find the information they are looking for.

courtesy: chatGPT

In [170]:
import torch
import numpy as np
import functools
from typing import Tuple, List
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] 
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

In [125]:
"""
Sentence-Transformers is a library for natural language processing (NLP) that provides state-of-the-art sentence embedding models. 
These models are trained using a technique called BERT, which stands for "Bidirectional Encoder Representations from Transformers."
BERT is a type of transformer model that uses attention mechanisms to learn contextual relationships between words in a sentence. 
The resulting sentence embeddings can be used for a wide range of downstream NLP tasks, such as text classification, information retrieval, and semantic search. 
The Sentence-Transformers library makes it easy to use these models in your own NLP projects,
and includes a variety of pre-trained models that can be fine-tuned on specific tasks or datasets.
"""

MODEL_NAME = 'sentence-transformers/all-mpnet-base-v2'

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME)

In [187]:
def semantic_search(query: List=[], corpus: List=[], top_k=5) -> List:
    
    #tokenize the sentences
    corpus_input = tokenizer(corpus, padding=True, truncation=True, return_tensors='pt')
    query_input = tokenizer(query, padding=True, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        corpus_output = model(**corpus_input)
        query_output = model(**query_input)
      
    #Mean pool on output embeddings
    corpus_embeddings = mean_pooling(model_output, corpus_input['attention_mask'])
    query_embeddings = mean_pooling(query_output, query_input['attention_mask'])
     
    #Normalize the embeddings
    corpus_embeddings = F.normalize(corpus_embeddings, p=2, dim=1)
    query_embeddings = F.normalize(query_embeddings, p=2, dim=1)

    corpus_embeddings = corpus_embeddings.detach().cpu().numpy()
    query_embeddings = query_embeddings.detach().cpu().numpy()

    #Calculate the cosine similarity between embeddings
    sim = cosine_similarity(query_embeddings, corpus_embeddings)[0]

    sorted_index = list(np.argsort(-sim)+1)

    score = list(sim[np.argsort(-sim)])
    
    similar_sentences = [(corpus[idx-1], score) for idx, score in zip(sort_idx[:top_k], score[:top_k])]
    
    return similar_sentences

In [188]:
corpus = ['The cat sits outside',
             'A man is playing guitar',
             'I love pasta',
             'The new movie is awesome',
             'The cat plays in the garden',
             'A woman watches TV',
             'The new movie is so great',
             'Do you like pizza?']


query = ['The cat plays in the garden',]

In [190]:
#Top 5 Similar Sentences
semantic_search(query, corpus, top_k=4)

[('The cat plays in the garden', 1.0000001),
 ('The cat sits outside', 0.6738439),
 ('I love pasta', 0.11078529),
 ('Do you like pizza?', 0.09111777)]