# Sentence BERT Models
https://sbert.net/

This notebook shows how to use SBERT to:
* Encode i.e., generate embeddings
* Calculate similarity score using cosine metric
* Do classification task using embeddings and cosine metric
* Semantic search using SBERT utility & PyTorch

**Note:**
* The code in this notebook will download the model to local cache

In [1]:
# !pip install -U sentence-transformers transformers

## Create Model instance

Create a pre-trained SBERT model.

In [2]:
import warnings
warnings.filterwarnings('ignore')

from sentence_transformers import SentenceTransformer, util

# Model name 
model_name = "all-MiniLM-L6-v2"

# Model creation
model = SentenceTransformer(model_name)

## 1. Generate embeddings and cosine scores

### Generate embeddings

Embeddings are generated using the **model.encode** function.

https://www.sbert.net/docs/package_reference/SentenceTransformer.html


In [3]:
# Single list of sentences
words = [
    "cat",
    "dog",
    "pasta",
    "puppy",
    "kitten",
    "car",
    "motorbike",
    "pizza",
]

# Compute embeddings for the entire list of words
embeddings = model.encode(words)

# Length
print(len(embeddings[0]))

384


### Calculate the score between test word and the list of words

In [4]:
# Change the word to test
test_word = "truck"

test_word_embedding = model.encode(test_word)

results = []
for i, embedding in enumerate(embeddings):
    
    # calculate the score
    score = util.cos_sim(test_word_embedding, embedding)
    
    # convert tensor to a scalar and round to 2 decimal places
    score = round(score.item(),2)

    results.append((score, words[i]))

results.sort(reverse=True)

print("Test word = ", test_word)

# Iterate through the resulst
for result in results:
    print(result)

Test word =  truck
(0.69, 'car')
(0.51, 'motorbike')
(0.45, 'dog')
(0.45, 'cat')
(0.39, 'kitten')
(0.38, 'puppy')
(0.32, 'pasta')
(0.31, 'pizza')


### Compute the cosine similarity score

## 2. Classification

1. Multiple sentences are given in 2 categories [Sports, History]
2. Given a new sentence:
    * Classify it [Sports, History]
    * Find a sentence that is closest to it from the 2 categories
  
**Plan of action:**
1. Calculate the embeddings for the given sentences (corpus)
2. For the query:
    * calculate its embedding
    * caclculate the average cosine score between the query emebedding and each of the 2 clusters
    * find the corpus string amon the 2 categories, that have the maximum score
3. Use average score to classify
4. Find the sentence within the category that has the highest score

In [5]:
# Category : Sports
sports_facts = [
    "Football, also known as soccer in some countries, is the most popular sport in the world, with billions of fans worldwide.",
    "Basketball was invented in 1891 by Dr. James Naismith, a Canadian physical education instructor, as an indoor game to keep his students active during the winter months.",
    "Tennis is a highly competitive sport that originated in the 19th century and is played by millions of people around the world on various surfaces such as grass, clay, and hardcourt.",
    "Golf is a precision club-and-ball sport in which players use various clubs to hit balls into a series of holes on a course in as few strokes as possible."
]

# Category : History
history_facts = [
    "The Renaissance was a period of cultural rebirth that emerged in Europe during the 14th to 17th centuries, marking a transition from the Middle Ages to modernity.",
    "The Industrial Revolution, which began in Britain in the late 18th century, transformed society by introducing mechanized manufacturing processes and urbanization.",
    "The Cold War, spanning from the late 1940s to the early 1990s, was a geopolitical conflict between the United States and the Soviet Union, characterized by ideological, economic, and military competition.",
    "The French Revolution, which erupted in 1789, was a watershed moment in European history, leading to the overthrow of the monarchy and the rise of democratic principles."
]

### Use model.encode() to generate embeddings

In [6]:
sports_facts_embeddings = model.encode(sports_facts)
history_facts_embeddings = model.encode(history_facts)

# optional step
# print("Dimension = ", len(sports_facts_embeddings[0]))

### Calculate Cosine Score

For ease of use, create a utility function to calculate the semantic score between query embedding and the list of embeddings for a given category.

Function returns a tuple:

* Scores for each of the list item
* Average score
* Index of the item for which the score is maximum

In [7]:
def  calculate_cosine_score(test_embedding, embeddings):
    # variable holds the score
    scores = []

    # calculate the average score, max score
    average_score = 0

    # holds the info on max score
    max_score = 0
    max_score_index = 0

    # loop through the list to calculate the score, average, max
    for i, embedding in enumerate(embeddings):
        score = util.cos_sim(test_embedding, embedding)
        scores.append(score)
        average_score += score
        if score > max_score:
            max_score_index = i
            max_score = score

    return scores, (average_score)/len(scores), max_score_index


### Test sentence & its embedding

In [8]:
test_sentences = [
    "I like putting",
    "two strong armies came face to face",
    "hoops on the two ends of the court",
    "steam engine changed the world",
    "arts, and self expression was the highlight"
]

test_sentence = test_sentences[4]

test_sentence_embedding =  model.encode(test_sentence)

# Calculate the scores
scores_sports, average_score_sports, max_score_index_sports = calculate_cosine_score(test_sentence_embedding, sports_facts_embeddings)
scores_history, average_score_history, max_score_index_history = calculate_cosine_score(test_sentence_embedding, history_facts_embeddings)

### Check category
category = "sports"
if average_score_history > average_score_sports:
    category = "history"

print(test_sentence," - Belongs to category : ", category)

# Get the closest sentence
if category == "sports":
    print("Closest sentence : ", sports_facts[max_score_index_sports])
else:
    print("Closest sentence : ", history_facts[max_score_index_history])

arts, and self expression was the highlight  - Belongs to category :  history
Closest sentence :  The Renaissance was a period of cultural rebirth that emerged in Europe during the 14th to 17th centuries, marking a transition from the Middle Ages to modernity.


## 3. Paraphrase mining

Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all other sentences and returns a list with the pairs that have the highest cosine similarity score.

In [9]:
from sentence_transformers import util

corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]

mining_result = util.paraphrase_mining(model, corpus)
for result in mining_result:
    print("------ Score: ",round(result[0],2),"  -------")
    print(corpus[result[1]])
    print(corpus[result[2]])

------ Score:  0.76   -------
A man is eating food.
A man is eating a piece of bread.
------ Score:  0.74   -------
A man is riding a horse.
A man is riding a white horse on an enclosed ground.
------ Score:  0.25   -------
A man is eating food.
A man is riding a horse.
------ Score:  0.2   -------
A woman is playing violin.
A monkey is playing drums.
------ Score:  0.17   -------
A man is eating food.
A man is riding a white horse on an enclosed ground.
------ Score:  0.14   -------
A man is eating a piece of bread.
A man is riding a horse.
------ Score:  0.14   -------
A man is eating food.
A cheetah is running behind its prey.
------ Score:  0.12   -------
A monkey is playing drums.
A cheetah is running behind its prey.
------ Score:  0.12   -------
A man is eating a piece of bread.
A man is riding a white horse on an enclosed ground.
------ Score:  0.08   -------
A man is riding a horse.
A monkey is playing drums.
------ Score:  0.08   -------
A man is riding a white horse on an en

## 4. Semantic search
In the classification example we had to build our own search logic to find the nearest matching sentence. Semantic search enginers offers this capability out of the box, as a result you don't have to build a search algorithm on your own. Sentence transformer library offers a simple utility function to carry out semantic search. 

https://sbert.net/examples/applications/semantic-search/README.html
https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search.py

#### 2 ways to do it
1. Use sentence-transformer **util.semantic_search(...)**
2. Use cosine distance and PyTorch.topk function for evaluating the top k


### Define the corpus

In this example we are using an array of sentences.

In [None]:
import torch

# Corpus with example sentences
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

### Test queries

In [None]:
# Query sentences:
queries = [
    "A man is eating pasta.",
    "Someone in a gorilla costume is playing a set of drums.",
    "A cheetah chases prey on across a field.",
]

### 1. Top-K queries using sentence-transformers util

Search corpus for each of the test queries


In [None]:
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]      #Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))


### 2. Top-K queries using PyTorch topk

Search corpus for each of the test queries

In [None]:
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

    """
    # Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]      #Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    """