<h2>Sentence Transformers</h2>


**Resources:**

- https://www.sbert.net/index.html
- https://www.sbert.net/docs/pretrained_models.html


**Use cases:**

- Sentence Embedding
- Sentence Similarity
- Semantic Search
- Clustering


**Generate Embeding**


In [1]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

In [2]:
sentences = [
    "This framework generates embeddings for each input sentence",
    "Sentences are passed as a list of string.",
]


embeddings = model.encode(sentences)


for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    # print("Embedding:", embedding)
    # print("")

Sentence: This framework generates embeddings for each input sentence
Embedding: [-1.37173794e-02 -4.28515188e-02 -1.56286024e-02  1.40537471e-02
  3.95537578e-02  1.21796258e-01  2.94333752e-02 -3.17524299e-02
  3.54959741e-02 -7.93140009e-02  1.75878536e-02 -4.04369831e-02
  4.97259162e-02  2.54912134e-02 -7.18700588e-02  8.14968869e-02
  1.47070922e-03  4.79627177e-02 -4.50336002e-02 -9.92175043e-02
 -2.81769410e-02  6.45046607e-02  4.44670245e-02 -4.76216860e-02
 -3.52952480e-02  4.38671857e-02 -5.28566130e-02  4.33045177e-04
  1.01921521e-01  1.64072365e-02  3.26996408e-02 -3.45986858e-02
  1.21339420e-02  7.94870928e-02  4.58342116e-03  1.57778263e-02
 -9.68204346e-03  2.87625827e-02 -5.05806208e-02 -1.55793950e-02
 -2.87906528e-02 -9.62281693e-03  3.15556750e-02  2.27348786e-02
  8.71449560e-02 -3.85027267e-02 -8.84718895e-02 -8.75498448e-03
 -2.12343037e-02  2.08923481e-02 -9.02077556e-02 -5.25732450e-02
 -1.05638849e-02  2.88310833e-02 -1.61454957e-02  6.17841491e-03
 -1.23234

**Cosine-Similarity**


In [3]:
emb1 = model.encode("I am eating Apple")
emb2 = model.encode("I like fruits")
cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.5398]])


**Compute cosine similarity between all pairs**


In [4]:
# Compute cosine similarity between all pairs

sentences = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "Someone in a gorilla costume is playing a set of drums.",
]

# Encode all sentences
embeddings = model.encode(sentences)

# Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

# cos_sim

In [5]:
# Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim) - 1):
    for j in range(i + 1, len(cos_sim)):
        all_sentence_combinations.append((cos_sim[i][j], i, j))
# all_sentence_combinations

In [6]:
# Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(
    all_sentence_combinations, key=lambda x: x[0], reverse=True
)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))

Top-5 most similar pairs:
A man is eating food. 	 A man is eating a piece of bread. 	 0.7553
A man is riding a horse. 	 A man is riding a white horse on an enclosed ground. 	 0.7369
A monkey is playing drums. 	 Someone in a gorilla costume is playing a set of drums. 	 0.6433
A woman is playing violin. 	 Someone in a gorilla costume is playing a set of drums. 	 0.2564
A man is eating food. 	 A man is riding a horse. 	 0.2474


**Semantic search**


In [7]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("clips/mfaq")



In [8]:
question = "<Q>How many models can I host on HuggingFace?"
answer_1 = "<A>All plans come with unlimited private models and datasets."
answer_2 = "<A>AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem."
answer_3 = "<A>Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job."

query_embedding = model.encode(question)
corpus_embeddings = model.encode([answer_1, answer_2, answer_3])

print(util.semantic_search(query_embedding, corpus_embeddings))

[[{'corpus_id': 0, 'score': 0.5646326541900635}, {'corpus_id': 2, 'score': 0.5142341256141663}, {'corpus_id': 1, 'score': 0.47300395369529724}]]


In [9]:
from transformers import pipeline

In [10]:
qa_model = pipeline("question-answering")
question = "How many models can I host on HuggingFace?"
context = "All plans come with unlimited private models and datasets."
qa_model(question=question, context=context)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.7017181515693665, 'start': 20, 'end': 29, 'answer': 'unlimited'}

In [11]:
print(util.semantic_search(query_embedding, corpus_embeddings))

[[{'corpus_id': 0, 'score': 0.5646326541900635}, {'corpus_id': 2, 'score': 0.5142341256141663}, {'corpus_id': 1, 'score': 0.47300395369529724}]]


**Clustering**


In [12]:
from sklearn.cluster import KMeans
import numpy as np

embedder = SentenceTransformer("all-MiniLM-L6-v2")

# Corpus with example sentences
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "Horse is eating grass.",
    "A man is eating pasta.",
    "A Woman is eating Biryani.",
    "The girl is carrying a baby.",
    "The baby is carried by the woman",
    "A man is riding a horse.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "Someone in a gorilla costume is playing a set of drums.",
    "A cheetah is running behind its prey.",
    "A cheetah chases prey on across a field.",
    "The cheetah is chasing a man who is riding the horse.",
    "man and women with their baby are watching cheetah in zoo",
]
corpus_embeddings = embedder.encode(corpus)

# Normalize the embeddings to unit length
corpus_embeddings = corpus_embeddings / np.linalg.norm(
    corpus_embeddings, axis=1, keepdims=True
)

In [13]:
# corpus_embeddings[0]

In [14]:
# source: https://stackoverflow.com/questions/55619176/how-to-cluster-similar-sentences-using-bert

clustering_model = KMeans(n_clusters=4)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
print(cluster_assignment)

[2 2 1 2 2 0 0 1 1 1 1 1 1 1 3]


In [15]:
clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in clustered_sentences:
        clustered_sentences[cluster_id] = []

    clustered_sentences[cluster_id].append(corpus[sentence_id])
clustered_sentences

{2: ['A man is eating food.',
  'A man is eating a piece of bread.',
  'A man is eating pasta.',
  'A Woman is eating Biryani.'],
 1: ['Horse is eating grass.',
  'A man is riding a horse.',
  'A man is riding a white horse on an enclosed ground.',
  'A monkey is playing drums.',
  'Someone in a gorilla costume is playing a set of drums.',
  'A cheetah is running behind its prey.',
  'A cheetah chases prey on across a field.',
  'The cheetah is chasing a man who is riding the horse.'],
 0: ['The girl is carrying a baby.', 'The baby is carried by the woman'],
 3: ['man and women with their baby are watching cheetah in zoo']}

In [16]:
clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in clustered_sentences:
        clustered_sentences[cluster_id] = []

    clustered_sentences[cluster_id].append(corpus[sentence_id])
clustered_sentences

{2: ['A man is eating food.',
  'A man is eating a piece of bread.',
  'A man is eating pasta.',
  'A Woman is eating Biryani.'],
 1: ['Horse is eating grass.',
  'A man is riding a horse.',
  'A man is riding a white horse on an enclosed ground.',
  'A monkey is playing drums.',
  'Someone in a gorilla costume is playing a set of drums.',
  'A cheetah is running behind its prey.',
  'A cheetah chases prey on across a field.',
  'The cheetah is chasing a man who is riding the horse.'],
 0: ['The girl is carrying a baby.', 'The baby is carried by the woman'],
 3: ['man and women with their baby are watching cheetah in zoo']}