# Sentence-BERT

### Sentence similarity
https://www.sbert.net/docs/sentence_transformer/usage/semantic_textual_similarity.html  
"For Semantic Textual Similarity (STS), we want to produce embeddings for all texts involved and calculate the similarities between them. The text pairs with the highest similarity score are most semantically similar."

In [1]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-5.1.1-py3-none-any.whl.metadata (16 kB)
Downloading sentence_transformers-5.1.1-py3-none-any.whl (486 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.6/486.6 kB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
  Attempting uninstall: sentence-transformers
    Found existing installation: sentence-transformers 5.1.0
    Uninstalling sentence-transformers-5.1.0:
      Successfully uninstalled sentence-transformers-5.1.0
Successfully installed sentence-transformers-5.1.1


In [2]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device='cuda')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [3]:
# an embedding is a vector encoding information from a document
sentences = ["That is a happy person", "That is a very happy person", "That is a sad person", "Today is a sunny day"]
embeddings = model.encode(sentences)
print(embeddings)

[[-0.03387689  0.09194157  0.04870133 ... -0.01439257 -0.02754978
   0.0447583 ]
 [-0.00248319  0.09151711  0.04838626 ... -0.02641123 -0.07529833
   0.0280321 ]
 [ 0.02025572  0.09257689  0.06799116 ... -0.08203211 -0.01837798
   0.03276783]
 [-0.01629126  0.10406607  0.09740782 ...  0.00676724 -0.0878846
   0.03404385]]


In [9]:
# how long is a vector created by sbert?
embeddings[0], embeddings[1], len(embeddings[0])

(array([-3.38768885e-02,  9.19415653e-02,  4.87013273e-02, -3.48835699e-02,
        -6.48292005e-02, -2.66857166e-02,  1.34293362e-01, -6.91503007e-03,
         6.44350797e-02, -5.82767278e-03,  8.87372568e-02, -1.62496027e-02,
        -2.54945345e-02,  4.83909110e-03,  6.14907779e-03,  1.55435819e-02,
        -5.95201775e-02, -3.20248120e-02,  1.41185485e-02,  2.05608155e-03,
        -1.00310877e-01, -2.04246351e-03, -2.08596904e-02,  9.96047258e-03,
        -1.69836283e-02, -1.64659843e-02,  4.00910787e-02, -2.72044679e-03,
         8.66091028e-02,  6.33227602e-02, -2.68440880e-02, -2.35456210e-02,
         1.09181754e-01,  2.25531720e-02, -3.85773405e-02,  1.94851141e-02,
        -3.15519683e-02,  1.68709159e-02, -9.62976553e-03,  2.02890281e-02,
        -1.82442181e-02,  1.77636929e-02,  1.86448377e-02,  1.22921569e-02,
        -2.05459236e-03, -3.49595100e-02,  6.22535385e-02, -4.34291251e-02,
         7.87903890e-02, -2.45035253e-02, -1.76689494e-02,  2.36276668e-02,
        -5.7

In [10]:
# cosine similarity between two embeddings
float(model.similarity(embeddings[0], embeddings[1]))

0.9429150223731995

In [11]:
sentences

['That is a happy person',
 'That is a very happy person',
 'That is a sad person',
 'Today is a sunny day']

In [12]:
embeddings

array([[-0.03387689,  0.09194157,  0.04870133, ..., -0.01439257,
        -0.02754978,  0.0447583 ],
       [-0.00248319,  0.09151711,  0.04838626, ..., -0.02641123,
        -0.07529833,  0.0280321 ],
       [ 0.02025572,  0.09257689,  0.06799116, ..., -0.08203211,
        -0.01837798,  0.03276783],
       [-0.01629126,  0.10406607,  0.09740782, ...,  0.00676724,
        -0.0878846 ,  0.03404385]], dtype=float32)

In [13]:
# similarity 'network' of embeddings
similarities = model.similarity(embeddings, embeddings)
similarities

tensor([[1.0000, 0.9429, 0.6561, 0.2569],
        [0.9429, 1.0000, 0.6407, 0.2106],
        [0.6561, 0.6407, 1.0000, 0.1681],
        [0.2569, 0.2106, 0.1681, 1.0000]])

In [17]:
# which sentence is the most similar to the sentence 0?
model.similarity(embeddings[0], embeddings[1:4])

tensor([[0.9429, 0.6561, 0.2569]])

### Reranking
https://www.sbert.net/docs/cross_encoder/usage/usage.html  
"A reranking model — also known as a cross-encoder — is a type of model that, given a query and document pair, will output a similarity score. We use this score to reorder the documents by relevance to our query." (https://www.pinecone.io/learn/series/rag/rerankers/)

In [32]:
from sentence_transformers import CrossEncoder

# 1. Load a pre-trained CrossEncoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", device='cuda')

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [33]:
# 2. Predict scores for a pair of sentences
scores = model.predict([
    ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
    ("How many people live in Berlin?", "Berlin is well known for its museums."),
])
print(scores)

[ 8.607139  -4.3200774]


In [34]:
# 3. Rank a list of passages for a query
query = "How many people live in Berlin?"
passages = [
    "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
    "Berlin is well known for its museums.",
    "In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.",
    "The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.",
    "The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019",
    "An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.",
    "Berlin is subdivided into 12 boroughs or districts (Bezirke).",
    "In 2015, the total labour force in Berlin was 1.85 million.",
    "In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
    "Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.",
]
ranks = model.rank(query, passages)

# Print the scores
print("Query:", query)
for rank in ranks:
    print(f"{rank['score']:.2f}\t{passages[rank['corpus_id']]}")

Query: How many people live in Berlin?
8.92	The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.
8.61	Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
8.24	An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.
7.60	In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.
6.35	In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
5.42	Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.
3.45	In 2015, the total labour force in Berlin was 1.85 million.
0.33	Berlin is subdivided into 12 boroughs or districts (Bezirke).
-4.24	The city of Paris had a population of 2,165,423 people within its administrative city limits as of Jan

In [None]:
# Let's try one more model
!pip install -U FlagEmbedding

In [35]:
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  use_fp16=True)

Fetching 30 files:   0%|          | 0/30 [00:00<?, ?it/s]

In [36]:
# colbert: finetuned for query-document structure
# sparse: matching words
# dense: 'semantic' similarities
# bge-m3 computes all three vectors, and for calculating 'score', you can weight three different 'scores'
score_dict = model.compute_score([[query, passage] for passage in passages],
                          max_passage_length=128, # a smaller max length leads to a lower latency
                          weights_for_different_modes=[0.4, 0.2, 0.4])

In [37]:
score_dict

{'colbert': [0.7814275026321411,
  0.6263712644577026,
  0.6892518401145935,
  0.7843672037124634,
  0.5183130502700806,
  0.7526565790176392,
  0.6591866612434387,
  0.6817111968994141,
  0.6112527847290039,
  0.7038821578025818],
 'sparse': [0.1807861328125,
  0.1295166015625,
  0.17578125,
  0.1925048828125,
  0.0462646484375,
  0.1409912109375,
  0.1322021484375,
  0.1488037109375,
  0.0189208984375,
  0.1275634765625],
 'dense': [0.751953125,
  0.61328125,
  0.67822265625,
  0.7509765625,
  0.50244140625,
  0.71142578125,
  0.619140625,
  0.67236328125,
  0.62060546875,
  0.6611328125],
 'sparse+dense': [0.5615640878677368,
  0.4520263671875,
  0.5107421875,
  0.5648193359375,
  0.3503825068473816,
  0.5212808847427368,
  0.4568278193473816,
  0.4978434145450592,
  0.4200439453125,
  0.4832763969898224],
 'colbert+sparse+dense': [0.6495094895362854,
  0.52176433801651,
  0.5821460485458374,
  0.6526384949684143,
  0.41755473613739014,
  0.6138311624526978,
  0.5377713441848755,
  

### Natural Language Inference/Textual Entailment (NLI/TE)
https://www.sbert.net/docs/cross_encoder/pretrained_models.html  
"Given two sentences, are these contradicting each other, entailing one the other or are these neutral? The following models were trained on the SNLI and MultiNLI datasets."

In [38]:
from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/nli-deberta-v3-base", device='cuda')
scores = model.predict([
    ("That is a happy person", "That is a very happy person"),
    ("That is a happy person", "That is a sad person"),
    ("That is a happy person", "Today is a sunny day"),
])

# Convert scores to labels
label_mapping = ["contradiction", "entailment", "neutral"]
labels = [label_mapping[score_max] for score_max in scores.argmax(axis=1)]
# => ['entailment', 'contradiction']

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/738M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [39]:
labels

['entailment', 'contradiction', 'neutral']

# Keyword Extraction
https://github.com/MaartenGr/KeyBERT  
"First, document embeddings are extracted with BERT to get a document-level representation. Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity to find the words/phrases that are the most similar to the document. The most similar words could then be identified as the words that best describe the entire document."

In [40]:
!pip install keybert

Collecting keybert
  Downloading keybert-0.9.0-py3-none-any.whl.metadata (15 kB)
Downloading keybert-0.9.0-py3-none-any.whl (41 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.4/41.4 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: keybert
Successfully installed keybert-0.9.0


In [41]:
from keybert import KeyBERT
kw_model = KeyBERT()

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).
      """

In [42]:
keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None, highlight=True)
print(keywords)

[('supervised', 0.6676), ('labeled', 0.4896), ('learning', 0.4813), ('training', 0.4134), ('labels', 0.3947)]


In [43]:
keywords_bigram = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
print(keywords_bigram)

[('supervised learning', 0.6779), ('supervised', 0.6676), ('signal supervised', 0.6152), ('in supervised', 0.6124), ('labeled training', 0.6013)]


# Named-Entity Recognition (NER)
https://huggingface.co/dslim/bert-base-NER  
"bert-base-NER is a fine-tuned BERT model that is ready to use for Named Entity Recognition and achieves state-of-the-art performance for the NER task. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC)."

In [44]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [45]:
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

Device set to use cuda:0


[{'entity': 'B-PER', 'score': np.float32(0.9990139), 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': np.float32(0.999645), 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]


In [46]:
token = ner_results[0]
example[token["start"] : token["end"]]

'Wolfgang'

# Emotion Detection
https://huggingface.co/j-hartmann/emotion-english-distilroberta-base  
With this model, you can classify emotions in English text data. The model was trained on 6 diverse datasets (see Appendix below) and predicts Ekman's 6 basic emotions, plus a neutral class

In [47]:
from transformers import pipeline
classifier = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", return_all_scores=True)

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/294 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/329M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


In [48]:
classifier("I love this!")

[[{'label': 'anger', 'score': 0.004419785924255848},
  {'label': 'disgust', 'score': 0.001611991785466671},
  {'label': 'fear', 'score': 0.00041385178337804973},
  {'label': 'joy', 'score': 0.9771687984466553},
  {'label': 'neutral', 'score': 0.005764591973274946},
  {'label': 'sadness', 'score': 0.0020923891570419073},
  {'label': 'surprise', 'score': 0.00852868054062128}]]

# On Your Own
- Choose and conduct two tasks in this script, that would benefit your project. Run your model on at least 100 data points. What did you find?
- Go to https://sbert.net/index.html and click on 'pretrained models'. Do you find any additional interesting models?