## Introduction to sentence embeddings

Pre-trained BERT models do not produce efficient and independent sentence embeddings
as they always need to be fine-tuned in an end-to-end supervised setting. This is because
we can think of a pre-trained BERT model as an indivisible whole and semantics is spread
across all layers, not just the final layer. Without fine-tuning, it may be ineffective to use its
internal representations independently. It is also hard to handle unsupervised tasks such
as clustering, topic modeling, information retrieval, or semantic search. Because we have
to evaluate many sentence pairs during clustering tasks, for instance, this causes massive
computational overhead.

Luckily, many modifications have been made to the original BERT model, such as Sentence-
BERT (SBERT), to derive semantically meaningful and independent sentence embeddings.
We will talk about these approaches in a moment. In the NLP literature, many neural
sentence embedding methods have been proposed for mapping a single sentence to a
common feature space (vector space model) wherein a cosine function (or dot product)
is usually used to measure similarity and the Euclidean distance to measure dissimilarity.

The following are some applications that can be efficiently solved with sentence embeddings:

• Sentence-pair tasks

• Information retrieval

• Question answering

• Duplicate question detection

• Paraphrase detection

• Document clustering

• Topic modeling

## Benchmarking sentence similarity models

In [4]:
from datasets import load_metric, load_dataset
metric = load_metric('glue', 'mrpc')
mrpc = load_dataset('glue', 'mrpc')

Found cached dataset glue (/home/nitiz/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 1455.68it/s]


In [3]:
metric = load_metric('glue', 'stsb')
metric.compute(predictions=[1,2,3],references=[5,2,2])

{'pearson': -0.8660254037844388, 'spearmanr': -0.8660254037844387}

In [7]:
!pip install tensorflow-hub
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25ldone
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125925 sha256=fbd7de60e4c879e32aff99d6bcb112db24305c74b2e2edefb65d60ea00cfaca1
  Stored in directory: /home/nitiz/.cache/pip/wheels/ff/27/bf/ffba8b318b02d7f691a57084ee154e26ed24d012b0c7805881
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2


In [8]:
from datasets import load_metric, load_dataset
stsb_metric = load_metric('glue', 'stsb')
stsb = load_dataset('glue', 'stsb')

Downloading and preparing dataset glue/stsb to /home/nitiz/.cache/huggingface/datasets/glue/stsb/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data: 100%|██████████████████████| 803k/803k [00:00<00:00, 3.60MB/s]
                                                                                

Dataset glue downloaded and prepared to /home/nitiz/.cache/huggingface/datasets/glue/stsb/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 1002.62it/s]


In [9]:
import tensorflow_hub as hub
use_model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
from sentence_transformers import SentenceTransformer
distilroberta = SentenceTransformer('stsb-distilroberta-base-v2')

2023-06-02 20:58:46.032996: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-06-02 20:58:46.180786: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-06-02 20:58:46.181972: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-02 21:00:43.562143: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-02 21:00:43.684527: W tensorflow/core/common_runtime/gpu/gpu_device.

In [12]:
import tensorflow as tf
import math
def use_sts_benchmark(batch):
    sts_encode1 = tf.nn.l2_normalize(use_model(tf.constant(batch['sentence1'])), axis = 1)
    sts_encode2 = tf.nn.l2_normalize(use_model(tf.constant(batch['sentence2'])), axis = 1)
    cosine_similarities = tf.reduce_sum(tf.multiply(sts_encode1, sts_encode2), axis = 1)
    clip_cosine_similarities = tf.clip_by_value(cosine_similarities, -1.0,1.0)
    scores = 1.0 -  tf.acos(clip_cosine_similarities) / math.pi
    return scores

In [13]:
def roberta_sts_benchmark(batch):
    sts_encode1 = tf.nn.l2_normalize(distilroberta.encode(batch['sentence1']), axis=1)
    sts_encode2 = tf.nn.l2_normalize(distilroberta.encode(batch['sentence2']), axis=1)
    cosine_similarities = tf.reduce_sum(tf.multiply(sts_encode1, sts_encode2), axis=1)
    clip_cosine_similarities = tf.clip_by_value(cosine_similarities, -1.0, 1.0)
    scores = 1.0 - tf.acos(clip_cosine_similarities) / math.pi
    return scores

In [14]:
use_results = use_sts_benchmark(stsb['validation'])
distilroberta_results = roberta_sts_benchmark(stsb['validation'])

2023-06-02 21:03:36.777570: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'inputs' with dtype string
	 [[{{node inputs}}]]


In [16]:
references = [item['label'] for item in stsb['validation']]

In [17]:
results = {
    "USE":stsb_metric.compute(
        predictions=use_results,
        references=references),

    "DistillRoberta":stsb_metric.compute(
        predictions=distilroberta_results,
        references=references)
}

In [18]:
import pandas as pd
pd.DataFrame(results)

Unnamed: 0,USE,DistillRoberta
pearson,0.810301,0.888461
spearmanr,0.808917,0.889246


## Using BART for zero-shot learning

In [19]:
from transformers import pipeline
import pandas as pd
classifier = pipeline("zero-shot-classification",
                        model="facebook/bart-large-mnli")
sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel',
                    'cooking',
                    'dancing',
                    'exploration']
result = classifier(sequence_to_classify, candidate_labels)
pd.DataFrame(result)

Downloading (…)lve/main/config.json: 100%|█| 1.15k/1.15k [00:00<00:00, 2.83MB/s]
Downloading pytorch_model.bin: 100%|███████| 1.63G/1.63G [02:36<00:00, 10.4MB/s]
Downloading (…)okenizer_config.json: 100%|████| 26.0/26.0 [00:00<00:00, 193kB/s]
Downloading (…)olve/main/vocab.json: 100%|███| 899k/899k [00:00<00:00, 1.33MB/s]
Downloading (…)olve/main/merges.txt: 100%|████| 456k/456k [00:00<00:00, 717kB/s]
Downloading (…)/main/tokenizer.json: 100%|█| 1.36M/1.36M [00:00<00:00, 1.44MB/s]


Unnamed: 0,sequence,labels,scores
0,one day I will see the world,travel,0.795756
1,one day I will see the world,exploration,0.199332
2,one day I will see the world,dancing,0.002621
3,one day I will see the world,cooking,0.002291


In [21]:
result = classifier(sequence_to_classify,
                    candidate_labels,
                    multi_label=True)
pd.DataFrame(result)

Unnamed: 0,sequence,labels,scores
0,one day I will see the world,travel,0.994511
1,one day I will see the world,exploration,0.938388
2,one day I will see the world,dancing,0.005706
3,one day I will see the world,cooking,0.001819
