<a href="https://colab.research.google.com/github/JacopoMangiavacchi/SBERT-ZSC/blob/main/SBERT-Cosine-Similarity-Test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
sentences = [
    'This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.'
]

# Test Sentence BERT with Hugginface Transformer 

In [2]:
!pip install transformers

from transformers import AutoTokenizer, AutoModel
from torch.nn import functional as F
from scipy import spatial

tokenizer = AutoTokenizer.from_pretrained('deepset/sentence_bert')
model = AutoModel.from_pretrained('deepset/sentence_bert')



Some weights of the model checkpoint at deepset/sentence_bert were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
# run inputs through model and mean-pool over the sequence
# dimension to get sequence-level representations
inputs = tokenizer.batch_encode_plus(sentences,
                                     return_tensors='pt',
                                     pad_to_max_length=True)



In [4]:
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
output = model(input_ids, attention_mask=attention_mask)[0]
sentence_rep = output.mean(dim=1)

In [5]:
# find the highest cosine similarities between sentences
print(F.cosine_similarity(sentence_rep[0], sentence_rep[0], dim=0))
print(F.cosine_similarity(sentence_rep[0], sentence_rep[1], dim=0))
print(F.cosine_similarity(sentence_rep[0], sentence_rep[2], dim=0))
print(F.cosine_similarity(sentence_rep[1], sentence_rep[2], dim=0))

tensor(1., grad_fn=<DivBackward0>)
tensor(0.5969, grad_fn=<DivBackward0>)
tensor(-0.1929, grad_fn=<DivBackward0>)
tensor(-0.0781, grad_fn=<DivBackward0>)


In [6]:
embeddings = sentence_rep.detach().numpy()

print(1 - spatial.distance.cosine(embeddings[0], embeddings[0]))
print(1 - spatial.distance.cosine(embeddings[0], embeddings[1]))
print(1 - spatial.distance.cosine(embeddings[0], embeddings[2]))
print(1 - spatial.distance.cosine(embeddings[1], embeddings[2]))

1.0
0.5969038605690002
-0.19288042187690735
-0.07813181728124619


# Test Sentence BERT with Sentence-Transformer

In [7]:
!pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

Collecting sentence-transformers
  Downloading sentence-transformers-2.0.0.tar.gz (85 kB)
[?25l[K     |███▉                            | 10 kB 20.1 MB/s eta 0:00:01[K     |███████▋                        | 20 kB 25.9 MB/s eta 0:00:01[K     |███████████▌                    | 30 kB 29.5 MB/s eta 0:00:01[K     |███████████████▎                | 40 kB 32.1 MB/s eta 0:00:01[K     |███████████████████▏            | 51 kB 29.6 MB/s eta 0:00:01[K     |███████████████████████         | 61 kB 29.3 MB/s eta 0:00:01[K     |██████████████████████████▉     | 71 kB 27.8 MB/s eta 0:00:01[K     |██████████████████████████████▋ | 81 kB 29.6 MB/s eta 0:00:01[K     |████████████████████████████████| 85 kB 4.1 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 26.1 MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sente

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [8]:
print("Max Sequence Length:", model.max_seq_length)

embeddings = model.encode(sentences)

Max Sequence Length: 256


In [9]:
print(1 - spatial.distance.cosine(embeddings[0], embeddings[0]))
print(1 - spatial.distance.cosine(embeddings[0], embeddings[1]))
print(1 - spatial.distance.cosine(embeddings[0], embeddings[2]))
print(1 - spatial.distance.cosine(embeddings[1], embeddings[2]))

1.0
0.5380793213844299
0.11805637180805206
0.10358978062868118
