<a href="https://colab.research.google.com/github/JacopoMangiavacchi/SBERT-ZSC/blob/main/SBERT_Cosine_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.10.2-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 6.0 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 52.2 MB/s 
[?25hCollecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.16-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 3.7 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 39.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 47.4 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installati

In [2]:
from transformers import AutoTokenizer, AutoModel
from torch.nn import functional as F
from scipy import spatial
import torch

In [3]:
sentences = [
    'This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.'
]

# Test BERT Sentence Embedding with standard BERT model CLS output embedding

In [4]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
# last_hidden_states.shape
outputs[0].shape, outputs[1].shape

(torch.Size([1, 8, 768]), torch.Size([1, 768]))

In [6]:
# run inputs through model and mean-pool over the sequence
# dimension to get sequence-level representations
inputs = tokenizer.batch_encode_plus(sentences,
                                     return_tensors='pt',
                                     pad_to_max_length=True)



In [7]:
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
output = model(input_ids, attention_mask=attention_mask)[0]

In [8]:
output.shape

torch.Size([3, 13, 768])

In [9]:
# find the highest cosine similarities between sentences
print(F.cosine_similarity(output[0][0], output[0][0], dim=0))
print(F.cosine_similarity(output[0][0], output[1][0], dim=0))
print(F.cosine_similarity(output[0][0], output[2][0], dim=0))
print(F.cosine_similarity(output[1][0], output[2][0], dim=0))

tensor(1., grad_fn=<DivBackward0>)
tensor(0.8859, grad_fn=<DivBackward0>)
tensor(0.7584, grad_fn=<DivBackward0>)
tensor(0.7234, grad_fn=<DivBackward0>)


# Test BERT Sentence Embedding with standard BERT model averaging output embeddings

In [10]:
sentence_rep = output.mean(dim=1)

In [11]:
# find the highest cosine similarities between sentences
print(F.cosine_similarity(sentence_rep[0], sentence_rep[0], dim=0))
print(F.cosine_similarity(sentence_rep[0], sentence_rep[1], dim=0))
print(F.cosine_similarity(sentence_rep[0], sentence_rep[2], dim=0))
print(F.cosine_similarity(sentence_rep[1], sentence_rep[2], dim=0))

tensor(1., grad_fn=<DivBackward0>)
tensor(0.8120, grad_fn=<DivBackward0>)
tensor(0.5005, grad_fn=<DivBackward0>)
tensor(0.4590, grad_fn=<DivBackward0>)


# Test BERT Sentence Embedding with Hugginface Sentence_Bert 

In [12]:
tokenizer = AutoTokenizer.from_pretrained('deepset/sentence_bert')
model = AutoModel.from_pretrained('deepset/sentence_bert')

Downloading:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at deepset/sentence_bert were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:
# run inputs through model and mean-pool over the sequence
# dimension to get sequence-level representations
inputs = tokenizer.batch_encode_plus(sentences,
                                     return_tensors='pt',
                                     pad_to_max_length=True)



In [14]:
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
output = model(input_ids, attention_mask=attention_mask)[0]
sentence_rep = output.mean(dim=1)

In [15]:
# find the highest cosine similarities between sentences
print(F.cosine_similarity(sentence_rep[0], sentence_rep[0], dim=0))
print(F.cosine_similarity(sentence_rep[0], sentence_rep[1], dim=0))
print(F.cosine_similarity(sentence_rep[0], sentence_rep[2], dim=0))
print(F.cosine_similarity(sentence_rep[1], sentence_rep[2], dim=0))

tensor(1., grad_fn=<DivBackward0>)
tensor(0.5969, grad_fn=<DivBackward0>)
tensor(-0.1929, grad_fn=<DivBackward0>)
tensor(-0.0781, grad_fn=<DivBackward0>)


In [16]:
embeddings = sentence_rep.detach().numpy()

print(1 - spatial.distance.cosine(embeddings[0], embeddings[0]))
print(1 - spatial.distance.cosine(embeddings[0], embeddings[1]))
print(1 - spatial.distance.cosine(embeddings[0], embeddings[2]))
print(1 - spatial.distance.cosine(embeddings[1], embeddings[2]))

1.0
0.5969038605690002
-0.19288042187690735
-0.07813181728124619


# Test BERT Sentence Embedding with Sentence-Transformer

In [17]:
!pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

Collecting sentence-transformers
  Downloading sentence-transformers-2.0.0.tar.gz (85 kB)
[?25l[K     |███▉                            | 10 kB 23.6 MB/s eta 0:00:01[K     |███████▋                        | 20 kB 29.4 MB/s eta 0:00:01[K     |███████████▌                    | 30 kB 25.8 MB/s eta 0:00:01[K     |███████████████▎                | 40 kB 20.8 MB/s eta 0:00:01[K     |███████████████████▏            | 51 kB 10.8 MB/s eta 0:00:01[K     |███████████████████████         | 61 kB 8.2 MB/s eta 0:00:01[K     |██████████████████████████▉     | 71 kB 9.0 MB/s eta 0:00:01[K     |██████████████████████████████▋ | 81 kB 9.9 MB/s eta 0:00:01[K     |████████████████████████████████| 85 kB 2.8 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 20.9 MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [18]:
print("Max Sequence Length:", model.max_seq_length)

embeddings = model.encode(sentences)

Max Sequence Length: 256


In [19]:
print(1 - spatial.distance.cosine(embeddings[0], embeddings[0]))
print(1 - spatial.distance.cosine(embeddings[0], embeddings[1]))
print(1 - spatial.distance.cosine(embeddings[0], embeddings[2]))
print(1 - spatial.distance.cosine(embeddings[1], embeddings[2]))

1.0
0.5380793213844299
0.11805637180805206
0.10358978062868118


# Test Huggingface Sentence-Bert with 'bert-base-uncased' standard Tokenizer

In [20]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('deepset/sentence_bert')

Some weights of the model checkpoint at deepset/sentence_bert were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [21]:
inputs = tokenizer.batch_encode_plus(sentences,
                                     return_tensors='pt',
                                     pad_to_max_length=True)

input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
output = model(input_ids, attention_mask=attention_mask)[0]
sentence_rep = output.mean(dim=1)



In [22]:
print(F.cosine_similarity(sentence_rep[0], sentence_rep[0], dim=0))
print(F.cosine_similarity(sentence_rep[0], sentence_rep[1], dim=0))
print(F.cosine_similarity(sentence_rep[0], sentence_rep[2], dim=0))
print(F.cosine_similarity(sentence_rep[1], sentence_rep[2], dim=0))

tensor(1., grad_fn=<DivBackward0>)
tensor(0.5969, grad_fn=<DivBackward0>)
tensor(-0.1929, grad_fn=<DivBackward0>)
tensor(-0.0781, grad_fn=<DivBackward0>)
