# Testing the GTE-base embeddings

Much of the code below was borrowed from [the GTE-base page on HuggingFace.] (https://huggingface.co/thenlper/gte-base)

We first demonstrate the possibility of encoding — then run some cosine-distance experiments to see how much of a difference chunking makes.



In [3]:
!pip install torch
!pip install transformers

Collecting transformers
  Downloading transformers-4.32.1-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m53.6 MB/s[0m eta [36m0:00:0

In [4]:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

In [16]:
from scipy.spatial.distance import cosine
import pandas as pd

In [5]:
def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-base")
model = AutoModel.from_pretrained("thenlper/gte-base")

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/219M [00:00<?, ?B/s]

In [74]:
input_texts = ["What is the largest satellite of Jupiter and indeed in the solar system?",
               "Beijing.",
               "Ganymede is composed of approximately equal amounts of silicate rock and water. It has an iron-rich, liquid core, and an internal ocean that may contain more water than all of Earth's oceans combined.",
               "What is the capital of China?"]


In [75]:
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
raw_embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

In [76]:
embeddings = F.normalize(raw_embeddings, p=2, dim=1)

In [77]:
embeddings

tensor([[-0.0561,  0.0076, -0.0014,  ...,  0.0266,  0.0480,  0.0357],
        [ 0.0037, -0.0045, -0.0036,  ...,  0.0160,  0.0324,  0.0128],
        [-0.0273,  0.0046,  0.0051,  ...,  0.0246,  0.0476,  0.0303],
        [-0.0051,  0.0029, -0.0053,  ...,  0.0152,  0.0496, -0.0020]],
       grad_fn=<DivBackward0>)

In [78]:
raw_embeddings

tensor([[-0.9026,  0.1228, -0.0232,  ...,  0.4279,  0.7722,  0.5740],
        [ 0.0597, -0.0723, -0.0577,  ...,  0.2578,  0.5228,  0.2062],
        [-0.4440,  0.0753,  0.0829,  ...,  0.4004,  0.7740,  0.4925],
        [-0.0830,  0.0463, -0.0860,  ...,  0.2453,  0.8025, -0.0330]],
       grad_fn=<DivBackward0>)

In [81]:
embedlen = len(embeddings)

row_data = [[0] * embedlen for _ in range(embedlen)]

for i in range(0, len(embeddings)):
  e = embeddings[i].detach().numpy()
  print(i)
  for j in range(i+1, embedlen):
    e2 = embeddings[j].detach().numpy()
    cosinesim = 1 - cosine (e, e2)
    print(i, j, cosinesim)
    row_data[i][j] = cosinesim
    row_data[j][i] = cosinesim
    # print(row_data)

0
0 1 0.7389065623283386
0 2 0.7952800393104553
0 3 0.701913058757782
1
1 2 0.7219271063804626
1 3 0.8907479047775269
2
2 3 0.6992931365966797
3


In [80]:
table = pd.DataFrame(row_data, columns = [x for x in range(embedlen)])
table

Unnamed: 0,0,1,2,3
0,0.0,0.738907,0.79528,0.701913
1,0.738907,0.0,0.721927,0.890748
2,0.79528,0.721927,0.0,0.699293
3,0.701913,0.890748,0.699293,0.0


Notice that the similarity is strongest between 0 and 2, and between 1 and 3.

# Embedding paragraphs versus embedding sentences, and then averaging.

In [88]:
paragraph1 = ["I was born in the year 1632, in the city of York, of a good family, though not of that country, my father being a foreigner of Bremen, who settled first at Hull.",
              "He got a good estate by merchandise, and leaving off his trade, lived afterwards at York, from whence he had married my mother, whose relations were named Robinson, and from whom I was called Robinson Kreutznaer; but, by the usual corruption of words in England, we are now called—nay we call ourselves and write our name—Crusoe; and so my companions always called me."]

paragraph2 = ["I had two elder brothers, one of whom was lieutenant-colonel to an English regiment of foot in Flanders, formerly commanded by the famous Colonel Lockhart, and was killed at the battle near Dunkirk against the Spaniards.",
              "What became of my second brother I never knew, any more than my father or mother knew what became of me."]


In [95]:
sentences = list(paragraph1)
sentences.extend(paragraph2)

batch_dict = tokenizer(sentences, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
sentence_embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
sentence_embeddings = [x.detach().numpy() for x in sentence_embeddings]

In [86]:
len(sentence_embeddings)

4

In [101]:
paragraphs = [paragraph1[0] + ' ' + paragraph1[1], paragraph2[0] + ' ' + paragraph2[1]]
par_batch_dict = tokenizer(paragraphs, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**par_batch_dict)
par_embeddings = average_pool(outputs.last_hidden_state, par_batch_dict['attention_mask'])
par_embeddings = [x.detach().numpy() for x in par_embeddings]

In [94]:
1 - cosine(par_embeddings[0], par_embeddings[1])

0.8254002928733826

In [97]:
1 - cosine((sentence_embeddings[0] + sentence_embeddings[1])/2, (sentence_embeddings[2] + sentence_embeddings[3])/2)

0.8772649765014648

The cosine similarity is a lot higher if you embed the sentences separately and then average the embeddings. This seems to be a general rule, not a one-off occurrence, and it makes some sense if you think about what happens in averaging points: they're going to tend to move toward the center of gravity of the space as a whole.

In [98]:
print(len(paragraphs[0].split()))

96


In [104]:
len(par_batch_dict['input_ids'][0])

125

We're not particularly near the 512-token limit. But notice that 96 words, in the first paragraph, becomes 125 tokens.