# Pipeline for cleaning volumes, chunking them, and producing embeddings

Becca Cohen, Sarah Griebel, Ted Underwood

The parts of this task are

1. Figure out a strategy for trimming front and back matter. Features we might use include tokens per page, percentage of non-word tokens, and mean or median sentence length. My strategy right now is to create these features for a bunch of sample volumes and "train" a very simple linear model to make a conservative guess at the boundaries of "body text."

2. Sketch an overall strategy and a set of data objects. For instance, the input to a top-level function might be a volume. Then we pass that to a function that divides each volume into a list of pages and each page into a list of sentences. We could pass *that* to a function that trims front and back matter, as described above.

3. Once we have trimmed volumes, we need to turn those into chunks. This involves counting sentence lengths (in # tokens) and grouping sentences.

4. Finally we turn a list of chunks into embeddings and return the list of embeddings.

5. Celebrate!!!

Much of the code below was borrowed from [the GTE-base page on HuggingFace.] (https://huggingface.co/thenlper/gte-base)

We first demonstrate the possibility of encoding — then run some cosine-distance experiments to see how much of a difference chunking makes.



In [3]:
!pip install torch
!pip install transformers

Collecting transformers
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.2-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m48.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m53.3 MB/s[0m eta [36m0:00:0

In [4]:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

In [5]:
from scipy.spatial.distance import cosine
import pandas as pd

In [6]:
def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-base")
model = AutoModel.from_pretrained("thenlper/gte-base")

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/219M [00:00<?, ?B/s]

In [7]:
input_texts = ["What is the largest satellite of Jupiter and indeed in the solar system?",
               "Beijing.",
               "Ganymede is composed of approximately equal amounts of silicate rock and water. It has an iron-rich, liquid core, and an internal ocean that may contain more water than all of Earth's oceans combined.",
               "What is the capital of China?"]


In [8]:
#!pip install spacy
#!pip install -U spacy
import nltk
import csv
#import spacy
import pandas as pd

#from spacy.lang.en import English

nltk.download('punkt')

#text= input_texts[0]
#from spacy.lang.en import English
from nltk.tokenize import sent_tokenize

# sentence_list = []
# for text in input_texts:
#   nltk_tokens = nltk.word_tokenize(text)
#   if len(nltk_tokens) > 512:
#     input = str(nltk_tokens[512])
#     #nlp = spacy.load("en_core_web_sm")
#   else:
#     input = str(nltk_tokens)
sentence_list = []
for text in input_texts:
  sentences = sent_tokenize(text)
  sentence_list.append(sentences)
  # sentence_list.append(sentences)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [9]:
sentence_list

[['What is the largest satellite of Jupiter and indeed in the solar system?'],
 ['Beijing.'],
 ['Ganymede is composed of approximately equal amounts of silicate rock and water.',
  "It has an iron-rich, liquid core, and an internal ocean that may contain more water than all of Earth's oceans combined."],
 ['What is the capital of China?']]

In [10]:
#tokenize adding one sentence at a time until 512 tokens
import pandas as pd
df = pd.DataFrame()


df['sentence'] = ''
df['batch_dict'] = ''
df['Token_Length'] = ''



sentence_list = []

batch_dict_list = []

token_counter_list = []


for text in input_texts:
  sentences = sent_tokenize(text)
  sentence_list.append(sentences)
df['sentence'] = sentence_list

for sentence in sentence_list:
  batch_dict = tokenizer(sentence, max_length=512, padding=True, truncation=True, return_tensors='pt')
  batch_dict_list.append(str(batch_dict['attention_mask']))
df['batch_dict'] = batch_dict_list

for entry in batch_dict_list:
  token_counter = 0
  for character in entry:
    if character == '1':
      token_counter += 1
  token_counter_list.append(token_counter)
df['Token_Length'] = token_counter_list

In [11]:
df

Unnamed: 0,sentence,batch_dict,Token_Length
0,[What is the largest satellite of Jupiter and ...,"tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...",16
1,[Beijing.],"tensor([[1, 1, 1, 1]])",4
2,[Ganymede is composed of approximately equal a...,"tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...",49
3,[What is the capital of China?],"tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])",9


In [13]:
_512_counter = 0

indexs_at_512 = []

indexs_over_512 = []

for index, row in df.iterrows():
  for row in df['Token_Length']:
    if _512_counter < 512:
      _512_counter += int(row)
    elif _512_counter == 512:
      indexs_at_512.append(index)
    elif _512_counter > 512:
      indexs_over_512.append(index)





In [15]:
indexs_at_512

#I mean there aren't enough tokens here to get to 512 so this makes sense

[]

In [None]:
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
raw_embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

In [None]:
embeddings = F.normalize(raw_embeddings, p=2, dim=1)

In [None]:
embeddings

tensor([[-0.0561,  0.0076, -0.0014,  ...,  0.0266,  0.0480,  0.0357],
        [ 0.0037, -0.0045, -0.0036,  ...,  0.0160,  0.0324,  0.0128],
        [-0.0273,  0.0046,  0.0051,  ...,  0.0246,  0.0476,  0.0303],
        [-0.0051,  0.0029, -0.0053,  ...,  0.0152,  0.0496, -0.0020]],
       grad_fn=<DivBackward0>)

In [None]:
raw_embeddings

tensor([[-0.9026,  0.1228, -0.0232,  ...,  0.4279,  0.7722,  0.5740],
        [ 0.0597, -0.0723, -0.0577,  ...,  0.2578,  0.5228,  0.2062],
        [-0.4440,  0.0753,  0.0829,  ...,  0.4004,  0.7740,  0.4925],
        [-0.0830,  0.0463, -0.0860,  ...,  0.2453,  0.8025, -0.0330]],
       grad_fn=<DivBackward0>)

In [None]:
embedlen = len(embeddings)

row_data = [[0] * embedlen for _ in range(embedlen)]

for i in range(0, len(embeddings)):
  e = embeddings[i].detach().numpy()
  print(i)
  for j in range(i+1, embedlen):
    e2 = embeddings[j].detach().numpy()
    cosinesim = 1 - cosine (e, e2)
    print(i, j, cosinesim)
    row_data[i][j] = cosinesim
    row_data[j][i] = cosinesim
    # print(row_data)

0
0 1 0.7389065623283386
0 2 0.7952800393104553
0 3 0.701913058757782
1
1 2 0.7219271063804626
1 3 0.8907479047775269
2
2 3 0.6992931365966797
3


In [None]:
table = pd.DataFrame(row_data, columns = [x for x in range(embedlen)])
table

Unnamed: 0,0,1,2,3
0,0.0,0.738907,0.79528,0.701913
1,0.738907,0.0,0.721927,0.890748
2,0.79528,0.721927,0.0,0.699293
3,0.701913,0.890748,0.699293,0.0


Notice that the similarity is strongest between 0 and 2, and between 1 and 3.

# Embedding paragraphs versus embedding sentences, and then averaging.

In [None]:
paragraph1 = ["I was born in the year 1632, in the city of York, of a good family, though not of that country, my father being a foreigner of Bremen, who settled first at Hull.",
              "He got a good estate by merchandise, and leaving off his trade, lived afterwards at York, from whence he had married my mother, whose relations were named Robinson, and from whom I was called Robinson Kreutznaer; but, by the usual corruption of words in England, we are now called—nay we call ourselves and write our name—Crusoe; and so my companions always called me."]

paragraph2 = ["I had two elder brothers, one of whom was lieutenant-colonel to an English regiment of foot in Flanders, formerly commanded by the famous Colonel Lockhart, and was killed at the battle near Dunkirk against the Spaniards.",
              "What became of my second brother I never knew, any more than my father or mother knew what became of me."]


In [None]:
sentences = list(paragraph1)
sentences.extend(paragraph2)

batch_dict = tokenizer(sentences, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
sentence_embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
sentence_embeddings = [x.detach().numpy() for x in sentence_embeddings]

In [None]:
len(sentence_embeddings)

4

In [None]:
paragraphs = [paragraph1[0] + ' ' + paragraph1[1], paragraph2[0] + ' ' + paragraph2[1]]
par_batch_dict = tokenizer(paragraphs, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**par_batch_dict)
par_embeddings = average_pool(outputs.last_hidden_state, par_batch_dict['attention_mask'])
par_embeddings = [x.detach().numpy() for x in par_embeddings]

In [None]:
1 - cosine(par_embeddings[0], par_embeddings[1])

0.8254002928733826

In [None]:
1 - cosine((sentence_embeddings[0] + sentence_embeddings[1])/2, (sentence_embeddings[2] + sentence_embeddings[3])/2)

0.8772649765014648

The cosine similarity is a lot higher if you embed the sentences separately and then average the embeddings. This seems to be a general rule, not a one-off occurrence, and it makes some sense if you think about what happens in averaging points: they're going to tend to move toward the center of gravity of the space as a whole.

In [None]:
print(len(paragraphs[0].split()))

96


In [None]:
len(par_batch_dict['input_ids'][0])

125

We're not particularly near the 512-token limit. But notice that 96 words, in the first paragraph, becomes 125 tokens.