# Chunked Pooling
This notebooks explains how the chunked pooling can be implemented. First you need to install the requirements: 

In [1]:
!pip install -r requirements.txt

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'[0m[31m
[0m

Then we load a model which we want to use for the embedding. We choose `jinaai/jina-embeddings-v2-base-en` but any other model which supports mean pooling is possible. However, models with a large maximum context-length are preferred.

In [2]:
from transformers import AutoModel
from transformers import AutoTokenizer

from chunked_pooling import chunked_pooling, chunk_by_sentences, chunk_by_config

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)

  from .autonotebook import tqdm as notebook_tqdm


Now we define the text which we want to encode and split it into chunks. The `chunk_by_sentences` function also returns the span annotations. Those specify the number of tokens per chunk which is needed for the chunked pooling.

In [3]:
input_text = "Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."

# determine chunks
chunks, span_annotations = chunk_by_sentences(input_text, tokenizer)
print('Chunks:\n- "' + '"\n- "'.join(chunks) + '"')
print(span_annotations)


tensor(0) tensor(0) 
tensor(0) tensor(6) Berlin
tensor(7) tensor(9) is
tensor(10) tensor(13) the
tensor(14) tensor(21) capital
tensor(22) tensor(25) and
tensor(26) tensor(33) largest
tensor(34) tensor(38) city
tensor(39) tensor(41) of
tensor(42) tensor(49) Germany
tensor(49) tensor(50) ,
tensor(51) tensor(55) both
tensor(56) tensor(58) by
tensor(59) tensor(63) area
tensor(64) tensor(67) and
tensor(68) tensor(70) by
tensor(71) tensor(81) population
tensor(81) tensor(82) .
tensor(83) tensor(86) Its
tensor(87) tensor(91) more
tensor(92) tensor(96) than
tensor(97) tensor(98) 3
tensor(98) tensor(99) .
tensor(99) tensor(101) 85
tensor(102) tensor(109) million
tensor(110) tensor(121) inhabitants
tensor(122) tensor(126) make
tensor(127) tensor(129) it
tensor(130) tensor(133) the
tensor(134) tensor(142) European
tensor(143) tensor(148) Union
tensor(148) tensor(149) '
tensor(149) tensor(150) s
tensor(151) tensor(155) most
tensor(156) tensor(164) populous
tensor(165) tensor(169) city
tensor(169) 

Now we encode the chunks with the traditional and the context-sensitive chunked pooling method:

In [4]:
# chunk before
embeddings_traditional_chunking = model.encode(chunks)

# chunk afterwards (context-sensitive chunked pooling)
inputs = tokenizer(input_text, return_tensors='pt')
model_output = model(**inputs)
embeddings = chunked_pooling(model_output, [span_annotations])[0]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Finally, we compare the similarity of the word "Berlin" with the chunks. The similarity should be higher for the context-sensitive chunked pooling method:

In [5]:
import numpy as np

cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

berlin_embedding = model.encode('Berlin')

for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking):
    print(f'similarity_new("Berlin", "{chunk}"):', cos_sim(berlin_embedding, new_embedding))
    print(f'similarity_trad("Berlin", "{chunk}"):', cos_sim(berlin_embedding, trad_embeddings))

similarity_new("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.849546
similarity_trad("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.8486218
similarity_new("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.8248903
similarity_trad("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.7084339
similarity_new("Berlin", " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): 0.8498009
similarity_trad("Berlin", " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): 0.7534553
