In [3]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# tokenizer = T5Tokenizer.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
# model = T5ForConditionalGeneration.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
# tokenizer.save_pretrained('./saved_models/query-gen-msmarco')
# model.save_pretrained('./saved_models/query-gen-msmarco')


In [19]:
tokenizer = T5Tokenizer.from_pretrained('./saved_models/query-gen-msmarco')
model = T5ForConditionalGeneration.from_pretrained('./saved_models/query-gen-msmarco')
model.eval()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=1024, out_features=4096, bias=False)
              (wo): Linear(in_features=4096, out_features=1024, bias=False)
              (d

In [5]:
passages = [
  "Asteroids are small rocky objects that orbit the Sun. They can be found abundantly in the main asteroid belt between the Mars and Jupiter. Despite their small size in planetary sense, large asteroids can measure 530 kilometres in diameter. These are sometimes called minor planets. Occasionally, asteroids smash into each other, knocking off smaller pieces of rock that are called meteoroids. Their size can range from no bigger than a few molecules to 100 meters in diameter.",
  "When meteoroids come close enough to be attracted by the Earth, they fall into our atmosphere and become a burning rock due to the friction between the falling objects and air molecules. They are, at this stage, meteors (or shooting stars) that we see in the sky. If the meteoroid does not burn completely in the atmosphere, it'd land on Earth and become “meteorites”.",
  "Comets are big icy objects coated with black organic materials. Each comet has a frozen part at the core called nucleus. When they nears the Sun, the heat causes some ice to turn into gases, creating an atmosphere called “coma” around the comet. As the comet travels across long distance, the coma may extend hundreds of thousands of kilometres, sometimes forming a long, bright tail. ",
  "Have you ever noticed a white spot on the egg yolk of the egg you just cracked? The white spot is called blastodisc. This is where the sperm enters the egg. The nucleus of the egg is located inside the blastodisc. By the time the fertilised egg is laid, many cycles of cell division would have taken place on the surface of the egg yolk, forming a blastoderm. Compared to the blastoderm, the blastodisc is a smaller and lighter spot and is more visually distinct from the egg yolk.",
  "Sometimes you might even find blood spots on your egg yolk. Little do people know, all egg yolks actually contain tiny blood vessels that would deliver nutrients to the chick embryo if the egg is fertilised. If the blood vessels are broken during the laying process (usually occur when the hen is being startled), there will be blood spots on the egg yolk. That being said, even if you find blood spot on the egg yolk, the egg is still safe for consumption. It's normal that there's blood spot on egg yolk once in a while, but if it is an on-going occurrence, you might want to check the health condition of the hen that laid those eggs.",
  "Everyone’s news feeds probably are full of Covid-19 reports these days. But the news on Asian giant hornets, also nicknamed “murder hornets”, gained traction in the US because many are worried that the invading species might decimate the ecological system. The first sight of Asian giant hornet in the US was reported in November last year. What followed was multiple incidents where the entire population of a bee hive were killed. It was an absolutely horrid sight - thousands of honey bees decapitated, dead on the ground. The Asian hornets won't even take over the hive. They attack honey bees, not for their honey or hive, but their thoraxes which they use to feed their young.",
  "My question is, since these Asian giant hornets have been hanging around in Asia for centuries, why haven't they killed all Asian honey bees already? It turns out that Asian honey bees evolved a defense mechanism to fend off or get back at the hornets. When they face attack from the murder hornets, Asian honey bees would quickly mobilise themselves to surround the hornet in great numbers, like a ball, and flex and flap their wings, creating a really hot environment inside the ball and basically cook the hornet to death. Unfortunately, American honey bees don't exhibit these behaviours when they are attacked."
]

In [6]:
import torch
from tqdm.auto import tqdm

pairs = []
file_count = 0

# set to no_grad as we don't need to calculate gradients for back prop
with torch.no_grad():
    # loop through each passage individually
    for p in tqdm(passages):
        p = p.replace('\t', ' ')
        # create input tokens
        input_ids = tokenizer.encode(p, return_tensors='pt')
        # generate output tokens (query generation)
        outputs = model.generate(
            input_ids=input_ids,
            max_length=64,
            do_sample=True,
            top_p=0.95,
            num_return_sequences=10
        )
        # decode output tokens to human-readable language
        for output in outputs:
            query = tokenizer.decode(output, skip_special_tokens=True)
            # append (query, passage) pair to pairs list, separate by \t
            pairs.append(query.replace('\t', ' ')+'\t'+p)
        
        # once we have 1024 pairs write to file
        if len(pairs) > 1024:
            with open(f'data/pairs_{file_count}.tsv', 'w', encoding='utf-8') as fp:
                fp.write('\n'.join(pairs))
            file_count += 1
            pairs = []

if pairs is not None:
    # save the final, smaller than 1024 batch
    with open(f'data/pairs_{file_count}.tsv', 'w', encoding='utf-8') as fp:
        fp.write('\n'.join(pairs))

100%|██████████| 7/7 [00:51<00:00,  7.41s/it]


In [20]:
para = "Asteroids are small rocky objects that orbit the Sun. They can be found abundantly in the main asteroid belt between the Mars and Jupiter. Despite their small size in planetary sense, large asteroids can measure 530 kilometres in diameter. These are sometimes called minor planets. Occasionally, asteroids smash into each other, knocking off smaller pieces of rock that are called meteoroids. Their size can range from no bigger than a few molecules to 100 meters in diameter."
print("Paragraph:")
print(para)

input_ids = tokenizer.encode(para, return_tensors='pt')
# generate output tokens (query generation)
outputs = model.generate(
    input_ids=input_ids,
    max_length=128,
    do_sample=True,
    top_p=0.95,
    num_return_sequences=10
)

print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{i + 1}: {query}')

Paragraph:
Asteroids are small rocky objects that orbit the Sun. They can be found abundantly in the main asteroid belt between the Mars and Jupiter. Despite their small size in planetary sense, large asteroids can measure 530 kilometres in diameter. These are sometimes called minor planets. Occasionally, asteroids smash into each other, knocking off smaller pieces of rock that are called meteoroids. Their size can range from no bigger than a few molecules to 100 meters in diameter.

Generated Queries:
1: what are small rocks or objects called?
2: what is the size of a large asteroids
3: what is asteroids
4: define asteroids
5: where are large asteroid clusters found
6: what are some of the largest asteroids
7: what is the largest known asteroids in the solar system
8: how big are asteroids
9: what makes an asteroid an asteroids
10: how many kilometers in diameter is a typical asteroid


In [7]:
from sentence_transformers import InputExample
pairs = []
with open('data/pairs_0.tsv', 'r', encoding='utf-8') as fp:
    lines = fp.read().split('\n')
    for line in lines:
        if '\t' in line:
            q, p = line.split('\t')
            pairs.append(InputExample(
                texts=[q, p]
            ))

In [8]:
from sentence_transformers import datasets

batch_size = 7

loader = datasets.NoDuplicatesDataLoader(
    pairs, batch_size=batch_size
)

In [9]:
from sentence_transformers import models, SentenceTransformer

mpnet = models.Transformer('microsoft/mpnet-base')
pooler = models.Pooling(
    mpnet.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True
)

model = SentenceTransformer(modules=[mpnet, pooler])

model

Some weights of MPNetModel were not initialized from the model checkpoint at microsoft/mpnet-base and are newly initialized: ['mpnet.pooler.dense.bias', 'mpnet.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

In [10]:
from sentence_transformers import losses

loss = losses.MultipleNegativesRankingLoss(model)

In [11]:
epochs = 3
warmup_steps = int(len(loader) * epochs * 0.1)

model.fit(
    train_objectives=[(loader, loss)],
    epochs=epochs,
    warmup_steps=warmup_steps,
    output_path='./saved_models/mpnet-genq-blog',
    show_progress_bar=True
)

Iteration: 100%|██████████| 10/10 [01:07<00:00,  6.71s/it]
Iteration: 100%|██████████| 10/10 [01:11<00:00,  7.18s/it]
Iteration: 100%|██████████| 10/10 [01:02<00:00,  6.27s/it]
Epoch: 100%|██████████| 3/3 [03:21<00:00, 67.24s/it]


In [13]:
%pip install pinecone-client

Note: you may need to restart the kernel to use updated packages.


In [14]:
import pinecone

pinecone.init(
    api_key='key',
    environment='gcp-starter'  # find next to API key
)
# create a new genq index if does not already exist
if 'genq' not in pinecone.list_indexes():
    pinecone.create_index(
        'genq',
        dimension=model.get_sentence_embedding_dimension()
    )
# connect
index = pinecone.Index('genq')

In [12]:
embeds = []

for i in range(len(passages)):
    batch = passages[i]
    batch_embeds = model.encode([batch]).tolist()
    # add to our embeds list
    embeds.extend(batch_embeds)
len(embeds)

7

In [15]:
ids = [str(i) for i in range(len(passages))]
meta = [{'context': p} for p in passages]
to_upsert = list(zip(ids, embeds, meta))

# now upsert
for i in range(0, len(passages), 3):
    i_end = i + 32
    i_end = len(passages) if i_end > len(passages) else i_end
    # get batch
    batch = to_upsert[i:i_end]
    # upsert
    index.upsert(vectors=batch)

In [16]:
query = "What are shooting stars?"
xq = model.encode([query]).tolist()

res = index.query(xq, top_k=5, include_metadata=True)
res

{'matches': [{'id': '1',
              'metadata': {'context': 'When meteoroids come close enough to be '
                                      'attracted by the Earth, they fall into '
                                      'our atmosphere and become a burning '
                                      'rock due to the friction between the '
                                      'falling objects and air molecules. They '
                                      'are, at this stage, meteors (or '
                                      'shooting stars) that we see in the sky. '
                                      'If the meteoroid does not burn '
                                      "completely in the atmosphere, it'd land "
                                      'on Earth and become “meteorites”.'},
              'score': 0.604144573,
              'values': []},
             {'id': '0',
              'metadata': {'context': 'Asteroids are small rocky objects that '
                         

In [17]:
query = "How do bees defend themselves?"
xq = model.encode([query]).tolist()

res = index.query(xq, top_k=5, include_metadata=True)
res

{'matches': [{'id': '6',
              'metadata': {'context': 'My question is, since these Asian giant '
                                      'hornets have been hanging around in '
                                      "Asia for centuries, why haven't they "
                                      'killed all Asian honey bees already? It '
                                      'turns out that Asian honey bees evolved '
                                      'a defense mechanism to fend off or get '
                                      'back at the hornets. When they face '
                                      'attack from the murder hornets, Asian '
                                      'honey bees would quickly mobilise '
                                      'themselves to surround the hornet in '
                                      'great numbers, like a ball, and flex '
                                      'and flap their wings, creating a really '
                             