### Homework 7: Question search engine

Remeber week01 where you used GloVe embeddings to find related questions? That was.. cute, but far from state of the art. It's time to really solve this task using context-aware embeddings.

__Warning:__ this task assumes you have seen `seminar.ipynb`!

In [None]:
%pip install --upgrade pip transformers datasets accelerate deepspeed torchmetrics evaluate sentencepiece
import os
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

from transformers import AutoConfig
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import GPT2Tokenizer
from transformers import GPT2LMHeadModel
from transformers import TrainingArguments
from transformers import Trainer
from transformers import default_data_collator

from datasets import load_dataset

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

from IPython.display import clear_output; clear_output()

### Nucleus sampling. Task from seminar

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', add_prefix_space=True); clear_output()
model = GPT2LMHeadModel.from_pretrained('gpt2').train(False).to(device); clear_output()

In [None]:
p = 0.8
text = "The Fermi paradox"
tokens = tokenizer.encode(text)
num_steps = 50
line_length, max_length = 0, 70

print(end=tokenizer.decode(tokens))

for i in range(num_steps):
    with torch.no_grad():
        logits = model(torch.as_tensor([tokens], device=device))[0]
    p_next = torch.softmax(logits[0, -1, :], dim=-1)

    p_next_sorted, indexes_sorted = torch.sort(p_next, dim=-1, descending=True)

    nucleus = torch.cumsum(p_next_sorted, dim=-1) >= p
    indexes = indexes_sorted[:nucleus.to(dtype=torch.int8).argmax() + 1]

    next_token_index = indexes[torch.multinomial(input=p_next[indexes], num_samples=1)]

    tokens.append(int(next_token_index))
    print(end=tokenizer.decode(tokens[-1]))
    line_length += len(tokenizer.decode(tokens[-1]))
    if line_length >= max_length:
        line_length = 0
        print()


 The Fermi paradox. This is especially true for physics as it tends to rely on a series of
 equations known as Cangi (Haufmann 1980). These equations are further
 modified by the widely accepted general relativity theory (Galey and McL
aughlin 1983) that

### Load data and model

In [None]:
qqp = load_dataset('SetFit/qqp'); clear_output()

print('\n')
print("Sample[0]:", qqp['train'][0])
print("Sample[3]:", qqp['train'][3])



Sample[0]: {'text1': 'How is the life of a math student? Could you describe your own experiences?', 'text2': 'Which level of prepration is enough for the exam jlpt5?', 'label': 0, 'idx': 0, 'label_text': 'not duplicate'}
Sample[3]: {'text1': 'What can one do after MBBS?', 'text2': 'What do i do after my MBBS ?', 'label': 1, 'idx': 3, 'label_text': 'duplicate'}


In [None]:
model_name = "gchhablani/bert-base-cased-finetuned-qqp"

tokenizer = AutoTokenizer.from_pretrained(model_name); clear_output()
model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device); clear_output()

### Tokenize the data

In [None]:
MAX_LENGTH = 128
def preprocess_function(examples):
    result = tokenizer(
        examples['text1'], examples['text2'],
        padding='max_length', max_length=MAX_LENGTH, truncation=True
    )
    result['label'] = examples['label']
    return result

qqp_preprocessed = qqp.map(preprocess_function, batched=True)

Map:   0%|          | 0/363846 [00:00<?, ? examples/s]

Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

In [None]:
qqp['train'][0]['text1']

'How is the life of a math student? Could you describe your own experiences?'

In [None]:
qqp['train'][0]['text2']

'Which level of prepration is enough for the exam jlpt5?'

In [None]:
print(repr(qqp_preprocessed['train'][0]['input_ids'])[:100], "...")

[101, 1731, 1110, 1103, 1297, 1104, 170, 12523, 2377, 136, 7426, 1128, 5594, 1240, 1319, 5758, 136,  ...


### Task 1: evaluation (1 points)

We randomly chose a model trained on QQP - but is it any good?

One way to measure this is with validation accuracy - which is what you will implement next.

Here's the interface to help you do that:

In [None]:
val_set = qqp_preprocessed['validation']
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=1, shuffle=False, collate_fn=default_data_collator
)

In [None]:
for batch in val_loader:
     break  # here be your training code
print("Sample batch:")
for key, value in batch.items():
    print(key, ":", value[:10] if len(value.shape) == 1 else value[0, :10])

with torch.no_grad():
  predicted = model(
      input_ids=batch['input_ids'].to(device),
      attention_mask=batch['attention_mask'].to(device),
      token_type_ids=batch['token_type_ids'].to(device)
  )

print('\nPrediction (probs):', torch.softmax(predicted.logits, dim=1).data.detach().cpu().numpy())

Sample batch:
labels : tensor([0])
idx : tensor([0])
input_ids : tensor([ 101, 2009, 1132, 2170,  118, 4038, 1177, 2712,  136,  102])
token_type_ids : tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
attention_mask : tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Prediction (probs): [[9.998927e-01 1.072812e-04]]


__Your task__ is to measure the validation accuracy of your model.
Doing so naively may take several hours. Please make sure you use the following optimizations:

- run the model on GPU with no_grad
- using batch size larger than 1
- use optimize data loader with num_workers > 1
- (optional) use [mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html)


In [None]:
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=256, shuffle=False, collate_fn=default_data_collator, num_workers=os.cpu_count()
)

In [None]:
from tqdm.notebook import tqdm
from torchmetrics.classification import BinaryAccuracy

metric = BinaryAccuracy()

for batch in tqdm(val_loader):
    with torch.no_grad():
        predicted = model(
            input_ids=batch['input_ids'].to(device),
            attention_mask=batch['attention_mask'].to(device),
            token_type_ids=batch['token_type_ids'].to(device),
        )
        metric(predicted.logits.argmax(dim=1).detach().cpu(), batch['labels'])

accuracy = metric.compute()


  0%|          | 0/158 [00:00<?, ?it/s]

In [None]:
assert 0.9 < accuracy < 0.91; print(f'Accuracy: {accuracy:.2%}')

Accuracy: 90.84%


### Task 2: train the model (5 points)

For this task, you have two options:

__Option A:__ fine-tune your own model. You are free to choose any model __except for the original BERT.__ We recommend [DeBERTa-v3](https://huggingface.co/microsoft/deberta-v3-base). Better yet, choose the best model based on public benchmarks (e.g. [GLUE](https://gluebenchmark.com/)).

You can write the training code manually or use transformers.Trainer (see [this example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification)). Please make sure that your model's accuracy is at least __comparable__ with the above example for BERT.


__Option B:__ compare at least 3 pre-finetuned models (in addition to the above BERT model). For each model, report (1) its accuracy, (2) its speed, measured in samples per second in your hardware setup and (3) its size in megabytes. Please take care to compare models in equal setting, e.g. same CPU / GPU. Compile your results into a table and write a short (~half-page on top of a table) report, summarizing your findings.

In [None]:
model_name = 'microsoft/deberta-v3-base'

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False); clear_output()
model = AutoModelForSequenceClassification.from_pretrained(model_name, finetuning_task='qqp'); clear_output()

In [None]:
qqp_preprocessed = qqp.map(preprocess_function, batched=True); clear_output()

In [None]:
import numpy as np
import evaluate

metric = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
training_args = TrainingArguments(
    output_dir='/content/gdrive/MyDrive/deberta-v3-base-trainer',
    overwrite_output_dir=True,
    save_total_limit=2,
    max_steps=1600,
    evaluation_strategy='steps',
    eval_steps=200,
    gradient_accumulation_steps=8,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    dataloader_num_workers=2,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=qqp_preprocessed['train'],
    eval_dataset=qqp_preprocessed['validation'],
    compute_metrics=compute_metrics,
)

In [None]:
trainer.evaluate(eval_dataset=qqp_preprocessed["validation"])


{'eval_loss': 0.7391034960746765,
 'eval_accuracy': 0.8238189463269849,
 'eval_runtime': 397.088,
 'eval_samples_per_second': 101.816,
 'eval_steps_per_second': 12.728}

In [None]:
trainer.train()




Step,Training Loss,Validation Loss,Accuracy
200,No log,0.32519,0.857111
400,No log,0.292697,0.872001
600,0.331700,0.271942,0.884665
800,0.331700,0.278048,0.878209
1000,0.280900,0.257632,0.889315
1200,0.280900,0.252731,0.892728
1400,0.280900,0.24349,0.895004
1600,0.255300,0.244501,0.895424


TrainOutput(global_step=1600, training_loss=0.2862422490119934, metrics={'train_runtime': 5916.3788, 'train_samples_per_second': 17.308, 'train_steps_per_second': 0.27, 'total_flos': 6735763813171200.0, 'train_loss': 0.2862422490119934, 'epoch': 0.28})

### Task 3: try the full pipeline (2 points)

Finally, it is time to use your model to find duplicate questions.
Please implement a function that takes a question and finds top-5 potential duplicates in the training set. For now, it is fine if your function is slow, as long as it yields correct results.

Showcase how your function works with at least 5 examples.

In [None]:
texts = []
for example in tqdm(qqp['train']):
    texts.append(example['text1'])
    if example['label'] == 0:
        texts.append(example['text2'])
texts = list(set(texts))

  0%|          | 0/363846 [00:00<?, ?it/s]

In [None]:
def find_duplicate_questition(questition: str) -> list[str]:
    duplicates = []
    model.eval()
    with torch.no_grad():
        for text in tqdm(texts[:1000]): # ограничения для демонстрации
            result = tokenizer(questition, text, padding='max_length', max_length=MAX_LENGTH, truncation=True)

            predicted = model(
                input_ids=torch.tensor([result['input_ids']], device=device),
                attention_mask=torch.tensor([result['attention_mask']], device=device),
                token_type_ids=torch.tensor([result['token_type_ids']], device=device),
            )

            is_duplicate_prob = predicted.logits.softmax(dim=-1)[0, 1].item()

            if text != questition:
                duplicates.append((is_duplicate_prob, text))

    duplicates = sorted(duplicates, key=lambda x: x[0])

    print(f'Initial questition is: {questition}')
    for p, text in reversed(duplicates[-5:]):
        print(f'Prob: {p:.2%}, Question: {text}')

    return duplicates

In [None]:
_ = find_duplicate_questition(texts[0])

  0%|          | 0/1000 [00:00<?, ?it/s]

Initial questition is: What was the production of artwork intended for in Hawaii and how is it compared to the one intended for in Wisconsin?
Prob: 0.11%, Question: Tennessee Titans Live Streaming | Watch Tennessee Titans Live Stream NFL Games Today Online?
Prob: 0.11%, Question: What is the exact role of the Lok Sabha speaker of India? What are the perks of being the Lok Sabha speaker of India?
Prob: 0.08%, Question: How did the first human murder happen? What was the reason? and how did a human being get the idea to kill another human being for the first time?
Prob: 0.08%, Question: Is there some way to identify genuine JBL speakers, as fake JBL speakers are also sold in market?
Prob: 0.07%, Question: What's the difference between Pokemon Ruby, Sapphire and Emerald?


In [None]:
_ = find_duplicate_questition(texts[1])

  0%|          | 0/1000 [00:00<?, ?it/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Initial questition is: Health emergency on overseas flight: What actions would be taken by the airlines if a life-threatening emergency (heart attack, anaphylactic reaction) occurred during a trans-oceanic flight?
Prob: 5.76%, Question: What is the exact role of the Lok Sabha speaker of India? What are the perks of being the Lok Sabha speaker of India?
Prob: 3.28%, Question: Is it good to prepare for IAS just after a B.Tech from IIT? What should be done for IAS preparation during my B.Tech? Currently I'm in my 3rd year of B.Tech.
Prob: 2.87%, Question: Tennessee Titans Live Streaming | Watch Tennessee Titans Live Stream NFL Games Today Online?
Prob: 2.72%, Question: Which
Prob: 2.52%, Question: How could a life threatening condition be determined?


In [None]:
_ = find_duplicate_questition(texts[2])

  0%|          | 0/1000 [00:00<?, ?it/s]

Initial questition is: What is the cure for polycystic ovary syndrome?
Prob: 0.67%, Question: What is the exact role of the Lok Sabha speaker of India? What are the perks of being the Lok Sabha speaker of India?
Prob: 0.34%, Question: Pregnancy k stating month m kya Khana chaiye?
Prob: 0.19%, Question: Is it good to prepare for IAS just after a B.Tech from IIT? What should be done for IAS preparation during my B.Tech? Currently I'm in my 3rd year of B.Tech.
Prob: 0.12%, Question: Tennessee Titans Live Streaming | Watch Tennessee Titans Live Stream NFL Games Today Online?
Prob: 0.09%, Question: What is the binifet of cpec for Pakistani traler owner?


In [None]:
_ = find_duplicate_questition(texts[3])

  0%|          | 0/1000 [00:00<?, ?it/s]

Initial questition is: How much does football physio earns?
Prob: 1.73%, Question: What is the exact role of the Lok Sabha speaker of India? What are the perks of being the Lok Sabha speaker of India?
Prob: 0.48%, Question: Tennessee Titans Live Streaming | Watch Tennessee Titans Live Stream NFL Games Today Online?
Prob: 0.09%, Question: How did the first human murder happen? What was the reason? and how did a human being get the idea to kill another human being for the first time?
Prob: 0.07%, Question: Is it good to prepare for IAS just after a B.Tech from IIT? What should be done for IAS preparation during my B.Tech? Currently I'm in my 3rd year of B.Tech.
Prob: 0.06%, Question: What's the difference between 4 GB mobile RAM and 4GB PC RAM? Why does mobile RAM cost lesser?


In [None]:
_ = find_duplicate_questition(texts[4])

  0%|          | 0/1000 [00:00<?, ?it/s]

Initial questition is: How is the Orbitz interview process?
Prob: 0.83%, Question: Is it good to prepare for IAS just after a B.Tech from IIT? What should be done for IAS preparation during my B.Tech? Currently I'm in my 3rd year of B.Tech.
Prob: 0.80%, Question: What is the exact role of the Lok Sabha speaker of India? What are the perks of being the Lok Sabha speaker of India?
Prob: 0.14%, Question: Tennessee Titans Live Streaming | Watch Tennessee Titans Live Stream NFL Games Today Online?
Prob: 0.12%, Question: What are the requirements for placing in ntt data?
Prob: 0.09%, Question: If I apply for a study visa in New Zealand, is it easy to apply for permanent residence in New Zealand?


__Bonus:__ for bonus points, try to find a way to run the function faster than just passing over all questions in a loop. For isntance, you can form a short-list of potential candidates using a cheaper method, and then run your tranformer on that short list. If you opted for this solution, please keep both the original implementation and the optimized one - and explain briefly what is the difference there.