### Homework 5: Question search engine

Remeber week01 where you used GloVe embeddings to find related questions? That was.. cute, but far from state of the art. It's time to really solve this task using context-aware embeddings.

__Warning:__ this task assumes you have seen `seminar.ipynb`!

In [None]:
!pip install --upgrade transformers datasets accelerate deepspeed
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
import datasets
from tqdm import tqdm
import numpy as np
import pandas as pd
import time


device = torch.device('cuda:0')

In [None]:
# !pip uninstall transformers
# !pip install --no-cache-dir transformers sentencepiece

### Load data and model

In [2]:
qqp = datasets.load_dataset('SetFit/qqp')
print('\n')
print("Sample[0]:", qqp['train'][0])
print("Sample[3]:", qqp['train'][3])

Downloading readme:   0%|          | 0.00/313 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/70.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.83M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/76.0M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]



Sample[0]: {'text1': 'How is the life of a math student? Could you describe your own experiences?', 'text2': 'Which level of prepration is enough for the exam jlpt5?', 'label': 0, 'idx': 0, 'label_text': 'not duplicate'}
Sample[3]: {'text1': 'What can one do after MBBS?', 'text2': 'What do i do after my MBBS ?', 'label': 1, 'idx': 3, 'label_text': 'duplicate'}


In [None]:
model_name = "gchhablani/bert-base-cased-finetuned-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name).to(device)

Downloading (…)okenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/890 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

### Tokenize the data

In [4]:
MAX_LENGTH = 128
def preprocess_function(examples):
    result = tokenizer(
        examples['text1'], examples['text2'],
        padding='max_length', max_length=MAX_LENGTH, truncation=True
    )
    result['label'] = examples['label']
    return result

qqp_preprocessed = qqp.map(preprocess_function, batched=True)

Map:   0%|          | 0/363846 [00:00<?, ? examples/s]

Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

In [None]:
print(repr(qqp_preprocessed['train'][0]['input_ids'])[:100], "...")

[101, 1731, 1110, 1103, 1297, 1104, 170, 12523, 2377, 136, 7426, 1128, 5594, 1240, 1319, 5758, 136,  ...


### Task 1: evaluation (1 point)

We randomly chose a model trained on QQP - but is it any good?

One way to measure this is with validation accuracy - which is what you will implement next.

Here's the interface to help you do that:

In [None]:
val_set = qqp_preprocessed['validation']
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=512, shuffle=False, collate_fn=transformers.default_data_collator, num_workers = 2
)

In [None]:
for batch in val_loader:
     break  # here be your training code
print("Sample batch:", batch)

with torch.no_grad():
  predicted = model(
      input_ids=batch['input_ids'].to(device),
      attention_mask=batch['attention_mask'].to(device),
      token_type_ids=batch['token_type_ids'].to(device)
  )

print('\nPrediction (probs):', torch.softmax(predicted.logits, dim=1).data.numpy())

__Your task__ is to measure the validation accuracy of your model.
Doing so naively may take several hours. Please make sure you use the following optimizations:

- run the model on GPU with no_grad
- using batch size larger than 1
- use optimize data loader with num_workers > 1
- (optional) use [mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html)


In [None]:
accuracy = []
with torch.no_grad():
    for batch in tqdm(val_loader):
        predicted = model(
            input_ids=batch['input_ids'].to(device),
            attention_mask=batch['attention_mask'].to(device),
            token_type_ids=batch['token_type_ids'].to(device)
        )
        predict = torch.softmax(predicted.logits, dim=1).argmax(1)

        accuracy.append((batch['labels'].to(device) == predict).float().mean().item())

accuracy = np.mean(accuracy)

100%|██████████| 79/79 [04:26<00:00,  3.37s/it]


In [None]:
assert 0.9 < accuracy < 0.91

### Task 2: train the model (4 points)

For this task, you have two options:

__Option A:__ fine-tune your own model. You are free to choose any model __except for the original BERT.__ We recommend [DeBERTa-v3](https://huggingface.co/microsoft/deberta-v3-base). Better yet, choose the best model based on public benchmarks (e.g. [GLUE](https://gluebenchmark.com/)).

You can write the training code manually or use transformers.Trainer (see [this example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification)). Please make sure that your model's accuracy is at least __comparable__ with the above example for BERT.


__Option B:__ compare at least 3 pre-finetuned models (in addition to the above BERT model). For each model, report (1) its accuracy, (2) its speed, measured in samples per second in your hardware setup and (3) its size in megabytes. Please take care to compare models in equal setting, e.g. same CPU / GPU. Compile your results into a table and write a short (~half-page on top of a table) report, summarizing your findings.

# Option A

Я попробовал, но не получилось accuracy сделать больше, чем у модели выше

In [None]:
from IPython.display import clear_output
from tqdm import tqdm


model_name = "microsoft/deberta-v3-base"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name).to(device)
opt = torch.optim.Adam(model.parameters(), lr=1e-5)
lost_f = nn.CrossEntropyLoss()

Downloading pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.weight', 'classifier.bias', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
transformers.logging.set_verbosity_error()

qqp_preprocessed = qqp.map(preprocess_function, batched=True)

train_set = qqp_preprocessed['train']
val_set = qqp_preprocessed['validation']

train_loader = torch.utils.data.DataLoader(
    train_set, batch_size=32, shuffle=True, collate_fn=transformers.default_data_collator, num_workers = 2
)

val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=32, shuffle=False, collate_fn=transformers.default_data_collator, num_workers = 2
)


Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

In [None]:
losses = []
for batch in tqdm(train_loader):
    opt.zero_grad()

    predicted = model(
        input_ids=batch['input_ids'].to(device),
        attention_mask=batch['attention_mask'].to(device),
        token_type_ids=batch['token_type_ids'].to(device)
    )
    predict = torch.softmax(predicted.logits, dim=1)
    loss = lost_f(predict, batch['labels'].to(device))
    losses.append(loss.item())
    loss.backward()
    opt.step()

print("loss: ", np.mean(losses))


accuracy = []
with torch.no_grad():
    for batch in tqdm(val_loader):
        predicted = model(
            input_ids=batch['input_ids'].to(device),
            attention_mask=batch['attention_mask'].to(device),
            token_type_ids=batch['token_type_ids'].to(device)
        )
        predict = torch.softmax(predicted.logits, dim=1).argmax(1)

        accuracy.append((batch['labels'].to(device) == predict).float().mean().item())

print("loss: ", np.mean(accuracy))

100%|██████████| 11371/11371 [2:24:01<00:00,  1.32it/s]


loss:  0.4201843806018029


100%|██████████| 1264/1264 [05:31<00:00,  3.82it/s]

loss:  0.8981973327507701





In [None]:
losses = []
for batch in tqdm(train_loader):
    opt.zero_grad()

    predicted = model(
        input_ids=batch['input_ids'].to(device),
        attention_mask=batch['attention_mask'].to(device),
        token_type_ids=batch['token_type_ids'].to(device)
    )
    predict = torch.softmax(predicted.logits, dim=1)
    loss = lost_f(predict, batch['labels'].to(device))
    losses.append(loss.item())
    loss.backward()
    opt.step()

print("loss: ", np.mean(losses))


accuracy = []
with torch.no_grad():
    for batch in tqdm(val_loader):
        predicted = model(
            input_ids=batch['input_ids'].to(device),
            attention_mask=batch['attention_mask'].to(device),
            token_type_ids=batch['token_type_ids'].to(device)
        )
        predict = torch.softmax(predicted.logits, dim=1).argmax(1)

        accuracy.append((batch['labels'].to(device) == predict).float().mean().item())

print("loss: ", np.mean(accuracy))

 17%|█▋        | 1936/11371 [24:31<1:59:32,  1.32it/s]

# Option B

In [None]:
MAX_LENGTH = 128
def preprocess_function(examples):
    result = tokenizer(
        examples['text1'], examples['text2'],
        padding='max_length', max_length=MAX_LENGTH, truncation=True
    )
    result['label'] = examples['label']
    return result

In [None]:
def get_time_accuracy(model):
    qqp_preprocessed = qqp.map(preprocess_function, batched=True)
    val_set = qqp_preprocessed['validation']

    val_loader = torch.utils.data.DataLoader(
        val_set, batch_size=32, shuffle=False, collate_fn=transformers.default_data_collator, num_workers = 2
    )

    accuracy = []
    with torch.no_grad():
        start = time.time()
        for batch in tqdm(val_loader):
            predicted = model(
                input_ids=batch['input_ids'].to(device),
                attention_mask=batch['attention_mask'].to(device),
                token_type_ids=batch['token_type_ids'].to(device),
                )

            predict = torch.softmax(predicted.logits, dim=1).argmax(dim=1)

            accuracy.append((batch['labels'].to(device) == predict).float().mean().item())
        time_evaluation = round(time.time() - start, 4)
    accuracy = round(np.mean(accuracy), 4)

    return accuracy, time_evaluation


In [None]:
def get_size(model):
    param_size = 0
    for param in model.parameters():
        param_size += param.nelement() * param.element_size()
    buffer_size = 0
    for buffer in model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()

    size_all_mb = (param_size + buffer_size) / 1024**2
    return round(size_all_mb)

In [None]:
gchhablani_name = "gchhablani/bert-base-cased-finetuned-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(gchhablani_name)
gchhablani_model = transformers.AutoModelForSequenceClassification.from_pretrained(gchhablani_name).to(device)


gchhablani_accuracy, gchhablani_time_evaluation = get_time_accuracy(gchhablani_model)
gchhablani_size = get_size(gchhablani_model)

gchhablani_accuracy, gchhablani_time_evaluation, gchhablani_size

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

100%|██████████| 1264/1264 [04:48<00:00,  4.37it/s]


(0.9083, 288.9596, 413)

In [None]:
JeremiahZ_name = "JeremiahZ/bert-base-uncased-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(JeremiahZ_name)
JeremiahZ_model = transformers.AutoModelForSequenceClassification.from_pretrained(JeremiahZ_name).to(device)


JeremiahZ_accuracy, JeremiahZ_time_evaluation = get_time_accuracy(JeremiahZ_model)
JeremiahZ_size = get_size(JeremiahZ_model)

JeremiahZ_accuracy, JeremiahZ_time_evaluation, JeremiahZ_size

100%|██████████| 1264/1264 [04:48<00:00,  4.37it/s]


(0.9099, 288.9867, 418)

In [None]:
assemblyai_name = "assemblyai/distilbert-base-uncased-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(assemblyai_name)
assemblyai_model = transformers.AutoModelForSequenceClassification.from_pretrained(assemblyai_name).to(device)


assemblyai_accuracy, assemblyai_time_evaluation = get_time_accuracy(assemblyai_model)
assemblyai_size = get_size(assemblyai_model)

assemblyai_accuracy, assemblyai_time_evaluation, assemblyai_size

100%|██████████| 1264/1264 [02:27<00:00,  8.58it/s]


(0.8992, 147.3216, 255)

In [None]:
textattack_name = "textattack/bert-base-uncased-QQP"
tokenizer = transformers.AutoTokenizer.from_pretrained(textattack_name)
textattack_model = transformers.AutoModelForSequenceClassification.from_pretrained(textattack_name).to(device)


textattack_accuracy, textattack_time_evaluation = get_time_accuracy(textattack_model)
textattack_size = get_size(textattack_model)

textattack_accuracy, textattack_time_evaluation, textattack_size

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

100%|██████████| 1264/1264 [04:49<00:00,  4.37it/s]


(0.909, 289.106, 418)

In [None]:
pd.DataFrame({
    "name": [gchhablani_name, assemblyai_name, JeremiahZ_name, textattack_name],
    "accuracy": [gchhablani_accuracy, assemblyai_accuracy, JeremiahZ_accuracy, textattack_accuracy],
    "time_evaluation": [gchhablani_time_evaluation, assemblyai_time_evaluation, JeremiahZ_time_evaluation, textattack_time_evaluation],
    "size_model mb": [gchhablani_size, assemblyai_size, JeremiahZ_size, textattack_size]
    })

Unnamed: 0,name,accuracy,time_evaluation,size_model mb
0,gchhablani/bert-base-cased-finetuned-qqp,0.9083,288.9596,413
1,assemblyai/distilbert-base-uncased-qqp,0.8992,147.3216,255
2,JeremiahZ/bert-base-uncased-qqp,0.9099,288.9867,418
3,textattack/bert-base-uncased-QQP,0.909,289.106,418


Тесты проводил в колабе на одном и том GPU, все параметры были идентичны для всех моделей.

Как можно видеть по таблице, модель assemblyai лучшая, если нужна модель поменьше и побыстрее. Она почти в два раза меньше и быстрее других, а accuracy, лишь немногим уступает.

Если не важна ни скорость, ни вес, то можно выбрать JeremiahZ, у неё наибольшее accuracy

### Task 3: try the full pipeline (1 point)

Finally, it is time to use your model to find duplicate questions.
Please implement a function that takes a question and finds top-5 potential duplicates in the training set. For now, it is fine if your function is slow, as long as it yields correct results.

Showcase how your function works with at least 5 examples.

__Bonus:__ for bonus points, try to find a way to run the function faster than just passing over all questions in a loop. For isntance, you can form a short-list of potential candidates using a cheaper method, and then run your tranformer on that short list. If you opted for this solution, please keep both the original implementation and the optimized one - and explain briefly what is the difference there.