### Homework 5: Question search engine

Remeber week01 where you used GloVe embeddings to find related questions? That was.. cute, but far from state of the art. It's time to really solve this task using context-aware embeddings.

__Warning:__ this task assumes you have seen `seminar.ipynb`!

In [None]:
%pip install --upgrade transformers datasets accelerate deepspeed
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
import datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Load data and model

In [None]:
qqp = datasets.load_dataset('SetFit/qqp')
print('\n')
print("Sample[0]:", qqp['train'][0])
print("Sample[3]:", qqp['train'][3])



  0%|          | 0/3 [00:00<?, ?it/s]



Sample[0]: {'text1': 'How is the life of a math student? Could you describe your own experiences?', 'text2': 'Which level of prepration is enough for the exam jlpt5?', 'label': 0, 'idx': 0, 'label_text': 'not duplicate'}
Sample[3]: {'text1': 'What can one do after MBBS?', 'text2': 'What do i do after my MBBS ?', 'label': 1, 'idx': 3, 'label_text': 'duplicate'}


In [None]:
model_name = "gchhablani/bert-base-cased-finetuned-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

### Tokenize the data

In [None]:
MAX_LENGTH = 128
def preprocess_function(examples):
    result = tokenizer(
        examples['text1'], examples['text2'],
        padding='max_length', max_length=MAX_LENGTH, truncation=True
    )
    result['label'] = examples['label']
    return result

qqp_preprocessed = qqp.map(preprocess_function, batched=True)



In [None]:
print(repr(qqp_preprocessed['train'][0]['input_ids'])[:100], "...")

[101, 1731, 1110, 1103, 1297, 1104, 170, 12523, 2377, 136, 7426, 1128, 5594, 1240, 1319, 5758, 136,  ...


### Task 1: evaluation (1 points)

We randomly chose a model trained on QQP - but is it any good?

One way to measure this is with validation accuracy - which is what you will implement next.

Here's the interface to help you do that:

In [None]:
val_set = qqp_preprocessed['validation']
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=1, shuffle=False, collate_fn=transformers.default_data_collator,
    num_workers = 2
)

In [None]:
print(val_set['text1'][0])
print(val_set['text2'][0])
print((val_set['label'][0]))

Why are African-Americans so beautiful?
Why are hispanics so beautiful?
0


In [None]:
torch.tensor(val_set['input_ids'])[:2]

tensor([[  101,  2009,  1132,  2170,   118,  4038,  1177,  2712,   136,   102,
          2009,  1132,  1117, 10224,  4724,  1177,  2712,   136,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,  

In [None]:
# model.attentions = torch.tensor(val_set['attention_mask'])[:2]
model(torch.tensor(val_set['input_ids'])[:2])

SequenceClassifierOutput(loss=None, logits=tensor([[ 3.4198, -2.9517],
        [ 3.3207, -2.7509]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

In [None]:
import pandas as pd
for batch in val_loader:
  prediction = model(
      input_ids=batch["input_ids"].to(device),
      attention_mask=batch["attention_mask"].to(device),
      token_type_ids=batch['token_type_ids'].to(device)
    )
print("Sample batch:", batch)

with torch.no_grad():
  predicted = model(
      input_ids=batch['input_ids'].to(device),
      attention_mask=batch['attention_mask'].to(device),
      token_type_ids=batch['token_type_ids'].to(device)
  )

print('\nPrediction (probs):', torch.softmax(predicted.logits, dim=1).data.cpu().numpy())

Sample batch: {'labels': tensor([1]), 'idx': tensor([40429]), 'input_ids': tensor([[ 101, 1731, 1169,  146, 1294, 1948, 3294, 1976, 1105, 3253,  136,  102,
         1327, 1110, 1294, 1948, 3294,  136,  102,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [None]:
int(torch.softmax(predicted.logits, dim=1).data.cpu().numpy()[0][1] >= 0.5)

1

__Your task__ is to measure the validation accuracy of your model.
Doing so naively may take several hours. Please make sure you use the following optimizations:

- run the model on GPU with no_grad
- using batch size larger than 1
- use optimize data loader with num_workers > 1
- (optional) use [mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html)


In [None]:
val_set = qqp_preprocessed['validation']
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=5, shuffle=False, collate_fn=transformers.default_data_collator,
    num_workers = 2
)

In [None]:
from torch import autocast
from torch.cuda.amp import GradScaler 
import numpy as np


In [None]:
a = np.array([[0, 1], [1, 0]])
a = [int(el[1] >= 0.5) for el in a]
b = [0, 1]
c = []
[c.append(int(a[i] == b[i])) for i in range(len(a))]
c

[0, 0]

In [None]:
# def loss_fn(input, output):
predicted.logits

tensor([[-3.3581,  3.0561]], device='cuda:0')

In [None]:
accuracy = []
for batch in val_loader:
  with autocast(device_type='cuda', dtype=torch.float16):
    predicted = model(
        input_ids=batch['input_ids'].to(device),
        attention_mask=batch['attention_mask'].to(device),
        token_type_ids=batch['token_type_ids'].to(device)
    )
    predicted_labels = torch.softmax(predicted.logits, dim=1).data.cpu().numpy()
    predicted_labels = [int(labels[1] >= 0.5) for labels in predicted_labels]
    [accuracy.append(int(predicted_labels[i] == batch['labels'][i])) for i in range(len(predicted_labels))]
print("Sample batch:", batch)

with torch.no_grad():
  predicted = model(
      input_ids=batch['input_ids'].to(device),
      attention_mask=batch['attention_mask'].to(device),
      token_type_ids=batch['token_type_ids'].to(device)
  )
  
print('\nPrediction (probs):', torch.softmax(predicted.logits, dim=1).data.cpu().numpy())
print('Accuracy is:', np.mean(accuracy))

Sample batch: {'labels': tensor([1, 1, 1, 1, 1]), 'idx': tensor([40425, 40426, 40427, 40428, 40429]), 'input_ids': tensor([[  101,  2009,  1110,  4542,  1105,  1103,  5922,  1602,  2412,  2628,
          1114,  4719,   136,   102,  2009,  1674,  4542,  4248,  4719,   136,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0, 

In [None]:
(torch.softmax(predicted.logits, dim=1).data.cpu().numpy())

array([[0.00399893, 0.99600106],
       [0.9945746 , 0.00542538],
       [0.01337479, 0.9866252 ],
       [0.00199273, 0.99800724],
       [0.00163549, 0.9983645 ]], dtype=float32)

In [None]:
accuracy = np.mean(accuracy)
assert 0.9 < accuracy < 0.91

### Task 2: train the model (5 points)

For this task, you have two options:

__Option A:__ fine-tune your own model. You are free to choose any model __except for the original BERT.__ We recommend [DeBERTa-v3](https://huggingface.co/microsoft/deberta-v3-base). Better yet, choose the best model based on public benchmarks (e.g. [GLUE](https://gluebenchmark.com/)).

You can write the training code manually or use transformers.Trainer (see [this example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification)). Please make sure that your model's accuracy is at least __comparable__ with the above example for BERT.


__Option B:__ compare at least 3 pre-finetuned models (in addition to the above BERT model). For each model, report (1) its accuracy, (2) its speed, measured in samples per second in your hardware setup and (3) its size in megabytes. Please take care to compare models in equal setting, e.g. same CPU / GPU. Compile your results into a table and write a short (~half-page on top of a table) report, summarizing your findings.

In [None]:
#https://github.com/huggingface/notebooks/blob/main/examples/accelerate_examples/simple_nlp_example.ipynb
train_set = qqp_preprocessed['train']
train_loader = torch.utils.data.DataLoader(
    train_set, batch_size=50, shuffle=True, collate_fn=transformers.default_data_collator,
    num_workers = 3
)

In [None]:
%pip install mpi4py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mpi4py
  Downloading mpi4py-3.1.4.tar.gz (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 15.0 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: mpi4py
  Building wheel for mpi4py (PEP 517) ... [?25l[?25hdone
  Created wheel for mpi4py: filename=mpi4py-3.1.4-cp37-cp37m-linux_x86_64.whl size=2185756 sha256=87d4841b603e98927c6c777d8e6b8029483de9d77054d7ddcdcf072b851a1cde
  Stored in directory: /root/.cache/pip/wheels/99/54/29/187b5768bbb7beeab6753bc30acf56f35bc8ca9c214a31e173
Successfully built mpi4py
Installing collected packages: mpi4py
Successfully installed mpi4py-3.1.4


In [None]:
from accelerate import Accelerator, DeepSpeedPlugin
accelerator = Accelerator(fp16=True, mixed_precision='fp16')
if accelerator.is_main_process:
        datasets.utils.logging.set_verbosity_warning()
        transformers.utils.logging.set_verbosity_info()
else:
    datasets.utils.logging.set_verbosity_error()
    transformers.utils.logging.set_verbosity_error()
# accelerator = Accelerator(fp16=True, mixed_precision='fp16')



In [None]:
model_name = "microsoft/deberta-v3-base"
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)


loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--deberta-v3-base/snapshots/8ccc9b6f36199bec6961081d44eb72fb3f7353f3/config.json
Model config DebertaV2Config {
  "_name_or_path": "microsoft/deberta-v3-base",
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta-v2",
  "norm_rel_ebd": "layer_norm",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_dropout": 0,
  "pooler_hidden_act": "gelu",
  "pooler_hidden_size": 768,
  "pos_att_type": [
    "p2c",
    "c2p"
  ],
  "position_biased_input": false,
  "position_buckets": 256,
  "relative_attention": true,
  "share_att_key": true,
  "transformers_version": "4.24.0",
  "type_vocab_size": 0,
  "vocab_size": 128100
}

loadi

In [None]:
model.classifier.parameters()

<generator object Module.parameters at 0x7fd79b3e2350>

In [None]:
num_epochs = 10
import time
import datasets
from datasets import load_dataset, load_metric
from tqdm.auto import tqdm
from torch import autocast
from  torch.cuda.amp import GradScaler

In [None]:
model

DebertaV2ForSequenceClassification(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 768, padding_idx=0)
      (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0): DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=768, out_features=768, bias=True)
              (key_proj): Linear(in_features=768, out_features=768, bias=True)
              (value_proj): Linear(in_features=768, out_features=768, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
 

In [None]:
optimizer = torch.optim.Adam(model.classifier.parameters())
model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)

progress_bar = tqdm(range(num_epochs * len(train_loader)), disable=not accelerator.is_main_process)

for epoch in range(num_epochs):
    start_time = time.time()
    accuracy = []
    model.train()
    for batch in train_loader:
        input_ids = batch['input_ids']
        token_type_ids = batch['token_type_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']
        output = model(input_ids, attention_mask=attention_mask,
                        token_type_ids=token_type_ids,
                        labels=labels)
        loss = output.loss

        accelerator.backward(loss)
        # scaler.step(optimizer)

        # loss.backward()
        progress_bar.update(1)
        # scaler.update()
        optimizer.zero_grad()

    print('Time for epoch, minutes:', (time.time() - start_time)/60)
    print('Loss:', loss)

  cpuset_checked))


  0%|          | 0/72770 [00:00<?, ?it/s]

Exception in thread QueueFeederThread:
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 232, in _feed
    close()
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 177, in close
    self._close()
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 263, in _feed
    queue_sem.release()
ValueError: semaphore or lock released too many times

Exception in thread QueueFeederThread:
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 232, in _feed
   

Time for epoch, minutes: 54.48321794271469
Loss: tensor(0.6885, device='cuda:0', grad_fn=<NllLossBackward0>)


In [None]:
import numpy as np

np.mean(accuracy)

### Task 3: try the full pipeline (2 points)

Finally, it is time to use your model to find duplicate questions.
Please implement a function that takes a question and finds top-5 potential duplicates in the training set. For now, it is fine if your function is slow, as long as it yields correct results.

Showcase how your function works with at least 5 examples.

__Bonus:__ for bonus points, try to find a way to run the function faster than just passing over all questions in a loop. For isntance, you can form a short-list of potential candidates using a cheaper method, and then run your tranformer on that short list. If you opted for this solution, please keep both the original implementation and the optimized one - and explain briefly what is the difference there.