### Homework 5: Question search engine

Remeber week01 where you used GloVe embeddings to find related questions? That was.. cute, but far from state of the art. It's time to really solve this task using context-aware embeddings.

__Warning:__ this task assumes you have seen `seminar.ipynb`!

In [1]:
# %pip install --upgrade transformers datasets accelerate deepspeed

import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
import datasets

from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


### Load data and model

In [None]:
qqp = datasets.load_dataset('SetFit/qqp')
print('\n')
print("Sample[0]:", qqp['train'][0])
print("Sample[3]:", qqp['train'][3])

In [None]:
model_name = "gchhablani/bert-base-cased-finetuned-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

### Tokenize the data

In [5]:
MAX_LENGTH = 128
def preprocess_function(examples):
    result = tokenizer(
        examples['text1'], examples['text2'],
        padding='max_length', max_length=MAX_LENGTH, truncation=True
    )
    result['label'] = examples['label']
    return result

qqp_preprocessed = qqp.map(preprocess_function, batched=True)

Map: 100%|████████████████████████████████████████████████████████████████████████████████| 363846/363846 [00:29<00:00, 12221.48 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████| 40430/40430 [00:03<00:00, 12755.40 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████| 390965/390965 [00:31<00:00, 12451.66 examples/s]


In [6]:
print(repr(qqp_preprocessed['train'][0]['input_ids'])[:100], "...")

[1, 577, 269, 262, 432, 265, 266, 5291, 1234, 302, 5047, 274, 3443, 290, 451, 2056, 302, 2, 2597, 67 ...


### Task 1: evaluation (1 point)

We randomly chose a model trained on QQP - but is it any good?

One way to measure this is with validation accuracy - which is what you will implement next.

Here's the interface to help you do that:

In [6]:
val_set = qqp_preprocessed['validation']
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=1, shuffle=False, collate_fn=transformers.default_data_collator
)

In [7]:
for batch in val_loader:
     break  # here be your training code
print("Sample batch:", batch)

with torch.no_grad():
  predicted = model(
      input_ids=batch['input_ids'],
      attention_mask=batch['attention_mask'],
      token_type_ids=batch['token_type_ids']
  )

print('\nPrediction (probs):', torch.softmax(predicted.logits, dim=1).data.numpy())

Sample batch: {'labels': tensor([0]), 'idx': tensor([0]), 'input_ids': tensor([[  101,  2009,  1132,  2170,   118,  4038,  1177,  2712,   136,   102,
          2009,  1132,  1117, 10224,  4724,  1177,  2712,   136,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,   

In [8]:
batch

{'labels': tensor([0]),
 'idx': tensor([0]),
 'input_ids': tensor([[  101,  2009,  1132,  2170,   118,  4038,  1177,  2712,   136,   102,
           2009,  1132,  1117, 10224,  4724,  1177,  2712,   136,   102,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,    

__Your task__ is to measure the validation accuracy of your model.
Doing so naively may take several hours. Please make sure you use the following optimizations:

- run the model on GPU with no_grad
- using batch size larger than 1
- use optimize data loader with num_workers > 1
- (optional) use [mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html)


In [9]:
@torch.no_grad()
def run_validation(model, loader, device='cuda:1', dtype=torch.float16):
    accuracy = 0
    validation_size = 0
    for batch in  tqdm(val_loader, desc='computing validation accuracy ...'):  
        batch = {name: tensor.to(device) for name, tensor in batch.items()}
        with torch.autocast(device_type=device, dtype=torch.float16): 
            predicted = model(
              input_ids=batch['input_ids'],
              attention_mask=batch['attention_mask'],
              token_type_ids=batch['token_type_ids']
            )
            accuracy += torch.sum(predicted.logits.argmax(dim=-1) == batch['labels'])
            validation_size += predicted.logits.size(0)

    return accuracy.item() / validation_size

In [43]:
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=1024, shuffle=False, 
    num_workers=4,
    pin_memory=True,
    collate_fn=transformers.default_data_collator
)
model.to('cuda:1')
model.eval()
accuracy = run_validation(model, val_loader, device='cuda:1', dtype=torch.float16)

computing validation accuracy ...: 100%|████████████████████████████████████████████████████████████████████| 40/40 [02:11<00:00,  3.30s/it]


In [44]:
assert 0.9 < accuracy < 0.91

### Task 2: train the model (4 points)

For this task, you have two options:

__Option A:__ fine-tune your own model. You are free to choose any model __except for the original BERT.__ We recommend [DeBERTa-v3](https://huggingface.co/microsoft/deberta-v3-base). Better yet, choose the best model based on public benchmarks (e.g. [GLUE](https://gluebenchmark.com/)).

You can write the training code manually or use transformers.Trainer (see [this example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification)). Please make sure that your model's accuracy is at least __comparable__ with the above example for BERT.


__Option B:__ compare at least 3 pre-finetuned models (in addition to the above BERT model). For each model, report (1) its accuracy, (2) its speed, measured in samples per second in your hardware setup and (3) its size in megabytes. Please take care to compare models in equal setting, e.g. same CPU / GPU. Compile your results into a table and write a short (~half-page on top of a table) report, summarizing your findings.

#### Option A

In [2]:
model_name = 'microsoft/deberta-v3-base'
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
MAX_LENGTH = 128
def process_data_for_finetuning(examples):
    result = tokenizer(
        examples['text1'], examples['text2'],
        padding='max_length', max_length=MAX_LENGTH, truncation=True
    )
    result['label'] = examples['label']
    return result

qqp = datasets.load_dataset('SetFit/qqp')
qqp_for_finetuning = qqp.map(process_data_for_finetuning, batched=True)

Repo card metadata block was not found. Setting CardData to empty.


In [6]:
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

train_loader = torch.utils.data.DataLoader(
    qqp_for_finetuning['train'], 
    batch_size=72, shuffle=True, 
    num_workers=4,
    pin_memory=True,
    collate_fn=transformers.default_data_collator
)
val_loader = torch.utils.data.DataLoader(
    qqp_for_finetuning['validation'], 
    batch_size=512, shuffle=False, 
    num_workers=2,
    # pin_memory=True,
    collate_fn=transformers.default_data_collator
)

In [10]:
def finetune_model(model, optimizer, scaler, train_loader, val_loader, device, n_epochs, limit_iteration: int = 2000):
    model.to(device)
    val_accuracy = run_validation(model, val_loader, device=device)
    print(f'Initial accuracy: {val_accuracy:.3f}')
    
    for epoch in range(n_epochs):
    
        model.train()
        mean_loss = 0
        for idx, batch in enumerate(pbar := tqdm(train_loader, desc=f'running {epoch=} ...', total=limit_iteration)):
            optimizer.zero_grad(set_to_none=True)
            
            batch = {name: tensor.to(device) for name, tensor in batch.items()}
            
            with torch.autocast(device_type=device, dtype=torch.float16): 
                predicted = model(
                    input_ids=batch['input_ids'],
                    attention_mask=batch['attention_mask'],
                    token_type_ids=batch['token_type_ids']
                )
                loss = F.cross_entropy(predicted.logits, batch['labels'])
            
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

            loss_coeff = 1 / (idx + 1)
            ema_loss = (1 - loss_coeff) * mean_loss +  loss_coeff * loss.item()
            pbar.set_description(f'running {epoch=} ... | ema loss {ema_loss:.6f}')
    
            if idx + 1 >= limit_iteration: break
        
        model.eval()
        val_accuracy = run_validation(model, val_loader, device=device)
        print(f'{epoch=} {val_accuracy=:.3f}')

In [24]:
n_epochs = 10
device = 'cuda:1'
scaler = torch.amp.GradScaler()
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

finetune_model(model, optimizer, scaler, train_loader, val_loader, device, n_epochs)

computing validation accuracy ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [02:52<00:00,  2.19s/it]


Initial accuracy: 0.368


running epoch=0 ... | ema loss 0.000134: 100%|█████████████████████████████████████████████████████████████████████████████████████▉| 1999/2000 [34:02<00:01,  1.02s/it]
computing validation accuracy ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [02:52<00:00,  2.19s/it]


epoch=0 val_accuracy=0.895


running epoch=1 ... | ema loss 0.000069: 100%|█████████████████████████████████████████████████████████████████████████████████████▉| 1999/2000 [34:03<00:01,  1.02s/it]
computing validation accuracy ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [02:52<00:00,  2.19s/it]


epoch=1 val_accuracy=0.899


running epoch=2 ... | ema loss 0.000092: 100%|█████████████████████████████████████████████████████████████████████████████████████▉| 1999/2000 [34:03<00:01,  1.02s/it]
computing validation accuracy ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [02:52<00:00,  2.19s/it]


epoch=2 val_accuracy=0.908


running epoch=3 ... | ema loss 0.000108: 100%|█████████████████████████████████████████████████████████████████████████████████████▉| 1999/2000 [34:03<00:01,  1.02s/it]
computing validation accuracy ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [02:52<00:00,  2.19s/it]


epoch=3 val_accuracy=0.913


running epoch=4 ... | ema loss 0.000133: 100%|█████████████████████████████████████████████████████████████████████████████████████▉| 1999/2000 [34:03<00:01,  1.02s/it]
computing validation accuracy ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [02:52<00:00,  2.19s/it]


epoch=4 val_accuracy=0.916


running epoch=5 ... | ema loss 0.000082: 100%|█████████████████████████████████████████████████████████████████████████████████████▉| 1999/2000 [34:04<00:01,  1.02s/it]
computing validation accuracy ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [02:52<00:00,  2.19s/it]


epoch=5 val_accuracy=0.916


running epoch=6 ... | ema loss 0.000079: 100%|█████████████████████████████████████████████████████████████████████████████████████▉| 1999/2000 [34:03<00:01,  1.02s/it]
computing validation accuracy ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [02:52<00:00,  2.19s/it]


epoch=6 val_accuracy=0.914


running epoch=7 ... | ema loss 0.000025: 100%|█████████████████████████████████████████████████████████████████████████████████████▉| 1999/2000 [34:04<00:01,  1.02s/it]
computing validation accuracy ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [02:52<00:00,  2.19s/it]


epoch=7 val_accuracy=0.916


running epoch=8 ... | ema loss 0.000026: 100%|█████████████████████████████████████████████████████████████████████████████████████▉| 1999/2000 [34:04<00:01,  1.02s/it]
computing validation accuracy ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [02:52<00:00,  2.19s/it]


epoch=8 val_accuracy=0.918


running epoch=9 ... | ema loss 0.000106: 100%|█████████████████████████████████████████████████████████████████████████████████████▉| 1999/2000 [34:03<00:01,  1.02s/it]
computing validation accuracy ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [02:52<00:00,  2.19s/it]


epoch=9 val_accuracy=0.916


In [25]:
final_accuracy = run_validation(model, val_loader, device=device)
print(f'{final_accuracy=:.3f}')

computing validation accuracy ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [02:52<00:00,  2.19s/it]


final_accuracy=0.916


In [None]:
import math
import time

In [None]:
@torch.no_grad()
def estimate_model_size(model):
    r'''Credit goes to: https://discuss.pytorch.org/t/finding-model-size/130275'''
    param_size, buffer_size = 0, 0
    for param in model.parameters():
        param_size += param.nelement() * param.element_size()
    
    for buffer in model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()
    
    size_all_mb = (param_size + buffer_size) / 1024**2
    print('model size: {:.3f}MB'.format(size_all_mb))


@torch.no_grad()
def estimate_model_inference_speed(model, loader, device='cuda:1'):
    model.eval()

    batch_time = []
    for batch in  tqdm(val_loader, desc='computing validation accuracy ...'):  
        batch = {name: tensor.to(device) for name, tensor in batch.items()}

        torch.cuda.synchronize()
        start = time.time()
        with torch.autocast(device_type=device, dtype=torch.float16): 
            _ = model(
              input_ids=batch['input_ids'],
              attention_mask=batch['attention_mask'],
              token_type_ids=batch['token_type_ids']
            )
        torch.cuda.synchronize()
        batch_time.append(time.time() - start)

    return math.mean(batch_time)



### Task 3: try the full pipeline (1 point)

Finally, it is time to use your model to find duplicate questions.
Please implement a function that takes a question and finds top-5 potential duplicates in the training set. For now, it is fine if your function is slow, as long as it yields correct results.

Showcase how your function works with at least 5 examples.

In [16]:
some_sentences_from_train_set = [
    "Is samsung j7 water proof?",
    "Do we need smaller states?",
    "What are the best places in delhi to chill with your best friend?",
    "Would America support Pakistan in war?",
    "How many software engineers at Google are able to write a balanced binary search tree in a Google Docs in the phone screen?",
    "What is best way to to become a good cyber security analyst?",
    "What's a good interview question on CSS?",
    "How is technology helping us?",
    "What are ways of earning money online?",
    "How long would I have to walk to lose 1kg per week with no other exercise?",
    "What is the purpose of our existence? I mean why do we exist?",
    "The human mind is both rational/irrational (it's just built that way). As an atheist, do you occasionally feel some emptiness or depression?",
    "What is the role of a brand manager?",
    "Which are the apps which use WebView for their apps in Android?",
    "If more vacuum energy appears with expansion and it has no limit, can infinite of this energy be created? If yes is energy infinite?",
    "Study tips to pas ca ipcc?",
    "What's the best smartphone in the market right now?",
    "How is depression cured without a therapist?",
]

In [17]:
class SimilarityDataset(torch.utils.data.Dataset):
    def __init__(self, dataset, tokenizer, question):
        super().__init__()
        self.dataset = dataset
        self.question = question
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        tokenized_input = tokenizer(
            self.question, self.dataset[idx]['text1'],
            padding='max_length', 
            max_length=MAX_LENGTH, 
            truncation=True
        )
        tokenized_input['idx'] = [idx]
        return tokenized_input

In [18]:
@torch.no_grad()
def find_duplicates(model, dataset, device: str = 'cuda:1', topk: int = 2):
    test_loader = torch.utils.data.DataLoader(
        dataset, 
        batch_size=512, shuffle=False, 
        num_workers=2,
        collate_fn=transformers.default_data_collator
    )
    results = []

    for batch in tqdm(test_loader, ncols=100):
        batch = {name: tensor.to(device) for name, tensor in batch.items()}
        
        with torch.autocast(device_type=device, dtype=torch.float16): 
            scores = model(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                token_type_ids=batch['token_type_ids']
            )
            scores = scores.logits.softmax(dim=-1)[:, -1]
            results.extend(
                [*zip(batch['idx'].flatten().cpu(), scores.flatten().cpu())]
            )

    results.sort(key=lambda x: x[-1], reverse=True)
    return results

def get_topk_results_simple(results, dataset, top_k: int = 5):
    return [(dataset[idx.item()]['text1'], round(score.item(), 3)) for idx, score in results[:top_k]]


def get_topk_results_filter_duplicates(esults, dataset, top_k: int = 5):
    '''This function assumes that identical scores can occure only for duplicates. '''
    seen_scores = set()
    top_k_results = []
    
    for idx, score in results:
        if len(top_k_results) >= top_k:
            break
        
        text = dataset[idx.item()]['text1']
        if score.item() in seen_scores:
            continue

        seen_scores.add(score.item())
        top_k_results.append((text, round(score.item(), 3)))

    return top_k_results
        

In [135]:
dataset = SimilarityDataset(qqp_for_finetuning['train'], tokenizer, some_sentences_from_train_set[12])
results = find_duplicates(model, dataset)

100%|█████████████████████████████████████████████████████████████| 711/711 [26:14<00:00,  2.21s/it]


In [138]:
some_sentences_from_train_set[12]

'What is the role of a brand manager?'

In [136]:
get_topk_results_simple(results, qqp_for_finetuning['train'])

[('What is the role of a brand manager?', 0.966),
 ('What does a brand manager do?', 0.921),
 ('What do brand managers do?', 0.888),
 ('What are the coloured dots on Ping Golf clubs? What do the color dots on Ping Golf Irons mean?',
  0.76),
 ('What happens when a bee  flies inside a car moving at 70kmph? Does that mean that the bee flies at 70kmph?',
  0.733)]

In [137]:
get_topk_results_filter_duplicates(results, qqp_for_finetuning['train'])

[('What is the role of a brand manager?', 0.966),
 ('What does a brand manager do?', 0.921),
 ('What do brand managers do?', 0.888),
 ('What are the coloured dots on Ping Golf clubs? What do the color dots on Ping Golf Irons mean?',
  0.76),
 ('What happens when a bee  flies inside a car moving at 70kmph? Does that mean that the bee flies at 70kmph?',
  0.733)]

In [114]:
dataset = SimilarityDataset(qqp_for_finetuning['train'], tokenizer, "Why some people are better then the others")
results = find_duplicates(model, dataset)

100%|█████████████████████████████████████████████████████████████| 711/711 [26:15<00:00,  2.22s/it]


In [134]:
get_topk_results_simple(results, qqp_for_finetuning['train'])

[('Are there people who had successful long distance relationships? Can you tell me about your successful experience with long distance relationship?',
  0.897),
 ('Are there people who had successful long distance relationships? Can you tell me about your successful experience with long distance relationship?',
  0.897),
 ('Are there people who had successful long distance relationships? Can you tell me about your successful experience with long distance relationship?',
  0.897),
 ('Are there people who had successful long distance relationships? Can you tell me about your successful experience with long distance relationship?',
  0.897),
 ('Are there people who had successful long distance relationships? Can you tell me about your successful experience with long distance relationship?',
  0.897)]

In [132]:
get_topk_results_filter_duplicates(results, qqp_for_finetuning['train'])

[('Are there people who had successful long distance relationships? Can you tell me about your successful experience with long distance relationship?',
  0.897),
 ("Why do some people succeed and others don't?", 0.862),
 ('Why does good things happen to bad people and why bad things happen to good people?',
  0.84),
 ('Why do some people succeed and others succeed more?', 0.84),
 ('How do white people who believe that white privilege exists really know it exists? What experience made you realize white privilege exists?',
  0.837)]

In [55]:
@torch.no_grad()
def find_duplicates_fast(model, dataset, device: str = 'cuda:1', topk: int = 5, take_n_batches: int = 5):
    test_loader = torch.utils.data.DataLoader(
        dataset, 
        batch_size=512, shuffle=False, 
        num_workers=2,
        collate_fn=transformers.default_data_collator
    )
    results = []
    
    for batch in tqdm(test_loader, ncols=100):
        batch = {name: tensor.to(device) for name, tensor in batch.items()}
        with torch.autocast(device_type=device, dtype=torch.float16): 
            sep_idx = torch.argwhere(batch['input_ids'][0] == 2)[0].item()
            quest_tokens = batch['input_ids'][:1, :sep_idx]
            mask = batch['attention_mask'][:, sep_idx + 1:]
            train_quest_tokens = batch['input_ids'][:, sep_idx + 1:]

            quest_embeddings = model.deberta.embeddings.word_embeddings(quest_tokens)
            quest_embeddings = quest_embeddings.mean(dim=1)
            
            train_quest_embeddings = model.deberta.embeddings.word_embeddings(train_quest_tokens)
            train_quest_embeddings = (train_quest_embeddings * mask.unsqueeze(-1)).sum(dim=1) / mask.unsqueeze(-1).sum(dim=1)
            
            cos_sim = F.cosine_similarity(quest_embeddings, train_quest_embeddings)
            results.extend(
                [*zip(batch['idx'].flatten().cpu(), cos_sim.flatten().cpu())]
            )
    
    results.sort(key=lambda x: x[-1], reverse=True)
    results_top_choise = [result_pair[0].item() for result_pair in results[:512 * take_n_batches]]
    
    subset = torch.utils.data.Subset(dataset, results_top_choise)
    return find_duplicates(model, subset, device=device, topk=topk)

In [56]:
dataset = SimilarityDataset(qqp_for_finetuning['train'], tokenizer, some_sentences_from_train_set[12])
results = find_duplicates_fast(model, dataset, take_n_batches=300)

 37%|██████████████████████▌                                      | 263/711 [00:41<01:04,  6.91it/s]Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x2b64c5d94ca0>
Traceback (most recent call last):
  File "/Vol0/user/k.tamogachev/miniforge3/envs/sase/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1477, in __del__
    self._shutdown_workers()
  File "/Vol0/user/k.tamogachev/miniforge3/envs/sase/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1460, in _shutdown_workers
    if w.is_alive():
  File "/Vol0/user/k.tamogachev/miniforge3/envs/sase/lib/python3.10/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x2b64c5d94ca0>
Traceback (most recent call last):
  File "/Vol0/user/k.tamogachev/miniforge3/envs/sase/lib/python3.10/site-pa

In [57]:
some_sentences_from_train_set[12]

'What is the role of a brand manager?'

In [58]:
get_topk_results_filter_duplicates(results, qqp_for_finetuning['train'])

[('What does a brand manager do?', 0.966),
 ('What is the role of a brand manager?', 0.963),
 ('Google Tag Manager: Did you heard about Google Tag manager? Why and how it is useful for us?',
  0.591),
 ('What does a Social Media Manager?', 0.145),
 ('What is brand management?', 0.019)]

In [59]:
dataset = SimilarityDataset(qqp_for_finetuning['train'], tokenizer, some_sentences_from_train_set[3])
results = find_duplicates_fast(model, dataset, take_n_batches=300)

100%|█████████████████████████████████████████████████████████████| 711/711 [01:40<00:00,  7.05it/s]
100%|█████████████████████████████████████████████████████████████| 300/300 [11:05<00:00,  2.22s/it]


In [60]:
some_sentences_from_train_set[3]

'Would America support Pakistan in war?'

In [61]:
get_topk_results_filter_duplicates(results, qqp_for_finetuning['train'])

[('Would America support Pakistan in war?', 0.978),
 ('Is there any reservation in sliding round of IPU and what if no seat is allotted in sliding round can I retain my seat of 3rd round as?',
  0.64),
 ('Why do some sunnis support assad?', 0.268),
 ('Will Trump approve the bill for declaring Pakistan as a Terror State?',
  0.128),
 ('What are some alternate theories to the singularity existing at t=0 just before the Big Bang? What else could have existed just before the Big Bang?',
  0.065)]

In [62]:
dataset = SimilarityDataset(qqp_for_finetuning['train'], tokenizer, some_sentences_from_train_set[5])
results = find_duplicates_fast(model, dataset, take_n_batches=300)

100%|█████████████████████████████████████████████████████████████| 711/711 [01:45<00:00,  6.72it/s]
100%|█████████████████████████████████████████████████████████████| 300/300 [11:06<00:00,  2.22s/it]


In [64]:
get_topk_results_filter_duplicates(results, qqp_for_finetuning['train'])

[('What is best way to to become a good cyber security analyst?', 0.99),
 ('Is there any reservation in sliding round of IPU and what if no seat is allotted in sliding round can I retain my seat of 3rd round as?',
  0.7),
 ('Which is the best way to prepare for upsc exam?', 0.52),
 ('How should I proceeded to become a good programmer?', 0.445),
 ('How do you become a security specialist?', 0.333)]

__Bonus:__ for bonus points, try to find a way to run the function faster than just passing over all questions in a loop. For isntance, you can form a short-list of potential candidates using a cheaper method, and then run your tranformer on that short list. If you opted for this solution, please keep both the original implementation and the optimized one - and explain briefly what is the difference there.

**Results:**
Slow method uses batches while fast method uses embeddings to compute fast scores for the sentences, then assembles shortlist of candidates, finally, applies true bert to the short list of candidates. Notably, the 'fast' method performes worse, as the precomputed 'fast scores' do not necesserily perfectly correlate with the true scores.