# Week 4: Transfer Learning, BERT (Homework)

## Question Search Engine

Embeddings are a good source of information for solving various tasks. For example, we can classify texts or find similar documents using their representations. We already know about word2vec, GloVe and fasttext, but they don't use context information from given text (only from contexts of source data).

For today we will use full power of context-aware embeddings to find text duplicates!

__Warning:__ this task assumes you have seen `seminar.ipynb`!

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
import datasets

### Data Preparation

In [2]:
qqp = datasets.load_dataset("SetFit/qqp")
print("\n")
print("Sample[0]:", qqp["train"][0])
print("Sample[3]:", qqp["train"][3])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/313 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


train.jsonl:   0%|          | 0.00/70.8M [00:00<?, ?B/s]

validation.jsonl: 0.00B [00:00, ?B/s]

test.jsonl:   0%|          | 0.00/76.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/363846 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/40430 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/390965 [00:00<?, ? examples/s]



Sample[0]: {'text1': 'How is the life of a math student? Could you describe your own experiences?', 'text2': 'Which level of prepration is enough for the exam jlpt5?', 'label': 0, 'idx': 0, 'label_text': 'not duplicate'}
Sample[3]: {'text1': 'What can one do after MBBS?', 'text2': 'What do i do after my MBBS ?', 'label': 1, 'idx': 3, 'label_text': 'duplicate'}


In [3]:
model_name = "gchhablani/bert-base-cased-finetuned-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/890 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

In [4]:
MAX_LENGTH = 128

def preprocess_function(examples):
    result = tokenizer(
        examples["text1"],
        examples["text2"],
        padding="max_length",
        max_length=MAX_LENGTH,
        truncation=True,
    )

    result["label"] = examples["label"]

    return result

In [5]:
qqp_preprocessed = qqp.map(preprocess_function, batched=True)

Map:   0%|          | 0/363846 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

In [6]:
print(repr(qqp_preprocessed["train"][0]["input_ids"])[:100], "...")

[101, 1731, 1110, 1103, 1297, 1104, 170, 12523, 2377, 136, 7426, 1128, 5594, 1240, 1319, 5758, 136,  ...


### Evaluation (1 point)

We randomly chose a model trained on QQP - but is it any good?

One way to measure this is with validation accuracy - which is what you will implement next.

Here's the interface to help you do that:

In [7]:
val_set = qqp_preprocessed["validation"]
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=1, shuffle=False, collate_fn=transformers.default_data_collator
)

In [13]:
from tqdm import tqdm

# Check if CUDA is available and set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Move model to GPU
model = model.to(device)

# Evaluation code
correct_predictions = 0
total_predictions = 0

model.eval()  # Set model to evaluation mode

# Используем tqdm для показа прогресса
progress_bar = tqdm(val_loader, desc="Evaluating", unit="batch")

for i, batch in enumerate(progress_bar):
    # Move batch to GPU
    batch = {key: value.to(device) for key, value in batch.items()}

    with torch.no_grad():
        # Get model predictions
        outputs = model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
            token_type_ids=batch["token_type_ids"],
        )

        # Get predicted class (0 or 1)
        predictions = torch.argmax(outputs.logits, dim=-1)

        # Compare with true labels
        true_labels = batch["labels"]  # или batch["label"]
        correct_predictions += (predictions == true_labels).sum().item()
        total_predictions += true_labels.size(0)

        # Обновляем прогресс-бар с текущей точностью каждые 100 батчей
        if (i + 1) % 100 == 0:
            current_accuracy = correct_predictions / total_predictions
            progress_bar.set_postfix({"Current Acc": f"{current_accuracy:.4f}"})

# Calculate final accuracy
accuracy = correct_predictions / total_predictions
print(f"\nFinal Validation Accuracy: {accuracy:.4f}")
print(f"Correct predictions: {correct_predictions}/{total_predictions}")


with torch.no_grad():
    predicted = model(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
        token_type_ids=batch["token_type_ids"],
    )

Using device: cuda


Evaluating: 100%|██████████| 40430/40430 [08:13<00:00, 81.85batch/s, Current Acc=0.9085]



Final Validation Accuracy: 0.9084
Correct predictions: 36726/40430


**Task 1 (1 point)**

- Measure the validation accuracy of your model. Doing so naively may take several hours. Please make sure you use the following optimizations:
  - Run the model on GPU with no_grad
  - Using batch size larger than 1
  - Use optimize data loader with num_workers > 1
  - (Optional) Use [mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html)


In [14]:
assert 0.9 < accuracy < 0.91

### Training (4 points)

For this task, you have two options:

__Option A:__ fine-tune your own model. You are free to choose any model __except for the original BERT.__ We recommend [DeBERTa-v3](https://huggingface.co/microsoft/deberta-v3-base). Better yet, choose the best model based on public benchmarks (e.g. [GLUE](https://gluebenchmark.com/)).

You can write the training code manually or use transformers.Trainer (see [this example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification)). Please make sure that your model's accuracy is at least __comparable__ with the above example for BERT.


__Option B:__ compare at least 3 pre-finetuned models (in addition to the above BERT model). For each model, report (1) its accuracy, (2) its speed, measured in samples per second in your hardware setup and (3) its size in megabytes. Please take care to compare models in equal setting, e.g. same CPU / GPU. Compile your results into a table and write a short (~half-page on top of a table) report, summarizing your findings.

**Task 2 (4 points)**
- Choose Option A or Option B (only one will be graded)
- Follow all the instructions and restrictions

In [18]:
import torch
import torch.nn as nn
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
from sklearn.metrics import accuracy_score
import numpy as np
from tqdm import tqdm

# Choose DeBERTa-v3-small for faster training (or use base if you prefer)
model_name = "microsoft/deberta-v3-small"  # Smaller model for faster training
print(f"Using model: {model_name}")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
)

print(f"Model loaded. Number of parameters: {model.num_parameters():,}")

# Tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples["text1"],
        examples["text2"],
        padding=False,
        max_length=128,
        truncation=True,
    )

# Tokenize datasets
print("Tokenizing datasets...")
train_dataset = qqp["train"].map(tokenize_function, batched=True, remove_columns=["text1", "text2", "idx", "label_text"])
val_dataset = qqp["validation"].map(tokenize_function, batched=True, remove_columns=["text1", "text2", "idx", "label_text"])

# Rename label column to labels
train_dataset = train_dataset.rename_column("label", "labels")
val_dataset = val_dataset.rename_column("label", "labels")

# Use smaller subset for faster training
train_dataset = train_dataset.select(range(10000))  # 10k samples
val_dataset = val_dataset.select(range(1000))       # 1k samples

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

# Data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Evaluation function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": accuracy_score(labels, predictions)}

# Simplified training arguments
training_args = TrainingArguments(
    output_dir="./deberta-qqp-finetuned",
    learning_rate=3e-5,
    per_device_train_batch_size=8,  # Smaller batch size
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    warmup_steps=200,
    logging_steps=100,
    eval_strategy="epoch",  # Evaluate each epoch instead of steps
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="eval_accuracy",
    greater_is_better=True,
    remove_unused_columns=True,
    dataloader_num_workers=0,  # Set to 0 to avoid multiprocessing issues
    fp16=False,  # Disable mixed precision to avoid issues
    gradient_checkpointing=False,  # Disable gradient checkpointing
    report_to=[],
    seed=42,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    processing_class=tokenizer,  # Use processing_class instead of tokenizer
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Check device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Training on device: {device}")

# Clear any cached gradients
if hasattr(model, 'zero_grad'):
    model.zero_grad()

# Train the model
print("Starting training...")
try:
    train_result = trainer.train()
    print("Training completed successfully!")

    # Evaluate on validation set
    print("\nEvaluating on validation set...")
    eval_results = trainer.evaluate()

    print(f"\nFinal Validation Results:")
    for key, value in eval_results.items():
        print(f"{key}: {value:.4f}")

    final_accuracy = eval_results["eval_accuracy"]

    # Save the model
    print("Saving model...")
    trainer.save_model("./deberta-qqp-final")
    tokenizer.save_pretrained("./deberta-qqp-final")

except Exception as e:
    print(f"Training error: {e}")
    print("Attempting basic evaluation...")

    # Try manual evaluation
    model.eval()
    correct = 0
    total = 0

    # Create a simple dataloader for evaluation
    from torch.utils.data import DataLoader

    eval_dataloader = DataLoader(
        val_dataset,
        batch_size=16,
        collate_fn=data_collator
    )

    with torch.no_grad():
        for batch in tqdm(eval_dataloader, desc="Manual evaluation"):
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            predictions = torch.argmax(outputs.logits, dim=-1)
            correct += (predictions == batch["labels"]).sum().item()
            total += batch["labels"].size(0)

    final_accuracy = correct / total
    print(f"Manual evaluation accuracy: {final_accuracy:.4f}")

# Test inference
print("\nTesting inference...")
model.eval()
test_samples = [
    ("What is machine learning?", "What is ML?"),
    ("How to cook pasta?", "What is quantum physics?"),
    ("Best pizza in New York", "Top pizza places in NYC")
]

with torch.no_grad():
    for text1, text2 in test_samples:
        inputs = tokenizer(text1, text2, return_tensors="pt", padding=True, truncation=True, max_length=128)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        model = model.to(device)

        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=-1).item()
        confidence = predictions[0][predicted_class].item()

        result = "DUPLICATE" if predicted_class == 1 else "NOT DUPLICATE"
        print(f"'{text1}' vs '{text2}' -> {result} (confidence: {confidence:.3f})")

print(f"\n✅ Final accuracy: {final_accuracy:.4f}")


Using model: microsoft/deberta-v3-small


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded. Number of parameters: 141,896,450
Tokenizing datasets...


Map:   0%|          | 0/363846 [00:00<?, ? examples/s]

Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

Training samples: 10000
Validation samples: 1000


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 2, 'bos_token_id': 1}.


Training on device: cuda
Starting training...


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3755,0.335523,0.862


Training completed successfully!

Evaluating on validation set...



Final Validation Results:
eval_loss: 0.3355
eval_accuracy: 0.8620
eval_runtime: 1.1581
eval_samples_per_second: 863.4580
eval_steps_per_second: 54.3980
epoch: 1.0000
Saving model...

Testing inference...
'What is machine learning?' vs 'What is ML?' -> NOT DUPLICATE (confidence: 0.976)
'How to cook pasta?' vs 'What is quantum physics?' -> NOT DUPLICATE (confidence: 0.999)
'Best pizza in New York' vs 'Top pizza places in NYC' -> DUPLICATE (confidence: 0.888)

✅ Final accuracy: 0.8620


### Finding Duplicates (1 point)

Finally, it is time to use your model to find duplicate questions.
Please implement a function that takes a question and finds top-5 potential duplicates in the training set. For now, it is fine if your function is slow, as long as it yields correct results.

Showcase how your function works with at least 5 examples.

**Task 3 (1 point)**
- Implement function for finding duplicates
- Test it on several examples (at least 5)
- Check suggested duplicates and make a conclusion about model correctness

In [19]:
import torch
import torch.nn.functional as F
from tqdm import tqdm

def find_duplicates(query_question, model, tokenizer, questions_list, top_k=5):
    scores = []
    model.eval()

    print(f"Searching for duplicates of: '{query_question}'")

    with torch.no_grad():
        for candidate in tqdm(questions_list, desc="Computing similarities"):
            if candidate.strip().lower() == query_question.strip().lower():
                continue  # Skip identical questions

            # Tokenize pair of questions
            inputs = tokenizer(
                query_question,
                candidate,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=128
            )

            # Move to GPU if available
            if torch.cuda.is_available():
                inputs = {k: v.cuda() for k, v in inputs.items()}
                model = model.cuda()

            # Get model prediction
            outputs = model(**inputs)
            probs = F.softmax(outputs.logits, dim=-1)
            duplicate_prob = probs[0][1].item()  # Probability of being duplicate

            scores.append((candidate, duplicate_prob))

    # Sort by duplicate probability
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]

# Prepare question list from training data
print("Preparing question list...")
questions_list = []
for example in qqp["train"].select(range(5000)):
    questions_list.extend([example["text1"], example["text2"]])

# Remove duplicates and empty questions
questions_list = list(set([q.strip() for q in questions_list if q.strip()]))
print(f"Total unique questions: {len(questions_list)}")

# Test queries
test_queries = [
    "How do I learn programming?",
    "What is the best way to lose weight?",
    "How can I make money online?",
    "What are good programming languages to learn?",
    "How do I improve my English?"
]

print("DUPLICATE DETECTION RESULTS")

for i, query in enumerate(test_queries, 1):
    print(f"\n Query {i}: '{query}'")
    print("-" * 40)

    try:
        duplicates = find_duplicates(query, model, tokenizer, questions_list, top_k=5)

        print("Top 5 potential duplicates:")
        for rank, (question, score) in enumerate(duplicates, 1):
            print(f"{rank}. [{score:.3f}] {question}")

    except Exception as e:
        print(f"Error: {e}")

# Test on actual duplicate pair from validation
print(f"\n Testing on known duplicate pair:")
val_example = qqp["validation"][3]  # We know this is a duplicate from earlier
text1, text2 = val_example["text1"], val_example["text2"]

print(f"Question 1: '{text1}'")
print(f"Question 2: '{text2}'")
print(f"Actual label: {'DUPLICATE' if val_example['label'] == 1 else 'NOT DUPLICATE'}")

# Test our model on this pair
inputs = tokenizer(text1, text2, return_tensors="pt", padding=True, truncation=True)
if torch.cuda.is_available():
    inputs = {k: v.cuda() for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    probs = F.softmax(outputs.logits, dim=-1)
    predicted_prob = probs[0][1].item()
    prediction = "DUPLICATE" if predicted_prob > 0.5 else "NOT DUPLICATE"

print(f"Model prediction: {prediction} (confidence: {predicted_prob:.3f})")


Preparing question list...
Total unique questions: 9821
DUPLICATE DETECTION RESULTS

 Query 1: 'How do I learn programming?'
----------------------------------------
Searching for duplicates of: 'How do I learn programming?'


Computing similarities: 100%|██████████| 9821/9821 [02:39<00:00, 61.68it/s]


Top 5 potential duplicates:
1. [0.898] How should you start learning programming?
2. [0.864] How can I learn programming from scratch?
3. [0.585] How would demonetizing 500 and 1000 rupee notes and introducing new 2000 rupee notes help curb black money and corruption?
4. [0.565] Could dark matter fill 'empty' space and be displaced by matter? Could the Milky Way's halo be the state of displacement of the dark matter?
5. [0.493] If the Indian government has decided to demonetise 500 and 1000 rupee notes, why are they bringing back new 500 and 2000 Rs notes?

 Query 2: 'What is the best way to lose weight?'
----------------------------------------
Searching for duplicates of: 'What is the best way to lose weight?'


Computing similarities: 100%|██████████| 9821/9821 [02:34<00:00, 63.42it/s]


Top 5 potential duplicates:
1. [0.963] What are the best simple ways to loose weight?
2. [0.963] What are the best ways to lose weight?
3. [0.963] What is the best and quick way to lose weight?
4. [0.962] What are the best way of loose the weight?
5. [0.962] What is the fastest way to lose weight?

 Query 3: 'How can I make money online?'
----------------------------------------
Searching for duplicates of: 'How can I make money online?'


Computing similarities: 100%|██████████| 9821/9821 [02:35<00:00, 63.28it/s]


Top 5 potential duplicates:
1. [0.960] How can I realistically make money online?
2. [0.960] How can i make money online easily?
3. [0.960] How do I can make extra money online?
4. [0.960] How could I make money online?
5. [0.960] How can one make money online?

 Query 4: 'What are good programming languages to learn?'
----------------------------------------
Searching for duplicates of: 'What are good programming languages to learn?'


Computing similarities: 100%|██████████| 9821/9821 [02:35<00:00, 63.23it/s]


Top 5 potential duplicates:
1. [0.939] What is the best programming language for beginners to learn?
2. [0.936] What is the best programming/coding language to learn?
3. [0.915] What are the best programming languages for beginners and why?
4. [0.913] Which is the best programming language for beginners?
5. [0.910] What are the best programming languages to learn today?

 Query 5: 'How do I improve my English?'
----------------------------------------
Searching for duplicates of: 'How do I improve my English?'


Computing similarities: 100%|██████████| 9821/9821 [02:35<00:00, 63.16it/s]
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Top 5 potential duplicates:
1. [0.962] How can I improve my English Language?
2. [0.960] How can I improve my English speaking ability?
3. [0.960] How could I improve my English?
4. [0.960] How can I continue to improve my English?
5. [0.959] How I can improve my English communication?

 Testing on known duplicate pair:
Question 1: 'Why are people so obsessed with having a girlfriend/boyfriend?'
Question 2: 'How can a single male have a child?'
Actual label: NOT DUPLICATE
Model prediction: NOT DUPLICATE (confidence: 0.003)
