In [1]:
from datasets import load_dataset
import random
import pandas as pd
import math
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch
import numpy as np
import os
import torch.nn.functional as F
from sklearn.model_selection import train_test_split

In [2]:
ds = load_dataset("clarin-knext/fiqa-pl", "corpus")
ds_queries = load_dataset("clarin-knext/fiqa-pl","queries")
ds_qa = load_dataset("clarin-knext/fiqa-pl-qrels")['train'] # we will train the model on a qrels training subset

model_name = 'allegro/herbert-base-cased'
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allegro/herbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
corpus = pd.DataFrame(ds['corpus'])
queries = pd.DataFrame(ds_queries['queries'])
qrels = pd.DataFrame(ds_qa)

corpus.drop(columns=['title'], inplace=True)
queries.drop(columns=['title'], inplace=True)

In [4]:
print(f"rows: a) corpus: {len(corpus)} b) queries: {len(queries)} c) qrels (train subset): {len(qrels)}")
print(type(qrels['query-id'][0]), type(queries['_id'][0]), type(corpus['_id'][0]))
queries['_id'] = queries['_id'].astype(int)
corpus['_id'] = corpus['_id'].astype(int)

rows: a) corpus: 57638 b) queries: 6648 c) qrels (train subset): 14166
<class 'numpy.int64'> <class 'str'> <class 'str'>


Creating dataset of positive pairs. I combine questions with answers and select, for example, 4000 random examples

In [6]:
positives = qrels.merge(queries, left_on="query-id", right_on="_id").merge(corpus, left_on="corpus-id", right_on="_id")
positives = positives[['text_x', 'text_y', 'score']].rename(columns={'text_x': 'query', 'text_y': 'answer'}).sample(n=4000)

In [12]:
positives[:3]

Unnamed: 0,query,answer,score
13567,Handel poufny powiązanym papierem wartościowym...,"Jeśli jesteś w stanie uzyskać informacje, któr...",1
4034,"Pożyczanie pieniędzy, a następnie ich inwestow...",Wykorzystam 10% z tych 20 000 na spłatę pożycz...,1
11617,Podatki od sprzedaży akcji,@BlackJack robi dobrą odpowiedź na temat zyskó...,1


I am creating a dictionary of correct answers to questions, so that when drawing nagative answers, I can check for positive ones

In [15]:
qa_dict = {}

for row in ds_qa:
    query_id = row['query-id']
    corpus_id = row['corpus-id']
    
    if corpus_id not in qa_dict:
        qa_dict[corpus_id] = []
    qa_dict[corpus_id].append(query_id)

In [17]:
corpus_dict = {corpus_id: corpus[corpus['_id'] == corpus_id]['text'].values[0] for corpus_id in corpus['_id']}
queries_dict = {query_id: queries[queries['_id'] == query_id]['text'].values[0] for query_id in queries['_id']}
qa_queries = set(ds_qa['query-id'])

I choose a random text from all corpus_id. Then I draw some question for it from ds_qa (that is, from those questions that have some correct answer).  
I draw 12,000 rows so that the ratio of positive examples to negative ones is 1:3.

In [19]:
negatives_rows = []

for i in range(12000):
    correct = -1
    corpus_id = random.choice(list(corpus['_id']))
    if corpus_id in qa_dict:
        correct = qa_dict[corpus_id]

    query = random.choice(list(qa_queries))
    while query == correct: # if answer is correct, draw again
        query = random.choice(list(qa_queries))

    row = {
        'query': queries_dict.get(query, ''),
        'answer': corpus_dict.get(corpus_id, ''),
        'score': 0
    }

    negatives_rows.append(row)

negatives = pd.DataFrame(negatives_rows)

Preparing a dataset as a collection of strings: {question} {separator} {passage}

In [21]:
dataset = pd.concat([positives, negatives], ignore_index=True)
dataset = dataset.sample(frac=1).reset_index(drop=True)

separator_token = tokenizer.sep_token
dataset['pair'] = dataset.apply(lambda row: f"{row['query']} {separator_token} {row['answer']}", axis=1)

dataset.drop(columns=['query', 'answer'], inplace=True)
dataset = dataset[['pair','score']]

In [22]:
dataset[:10]

Unnamed: 0,pair,score
0,Dlaczego transakcje bankowe nie są natychmiast...,1
1,Krótka koncepcja ruchu ceny określonej akcji [...,0
2,Budowanie niezależności finansowej </s> Ważne ...,1
3,jak późno mogę wpłacić pieniądze na konto IRA ...,1
4,Czy jako obywatel niebędący obywatelem Indii m...,0
5,Czy śledzenie pieniędzy i posiadanie budżetu t...,1
6,Czy istnieją karty kredytowe z okresem wyciągu...,0
7,Dlaczego nigdy nie widziałem podziału akcji? <...,1
8,Kiedy zlecenia stop/limit są widoczne na otwar...,0
9,"Przyznanie akcji, podatki i IRS </s> „Zawsze m...",0


Splitting dataset with stratification

In [24]:
train_df, test_df = train_test_split(dataset, test_size=0.2, stratify=dataset['score'], random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.2, stratify=train_df['score'], random_state=42)

In [25]:
train_pairs = train_df['pair'].tolist()
val_pairs = val_df['pair'].tolist()
test_pairs = test_df['pair'].tolist()

train_labels = train_df['score'].tolist()
val_labels = val_df['score'].tolist()
test_labels = test_df['score'].tolist()

Tokenization of data, because this is what the model works on

In [27]:
def tokenize_data(pairs, labels):
    encoding = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')
    encoding['labels'] = torch.tensor(labels)
    return encoding

train_encodings = tokenize_data(train_pairs, train_labels)
val_encodings = tokenize_data(val_pairs, val_labels)
test_encodings = tokenize_data(test_pairs, test_labels)

In [28]:
from torch.utils.data import Dataset

class TextPairDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings['input_ids'])

In [29]:
train_dataset = TextPairDataset(train_encodings)
val_dataset = TextPairDataset(val_encodings)
test_dataset = TextPairDataset(test_encodings)

In [30]:
# import transformers
# import accelerate
# print("Transformers version:", transformers.__version__)
# print("Accelerate version:", accelerate.__version__)

In [31]:
# pip install accelerate>=0.26.0

During training, we calculate various metrics, but when selecting the best model, we rely on the F1 score (as it works well for imbalanced datasets).

In [33]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(axis=-1) # converitng results to labels

    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="weighted") #!

    return {
        "eval_accuracy": accuracy,
        "eval_precision": precision,
        "eval_recall": recall,
        "eval_f1": f1
    }

In [46]:
print("Is GPU available:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")

Is GPU available: True
GPU: NVIDIA GeForce RTX 3050 Ti Laptop GPU


In [48]:
model.to("cuda")

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,  # Loading best model at the end due to f1-score
    metric_for_best_model="eval_f1", 
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    run_name="lab5_run",
    logging_steps=30,
    save_strategy="epoch",  # Record every epoch, just like an evaluation
    learning_rate=5e-05,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir='./logs',
    gradient_accumulation_steps=2,
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)



Training was commented out because the model was trained on Google Colab using a more powerful GPU, then saved to a computer from which the trained model is now being loaded.

In [51]:
# trainer.train()

In [53]:
from elasticsearch import Elasticsearch
client = Elasticsearch(
    "http://localhost:9200",
    basic_auth=("elastic", "CzemuTakieDlugie24"),
    verify_certs=False
)

Not pretty code but I use code from lab2 that worked for 4 analyzers 

In [55]:
def search_text(index_name: str, queries: dict, k: int):

    model_results = [[] for _ in range(1)] 
    
    for query_id, query_text in queries.items():
        search_bodies = [
            {
                "query": {
                    "multi_match": {
                        "query": query_text,
                        "fields": ["without_synonyms_with_lemmatizer"]
                    }
                },
                "size": k
            }
        ]

        for i, search_body in enumerate(search_bodies):
            response = client.search(index=index_name, body=search_body)
            
            for hit in response['hits']['hits']:
                model_results[i].append({
                    'query-id': query_id,
                    'corpus-id': int(hit['_id'])
                })

    return model_results

In [57]:
from collections import defaultdict, OrderedDict
ds_qa = load_dataset("clarin-knext/fiqa-pl-qrels")['test']  # Testing model and calculating NDCG on TEST data
test_query_ids = set(ds_qa['query-id'])
filtered_queries = OrderedDict(sorted((int(row['_id']), row['text']) for row in ds_queries['queries'] if int(row['_id']) in test_query_ids))

In [59]:
results = search_text("fiqa_pl_index_v3", filtered_queries, 30)[0] # we search top 30 texts for each query from the test set
print(len(results))
results[:3]

19440


[{'query-id': 8, 'corpus-id': 309023},
 {'query-id': 8, 'corpus-id': 65404},
 {'query-id': 8, 'corpus-id': 438975}]

### Loading the trained model

In [61]:
path = r"C:\Users\karol\Desktop\STUDIA\pjn\my_model"
print(os.listdir(path))

['config.json', 'merges.txt', 'model.safetensors', 'special_tokens_map.json', 'tokenizer.json', 'tokenizer_config.json', 'vocab.json']


In [63]:
model = AutoModelForSequenceClassification.from_pretrained(path)
tokenizer = AutoTokenizer.from_pretrained(path)

trainer = Trainer(
    model=model,                   # Model do wytrenowania
    args=training_args,            # Argumenty trenowania
    train_dataset=train_dataset,   # Zbiór danych treningowych
    eval_dataset=val_dataset,       # Zbiór danych walidacyjnych
    compute_metrics=compute_metrics
)

Model evaluation

In [66]:
trainer.evaluate()

{'eval_accuracy': 0.953125,
 'eval_precision': 0.9529909490256199,
 'eval_recall': 0.953125,
 'eval_f1': 0.9530511832718206,
 'eval_loss': 0.17457541823387146,
 'eval_model_preparation_time': 0.0035,
 'eval_runtime': 24.8192,
 'eval_samples_per_second': 103.146,
 'eval_steps_per_second': 3.223}

Here I convert the questions from the qa_test set into the format expected by the model, which is query < / s > text. Then, I tokenize them, as the model requires tokenized data.

In [67]:
pairs_to_classify = []

for result in results:
    first_part = filtered_queries[result['query-id']] + " </s>"
    second_part = corpus_dict[result['corpus-id']]
    pair = first_part + second_part
    pairs_to_classify.append(pair)

pairs_tokenized = [tokenizer(pair, return_tensors="pt", truncation=True).to("cuda") for pair in pairs_to_classify]
len(pairs_to_classify)

19440

I create list of tuples (query_id, corpus_id) to save the prediction results as a dictionary: (query_id, corpus_id): (predicted_class, probability)

In [69]:
results_tuple = [(d['query-id'], d['corpus-id']) for d in results]

Here we make predictions: for each tokenized example, we calculate the class and the probability with which the model is confident. We save this to predictions.

In [73]:
predictions = {}

for i, inputs in enumerate(pairs_tokenized):

    if i % 1000 == 0: # just tracking how much is left
        print(i)

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

        probabilities = F.softmax(logits, dim=-1)
        predicted_class = torch.argmax(logits, dim=-1).item()
        positive_class_prob = probabilities[0][1].item() 

    predictions[results_tuple[i]] = (predicted_class, positive_class_prob)

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000


I sort against prediction probability by grouping by each query_id 

In [76]:
pred_sorted = dict(
    sorted(
        predictions.items(),
        key=lambda item: (item[0][0], -item[1][1])
    )
)

Then from each group we leave only the “k” most certain results (5 in our case because we'll be calculating NDCG5 as in lab2)

In [79]:
k=5
grouped = defaultdict(list)
for key, value in pred_sorted.items():
    grouped[key[0]].append((key, value))

# Zachowanie tylko 5 pierwszych elementów z każdej grupy
filtered_data = {key: v[:k] for key, v in grouped.items()}

# Złączenie wyników z powrotem w jeden słownik
pred = {key: value for group in filtered_data.values() for key, value in group}

Then I format the data to the same format as in lab2 to use the same code (we have: [[ ]]) because in lab 2 there were 4 analyzers, here we give 1 but for code consistency also give a list of lists

In [82]:
formatted_pred = [[{'query-id': query_id, 'corpus-id': corpus_id, 'analyzer': 0} for (query_id, corpus_id), (anayzer, _) in pred.items()]]  

Creating a list containing valid results (those from the ds_qa dataset)

In [85]:
correct_results = defaultdict(list)

for entry in ds_qa:
    query_id = entry['query-id']
    corpus_id = entry['corpus-id']
    correct_results[query_id].append(corpus_id)

correct_list = [{'query-id': query_id, 'corpuses-id': corpuses} for query_id, corpuses in correct_results.items()]

In [87]:
def get_DCG(relevance_scores, k):
    return sum(rel / np.log2(idx + 2) for idx, rel in enumerate(relevance_scores[:k]))

def get_NDCG(model_results, relevant_documents, k):

    relevance_scores = [1 if doc['corpus-id'] in relevant_documents else 0 for doc in model_results]
    dcg_k = get_DCG(relevance_scores, k)

    num_docs = len(relevant_documents)
    ideal_relevance_scores = [1] * min(num_docs, k) + [0] * (k - min(num_docs, k))
    idcg_k = get_DCG(ideal_relevance_scores, k)

    ndcg_k = dcg_k / idcg_k if idcg_k > 0 else 0
    return ndcg_k

In [89]:
ndcgs = []
k=5

for j in range(len(formatted_pred)):
    results = []
    for i in range(len(correct_list)):
        ndcg = get_NDCG(formatted_pred[j][i*k:(i+1)*k], correct_list[i]['corpuses-id'], k)
        results.append((correct_list[i]['query-id'], float(ndcg)))

    ndcgs.append(results)

In [91]:
for i in range(len(ndcgs)):
    avg = sum(value for _, value in ndcgs[i]) / len(ndcgs[i])
    print(f"Result after training: {round(avg*100,2)} %")

Result after training: 21.45 %


In Lab2, the result for the same dataset and analyzer was 18.51%. With the current approach, we achieved a 3% improvement, bringing the result up to 21.45%. While this represents a noticeable increase, the overall difference is relatively modest. Nevertheless, it is still a significant improvement in the model's performance.

## Questions

### 1) Do you think simpler methods, like Bayesian bag-of-words model, would work for sentence-pair classification? Justify your answer.

Although Bayesian and bag-of-words models can be used for sentence-pair classification, they generally lack the ability to capture the nuanced semantic relationships and contextual dependencies required for such tasks. For example, the bag-of-words model treats words in isolation, ignoring their order and interdependencies. For this reason, these models are not ideal and would likely yield subpar results compared to modern NLP models that are explicitly designed for such tasks.


### 2) What hyper-parameters you have selected for the training? What resources (papers, tutorial) you have consulted to select these hyper-parameters?

For the chosen hyperparameters, I decided on the following values:

learning_rate=5e-05 – This is one of the default values used in this type of problem; based on discussions I found on forums, it seems to perform well.   
batch_size=32 – I experimented with various values, but for this batch size, I found a good balance between training stability, model generalization, and training speed.   
epochs=5 – I conducted several experiments and didn't observe a significant improvement in results from one epoch to the next, so I decided to limit the number of epochs to five, 
 especially since the model already took a considerable amount of time to train.   
fp16 – After researching, I found that using 16-bit floating point precision can speed up training, especially on certain GPUs. 

### 3) Think about pros and cons of the neural-network models with respect to natural language processing. Provide at least 2 pros and 2 cons.

#### Pros: 
- Neural networks, especially deep learning models like transformers (e.g., BERT, GPT), excel at learning complex and subtle patterns in data. They can capture long-range dependencies in text, such as context between words that are far apart in a sentence, which simpler models like bag-of-words or rule-based systems cannot.
- Neural network models can be fine-tuned for specific tasks with relatively small amounts of labeled data, thanks to techniques like transfer learning. Models such as BERT and GPT can be adapted to new tasks with excellent performance even with limited task-specific data. This greatly reduces the need for extensive labeled datasets.

#### Cons
- Training deep neural networks requires significant computational resources, including powerful GPUs or TPUs and considerable time. Additionally, fine-tuning large pre-trained models for specific tasks can still be resource-intensive, even though it requires less data.
- For neural netowrks it is difficult to understand how they make decisions. This lack of transparency can be problematic, especially in sensitive domains like healthcare, law, or finance, where understanding the rationale behind a model's prediction is critical.