<a href="https://colab.research.google.com/github/Kkordik/NovelQSI/blob/main/Compare_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# In this notebook I compared two models two times:

**Base models:**
- mrm8488/longformer-base-4096-finetuned-squadv2
- deepset/deberta-v3-base-squad2

With parameters:

`dataset_name = Kkordik/TriviaQA_SQuAD`

`split = "train"`

`intervals = [[1, 512], [512, 1024], [1024, 1500], [1500, 2000], [2000, 2500]]`

*Also some code differences were commented and explained

.

**And chosen base model with fine-tuned:**
- mrm8488/longformer-base-4096-finetuned-squadv2
- Kkordik/test_longformer_4096_qsi

With parameters:

`dataset_name = Kkordik/NovelQSI`

`split = "test"`

`intervals = [[2100, 2300]]`

In [5]:
!pip install evaluate



In [7]:
# @title Load dataset and initialize task_evaluator

import datasets
from evaluate import evaluator

dataset_name = "Kkordik/NovelQSI" # @param {type:"string"}
split = "test" # @param {type:"string"}
intervals = [[2100, 2300]] # @param

dataset = datasets.load_dataset(dataset_name, split=split)

task_evaluator = evaluator("question-answering")

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.47M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/746 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/320 [00:00<?, ? examples/s]

In [8]:
# @title Initialize the first model and tokenizer

import datasets
from evaluate import evaluator
from transformers import AutoModelForQuestionAnswering, AutoTokenizer


hf_model_name_1 = "mrm8488/longformer-base-4096-finetuned-squadv2" # @param {type:"string"}

model_1 = AutoModelForQuestionAnswering.from_pretrained(hf_model_name_1)
tokenizer_1 = AutoTokenizer.from_pretrained(hf_model_name_1)

Some weights of the model checkpoint at mrm8488/longformer-base-4096-finetuned-squadv2 were not used when initializing LongformerForQuestionAnswering: ['longformer.pooler.dense.bias', 'longformer.pooler.dense.weight']
- This IS expected if you are initializing LongformerForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LongformerForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
# @title Run evaluation for the first model

for interval in intervals:
  tokenizer_1.model_max_length = interval[1]

  # This function was used with Kkordik/TriviaQA_SQuAD dataset, as it was too long for tokenization each row (I had no time). It is cosidered as bad practice
  # filter_interval = lambda ex: interval[0]*5 <= len(ex['context'] + " " + ex['question']) < interval[1]*5 and len(ex['answers']['text']) > 0
  filter_interval = lambda row: interval[0] <= len(tokenizer_1.tokenize(row["context"] + " " + row["question"])) < interval[1]

  filtered_data = dataset.filter(filter_interval)

  print(filtered_data)

  eval_results = task_evaluator.compute(
      model_or_pipeline=model_1,
      tokenizer=tokenizer_1,
      data=filtered_data,
      strategy="bootstrap",
      n_resamples=30
  )
  print(interval, eval_results)

Filter:   0%|          | 0/320 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2318 > 2300). Running this sequence through the model will result in indexing errors


Dataset({
    features: ['context', 'question', 'answers', 'id'],
    num_rows: 231
})


Filter:   0%|          | 0/231 [00:00<?, ? examples/s]



[2100, 2300] {'exact_match': {'confidence_interval': (9.090909090909092, 16.450216450216452), 'standard_error': 2.8392490950299343, 'score': 12.121212121212121}, 'f1': {'confidence_interval': (19.824994401700803, 27.508771433359502), 'standard_error': 2.5308886792106966, 'score': 22.799422799422796}, 'total_time_in_seconds': 217.63294786500023, 'samples_per_second': 1.0614201676084978, 'latency_in_seconds': 0.9421339734415595}


In [12]:
# @title Initialize the second model and tokenizer

hf_model_name_2 = "Kkordik/test_longformer_4096_qsi" # @param {type:"string"}


model_2 = AutoModelForQuestionAnswering.from_pretrained(hf_model_name_2)
tokenizer_2 = AutoTokenizer.from_pretrained(hf_model_name_2)

config.json:   0%|          | 0.00/914 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/592M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

In [13]:
# @title Run evaluation for the second model

for interval in intervals:
  tokenizer_2.model_max_length = interval[1]

  # This function was used with Kkordik/TriviaQA_SQuAD dataset, as it was too long for tokenization each row (I had no time). It is cosidered as bad practice
  # filter_interval = lambda ex: interval[0]*5 <= len(ex['context'] + " " + ex['question']) < interval[1]*5 and len(ex['answers']['text']) > 0
  filter_interval = lambda row: interval[0] <= len(tokenizer_2.tokenize(row["context"] + " " + row["question"])) < interval[1]

  filtered_data = dataset.filter(filter_interval)

  print(filtered_data)

  eval_results = task_evaluator.compute(
      model_or_pipeline=model_2,
      tokenizer=tokenizer_2,
      data=filtered_data,
      strategy="bootstrap",
      n_resamples=30
  )
  print(interval, eval_results)

Filter:   0%|          | 0/320 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2318 > 2300). Running this sequence through the model will result in indexing errors


Dataset({
    features: ['context', 'question', 'answers', 'id'],
    num_rows: 231
})


Filter:   0%|          | 0/231 [00:00<?, ? examples/s]



[2100, 2300] {'exact_match': {'confidence_interval': (16.017316017316016, 23.1075215284475), 'standard_error': 2.2611645103110676, 'score': 20.346320346320347}, 'f1': {'confidence_interval': (22.921470264395815, 30.999957029781054), 'standard_error': 2.668262666925105, 'score': 26.580086580086572}, 'total_time_in_seconds': 206.34977920599977, 'samples_per_second': 1.1194584306746063, 'latency_in_seconds': 0.8932890874718604}
