# QA Cross Evaluation

The purpose of this notebook is to evaluate all of the fine tuned models on SQuAD, TriviaQA, NQ, QuAC, and NewsQA question-answering datasets across all these datasets the same way [this paper](https://arxiv.org/pdf/2004.03490.pdf) does on table 3. 

## First Steps

**Mount Google Drive(to read the datasets and model's parameters)**

First of all, we must mount the google drive storage to load the datasets and the parameters of the fine-tuned models. I did that this way because the files are too big to upload multiple times.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


**Install pytorch interface for pre-trained BERT**

Then, we must install the interface to use the pre-trained BERT models.

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 4.9 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 37.2 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 36.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 35.5 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Foun

**Import necessary libraries**

Now, it's time to import the necessary libraries for this notebook.

In [None]:
import torch
import json
import pandas as pd
import re
import string
import collections
from transformers import BertTokenizerFast, BertForQuestionAnswering
from tqdm import tqdm

pd.set_option('max_colwidth', 500)

**Enable CUDA**

Enable CUDA for GPU utilization by our model. This makes calculations faster.

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda', index=0)

**BERT model definition**

And here, we define the name of the pre-trained bert model from [Hugging Face](https://huggingface.co/models) which we used previously on the fine-tuning process.

In [None]:
BERT_MODEL_NAME = 'bert-base-uncased'

## Dataset loading

Now, let's load our datasets. We will load only the validation (dev datasets):

**JSON to pandas DataFrame loader**

This script loads a dataset from a file in the form of SQuAD json format to a pandas DataFrame.

In [None]:
def squad_load_from_json(json_file_path: str):
  with open(json_file_path, "r") as f:
    json_data = json.load(f)['data']
    questions = []
    answers = []
    corpuses = []
    no_ans_questions = 0
    for category in json_data:
      for paragraph in category['paragraphs']:
        context = paragraph['context']
        for qa in paragraph['qas']:
          corpuses.append(context)
          question = qa['question']
          questions.append(question)
          if qa['is_impossible']:
            if 'plausible_answers' in qa:
              ans_list = qa['plausible_answers']
            else:
              ans_list = []
          else:
            ans_list = qa['answers']
          ans_set = set()
          if len(ans_list) == 0:
            no_ans_questions += 1
          for idx, ans in enumerate(ans_list):
            ans_set.add((ans['answer_start'], ans['answer_start']+len(ans['text']), ans['text']))
          answers.append(list(ans_set))
    print("Questions with no answers: ", no_ans_questions)
    return pd.DataFrame(data={'question':questions, 'answer':answers, 'corpus':corpuses})

**Load SQuAD dataset**

In [None]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2022-03-13 01:00:22--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2022-03-13 01:00:22 (55.4 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



In [None]:
squad_dataset = squad_load_from_json("dev-v2.0.json").explode('answer').reset_index()
squad_dataset

Questions with no answers:  15


Unnamed: 0,index,question,answer,corpus
0,0,In what country is Normandy located?,"(159, 165, France)","The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants ..."
1,1,When were the Normans in Normandy?,"(94, 117, 10th and 11th centuries)","The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants ..."
2,1,When were the Normans in Normandy?,"(87, 117, in the 10th and 11th centuries)","The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants ..."
3,2,From which countries did the Norse originate?,"(256, 283, Denmark, Iceland and Norway)","The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants ..."
4,3,Who was the Norse leader?,"(308, 313, Rollo)","The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants ..."
...,...,...,...,...
16328,11868,What is the seldom used force unit equal to one thousand newtons?,"(665, 671, sthène)","The pound-force has a metric counterpart, less commonly used than the newton: the kilogram-force (kgf) (sometimes kilopond), is the force exerted by standard gravity on one kilogram of mass. The kilogram-force leads to an alternate, but rarely used unit of mass: the metric slug (sometimes mug or hyl) is that mass that accelerates at 1 m·s−2 when subjected to a force of 1 kgf. The kilogram-force is not a part of the modern SI system, and is generally deprecated; however it still sees use for ..."
16329,11869,What does not have a metric counterpart?,"(4, 15, pound-force)","The pound-force has a metric counterpart, less commonly used than the newton: the kilogram-force (kgf) (sometimes kilopond), is the force exerted by standard gravity on one kilogram of mass. The kilogram-force leads to an alternate, but rarely used unit of mass: the metric slug (sometimes mug or hyl) is that mass that accelerates at 1 m·s−2 when subjected to a force of 1 kgf. The kilogram-force is not a part of the modern SI system, and is generally deprecated; however it still sees use for ..."
16330,11870,What is the force exerted by standard gravity on one ton of mass?,"(82, 96, kilogram-force)","The pound-force has a metric counterpart, less commonly used than the newton: the kilogram-force (kgf) (sometimes kilopond), is the force exerted by standard gravity on one kilogram of mass. The kilogram-force leads to an alternate, but rarely used unit of mass: the metric slug (sometimes mug or hyl) is that mass that accelerates at 1 m·s−2 when subjected to a force of 1 kgf. The kilogram-force is not a part of the modern SI system, and is generally deprecated; however it still sees use for ..."
16331,11871,What force leads to a commonly used unit of mass?,"(195, 209, kilogram-force)","The pound-force has a metric counterpart, less commonly used than the newton: the kilogram-force (kgf) (sometimes kilopond), is the force exerted by standard gravity on one kilogram of mass. The kilogram-force leads to an alternate, but rarely used unit of mass: the metric slug (sometimes mug or hyl) is that mass that accelerates at 1 m·s−2 when subjected to a force of 1 kgf. The kilogram-force is not a part of the modern SI system, and is generally deprecated; however it still sees use for ..."


**Load TriviaQA dataset**

In [None]:
triviaqa_dataset = squad_load_from_json("/content/gdrive/MyDrive/NLP exercises/Datasets/Question_Answering_in_SQuAD_format/TriviaQA/triviaqa_dev.json").explode('answer').reset_index()
triviaqa_dataset

Questions with no answers:  4394


Unnamed: 0,index,question,answer,corpus
0,0,Which Lloyd Webber musical premiered in the US on 10th December 1993?,,"Andrew Lloyd Webber , Baron Lloyd-Webber ( born 22 March 1948 ) is an English composer and impresario of musical theatre . \n \n Several of his musicals have run for more than a decade both in the West End and on Broadway . He has composed 13 musicals , a song cycle , a set of variations , two film scores , and a Latin Requiem Mass . Several of his songs have been widely recorded and were hits outside of their parent musicals , notably `` The Music of the Night '' from The Phantom of the Ope..."
1,1,Who was the next British Prime Minister after Arthur Balfour?,,"The Prime Minister of the United Kingdom of Great Britain and Northern Ireland is the head of Her Majesty 's Government in the United Kingdom . The prime minister ( informal abbreviation : PM ) and Cabinet ( consisting of all the most senior ministers , most of whom are government department heads ) are collectively accountable for their policies and actions to the Monarch , to Parliament , to their political party and ultimately to the electorate . The , Theresa May , leader of the Conserva..."
2,2,Who was the next British Prime Minister after Arthur Balfour?,,"Arthur James Balfour , 1st Earl of Balfour , ( ; 25 July 1848 – 19 March 1930 ) was a British Conservative politician who was the Prime Minister of the United Kingdom from July 1902 to December 1905 , and later Foreign Secretary . \n \n Entering Parliament in 1874 , Balfour achieved prominence as Chief Secretary for Ireland , in which position he suppressed agrarian unrest whilst taking measures against absentee landlords . He opposed Irish Home Rule , saying there could be no half-way house..."
3,3,Who had a 70s No 1 hit with Kiss You All Over?,"(62, 67, Exile)","`` Kiss You All Over '' is a 1978 song performed by the group Exile . It was written by Mike Chapman and Nicky Chinn . It was included on the band 's album Mixed Emotions , and it featured Jimmy Stokley and guitarist JP Pennington on lead vocals . It was a number one single in the United States , but proved to be Exile 's only big hit in the pop rock market . Billboard ranked it as the No . 5 song for 1978 . In the United Kingdom , the song was released on Mickie Most 's RAK Records , and it..."
4,4,What claimed the life of singer Kathleen Ferrier?,"(2488, 2494, Cancer)","Kathleen Mary Ferrier , CBE ( 22 April 1912 - 8 October 1953 ) was an English contralto singer who achieved an international reputation as a stage , concert and recording artist , with a repertoire extending from folksong and popular ballads to the classical works of Bach , Brahms , Mahler and Elgar . Her death from cancer , at the height of her fame , was a shock to the musical world and particularly to the general public , which was kept in ignorance of the nature of her illness until afte..."
...,...,...,...,...
14224,14224,"With a motto of Always Ready, Always There, what US military branch had it's founding on Dec 14, 1636?","(4, 18, National Guard)","The National Guard of the United States , part of the reserve components of the United States Armed Forces , is a reserve military force , composed of National Guard military members or units of each state and the territories of Guam , of the Virgin Islands , and of Puerto Rico , as well as of the District of Columbia , for a total of 54 separate organizations . All members of the National Guard of the United States are also members of the militia of the United States as defined by . Nationa..."
14225,14225,Who tried to steal Christmas from the town of Whoville?,"(161, 167, Grinch)","Whoville is a fictional town created by author Theodor Seuss Geisel , under the name Dr. Seuss . Whoville appeared in the books Horton Hears a Who ! and How the Grinch Stole Christmas ! However , there were significant differences between the two renditions . \n \n Location \n \n The exact location of Whoville seems to vary depending on which book or media is being referenced . \n According to the book Horton Hears a Who ! , the city of Whoville is located within a floating speck of dust whi..."
14226,14226,"What is the name of the parson mentioned in the lyrics of the Christmas carol ""Winter Wonderland""?","(1908, 1913, Brown)","`` Winter Wonderland '' is a winter song , popularly treated as a Christmastime pop standard , written in 1934 by Felix Bernard ( music ) and Richard B. Smith ( lyricist ) . Through the decades it has been recorded by over 200 different artists . \n \n History \n \n Dick Smith , a native of Honesdale , Pennsylvania , was reportedly inspired to write the song after seeing Honesdale 's Central Park covered in snow . Smith had written the lyrics while in the West Mountain Sanitarium , being tre..."
14227,14227,"What is the name of the parson mentioned in the lyrics of the Christmas carol ""Winter Wonderland""?","(2032, 2037, Brown)","Frosty 's Winter Wonderland is a 1976 animated Christmas television special produced by Rankin/Bass Productions which originally aired on December 2 , 1976 on ABC . It is the second Frosty special and is a sequel to the 1969 Frosty the Snowman special , also written by Romeo Muller , with narration provided by Andy Griffith . \n \n Plot \n \n Years have passed since Frosty left for the North Pole , but kept his promise to the children that he would be back again someday . When he hears the n..."


**Load NQ dataset**

In [None]:
nq_dataset = squad_load_from_json("/content/gdrive/MyDrive/NLP exercises/Datasets/Question_Answering_in_SQuAD_format/NQ/nq_dev.json").explode('answer').reset_index()
nq_dataset

Questions with no answers:  1013


Unnamed: 0,index,question,answer,corpus
0,0,who is the owner of the mandalay bay in vegas,"(124, 149, MGM Resorts International)","Mandalay Bay is a 43-story luxury resort and casino on the Las Vegas Strip in Paradise, Nevada. It is owned and operated by MGM Resorts International. One of the property's towers operates as the Delano; the Four Seasons Hotel is independently operated within the Mandalay Bay tower, occupying 5 floors (35–39)."
1,1,who kicks the ball first to start a football game,"(138, 179, the team that lost the pre-game coin toss)","A kick-off is used to start each half of play, and each period of extra time where applicable. The kick-off to start a game is awarded to the team that lost the pre-game coin toss (the team that won the coin-toss chooses which direction they wish to play). The kick-off begins when the referee blows the whistle. The kick-off to start the second half is taken by the other team. If extra time is played another coin-toss is used at the beginning of this period."
2,1,who kicks the ball first to start a football game,"(127, 179, awarded to the team that lost the pre-game coin toss)","A kick-off is used to start each half of play, and each period of extra time where applicable. The kick-off to start a game is awarded to the team that lost the pre-game coin toss (the team that won the coin-toss chooses which direction they wish to play). The kick-off begins when the referee blows the whistle. The kick-off to start the second half is taken by the other team. If extra time is played another coin-toss is used at the beginning of this period."
3,2,mount everest how did it get its name,,"In 1865, Everest was given its official English name by the Royal Geographical Society, upon a recommendation by Andrew Waugh, the British Surveyor General of India. As there appeared to be several different local names, Waugh chose to name the mountain after his predecessor in the post, Sir George Everest, despite George Everest's objections.[6]"
4,3,who votes in the baseball hall of fame,"(80, 162, the Baseball Writers' Association of America (or BBWAA), or the Veterans Committee)","Players are currently inducted into the Hall of Fame through election by either the Baseball Writers' Association of America (or BBWAA), or the Veterans Committee,[8] which now consists of four subcommittees, each of which considers and votes for candidates from a separate era of baseball. Five years after retirement, any player with 10 years of major league experience who passes a screening committee (which removes from consideration players of clearly lesser qualification) is eligible to b..."
...,...,...,...,...
4664,3366,how many episodes are there in modern family,"(18, 21, 201)","As of January 17, 2018,[update] 201 episodes of Modern Family have aired."
4665,3367,who built the first temple for god in jerusalem,"(62, 69, Solomon)","The Hebrew Bible states that the temple was constructed under Solomon, king of the United Kingdom of Israel and Judah and that during the Kingdom of Judah, the temple was dedicated to Yahweh, and is said to have housed the Ark of the Covenant. Jewish historian Josephus says that ""the temple was burnt four hundred and seventy years, six months, and ten days after it was built"",[1] although rabbinic sources state that the First Temple stood for 410 years and, based on the 2nd-century work Sede..."
4666,3367,who built the first temple for god in jerusalem,"(62, 117, Solomon, king of the United Kingdom of Israel and Judah)","The Hebrew Bible states that the temple was constructed under Solomon, king of the United Kingdom of Israel and Judah and that during the Kingdom of Judah, the temple was dedicated to Yahweh, and is said to have housed the Ark of the Covenant. Jewish historian Josephus says that ""the temple was burnt four hundred and seventy years, six months, and ten days after it was built"",[1] although rabbinic sources state that the First Temple stood for 410 years and, based on the 2nd-century work Sede..."
4667,3368,what is a dropped pin on google maps for,"(58, 88, marks locations in Google Maps)","The Google Maps pin is the inverted-drop-shaped icon that marks locations in Google Maps. The pin is protected under a U.S. design patent as ""teardrop-shaped marker icon including a shadow.""[1][2] Google has used the pin in various graphics, games, and promotional materials."


**Load QuAC dataset**

In [None]:
quac_dataset = squad_load_from_json("/content/gdrive/MyDrive/NLP exercises/Datasets/Question_Answering_in_SQuAD_format/quac/quac_dev.json").explode('answer').reset_index()
quac_dataset

Questions with no answers:  1486


Unnamed: 0,index,question,answer,corpus
0,0,what happened in 1983?,"(0, 52, In May 1983, she married Nikos Karvelas, a composer,)","In May 1983, she married Nikos Karvelas, a composer, with whom she collaborated in 1975 and in November she gave birth to her daughter Sofia. After their marriage, she started a close collaboration with Karvelas. Since 1975, all her releases have become gold or platinum and have included songs by Karvelas. In 1986, she participated at the Cypriot National Final for Eurovision Song Contest with the song Thelo Na Gino Star (""I Want To Be A Star""), taking second place. This song is still unrele..."
1,1,did they have any children?,"(92, 141, in November she gave birth to her daughter Sofia.)","In May 1983, she married Nikos Karvelas, a composer, with whom she collaborated in 1975 and in November she gave birth to her daughter Sofia. After their marriage, she started a close collaboration with Karvelas. Since 1975, all her releases have become gold or platinum and have included songs by Karvelas. In 1986, she participated at the Cypriot National Final for Eurovision Song Contest with the song Thelo Na Gino Star (""I Want To Be A Star""), taking second place. This song is still unrele..."
2,2,did she have any other children?,,"In May 1983, she married Nikos Karvelas, a composer, with whom she collaborated in 1975 and in November she gave birth to her daughter Sofia. After their marriage, she started a close collaboration with Karvelas. Since 1975, all her releases have become gold or platinum and have included songs by Karvelas. In 1986, she participated at the Cypriot National Final for Eurovision Song Contest with the song Thelo Na Gino Star (""I Want To Be A Star""), taking second place. This song is still unrele..."
3,3,what collaborations did she do with nikos?,"(213, 307, Since 1975, all her releases have become gold or platinum and have included songs by Karvelas.)","In May 1983, she married Nikos Karvelas, a composer, with whom she collaborated in 1975 and in November she gave birth to her daughter Sofia. After their marriage, she started a close collaboration with Karvelas. Since 1975, all her releases have become gold or platinum and have included songs by Karvelas. In 1986, she participated at the Cypriot National Final for Eurovision Song Contest with the song Thelo Na Gino Star (""I Want To Be A Star""), taking second place. This song is still unrele..."
4,4,what influences does he have in her music?,,"In May 1983, she married Nikos Karvelas, a composer, with whom she collaborated in 1975 and in November she gave birth to her daughter Sofia. After their marriage, she started a close collaboration with Karvelas. Since 1975, all her releases have become gold or platinum and have included songs by Karvelas. In 1986, she participated at the Cypriot National Final for Eurovision Song Contest with the song Thelo Na Gino Star (""I Want To Be A Star""), taking second place. This song is still unrele..."
...,...,...,...,...
7349,7349,How did Koufax perform in the post-season?,"(1661, 1769, Facing the Yankees in the 1963 World Series, Koufax beat Whitey Ford 5-2 in Game 1 and struck out 15 batters)","In 1963, Major League Baseball expanded the strike zone. Compared to the previous season, National League walks fell 13 percent, strikeouts increased six percent, the league batting average fell from .261 to .245, and runs fell 15 percent. Koufax, who had reduced his walks allowed per nine innings to 3.4 in 1961 and 2.8 in 1962, reduced his walk rate further to 1.7 in 1963, which ranked fifth in the league. The top pitchers of the era - Don Drysdale, Juan Marichal, Jim Bunning, Bob Gibson, W..."
7350,7350,Are there any other interesting aspects about this article?,"(411, 563, The top pitchers of the era - Don Drysdale, Juan Marichal, Jim Bunning, Bob Gibson, Warren Spahn, and above all Koufax - significantly reduced the walks)","In 1963, Major League Baseball expanded the strike zone. Compared to the previous season, National League walks fell 13 percent, strikeouts increased six percent, the league batting average fell from .261 to .245, and runs fell 15 percent. Koufax, who had reduced his walks allowed per nine innings to 3.4 in 1961 and 2.8 in 1962, reduced his walk rate further to 1.7 in 1963, which ranked fifth in the league. The top pitchers of the era - Don Drysdale, Juan Marichal, Jim Bunning, Bob Gibson, W..."
7351,7351,How did this change in walks affect the game of baseball?,,"In 1963, Major League Baseball expanded the strike zone. Compared to the previous season, National League walks fell 13 percent, strikeouts increased six percent, the league batting average fell from .261 to .245, and runs fell 15 percent. Koufax, who had reduced his walks allowed per nine innings to 3.4 in 1961 and 2.8 in 1962, reduced his walk rate further to 1.7 in 1963, which ranked fifth in the league. The top pitchers of the era - Don Drysdale, Juan Marichal, Jim Bunning, Bob Gibson, W..."
7352,7352,How did the league respond to this change?,,"In 1963, Major League Baseball expanded the strike zone. Compared to the previous season, National League walks fell 13 percent, strikeouts increased six percent, the league batting average fell from .261 to .245, and runs fell 15 percent. Koufax, who had reduced his walks allowed per nine innings to 3.4 in 1961 and 2.8 in 1962, reduced his walk rate further to 1.7 in 1963, which ranked fifth in the league. The top pitchers of the era - Don Drysdale, Juan Marichal, Jim Bunning, Bob Gibson, W..."


**Load NewsQA dataset**

In [None]:
newsqa_dataset = squad_load_from_json("/content/gdrive/MyDrive/NLP exercises/Datasets/Question_Answering_in_SQuAD_format/NewsQA/newsqa_dev.json").explode('answer').reset_index()
newsqa_dataset

Questions with no answers:  0


Unnamed: 0,index,question,answer,corpus
0,0,Iran criticizes who ?,"(75, 108, U.S. President-elect Barack Obama)","TEHRAN , Iran -LRB- CNN -RRB- -- Iran 's parliament speaker has criticized U.S. President-elect Barack Obama for saying that Iran 's development of a nuclear weapon is unacceptable . Iranian President Mahmoud Ahmadinejad has outlined where he thinks U.S. policy needs to change . Ali Larijani said Saturday that Obama should apply his campaign message of change to U.S. dealings with Iran . `` Obama must know that the change that he talks about is not simply a superficial changing of colors or ..."
1,1,What happened to the U.N. compound ?,"(3246, 3265, hit and set on fire)","LONDON , England -LRB- CNN -RRB- -- Israeli military action in Gaza is comparable to that of German soldiers during the Holocaust , a Jewish UK lawmaker whose family suffered at the hands of the Nazis has claimed . A protester confronts police in London last weekend at a demonstration against Israeli action in Gaza . Gerald Kaufman , a member of the UK 's ruling Labour Party , also called for an arms embargo on Israel , currently fighting militant Palestinian group Hamas , during the debate ..."
2,2,Who said there is no immediate plans for deployment ?,"(122, 137, President Obama)","WASHINGTON -LRB- CNN -RRB- -- There are no immediate plans to commit more U.S. troops to the ongoing war in Afghanistan , President Obama said Wednesday . Canadian Prime Minister Stephen Harper , left , and President Obama meet in Washington on Wednesday . Speaking to reporters alongside Canadian Prime Minister Stephen Harper , Obama said he would consult with U.S. allies before determining a strategy in Afghanistan after last month 's elections there . `` I 'm going to take a very deliberat..."
3,3,Will Lieberman investigate further ?,"(1980, 2005, intends to follow up with)","LOS ANGELES , California -LRB- CNN -RRB- -- Former detainees of Immigration and Customs Enforcement accuse the agency in a lawsuit of forcibly injecting them with psychotropic drugs while trying to shuttle them out of the country during their deportation . Raymond Soeoth , pictured here with his wife , says he was injected with drugs by ICE agents against his will . One of the drugs in question is the potent anti-psychotic drug Haldol , which is often used to treat schizophrenia or other men..."
4,4,Who spent nine years in prison ?,"(112, 123, Tim Masters)","-LRB- CNN -RRB- -- A Colorado prosecutor Friday asked a judge to dismiss the first-degree murder charge against Tim Masters , who spent nine years in prison until new DNA evidence indicated someone else might have committed the crime . Tim Masters , center , walks out of a Fort Collins , Colorado , courthouse Tuesday with his attorney David Wymore . Court papers filed by District Attorney Larry Abrahamson cited `` newly discovered '' evidence , but took pains to state that evidence did n't c..."
...,...,...,...,...
5161,5161,What is the top drug choice in Hong Kong ?,"(141, 149, ketamine)","HONG KONG , China -LRB- CNN -RRB- -- A 16-year-old Hong Kong boy makes two phone calls for delivery : One for pizza , the other for the drug ketamine . Two teenage girls are found semi-conscious in a car park after overdosing on ketamine . A 13-year-old boy joins a gang and is given free ketamine . Glass capsules containing ketamine , which has become the drug of choice for Hong Kong 's youth . These are anecdotes told to CNN by police , a family doctor and a former gang member . Ketamine ha..."
5162,5162,What was the name of the agency ?,"(332, 339, Mohmand)","ISLAMABAD , Pakistan -LRB- CNN -RRB- -- Hundreds of militants , believed to be foreign fighters , launched attacks on various military check posts in Pakistan 's border with Afghanistan Saturday night and early Sunday morning , military officials said . A Pakistan soldier on patrol last fall against militants on the border of the Mohmand agency district . The ensuing fighting left 40 militants and six Pakistan soldiers dead , said military spokesman Gen. Athar Abbas . `` This is one of the l..."
5163,5163,who was at home with family in Los Angeles ?,"(19, 29, Bea Arthur)","-LRB- CNN -RRB- -- Bea Arthur , the actress best known for her roles as television 's `` Maude '' and the sardonic Dorothy on `` The Golden Girls , '' has died of cancer , a family spokesman said Saturday . Bea Arthur , right , with `` Golden Girls '' co-star Rue McClanahan in June 2008 . She was 86 . Spokesman Dan Watt said that Arthur died Saturday morning at her home in Los Angeles , her family by her side . She is survived by her sons Matthew and Daniel and grandchildren Kyra and Violet ..."
5164,5164,when Les Bleus avenged 2007 semifinal loss to England on home soil ?,"(304, 317, 19-12 victory)","-LRB- CNN -RRB- -- France 's reputation as rugby 's Jekyll and Hyde team was reaffirmed on Saturday as Marc Lievremont 's inconsistent side bounced back from two defeats to eliminate England and reach the World Cup semifinals . Les Bleus avenged their 2007 semi defeat by the English on home soil with a 19-12 victory in Auckland , setting up a last-four clash with Wales -- who went through after beating Celtic neighbors Ireland 22-10 . With the other half of the drawing pitting hosts New Zeal..."


## Tokenization process

After loading the datasets, before we use them for the evaluation process, we must first tokenize them to calculate the gold answer start and end token positions and then decode to get the gold answer text back. The reason we do this is because some texts have characters which the bert tokenizer does not recognize (like Beyoncé) which leads to a result of taking this answer as incorrect while it isn't.

**Download BERT tokenizer**

In [None]:
tokenizer = BertTokenizerFast.from_pretrained(BERT_MODEL_NAME)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

**Calculate start and end tokens of the answers**

In [None]:
def calculate_tokenized_ans_indices(dataset: pd.DataFrame):
  ans_tok_start = []
  ans_tok_end = []
  ans_tok_text = []
  for idx, ans in enumerate(dataset['answer'].values):
    if not pd.isna(ans):
      ans_text_start = ans[0]
      ans_text_end = ans[1]
      ans_text = ans[2]
      encoding = tokenizer.encode_plus(text=dataset['corpus'].values[idx], text_pair=dataset['question'].values[idx], max_length=512, padding='max_length', truncation=True)
      ans_start = encoding.char_to_token(0, ans_text_start)
      ans_end = encoding.char_to_token(0, ans_text_end-1)
      # Handle truncated answers
      if ans_start is None:
        ans_start = ans_end = tokenizer.model_max_length
      elif ans_end is None:
        ans_end = [i for i, inp in enumerate(encoding['input_ids']) if inp == tokenizer.sep_token_id][0]
      # ans_text_tok = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(encoding['input_ids'][ans_start:ans_end+1]))
      ans_text_tok = tokenizer.decode(tokenizer.encode(ans_text), skip_special_tokens=True)
    else:
      ans_start = ans_end = tokenizer.model_max_length
      ans_text_tok = ""
    ans_tok_start.append(ans_start)
    ans_tok_end.append(ans_end)
    ans_tok_text.append(ans_text_tok)
  dataset['ans_start_tok'] = ans_tok_start
  dataset['ans_end_tok'] = ans_tok_end
  dataset['ans_tok_text'] = ans_tok_text
  return dataset.groupby('index').agg({'question': lambda x : x.tolist()[0], 'answer': lambda x : x.tolist(), 'corpus': lambda x : x.tolist()[0], 'ans_start_tok': lambda x : x.tolist(), 'ans_end_tok': lambda x : x.tolist(), 'ans_tok_text': lambda x : x.tolist()})

In [None]:
squad_dataset = calculate_tokenized_ans_indices(squad_dataset)
squad_dataset

Unnamed: 0_level_0,question,answer,corpus,ans_start_tok,ans_end_tok,ans_tok_text
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,In what country is Normandy located?,"[(159, 165, France)]","The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants ...",[41],[41],[france]
1,When were the Normans in Normandy?,"[(94, 117, 10th and 11th centuries), (87, 117, in the 10th and 11th centuries)]","The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants ...","[28, 26]","[31, 31]","[10th and 11th centuries, in the 10th and 11th centuries]"
2,From which countries did the Norse originate?,"[(256, 283, Denmark, Iceland and Norway)]","The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants ...",[63],[67],"[denmark, iceland and norway]"
3,Who was the Norse leader?,"[(308, 313, Rollo)]","The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants ...",[73],[74],[rollo]
4,What century did the Normans first gain their separate identity?,"[(671, 675, 10th), (649, 683, the first half of the 10th century), (671, 683, 10th century)]","The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants ...","[144, 139, 144]","[144, 145, 145]","[10th, the first half of the 10th century, 10th century]"
...,...,...,...,...,...,...
11868,What is the seldom used force unit equal to one thousand newtons?,"[(665, 671, sthène)]","The pound-force has a metric counterpart, less commonly used than the newton: the kilogram-force (kgf) (sometimes kilopond), is the force exerted by standard gravity on one kilogram of mass. The kilogram-force leads to an alternate, but rarely used unit of mass: the metric slug (sometimes mug or hyl) is that mass that accelerates at 1 m·s−2 when subjected to a force of 1 kgf. The kilogram-force is not a part of the modern SI system, and is generally deprecated; however it still sees use for ...",[158],[160],[sthene]
11869,What does not have a metric counterpart?,"[(4, 15, pound-force)]","The pound-force has a metric counterpart, less commonly used than the newton: the kilogram-force (kgf) (sometimes kilopond), is the force exerted by standard gravity on one kilogram of mass. The kilogram-force leads to an alternate, but rarely used unit of mass: the metric slug (sometimes mug or hyl) is that mass that accelerates at 1 m·s−2 when subjected to a force of 1 kgf. The kilogram-force is not a part of the modern SI system, and is generally deprecated; however it still sees use for ...",[2],[4],[pound - force]
11870,What is the force exerted by standard gravity on one ton of mass?,"[(82, 96, kilogram-force)]","The pound-force has a metric counterpart, less commonly used than the newton: the kilogram-force (kgf) (sometimes kilopond), is the force exerted by standard gravity on one kilogram of mass. The kilogram-force leads to an alternate, but rarely used unit of mass: the metric slug (sometimes mug or hyl) is that mass that accelerates at 1 m·s−2 when subjected to a force of 1 kgf. The kilogram-force is not a part of the modern SI system, and is generally deprecated; however it still sees use for ...",[18],[21],[kilogram - force]
11871,What force leads to a commonly used unit of mass?,"[(195, 209, kilogram-force)]","The pound-force has a metric counterpart, less commonly used than the newton: the kilogram-force (kgf) (sometimes kilopond), is the force exerted by standard gravity on one kilogram of mass. The kilogram-force leads to an alternate, but rarely used unit of mass: the metric slug (sometimes mug or hyl) is that mass that accelerates at 1 m·s−2 when subjected to a force of 1 kgf. The kilogram-force is not a part of the modern SI system, and is generally deprecated; however it still sees use for ...",[50],[53],[kilogram - force]


In [None]:
triviaqa_dataset = calculate_tokenized_ans_indices(triviaqa_dataset)
triviaqa_dataset

Unnamed: 0_level_0,question,answer,corpus,ans_start_tok,ans_end_tok,ans_tok_text
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,Which Lloyd Webber musical premiered in the US on 10th December 1993?,[nan],"Andrew Lloyd Webber , Baron Lloyd-Webber ( born 22 March 1948 ) is an English composer and impresario of musical theatre . \n \n Several of his musicals have run for more than a decade both in the West End and on Broadway . He has composed 13 musicals , a song cycle , a set of variations , two film scores , and a Latin Requiem Mass . Several of his songs have been widely recorded and were hits outside of their parent musicals , notably `` The Music of the Night '' from The Phantom of the Ope...",[512],[512],[]
1,Who was the next British Prime Minister after Arthur Balfour?,[nan],"The Prime Minister of the United Kingdom of Great Britain and Northern Ireland is the head of Her Majesty 's Government in the United Kingdom . The prime minister ( informal abbreviation : PM ) and Cabinet ( consisting of all the most senior ministers , most of whom are government department heads ) are collectively accountable for their policies and actions to the Monarch , to Parliament , to their political party and ultimately to the electorate . The , Theresa May , leader of the Conserva...",[512],[512],[]
2,Who was the next British Prime Minister after Arthur Balfour?,[nan],"Arthur James Balfour , 1st Earl of Balfour , ( ; 25 July 1848 – 19 March 1930 ) was a British Conservative politician who was the Prime Minister of the United Kingdom from July 1902 to December 1905 , and later Foreign Secretary . \n \n Entering Parliament in 1874 , Balfour achieved prominence as Chief Secretary for Ireland , in which position he suppressed agrarian unrest whilst taking measures against absentee landlords . He opposed Irish Home Rule , saying there could be no half-way house...",[512],[512],[]
3,Who had a 70s No 1 hit with Kiss You All Over?,"[(62, 67, Exile)]","`` Kiss You All Over '' is a 1978 song performed by the group Exile . It was written by Mike Chapman and Nicky Chinn . It was included on the band 's album Mixed Emotions , and it featured Jimmy Stokley and guitarist JP Pennington on lead vocals . It was a number one single in the United States , but proved to be Exile 's only big hit in the pop rock market . Billboard ranked it as the No . 5 song for 1978 . In the United Kingdom , the song was released on Mickie Most 's RAK Records , and it...",[17],[17],[exile]
4,What claimed the life of singer Kathleen Ferrier?,"[(2488, 2494, Cancer)]","Kathleen Mary Ferrier , CBE ( 22 April 1912 - 8 October 1953 ) was an English contralto singer who achieved an international reputation as a stage , concert and recording artist , with a repertoire extending from folksong and popular ballads to the classical works of Bach , Brahms , Mahler and Elgar . Her death from cancer , at the height of her fame , was a shock to the musical world and particularly to the general public , which was kept in ignorance of the nature of her illness until afte...",[486],[486],[cancer]
...,...,...,...,...,...,...
14224,"With a motto of Always Ready, Always There, what US military branch had it's founding on Dec 14, 1636?","[(4, 18, National Guard)]","The National Guard of the United States , part of the reserve components of the United States Armed Forces , is a reserve military force , composed of National Guard military members or units of each state and the territories of Guam , of the Virgin Islands , and of Puerto Rico , as well as of the District of Columbia , for a total of 54 separate organizations . All members of the National Guard of the United States are also members of the militia of the United States as defined by . Nationa...",[2],[3],[national guard]
14225,Who tried to steal Christmas from the town of Whoville?,"[(161, 167, Grinch)]","Whoville is a fictional town created by author Theodor Seuss Geisel , under the name Dr. Seuss . Whoville appeared in the books Horton Hears a Who ! and How the Grinch Stole Christmas ! However , there were significant differences between the two renditions . \n \n Location \n \n The exact location of Whoville seems to vary depending on which book or media is being referenced . \n According to the book Horton Hears a Who ! , the city of Whoville is located within a floating speck of dust whi...",[39],[40],[grinch]
14226,"What is the name of the parson mentioned in the lyrics of the Christmas carol ""Winter Wonderland""?","[(1908, 1913, Brown)]","`` Winter Wonderland '' is a winter song , popularly treated as a Christmastime pop standard , written in 1934 by Felix Bernard ( music ) and Richard B. Smith ( lyricist ) . Through the decades it has been recorded by over 200 different artists . \n \n History \n \n Dick Smith , a native of Honesdale , Pennsylvania , was reportedly inspired to write the song after seeing Honesdale 's Central Park covered in snow . Smith had written the lyrics while in the West Mountain Sanitarium , being tre...",[399],[399],[brown]
14227,"What is the name of the parson mentioned in the lyrics of the Christmas carol ""Winter Wonderland""?","[(2032, 2037, Brown)]","Frosty 's Winter Wonderland is a 1976 animated Christmas television special produced by Rankin/Bass Productions which originally aired on December 2 , 1976 on ABC . It is the second Frosty special and is a sequel to the 1969 Frosty the Snowman special , also written by Romeo Muller , with narration provided by Andy Griffith . \n \n Plot \n \n Years have passed since Frosty left for the North Pole , but kept his promise to the children that he would be back again someday . When he hears the n...",[449],[449],[brown]


In [None]:
nq_dataset = calculate_tokenized_ans_indices(nq_dataset)
nq_dataset

Unnamed: 0_level_0,question,answer,corpus,ans_start_tok,ans_end_tok,ans_tok_text
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,who is the owner of the mandalay bay in vegas,"[(124, 149, MGM Resorts International)]","Mandalay Bay is a 43-story luxury resort and casino on the Las Vegas Strip in Paradise, Nevada. It is owned and operated by MGM Resorts International. One of the property's towers operates as the Delano; the Four Seasons Hotel is independently operated within the Mandalay Bay tower, occupying 5 floors (35–39).",[29],[31],[mgm resorts international]
1,who kicks the ball first to start a football game,"[(138, 179, the team that lost the pre-game coin toss), (127, 179, awarded to the team that lost the pre-game coin toss)]","A kick-off is used to start each half of play, and each period of extra time where applicable. The kick-off to start a game is awarded to the team that lost the pre-game coin toss (the team that won the coin-toss chooses which direction they wish to play). The kick-off begins when the referee blows the whistle. The kick-off to start the second half is taken by the other team. If extra time is played another coin-toss is used at the beginning of this period.","[34, 32]","[43, 43]","[the team that lost the pre - game coin toss, awarded to the team that lost the pre - game coin toss]"
2,mount everest how did it get its name,[nan],"In 1865, Everest was given its official English name by the Royal Geographical Society, upon a recommendation by Andrew Waugh, the British Surveyor General of India. As there appeared to be several different local names, Waugh chose to name the mountain after his predecessor in the post, Sir George Everest, despite George Everest's objections.[6]",[512],[512],[]
3,who votes in the baseball hall of fame,"[(80, 162, the Baseball Writers' Association of America (or BBWAA), or the Veterans Committee), (73, 163, either the Baseball Writers' Association of America (or BBWAA), or the Veterans Committee,)]","Players are currently inducted into the Hall of Fame through election by either the Baseball Writers' Association of America (or BBWAA), or the Veterans Committee,[8] which now consists of four subcommittees, each of which considers and votes for candidates from a separate era of baseball. Five years after retirement, any player with 10 years of major league experience who passes a screening committee (which removes from consideration players of clearly lesser qualification) is eligible to b...","[14, 13]","[31, 32]","[the baseball writers'association of america ( or bbwaa ), or the veterans committee, either the baseball writers'association of america ( or bbwaa ), or the veterans committee,]"
4,who played taylor on the bold and beautiful,"[(112, 123, Hunter Tylo)]","Taylor Hayes is a fictional character from the American CBS soap opera The Bold and the Beautiful, portrayed by Hunter Tylo. The character was created by William J. Bell and debuted during the episode dated June 6, 1990. Tylo appeared as a regular continuously until 1994 when she took a hiatus for a few months before being written back into the series. In 1996, she left the serial after being cast on Melrose Place, where she was soon fired on the grounds of being pregnant, and returned short...",[21],[23],[hunter tylo]
...,...,...,...,...,...,...
3364,where are red blood cells made in adults,"[(357, 375, in the bone marrow)]","In humans, mature red blood cells are flexible and oval biconcave disks. They lack a cell nucleus and most organelles, in order to accommodate maximum space for hemoglobin; they can be viewed as sacks of hemoglobin, with a plasma membrane as the sack. Approximately 2.4 million new erythrocytes are produced per second in human adults.[2] The cells develop in the bone marrow and circulate for about 100–120 days in the body before their components are recycled by macrophages. Each circulation t...",[83],[86],[in the bone marrow]
3365,who was the pinkerton detective agency's first female detective,"[(0, 10, Kate Warne)]","Kate Warne (1833 – January 28, 1868)[1] was the first female detective, in 1856, in the Pinkerton Detective Agency and the United States.",[1],[3],[kate warne]
3366,how many episodes are there in modern family,"[(0, 44, As of January 17, 2018,[update] 201 episodes), (18, 21, 201)]","As of January 17, 2018,[update] 201 episodes of Modern Family have aired.","[1, 6]","[12, 6]","[as of january 17, 2018, [ update ] 201 episodes, 201]"
3367,who built the first temple for god in jerusalem,"[(62, 69, Solomon), (62, 117, Solomon, king of the United Kingdom of Israel and Judah)]","The Hebrew Bible states that the temple was constructed under Solomon, king of the United Kingdom of Israel and Judah and that during the Kingdom of Judah, the temple was dedicated to Yahweh, and is said to have housed the Ark of the Covenant. Jewish historian Josephus says that ""the temple was burnt four hundred and seventy years, six months, and ten days after it was built"",[1] although rabbinic sources state that the First Temple stood for 410 years and, based on the 2nd-century work Sede...","[11, 11]","[11, 21]","[solomon, solomon, king of the united kingdom of israel and judah]"


In [None]:
quac_dataset = calculate_tokenized_ans_indices(quac_dataset)
quac_dataset

Unnamed: 0_level_0,question,answer,corpus,ans_start_tok,ans_end_tok,ans_tok_text
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,what happened in 1983?,"[(0, 52, In May 1983, she married Nikos Karvelas, a composer,)]","In May 1983, she married Nikos Karvelas, a composer, with whom she collaborated in 1975 and in November she gave birth to her daughter Sofia. After their marriage, she started a close collaboration with Karvelas. Since 1975, all her releases have become gold or platinum and have included songs by Karvelas. In 1986, she participated at the Cypriot National Final for Eurovision Song Contest with the song Thelo Na Gino Star (""I Want To Be A Star""), taking second place. This song is still unrele...",[1],[16],"[in may 1983, she married nikos karvelas, a composer,]"
1,did they have any children?,"[(92, 141, in November she gave birth to her daughter Sofia.)]","In May 1983, she married Nikos Karvelas, a composer, with whom she collaborated in 1975 and in November she gave birth to her daughter Sofia. After their marriage, she started a close collaboration with Karvelas. Since 1975, all her releases have become gold or platinum and have included songs by Karvelas. In 1986, she participated at the Cypriot National Final for Eurovision Song Contest with the song Thelo Na Gino Star (""I Want To Be A Star""), taking second place. This song is still unrele...",[24],[33],[in november she gave birth to her daughter sofia.]
2,did she have any other children?,[nan],"In May 1983, she married Nikos Karvelas, a composer, with whom she collaborated in 1975 and in November she gave birth to her daughter Sofia. After their marriage, she started a close collaboration with Karvelas. Since 1975, all her releases have become gold or platinum and have included songs by Karvelas. In 1986, she participated at the Cypriot National Final for Eurovision Song Contest with the song Thelo Na Gino Star (""I Want To Be A Star""), taking second place. This song is still unrele...",[512],[512],[]
3,what collaborations did she do with nikos?,"[(213, 307, Since 1975, all her releases have become gold or platinum and have included songs by Karvelas.)]","In May 1983, she married Nikos Karvelas, a composer, with whom she collaborated in 1975 and in November she gave birth to her daughter Sofia. After their marriage, she started a close collaboration with Karvelas. Since 1975, all her releases have become gold or platinum and have included songs by Karvelas. In 1986, she participated at the Cypriot National Final for Eurovision Song Contest with the song Thelo Na Gino Star (""I Want To Be A Star""), taking second place. This song is still unrele...",[49],[69],"[since 1975, all her releases have become gold or platinum and have included songs by karvelas.]"
4,what influences does he have in her music?,[nan],"In May 1983, she married Nikos Karvelas, a composer, with whom she collaborated in 1975 and in November she gave birth to her daughter Sofia. After their marriage, she started a close collaboration with Karvelas. Since 1975, all her releases have become gold or platinum and have included songs by Karvelas. In 1986, she participated at the Cypriot National Final for Eurovision Song Contest with the song Thelo Na Gino Star (""I Want To Be A Star""), taking second place. This song is still unrele...",[512],[512],[]
...,...,...,...,...,...,...
7349,How did Koufax perform in the post-season?,"[(1661, 1769, Facing the Yankees in the 1963 World Series, Koufax beat Whitey Ford 5-2 in Game 1 and struck out 15 batters)]","In 1963, Major League Baseball expanded the strike zone. Compared to the previous season, National League walks fell 13 percent, strikeouts increased six percent, the league batting average fell from .261 to .245, and runs fell 15 percent. Koufax, who had reduced his walks allowed per nine innings to 3.4 in 1961 and 2.8 in 1962, reduced his walk rate further to 1.7 in 1963, which ranked fifth in the league. The top pitchers of the era - Don Drysdale, Juan Marichal, Jim Bunning, Bob Gibson, W...",[417],[444],"[facing the yankees in the 1963 world series, koufax beat whitey ford 5 - 2 in game 1 and struck out 15 batters]"
7350,Are there any other interesting aspects about this article?,"[(411, 563, The top pitchers of the era - Don Drysdale, Juan Marichal, Jim Bunning, Bob Gibson, Warren Spahn, and above all Koufax - significantly reduced the walks)]","In 1963, Major League Baseball expanded the strike zone. Compared to the previous season, National League walks fell 13 percent, strikeouts increased six percent, the league batting average fell from .261 to .245, and runs fell 15 percent. Koufax, who had reduced his walks allowed per nine innings to 3.4 in 1961 and 2.8 in 1962, reduced his walk rate further to 1.7 in 1963, which ranked fifth in the league. The top pitchers of the era - Don Drysdale, Juan Marichal, Jim Bunning, Bob Gibson, W...",[93],[129],"[the top pitchers of the era - don drysdale, juan marichal, jim bunning, bob gibson, warren spahn, and above all koufax - significantly reduced the walks]"
7351,How did this change in walks affect the game of baseball?,[nan],"In 1963, Major League Baseball expanded the strike zone. Compared to the previous season, National League walks fell 13 percent, strikeouts increased six percent, the league batting average fell from .261 to .245, and runs fell 15 percent. Koufax, who had reduced his walks allowed per nine innings to 3.4 in 1961 and 2.8 in 1962, reduced his walk rate further to 1.7 in 1963, which ranked fifth in the league. The top pitchers of the era - Don Drysdale, Juan Marichal, Jim Bunning, Bob Gibson, W...",[512],[512],[]
7352,How did the league respond to this change?,[nan],"In 1963, Major League Baseball expanded the strike zone. Compared to the previous season, National League walks fell 13 percent, strikeouts increased six percent, the league batting average fell from .261 to .245, and runs fell 15 percent. Koufax, who had reduced his walks allowed per nine innings to 3.4 in 1961 and 2.8 in 1962, reduced his walk rate further to 1.7 in 1963, which ranked fifth in the league. The top pitchers of the era - Don Drysdale, Juan Marichal, Jim Bunning, Bob Gibson, W...",[512],[512],[]


In [None]:
newsqa_dataset = calculate_tokenized_ans_indices(newsqa_dataset)
newsqa_dataset

Unnamed: 0_level_0,question,answer,corpus,ans_start_tok,ans_end_tok,ans_tok_text
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,Iran criticizes who ?,"[(75, 108, U.S. President-elect Barack Obama)]","TEHRAN , Iran -LRB- CNN -RRB- -- Iran 's parliament speaker has criticized U.S. President-elect Barack Obama for saying that Iran 's development of a nuclear weapon is unacceptable . Iranian President Mahmoud Ahmadinejad has outlined where he thinks U.S. policy needs to change . Ali Larijani said Saturday that Obama should apply his campaign message of change to U.S. dealings with Iran . `` Obama must know that the change that he talks about is not simply a superficial changing of colors or ...",[22],[30],[u. s. president - elect barack obama]
1,What happened to the U.N. compound ?,"[(3246, 3265, hit and set on fire)]","LONDON , England -LRB- CNN -RRB- -- Israeli military action in Gaza is comparable to that of German soldiers during the Holocaust , a Jewish UK lawmaker whose family suffered at the hands of the Nazis has claimed . A protester confronts police in London last weekend at a demonstration against Israeli action in Gaza . Gerald Kaufman , a member of the UK 's ruling Labour Party , also called for an arms embargo on Israel , currently fighting militant Palestinian group Hamas , during the debate ...",[512],[512],[hit and set on fire]
2,Who said there is no immediate plans for deployment ?,"[(122, 137, President Obama)]","WASHINGTON -LRB- CNN -RRB- -- There are no immediate plans to commit more U.S. troops to the ongoing war in Afghanistan , President Obama said Wednesday . Canadian Prime Minister Stephen Harper , left , and President Obama meet in Washington on Wednesday . Speaking to reporters alongside Canadian Prime Minister Stephen Harper , Obama said he would consult with U.S. allies before determining a strategy in Afghanistan after last month 's elections there . `` I 'm going to take a very deliberat...",[33],[34],[president obama]
3,Will Lieberman investigate further ?,"[(1980, 2005, intends to follow up with)]","LOS ANGELES , California -LRB- CNN -RRB- -- Former detainees of Immigration and Customs Enforcement accuse the agency in a lawsuit of forcibly injecting them with psychotropic drugs while trying to shuttle them out of the country during their deportation . Raymond Soeoth , pictured here with his wife , says he was injected with drugs by ICE agents against his will . One of the drugs in question is the potent anti-psychotic drug Haldol , which is often used to treat schizophrenia or other men...",[404],[408],[intends to follow up with]
4,Who spent nine years in prison ?,"[(112, 123, Tim Masters)]","-LRB- CNN -RRB- -- A Colorado prosecutor Friday asked a judge to dismiss the first-degree murder charge against Tim Masters , who spent nine years in prison until new DNA evidence indicated someone else might have committed the crime . Tim Masters , center , walks out of a Fort Collins , Colorado , courthouse Tuesday with his attorney David Wymore . Court papers filed by District Attorney Larry Abrahamson cited `` newly discovered '' evidence , but took pains to state that evidence did n't c...",[28],[29],[tim masters]
...,...,...,...,...,...,...
5161,What is the top drug choice in Hong Kong ?,"[(141, 149, ketamine)]","HONG KONG , China -LRB- CNN -RRB- -- A 16-year-old Hong Kong boy makes two phone calls for delivery : One for pizza , the other for the drug ketamine . Two teenage girls are found semi-conscious in a car park after overdosing on ketamine . A 13-year-old boy joins a gang and is given free ketamine . Glass capsules containing ketamine , which has become the drug of choice for Hong Kong 's youth . These are anecdotes told to CNN by police , a family doctor and a former gang member . Ketamine ha...",[41],[43],[ketamine]
5162,What was the name of the agency ?,"[(332, 339, Mohmand)]","ISLAMABAD , Pakistan -LRB- CNN -RRB- -- Hundreds of militants , believed to be foreign fighters , launched attacks on various military check posts in Pakistan 's border with Afghanistan Saturday night and early Sunday morning , military officials said . A Pakistan soldier on patrol last fall against militants on the border of the Mohmand agency district . The ensuing fighting left 40 militants and six Pakistan soldiers dead , said military spokesman Gen. Athar Abbas . `` This is one of the l...",[64],[66],[mohmand]
5163,who was at home with family in Los Angeles ?,"[(19, 29, Bea Arthur)]","-LRB- CNN -RRB- -- Bea Arthur , the actress best known for her roles as television 's `` Maude '' and the sardonic Dorothy on `` The Golden Girls , '' has died of cancer , a family spokesman said Saturday . Bea Arthur , right , with `` Golden Girls '' co-star Rue McClanahan in June 2008 . She was 86 . Spokesman Dan Watt said that Arthur died Saturday morning at her home in Los Angeles , her family by her side . She is survived by her sons Matthew and Daniel and grandchildren Kyra and Violet ...",[12],[13],[bea arthur]
5164,when Les Bleus avenged 2007 semifinal loss to England on home soil ?,"[(304, 317, 19-12 victory)]","-LRB- CNN -RRB- -- France 's reputation as rugby 's Jekyll and Hyde team was reaffirmed on Saturday as Marc Lievremont 's inconsistent side bounced back from two defeats to eliminate England and reach the World Cup semifinals . Les Bleus avenged their 2007 semi defeat by the English on home soil with a 19-12 victory in Auckland , setting up a last-four clash with Wales -- who went through after beating Celtic neighbors Ireland 22-10 . With the other half of the drawing pitting hosts New Zeal...",[74],[77],[19 - 12 victory]


**Evaluation script**

The script used to calculate f1 and exact match scores between gold and predicted answers. This is the same script from the SQuAD website which is defined [here](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/).

In [None]:
def normalize_answer(s):
  """Lower text and remove punctuation, articles and extra whitespace."""
  def remove_articles(text):
    regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
    return re.sub(regex, ' ', text)
  def white_space_fix(text):
    return ' '.join(text.split())
  def remove_punc(text):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in text if ch not in exclude)
  def lower(text):
    return text.lower()
  return white_space_fix(remove_articles(remove_punc(lower(s))))

def get_tokens(s):
  if not s: return []
  return normalize_answer(s).split()

def compute_exact(a_gold, a_pred):
  return int(normalize_answer(a_gold) == normalize_answer(a_pred))

def compute_f1(a_gold, a_pred):
  gold_toks = get_tokens(a_gold)
  pred_toks = get_tokens(a_pred)
  common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
  num_same = sum(common.values())
  if len(gold_toks) == 0 or len(pred_toks) == 0:
    # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
    return int(gold_toks == pred_toks)
  if num_same == 0:
    return 0
  precision = 1.0 * num_same / len(pred_toks)
  recall = 1.0 * num_same / len(gold_toks)
  f1 = (2 * precision * recall) / (precision + recall)
  return f1

def predict(model: BertForQuestionAnswering, query: str, context: str):
  with torch.no_grad():
    model.eval()
    inputs = tokenizer.encode_plus(text=context, text_pair=query, max_length=512, padding='max_length', truncation=True, return_tensors='pt').to(device)
    outputs = model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], token_type_ids=inputs['token_type_ids'])
    ans_start = torch.argmax(outputs[0])
    ans_end = torch.argmax(outputs[1])
    ans = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][ans_start:ans_end+1]))
    return ans

## Model loading

After dataset loading, it's time to load our models and theis parameters.

**SQuAD model**

In [None]:
model_squad = BertForQuestionAnswering.from_pretrained(BERT_MODEL_NAME).to(device)
model_squad.load_state_dict(torch.load('/content/gdrive/MyDrive/NLP exercises/Question Answering/Model Parameters/model_squad.pt'))

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

<All keys matched successfully>

**TriviaQA model**

In [None]:
model_trivia_qa = BertForQuestionAnswering.from_pretrained(BERT_MODEL_NAME).to(device)
model_trivia_qa.load_state_dict(torch.load('/content/gdrive/MyDrive/NLP exercises/Question Answering/Model Parameters/model_triviaqa.pt'))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

<All keys matched successfully>

**NQ model**

In [None]:
model_nq = BertForQuestionAnswering.from_pretrained(BERT_MODEL_NAME).to(device)
model_nq.load_state_dict(torch.load('/content/gdrive/MyDrive/NLP exercises/Question Answering/Model Parameters/model_nq.pt'))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

<All keys matched successfully>

**QuAC model**

In [None]:
model_quac = BertForQuestionAnswering.from_pretrained(BERT_MODEL_NAME).to(device)
model_quac.load_state_dict(torch.load('/content/gdrive/MyDrive/NLP exercises/Question Answering/Model Parameters/model_quac.pt'))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

<All keys matched successfully>

**NewsQA model**

In [None]:
model_newsqa = BertForQuestionAnswering.from_pretrained(BERT_MODEL_NAME).to(device)
model_newsqa.load_state_dict(torch.load('/content/gdrive/MyDrive/NLP exercises/Question Answering/Model Parameters/model_news_qa.pt'))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

<All keys matched successfully>

## Evaluation process

**Model evaluation function**

Before we begin with the evaluation process, we must define the function that evaluates a qa model on a given dataset. The following function computes the average f1 and exact match scores that the model gives for each example in the given dataset, and also computes the percentage of the questions without any answers (from the beginning or truncated) that the model predicted they do have an answer (aka Wrong no ans) to see how well a model does on unanswerable questions. In other words, the lower Wrong no ans is the better the model handles questions without answers.

In [None]:
def evaluate_model(model: BertForQuestionAnswering, dataset: pd.DataFrame):
  exact_match_sum = 0
  f1_sum = 0
  no_ans = 0
  wrong_no_ans = 0
  for idx, val_row in enumerate(tqdm(dataset.values)):
    question = val_row[0]
    context = val_row[2]
    answers = val_row[5]
    predicted_answer = predict(model, question, context)
    if len(answers) == 1 and answers[0] == '':
      no_ans += 1
      if len(predicted_answer) > 0:
        wrong_no_ans += 1
    em = max([compute_exact(gold_ans, predicted_answer) for gold_ans in answers])
    f1 = max([compute_f1(gold_ans, predicted_answer) for gold_ans in answers])
    exact_match_sum += em
    f1_sum += f1

  em_score = exact_match_sum / len(dataset.values)
  f1_score = f1_sum / len(dataset.values)

  print()
  print("Exact Match = ", em_score)
  print("F1 = ", f1_score)
  percent_worng_ans = wrong_no_ans / no_ans if no_ans > 0 else 0
  print("Wrong no ans = ", percent_worng_ans)

**Cross Evaluation**

And now it's the time all of you have been waiting for.......

Cross Evaluation!!!!!!!

In [None]:
print('SQuAD - SQuAD')
evaluate_model(model_squad, squad_dataset)

SQuAD - SQuAD


100%|██████████| 11873/11873 [15:02<00:00, 13.15it/s]


Exact Match =  0.6597321654173335
F1 =  0.7770930233701563
Wrong no ans =  1.0





In [None]:
print('SQuAD - TriviaQA')
evaluate_model(model_squad, triviaqa_dataset)

SQuAD - TriviaQA


100%|██████████| 14229/14229 [18:23<00:00, 12.90it/s]


Exact Match =  0.2866680722468199
F1 =  0.34671511994836374
Wrong no ans =  0.888838372357354





In [None]:
print('SQuAD - NQ')
evaluate_model(model_squad, nq_dataset)

SQuAD - NQ


100%|██████████| 3369/3369 [04:14<00:00, 13.23it/s]


Exact Match =  0.43514395963193825
F1 =  0.5139319968924498
Wrong no ans =  0.9032576505429417





In [None]:
print('SQuAD - QuAC')
evaluate_model(model_squad, quac_dataset)

SQuAD - QuAC


100%|██████████| 7354/7354 [09:25<00:00, 13.01it/s]


Exact Match =  0.055751971716072886
F1 =  0.14449650779692247
Wrong no ans =  0.8499327052489906





In [None]:
print('SQuAD - NewsQA')
evaluate_model(model_squad, newsqa_dataset)

SQuAD - NewsQA


100%|██████████| 5166/5166 [06:36<00:00, 13.03it/s]


Exact Match =  0.30952380952380953
F1 =  0.4428092721806003
Wrong no ans =  0





In [None]:
print('TriviaQA - SQuAD')
evaluate_model(model_trivia_qa, squad_dataset)

TriviaQA - SQuAD


100%|██████████| 11873/11873 [14:50<00:00, 13.33it/s]


Exact Match =  0.2916701760296471
F1 =  0.4066479644433105
Wrong no ans =  1.0





In [None]:
print('TriviaQA - TriviaQA')
evaluate_model(model_trivia_qa, triviaqa_dataset)

TriviaQA - TriviaQA


100%|██████████| 14229/14229 [18:27<00:00, 12.85it/s]


Exact Match =  0.3825286386956216
F1 =  0.4194406443745105
Wrong no ans =  0.8920209138440555





In [None]:
print('TriviaQA - NQ')
evaluate_model(model_trivia_qa, nq_dataset)

TriviaQA - NQ


100%|██████████| 3369/3369 [04:17<00:00, 13.11it/s]


Exact Match =  0.2861383199762541
F1 =  0.37604680117349526
Wrong no ans =  0.8825271470878578





In [None]:
print('TriviaQA - QuAC')
evaluate_model(model_trivia_qa, quac_dataset)

TriviaQA - QuAC


100%|██████████| 7354/7354 [09:31<00:00, 12.86it/s]


Exact Match =  0.03739461517541474
F1 =  0.08930756175633274
Wrong no ans =  0.8694481830417228





In [None]:
print('TriviaQA - NewsQA')
evaluate_model(model_trivia_qa, newsqa_dataset)

TriviaQA - NewsQA


100%|██████████| 5166/5166 [06:43<00:00, 12.80it/s]


Exact Match =  0.18447541618273325
F1 =  0.28828200120712655
Wrong no ans =  0





In [None]:
print('NQ - SQuAD')
evaluate_model(model_nq, squad_dataset)

NQ - SQuAD


100%|██████████| 11873/11873 [15:07<00:00, 13.08it/s]


Exact Match =  0.4459698475532721
F1 =  0.6001581599452476
Wrong no ans =  1.0





In [None]:
print('NQ - TriviaQA')
evaluate_model(model_nq, triviaqa_dataset)

NQ - TriviaQA


100%|██████████| 14229/14229 [18:33<00:00, 12.78it/s]


Exact Match =  0.2518096844472556
F1 =  0.3241383689705737
Wrong no ans =  0.8942941577631279





In [None]:
print('NQ - NQ')
evaluate_model(model_nq, nq_dataset)

NQ - NQ


100%|██████████| 3369/3369 [04:17<00:00, 13.08it/s]


Exact Match =  0.5244879786286732
F1 =  0.5917062438615326
Wrong no ans =  0.9397828232971372





In [None]:
print('NQ - QuAC')
evaluate_model(model_nq, quac_dataset)

NQ - QuAC


100%|██████████| 7354/7354 [09:33<00:00, 12.83it/s]


Exact Match =  0.04976883328800653
F1 =  0.14097679112411426
Wrong no ans =  0.8566621803499327





In [None]:
print('NQ - NewsQA')
evaluate_model(model_nq, newsqa_dataset)

NQ - NewsQA


100%|██████████| 5166/5166 [06:44<00:00, 12.77it/s]


Exact Match =  0.21641502129307008
F1 =  0.3577094964987927
Wrong no ans =  0





In [None]:
print('QuAC - SQuAD')
evaluate_model(model_quac, squad_dataset)

QuAC - SQuAD


100%|██████████| 11873/11873 [15:11<00:00, 13.03it/s]


Exact Match =  0.07933967826160196
F1 =  0.31457582861819866
Wrong no ans =  0.8





In [None]:
print('QuAC - TriviaQA')
evaluate_model(model_quac, triviaqa_dataset)

QuAC - TriviaQA


100%|██████████| 14229/14229 [18:35<00:00, 12.76it/s]


Exact Match =  0.09586056644880174
F1 =  0.16654775747299863
Wrong no ans =  0.8242782450556945





In [None]:
print('QuAC - NQ')
evaluate_model(model_quac, nq_dataset)

QuAC - NQ


100%|██████████| 3369/3369 [04:17<00:00, 13.07it/s]


Exact Match =  0.12911843276936777
F1 =  0.3106498958525553
Wrong no ans =  0.8795656465942744





In [None]:
print('QuAC - QuAC')
evaluate_model(model_quac, quac_dataset)

QuAC - QuAC


100%|██████████| 7354/7354 [09:31<00:00, 12.86it/s]


Exact Match =  0.1259178678270329
F1 =  0.2752948495260583
Wrong no ans =  0.8458950201884253





In [None]:
print('QuAC - NewsQA')
evaluate_model(model_quac, newsqa_dataset)

QuAC - NewsQA


100%|██████████| 5166/5166 [06:44<00:00, 12.78it/s]


Exact Match =  0.06620209059233449
F1 =  0.2425414138183661
Wrong no ans =  0





In [None]:
print('NewsQA - SQuAD')
evaluate_model(model_newsqa, squad_dataset)

NewsQA - SQuAD


100%|██████████| 11873/11873 [15:05<00:00, 13.12it/s]


Exact Match =  0.4840394171649962
F1 =  0.6423214040979349
Wrong no ans =  0.8666666666666667





In [None]:
print('NewsQA - TriviaQA')
evaluate_model(model_newsqa, triviaqa_dataset)

NewsQA - TriviaQA


100%|██████████| 14229/14229 [18:32<00:00, 12.78it/s]


Exact Match =  0.24611708482676226
F1 =  0.3192167396122766
Wrong no ans =  0.8745169356671971





In [None]:
print('NewsQA - NQ')
evaluate_model(model_newsqa, nq_dataset)

NewsQA - NQ


100%|██████████| 3369/3369 [04:17<00:00, 13.08it/s]


Exact Match =  0.398337785693084
F1 =  0.4844369352058514
Wrong no ans =  0.8775913129318855





In [None]:
print('NewsQA - QuAC')
evaluate_model(model_newsqa, quac_dataset)

NewsQA - QuAC


100%|██████████| 7354/7354 [09:31<00:00, 12.86it/s]


Exact Match =  0.054392167527875984
F1 =  0.15079696693121114
Wrong no ans =  0.8458950201884253





In [None]:
print('NewsQA - NewsQA')
evaluate_model(model_newsqa, newsqa_dataset)

NewsQA - NewsQA


100%|██████████| 5166/5166 [06:42<00:00, 12.82it/s]


Exact Match =  0.40321331784746417
F1 =  0.5416083851394523
Wrong no ans =  0



