# Assignment 4 Q/A Data Prep & Baselines
# 60 points

For assignments 4, 5, and 6 we will be exploring question answering (Q/A) using a small variant of BERT called "BERT mini" which is a 4-layer, 11M parameter model described here:

https://huggingface.co/prajjwal1/bert-mini 

https://github.com/prajjwal1/generalize_lm_nli

This model should be small enough to load into memory, run inference, and finetune on most laptop/desktop computers.  If you run into resource constraints in your compute environment, then you can use "BERT tiny" which is a 2-layer model with 4.4M parameters described here:

https://huggingface.co/prajjwal1/bert-tiny

In this assignment, we will focus on obtaining and preparing datasets for training and evaluation and establish a baseline Q/A that uses BERT-Mini "out of the box".  We are going to use three datasets for this work: 

 * SQuAD - a publicly available Q/A dataset that is readily available and can be used without any preparation
 * Wikipedia - we will discuss this dataset in assignment 5
 * Custom Dataset - we will create a custom dataset as a class as discussed below.




# Imports

Our first step will be to import relevant python libraries including HuggingFace (transformers and datasets) and PyTorch (torch).  If you get an error when loading these libraries, then you may need to install them with a command like the following:

In [1]:
! pip install datasets



In [2]:
# run this block to import the necessary libraries
from transformers import BertTokenizer, BertForQuestionAnswering, Trainer, TrainingArguments
from datasets import load_dataset
import torch
import random
from collections import Counter
import statistics
import json

import os
os.environ['CURL_CA_BUNDLE'] = ''

  from .autonotebook import tqdm as notebook_tqdm


# SQuAD

You can read about the SQuAD dataset here:

https://rajpurkar.github.io/SQuAD-explorer/

SQuAD consists of 87,599 training examples and 10,570 validation examples.  See for yourself!

In [3]:
squad = load_dataset("squad")
print(squad)

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


In [4]:
train_count = len(squad['train'])
val_count = len(squad['validation'])
#print(f"number of training examples: {"{:,}".format(train_count)}")
#print(f"number of validation examples: {"{:,}".format(val_count)}")
#print(f"total number of examples: {"{:,}".format(train_count+val_count)}")

## SQuAD Examples

Each example has five key/value pairs for the following keys: 
* id - a unique identifier
* title - the title of the Wikipedia article that the context passage was extracted from
* context - a snippet of a Wikipedia article that contains the answer to the question
* question - a question for which there is an answer that is provided by the context
* answers - each answer has two key/value pairs corresponding to a textual answer and a character offset
  * text - the text of the answer
  * answer_start - the character offset where the answer text can be found in the context string.

Look at some random examples from the training data by running the following code several times:

In [5]:
print(squad['train'][1])

{'id': '5733be284776f4190066117f', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'What is in front of the Notre Dame Main Building?', 'answers': {'text': ['a copper statue of Christ'], 'answer_start': [188]}}


In [6]:
def print_example(index, example):
    print(f"example[{index}]: id = {example['id']}")
    print(f"title = {example['title']}")
    print(f"context = {example['context']}")
    print(f"question = {example['question']}")
    for answer_text in example['answers']['text']:
        print(f"answer = {answer_text}")

In [7]:
index = random.randint(0, train_count - 1)
print_example(index, squad['train'][index])

example[27085]: id = 570983a2200fba14003680f3
title = Grape
context = There are several sources of the seedlessness trait, and essentially all commercial cultivators get it from one of three sources: Thompson Seedless, Russian Seedless, and Black Monukka, all being cultivars of Vitis vinifera. There are currently more than a dozen varieties of seedless grapes. Several, such as Einset Seedless, Benjamin Gunnels's Prime seedless grapes, Reliance, and Venus, have been specifically cultivated for hardiness and quality in the relatively cold climates of northeastern United States and southern Ontario.
question = How many seedless grape sources are there for commercial cultivators? 
answer = three


## SQuAD Analysis
Whenever you are working with a new dataset it is important to inspect it and understand what is in it.  In this section, you are asked to write code that inspects the SQuAD data to answer the following questions:
 * What are the lengths of the contexts?
 * What are the lengths of the answers?
 * How many examples have multiple answers?
   * When there are multiple answers, are they different?
 * Is the answer text always consistent with the text found at the answer_start?

I have to admit that I'm irritated by the possibility that an example could have more than one answer.  What if they are different?  How will this complicate the evaluation?  Let's run a quick sanity check first:

In [8]:
# how many times does an example have more than one answer?
count = 0
for example in squad['train']:
    if len(example['answers']['text']) != 1:
        count +=1

print(f"count={count}")

count=0


Phew!  Each example has one answer.  That simplifies our lives.  But just to be sure, let's run it again on the validation data:

In [9]:
# how many times does an example have more than one answer?
count = 0
for example in squad['validation']:
    if len(example['answers']['text']) != 1:
        count +=1

print(f"count={count}")

count=10567


Ugh!  We have multiple answers in the validation data.  This will complicate the counting we do below and our evaluation too.  Let's look at one to understand why a question would have multiple answers:

In [10]:
# This example has three correct/possible answers.  This gives the model a better chance of getting an answer that is marked as correct.
squad['validation'][2]

{'id': '56be4db0acb8001400a502ee',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Where did Super Bowl 50 take place?',
 'answers': {'text': ['Santa Clara, California',
   "Levi's Stadium",
   "Levi's Stadium in th

In [11]:
# let's quickly verify that the first answer is a substring of the context
context = squad['validation'][2]['context']
answer1_text = squad['validation'][2]['answers']['text'][0]
answer1_start = squad['validation'][2]['answers']['answer_start'][0]
answer1_end = answer1_start + len(answer1_text)
assert answer1_text == context[answer1_start:answer1_end]

Similarly, I am irritated by the notion that perhaps the answers provided are not substrings of the contexts.  Let's do another quick sanity check to see how consistent the data is:

In [12]:
bad_answer_count = 0
good_answer_count = 0
for example in list(squad['train']) + list(squad['validation']):
    context = example['context']
    for answer, start in zip(example['answers']['text'], example['answers']['answer_start']):
        if answer != context[start:(start+len(answer))]:
            print(f"answer={answer}")
            print(f"start={start}")
            print(f"context[start:start+len(answer)={context[start:(start+len(answer))]}")
            bad_answer_count += 1
        else:
            good_answer_count += 1
print(f"bad_answer_count = {bad_answer_count}")
print(f"good_answer_count = {good_answer_count}")

bad_answer_count = 0
good_answer_count = 122325


Someone did some quality control on this dataset!  

Ok, let's answer the other questions.  First we are going to gather some basic statistics:

 * the lengths of the contexts in characters
 * the lengths of the contexts in tokens (white-space delimited)
 * the lengths of the answers in characters
 * the lengths of the answers in tokens (white-space delimited)

Because the above four will require very similar code, we are going to avoid writing lots of redundant code by using Python lambda functions as outlined next.  A lambda function is simply a function that you can pass as a parameter to another function which can then be called by that other function.

# TODO (5 points)

In [13]:
# TODO Please implement the following method

def length_stats(examples, text_fxn, len_fxn):
    """
    Parameters:
    examples (dictionary): e.g. squad['train']
    text_fxn (lambda): is a lambda function that takes an example and produces a text (e.g. a context or an answer) which we can then measure the length of
    len_fxn (lambda): is a lambda function that takes a text and returns a count such as the length in character or the number of tokens

    Returns:
    total (int): total count for all lengths of all the texts seen
    maximum (int): maximum length seen
    minimum (int): minimum length seen
    mean (float): average length 
    median (float): median length
    mode (int): length that occurs most frequently.

    Methods you might find useful are: sum, max, min, list.append, statistics.mean, statistics.median, statistics.mode
    """
    text_lengths = []
    for example in examples:
        text = text_fxn(example)
        length = len_fxn(text)
        text_lengths.append(length)

    minimum = min(text_lengths)
    maximum = max(text_lengths)
    total = sum(text_lengths)
    mean = statistics.mean(text_lengths)
    median = statistics.median(text_lengths)
    mode = statistics.mode(text_lengths)
    # the last line of the function should be equivalent to the following:
    return total, maximum, minimum, mean, median, mode


In [14]:
# This first method is implemented for you
def compute_context_lengths_chars(examples):
    """
    This method should return the total, max, min, average (mean), median, and mode of the context lengths as measured in characters.
    Methods you might find useful are: len
    """
    # this function says "pass in an example and return its context"
    text_fxn = lambda example: example['context']
    # this function says "pass in a context and returns its length"
    len_fxn = lambda context: len(context)
    return length_stats(examples, text_fxn, len_fxn)

In [15]:
# this method prints out the actual summary stats and then compares them with the expected values passed in.  
def print_and_assert(length_stats_fxn, examples, description, expected_total, expected_maximum, expected_minimum, expected_mean, expected_median, expected_mode):
    actual_total, actual_maximum, actual_minimum, actual_mean, actual_median, actual_mode = length_stats_fxn(examples)

    print(f"{description}")
    print(f"total: {actual_total}")
    print(f"maximum: {actual_maximum}")
    print(f"minimum: {actual_minimum}")
    print(f"mean: {actual_mean}")
    print(f"median: {actual_median}")
    print(f"mode: {actual_mode}")

    assert actual_total == expected_total
    assert actual_maximum == expected_maximum
    assert actual_minimum == expected_minimum
    assert actual_mean == expected_mean
    assert actual_median == expected_median
    assert actual_mode == expected_mode

In [16]:
# Please run the following code to check your work
print_and_assert(compute_context_lengths_chars, squad['train'], "length in characters of train contexts", 66081551, 3706, 151, 754.3642164864896, 693, 597)
#print_and_assert(compute_context_lengths_chars, squad['validation'], "length in characters of train contexts", 8233854, 4063, 157, 778.98334910122996, 703, 631)
print_and_assert(compute_context_lengths_chars, squad['validation'], "length in characters of validation contexts", 8233854, 4063, 157, 778.98334910122996, 703, 631)

length in characters of train contexts
total: 66081551
maximum: 3706
minimum: 151
mean: 754.3642164864896
median: 693
mode: 597
length in characters of validation contexts
total: 8233854
maximum: 4063
minimum: 157
mean: 778.9833491012299
median: 703.0
mode: 631


# TODO (5 points)

In [17]:
# TODO Please implement the following method

def compute_context_lengths_tokens(examples):
    """
    This method should return the total, max, min, average (mean), median, and mode of the context lengths as measured in white-space separated tokens.
    Methods you might find useful are: len, string.split (for tokenization)
    """
    text_fxn = lambda example: example['context']
    len_fxn = lambda context: len(context.split())
    return length_stats(examples, text_fxn, len_fxn)

In [18]:
# Please run the following code to check your work
print_and_assert(compute_context_lengths_tokens, squad['train'], "length in tokens of train contexts", 10491130, 653, 20, 119.76312514983047, 110, 87)
print_and_assert(compute_context_lengths_tokens, squad['validation'], "length in tokens of validation contexts", 1310201, 629, 22, 123.9546830652791, 112, 104)

length in tokens of train contexts
total: 10491130
maximum: 653
minimum: 20
mean: 119.76312514983047
median: 110
mode: 87
length in tokens of validation contexts
total: 1310201
maximum: 629
minimum: 22
mean: 123.9546830652791
median: 112.0
mode: 104


# TODO (5 Points)

In [19]:
# TODO Please implement the following method

def compute_answer_lengths_chars(examples):
    """
    This method should return the total, max, min, average (mean), median, and mode of the answer lengths as measured in characters.
    Methods you might find useful are len
    """
    text_fxn = lambda example: example['answers']['text'][0]
    len_fxn = lambda text: len(text)
    return length_stats(examples, text_fxn, len_fxn)

In [20]:
# Please run the following code to check your work
print_and_assert(compute_answer_lengths_chars, squad['train'], "length in characters of train answers", 1764881, 239, 1, 20.147273370700578, 14, 4)
print_and_assert(compute_answer_lengths_chars, squad['validation'], "length in characters of validation answers", 204878, 160, 1, 19.382970671712393, 14, 4)

length in characters of train answers
total: 1764881
maximum: 239
minimum: 1
mean: 20.147273370700578
median: 14
mode: 4
length in characters of validation answers
total: 204878
maximum: 160
minimum: 1
mean: 19.382970671712393
median: 14.0
mode: 4


# TODO (5 Points)

In [21]:
# TODO Please implement the following method

def compute_answer_lengths_tokens(examples):
    """
    This method should return the total, average (mean), median, and mode of the answer lengths as measured in characters.
    Methods you might find useful are len, sum, list.append, statistics.mean, statistics.median, statistics.mode
    """
    text_fxn = lambda example: example['answers']['text'][0]
    len_fxn = lambda text: len(text.split())
    return length_stats(examples, text_fxn, len_fxn)

In [22]:
# Please run the following code to check your work
print_and_assert(compute_answer_lengths_tokens, squad['train'], "length in tokens of train answers", 277_002, 43, 1, 3.162159385381112, 2, 1)
print_and_assert(compute_answer_lengths_tokens, squad['validation'], "length in tokens of validation answers", 31_884, 29, 1, 3.016461684011353, 2, 1)

length in tokens of train answers
total: 277002
maximum: 43
minimum: 1
mean: 3.162159385381112
median: 2
mode: 1
length in tokens of validation answers
total: 31884
maximum: 29
minimum: 1
mean: 3.016461684011353
median: 2.0
mode: 1


## SQuAD Observations

This was an interesting exercise.  What did we learn?  Some of the contexts are very short - as short as 20 tokens.  Some are quite long - as much as 653 tokens. Similarly, answers can be short and long.  Some are one token long - in fact that's the most common length of an answer.  The mean answer length is significantly shorter (2 tokens) than the average length (3 tokens) which suggests that there might be a fair number long answers.  We found that the longest answer is 43 words!  Let's print out some of the long contexts and long answers:

In [23]:
# if the current context is the longest we've seen thus far, then print it out
max_length = 0
for example in squad['train']:
    context_len = len(example['context'])
    if context_len > max_length:
        max_length = context_len
        print(f"context length={context_len}: {example['context']}\n\n")

context length=695: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


context length=1405: As at most other universities, Notre Dame's students run a number of news media outlets. The nine student-run outlets include three newspapers, both a radio and television station, and several magazines and journals. Begun as a one-page journal in September 187

In [24]:
# if the current answer is the longest we've seen thus far, then print it out
max_length = 0
for example in squad['train']:
    context_len = len(example['answers']['text'][0])
    if context_len > max_length:
        max_length = context_len
        print(f"answer length={context_len}: {example['answers']['text'][0]}\n")

answer length=26: Saint Bernadette Soubirous

answer length=39: a Marian place of prayer and reflection

answer length=54: Joan B. Kroc Institute for International Peace Studies

answer length=67: oldest university band in continuous existence in the United States

answer length=71: American Society of Composers, Authors, and Publishers Pop Music Awards

answer length=76: the American Society of Composers, Authors, and Publishers Pop Music Awards.

answer length=98: sportswear, denim offerings with fur, outerwear and accessories that include handbags and footwear

answer length=128: The word genocide is the combination of the Greek prefix geno- (meaning tribe or race) and caedere (the Latin word for to kill).

answer length=153: a specific set of violent crimes that are committed against a certain group with the attempt to remove the entire group from existence or to destroy them

answer length=239: that the sudden shift of a huge quantity of water into the region could have relaxed th

# Custom Dataset

We are going to create a custom Q/A dataset by having every student create 10 question / answer pairs in English.  If everyone creates 10 examples, then we should have a new corpus of Q/A pairs with over 500 examples.  There is no plan to release this data outside of this class, so we are not going to be particularly careful about how we gather data for this corpus.  Please use good judgment and use resources that are widely available.  Each example should have the following fields:

 * context - you can copy some span of text that contains the answer for the question
 * question - please compose a new question that is based on the context and whose answer is provided in the context
 * answers - these should correspond to spans of text found in the context
 * source - this should be a link that when visited, you should be able to see the text of the context somewhere on the screen.

There are many sources of widely available data including various news outlets, GitHub, [Project Gutenberg](https://www.gutenberg.org), [EDGAR](https://www.sec.gov/search-filings), [PubMed](https://pubmed.ncbi.nlm.nih.gov/), [arXiv](https://arxiv.org/), [IMDB](https://www.imdb.com/), etc.  Please do not use StackExchange or other websites built around questions and answers.  Also, please avoid social media sites such as X, Facebook, etc.  Please try to create question and answer pairs that are suitable for this learning task.  Please look at the SQuAD data for examples if you need some sense of what they should look like.  

Please create 10 question/answer examples and format them using the following code and then submit them to me as `your-name.jsonl`.  Everyone in class will be able to see all questions submitted, but no one will know which questions you contributed.

When all the examples have been submitted I will create a new dataset from each jsonl file and distribute it.

In [25]:
def to_example(context, question, answers, source):
    if len(answers) != len(set(answers)):
        raise ValueError(f"Invalid answers: each answer must be unique.")
    qa = {'context':context, 'question':question, 'source':source}
    answer_starts = []
    for answer in answers:
        answer_start = context.find(answer)
        if answer_start >= 0:
            answer_starts.append(answer_start)
        else:
            raise ValueError(f"Invalid answer: {answer}. The answer must be a substring of the context.")
    qa['answers'] = {'text':answers, 'answer_start':answer_starts}  
    return qa

# TODO (20 Points)

Please email me the resulting jsonl file

In [26]:
# TODO please create 10 blocks of code corresponding to 10 question answer pairs.  
context1 = "Surf's Up is a 2007 American animated mockumentary comedy film produced by Columbia Pictures and Sony Pictures Animation, and distributed by Sony Pictures Releasing. It was directed by Ash Brannon and Chris Buck from a screenplay they co-wrote with Don Rhymer and producer Chris Jenkins, based on a story by Jenkins and Christian Darren. The film stars the voices of Shia LaBeouf, Jeff Bridges, Zooey Deschanel, Jon Heder, and James Woods. It is a parody of surfing documentaries, such as The Endless Summer and Riding Giants, with parts of the plot parodying North Shore."
question1 = "What year was Surf's Up released?"
answers1 = ["2007"]
source1 = "https://en.wikipedia.org/wiki/Surf's_Up_(film)"

context2 = "McDonald's Corporation, doing business as McDonald's, is an American multinational fast food chain, founded in 1940 as a restaurant operated by Richard and Maurice McDonald, in San Bernardino, California, United States. They rechristened their business as a hamburger stand and later turned the company into a franchise, with the Golden Arches logo being introduced in 1953 at a location in Phoenix, Arizona."
question2 = "Where were the Gold Arches first introduced?"
answers2 = ["Phoenix, Arizona"]
source2 = "https://en.wikipedia.org/wiki/McDonald%27s"

context3 = "Darien (formerly Cass) is a city in DuPage County, Illinois, United States. Per the 2020 census, the population was 22,011. A southwestern suburb of Chicago, Darien was named after the town of Darien, Connecticut. Darien is just north of I-55 and Historic U.S. Route 66 (now Frontage Road). The entire south edge of the town borders Waterfall Glen."
question3 = "What is the population of Darien?"
answers3 = ["22,011"]
source3 = "https://en.wikipedia.org/wiki/Darien,_Illinois"

context4 = "A synthesizer (also synthesiser or synth) is an electronic musical instrument that generates audio signals. Synthesizers typically create sounds by generating waveforms through methods including subtractive synthesis, additive synthesis and frequency modulation synthesis. These sounds may be altered by components such as filters, which cut or boost frequencies; envelopes, which control articulation, or how notes begin and end; and low-frequency oscillators, which modulate parameters such as pitch, volume, or filter characteristics affecting timbre. Synthesizers are typically played with keyboards or controlled by sequencers, software or other instruments, and may be synchronized to other equipment via MIDI."
question4 = "What do pitch, volume or filter characteristics affect?"
answers4 = ["timbre"]
source4 = "https://en.wikipedia.org/wiki/Synthesizer"

context5 = "Temple University (Temple or TU) is a public state-related research university in Philadelphia, Pennsylvania. It was founded in 1884 by the Baptist minister Russell Conwell and his congregation Grace Baptist Church of Philadelphia then called Baptist Temple."
question5 = "Who founded Temple University?"
answers5 = ["Russell Conwell"]
source5 = "https://en.wikipedia.org/wiki/Temple_University"

context6 = "Footloose is a 1984 American musical drama film directed by Herbert Ross and written by Dean Pitchford. It tells the story of Ren McCormack (Kevin Bacon), a teenager from Chicago who moves to a small town, where he attempts to overturn the ban on dancing instituted by the efforts of a local minister (John Lithgow). The film was released on February 17, 1984, by Paramount Pictures, and received mixed reviews from the critics and was a box office success, grossing $80 million in North America, becoming the seventh highest-grossing film of 1984. The songs Footloose by Kenny Loggins and Let's Hear It for the Boy by Deniece Williams were nominated for the Academy Award for Best Original Song."
question6 = "What were songs Footloose and Let's Hear It for the Boy nominated for?"
answers6 = ["the Academy Award for Best Original Song"]
source6 = "https://en.wikipedia.org/wiki/Footloose"

context7 = "Jennifer Shrader Lawrence (born August 15, 1990) is an American actress and producer. Lawrence is known for starring in both action film franchises and independent dramas, and her films have grossed over $6 billion worldwide. The world's highest-paid actress in 2015 and 2016, she appeared in Time's 100 most influential people in the world list in 2013 and the Forbes Celebrity 100 list from 2013 to 2016."
question7 = "What year was Jennifer Lawrence born?"
answers7 = ["1990"]
source7 = "https://en.wikipedia.org/wiki/Jennifer_Lawrence"

context8 = "The six classes of British nationality each have varying degrees of civil and political rights, due to the UK's historical status as a colonial empire. The principal class of British nationality is British citizenship, which is associated with the British Islands. British nationals associated with an overseas territory are British Overseas Territories citizens (BOTCs). Almost all BOTCs (except for those from Akrotiri and Dhekelia) have also been British citizens since 2002. Individuals connected with former British colonies may hold residual forms of British nationality, which do not confer an automatic right of abode in the United Kingdom and generally may no longer be acquired. These residual nationalities are the statuses of British Overseas citizen, British subject, British National (Overseas), and British protected person."
question8 = "What class of British nationality are British nationals associated with overseas territory?"
answers8 = ["British Overseas Territories citizens (BOTCs)"]
source8 = "https://en.wikipedia.org/wiki/British_nationality_law"

context9 = "The Ireland Act 1949 is an Act of the Parliament of the United Kingdom intended to deal with the consequences of the Republic of Ireland Act 1948 as passed by the Irish parliament, the Oireachtas. Following the secession of most of Ireland from the United Kingdom in 1922, the then created Irish Free State remained (for the purposes of British law) a dominion of the British Empire and thus its people remained British subjects with the right to live and work in the United Kingdom and elsewhere in the Empire. The British monarch continued to be head of state."
question9 = "What is the Irish Parlament called?"
answers9 = ["the Oireachtas"]
source9 = "https://en.wikipedia.org/wiki/Ireland_Act_1949"

context10 = "New York University (NYU) is a private research university in New York City, United States. Chartered in 1831 by the New York State Legislature, NYU was founded in 1832 by Albert Gallatin as a non-denominational all-male institution near City Hall based on a curriculum focused on a secular education. The university moved in 1833 and has maintained its main campus in Greenwich Village surrounding Washington Square Park. Since then, the university has added an engineering school in Brooklyn's MetroTech Center and graduate schools throughout Manhattan."
question10 = "When did New York University move?"
answers10 = ["1833"]
source10 = "https://en.wikipedia.org/wiki/New_York_University"

# repeat until you have 10 blocks of four variables 

qa1 = to_example(context1, question1, answers1, source1)
qa2 = to_example(context2, question2, answers2, source2)
qa3 = to_example(context3, question3, answers3, source3)
qa4 = to_example(context4, question4, answers4, source4)
qa5 = to_example(context5, question5, answers5, source5)
qa6 = to_example(context6, question6, answers6, source6)
qa7 = to_example(context7, question7, answers7, source7)
qa8 = to_example(context8, question8, answers8, source8)
qa9 = to_example(context9, question9, answers9, source9)
qa10 = to_example(context10, question10, answers10, source10)

qas = [qa1, qa2, qa3, qa4, qa5, qa6, qa7, qa8, qa9, qa10]

In [27]:
out_path = "jasper.wilkerson.jsonl" # edit this line to use your name

with open(out_path, 'w') as jsonl:
    for qa in qas:
        print(json.dumps(qa), file=jsonl)

# BERT-Mini

We will be using a small version the BERT called "BERT mini" which is a 4-layer, 11M parameter model described here:

https://huggingface.co/prajjwal1/bert-mini

https://github.com/prajjwal1/generalize_lm_nli

If you find that your compute device (i.e. laptop or desktop or server) does not have enough resources to use "BERT mini", then please give "BERT tiny" a try.  

Let's start by loading the model and the tokenizer

In [28]:
tokenizer = BertTokenizer.from_pretrained('prajjwal1/bert-mini')
model = BertForQuestionAnswering.from_pretrained('prajjwal1/bert-mini')

# if you need to use BERT tiny, then comment the code above and uncomment the code below
# tokenizer = BertTokenizer.from_pretrained('prajjwal1/bert-tiny')
# model = BertForQuestionAnswering.from_pretrained('prajjwal1/bert-tiny')

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at prajjwal1/bert-mini and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
# Please run the following to get some sense of what is in BERT Mini
# Count the total number of parameters
total_params = sum(p.numel() for p in model.parameters())

# Count the number of trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params}")
print(f"Trainable parameters: {trainable_params}")

Total parameters: 11105282
Trainable parameters: 11105282


## Answer one question

Let's try to get BERT Mini to answer a single question from SQuAD

In [30]:
# first let's grab a random training example
index = random.randint(0, train_count - 1)
example = squad['train'][index]
question = example['question']
context = example['context']
print(f"example: question = '{question}', context = '{context}'")
print(f"expected answer = {example['answers']['text'][0]}")

example: question = 'After Findlay's charting how many islands of the group were named Ellice?', context = 'The next European to visit was Arent Schuyler de Peyster, of New York, captain of the armed brigantine or privateer Rebecca, sailing under British colours, which passed through the southern Tuvaluan waters in May 1819; de Peyster sighted Nukufetau and Funafuti, which he named Ellice's Island after an English Politician, Edward Ellice, the Member of Parliament for Coventry and the owner of the Rebecca's cargo. The name Ellice was applied to all nine islands after the work of English hydrographer Alexander George Findlay.'
expected answer = all nine islands


In [31]:
# tokenize the question and the context and inspect the results
inputs = tokenizer(question, context, return_tensors='pt')
input_ids = inputs["input_ids"].tolist()[0]
print(inputs)

{'input_ids': tensor([[  101,  2044,  2424,  8485,  1005,  1055, 17918,  2129,  2116,  3470,
          1997,  1996,  2177,  2020,  2315,  3449, 13231,  1029,   102,  1996,
          2279,  2647,  2000,  3942,  2001,  4995,  2102,  8040,  6979, 20853,
          2139, 21877, 27268,  2121,  1010,  1997,  2047,  2259,  1010,  2952,
          1997,  1996,  4273, 16908,  4630,  3170,  2030, 26790,  9423,  1010,
          8354,  2104,  2329,  8604,  1010,  2029,  2979,  2083,  1996,  2670,
         10722, 10175, 13860,  5380,  1999,  2089, 12552,  1025,  2139, 21877,
         27268,  2121, 19985, 16371,  5283,  7959,  2696,  2226,  1998,  4569,
         10354, 21823,  1010,  2029,  2002,  2315,  3449, 13231,  1005,  1055,
          2479,  2044,  2019,  2394,  3761,  1010,  3487,  3449, 13231,  1010,
          1996,  2266,  1997,  3323,  2005, 13613,  1998,  1996,  3954,  1997,
          1996,  9423,  1005,  1055,  6636,  1012,  1996,  2171,  3449, 13231,
          2001,  4162,  2000,  2035,  

In [32]:
# observe that the input ids begin with 101 and end with 102 for every example you observe
# these are special tokens '[CLS]' and '[SEP'].  You will observe there is always a 102 between the question token ids and the context token ids.
# also observe that the token type ids assign 0 to the question tokens and 1 to the context tokens.
special = tokenizer.decode([101, 102])
print(f"the special tokens at the beginning and end are {special}")

# you can decode any token id or encode any token to get some sense of how the tokenization works

token_id = tokenizer.encode("Who", add_special_tokens=False)[0]
print(f"token id for 'Who' is {token_id}")

token = tokenizer.decode([2079])
print(f"token for '2079' is {token}")

the special tokens at the beginning and end are [CLS] [SEP]
token id for 'Who' is 2040
token for '2079' is do


ok - let's call the model.  There are many ways to use a BERT model.  Here we have instantiated BERT as a BertForQuestionAnswering object.  You can read about this here:

https://huggingface.co/transformers/v3.0.2/model_doc/bert.html#bertforquestionanswering



In [33]:
with torch.no_grad():
    outputs = model(**inputs)
    start_scores = outputs.start_logits
    end_scores = outputs.end_logits

# want to see some logits?!!
print(outputs.start_logits)
print(outputs.end_logits)

tensor([[-0.1161,  0.1602,  0.4311,  0.3323,  0.3339, -0.0034,  0.3245,  0.4651,
          0.2019,  0.2221,  0.1454,  0.2009,  0.1181,  0.0544, -0.0920,  0.0696,
          0.1881,  0.1520,  0.0133,  0.2852,  0.0325,  0.1468, -0.2346, -0.2170,
          0.2616,  0.1069,  0.3365,  0.0863,  0.1668,  0.3600,  0.0614,  0.4924,
          0.1830,  0.0598,  0.0162,  0.2512,  0.2964, -0.2163, -0.0877, -0.0626,
          0.3703,  0.2311,  0.0493, -0.0461,  0.3656, -0.0251, -0.1655,  0.1667,
         -0.1079, -0.2763,  0.0054,  0.0110, -0.0685, -0.2000, -0.1252, -0.0144,
          0.2877,  0.1901,  0.3239,  0.2435,  0.4178,  0.1285,  0.3391, -0.1707,
         -0.1915, -0.2611, -0.1546, -0.0116,  0.3176,  0.5169,  0.2286,  0.1285,
          0.1572,  0.2429, -0.0031,  0.1389,  0.1611,  0.1907, -0.0405, -0.0183,
          0.1191,  0.2176, -0.0235,  0.0594,  0.4031, -0.1192,  0.0893,  0.3296,
          0.2979, -0.0102, -0.1475,  0.0191,  0.4348,  0.0403, -0.0377,  0.2116,
          0.1814,  0.0923,  

In [34]:
answer_start = torch.argmax(start_scores)
answer_end = torch.argmax(end_scores) + 1

# Convert tokens to string
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(f"answer_start={answer_start}, answer_end={answer_end}")
print(f"answer={answer}")

answer_start=69, answer_end=78
answer=peyster sighted nukufetau


Well, it tried.  We are using BERT mini after all....

There are (at least) two problems with the above code.  The first is that we have not constrained the answer to start inside the context tokens.  The second is that we have not constrained the answer to end after the start.  Let's write a function that takes a question and context and returns an answer that fixes these two problems.

# TODO (10 Points)

In [35]:
# TODO Please implement the following method

def answer_question(question, context, model):
    """
    This method should return an answer predicted by the model.
    Use the code above and add a bit of logic to make sure that the answer is found in the context.  
    Also, make sure this method finds the best answer end that follows the best answer start.  
    That is you should argmax over only the end scores that appear after the answer start
    It's very easy to get index arithmetic wrong - so please test every line of code.
    """
    # I added a couple of parameters to deal with a few examples whose questions + contexts were too long
    inputs = tokenizer(question, context, return_tensors='pt', truncation=True, max_length=512)
    input_ids = inputs["input_ids"].tolist()[0]
    with torch.no_grad():
        outputs_logits = model(**inputs)
        start_scores = outputs_logits.start_logits
        end_scores = outputs_logits.end_logits
        
    num_tokens = len(input_ids)
    answer_start = torch.argmax(start_scores[0][:num_tokens])
    #Our answer shouldn't be longer than 43 tokens
    window_end = answer_start + 43
    end_scores = end_scores[0][answer_start:window_end]
    answer_end = torch.argmax(end_scores)
    answer_end += answer_start

    # print("answer start", answer_start)
    # print("answer end", answer_end)

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    return answer

In [66]:
# test a random example
index = random.randint(0, train_count - 1)
example = squad['train'][index]
question = example['question']
context = example['context']
print(f"example: question = '{question}', context = '{context}'")
print(f"expected answer = {example['answers']['text'][0]}")
actual_answer = answer_question(question, context, model)
print(f"actual answer = {actual_answer}")

example: question = 'When did the Soviet Union break up?', context = 'On December 25, 1991, following the collapse of the Soviet Union, the republic was renamed the Russian Federation, which it remains to this day. This name and "Russia" were specified as the official state names in the April 21, 1992 amendment to the existing constitution and were retained as such in the 1993 Constitution of Russia.'
expected answer = December 25, 1991
actual answer = ? [SEP] on december 25 , 1991 , following the collapse of the soviet union , the republic was renamed the russian


# How many of the training examples does BERT mini get correct?
The final bit of code to write is a loop that calls answer_question for each example in the training data and reports the number of times BERT mini gets the answer correct. Put your answer on the next line:

# The number of questions BERT mini answers correctly is 16-20

# TODO (10 Points)

In [69]:
# TODO write some code that loops over each example in the training data and calls the answer_question() method for each
# if the expected answer is the same as the actual, model-generated answer, then count it as correct
# Count all the correct answers and indicate it in the markdown block above
# This code may take a while (30 minutes or more) and so you might consider adding progress updates
# here's the code I used
# progress = 0
# (inside the loop)
# progress += 1
# if progress % 5000 == 0:
#     print(f"progress = {progress}, correct = {correct}")

correct = 0
progress = 0
print("Total examples to run:", len(squad['train']))

for example in squad['train']:
    progress += 1
    if progress % 5000 == 0:
        print(f"progress = {progress}, correct = {correct}")
 
    question = example['question']
    context = example['context']
    expected_answer = example['answers']['text'][0].lower().strip()
    predicted_answer = answer_question(question, context, model).lower().strip()
    if expected_answer == predicted_answer:
        correct += 1

print(f"correct={correct}")



Total examples to run: 87599


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

progress = 5000, correct = 1
progress = 10000, correct = 3


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


progress = 15000, correct = 4
progress = 20000, correct = 4


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

progress = 25000, correct = 5


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

progress = 30000, correct = 6
progress = 35000, correct = 6


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

progress = 40000, correct = 6


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

progress = 45000, correct = 7


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

progress = 50000, correct = 8


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

progress = 55000, correct = 8


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

progress = 60000, correct = 10


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

progress = 65000, correct = 12


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

progress = 70000, correct = 12


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

progress = 75000, correct = 12


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


progress = 80000, correct = 14


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

progress = 85000, correct = 16


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

correct=16
