[Original Code](https://github.com/dredwardhyde/bert-examples/blob/main/bert_squad_tensorflow.py)

SQUAD contains >100,000 Q-A pairs on 500+ articles. <br>
The goal is to find, for each question, a span of text that answers that question. Model performance is measured as a percentage of predictions that closely match any of the ground-truth answers. <P>

BERT model is fine-tuned to perform this task in the following:
1.  Context and the question are preprocessed and passed as inputs.
2.  Take the state of last hidden layer and feed it into the start token classifier. The start token classifier only has a single set of weights which it applies to every word. After taking the dot product between output embeddings and start weights, we apply softmax activation to produce a probability distribution over all of the words. Whichever word has the highest probability of being the start token is the one we pick.
3.  Repeat this process for the end token - we have a separate weight vector for this.

In [1]:
!pip install tokenizers

Collecting tokenizers
  Downloading tokenizers-0.11.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 8.5 MB/s 
[?25hInstalling collected packages: tokenizers
Successfully installed tokenizers-0.11.4


In [2]:
import json
import os
import re
import string
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow import keras
from tensorflow.keras import layers
from tokenizers import BertWordPieceTokenizer
#os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [3]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  1


In [4]:
!nvidia-smi

Mon Feb 14 04:45:46 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [5]:
# ============================================= PREPARING DATASET ======================================================
class Sample:
    def __init__(self, question, context, start_char_idx=None, answer_text=None, all_answers=None):
        self.question = question
        self.context = context
        self.start_char_idx = start_char_idx # zero-th char index
        self.answer_text = answer_text
        self.all_answers = all_answers
        self.skip = False # skip bool; something we'll look at later..
        self.start_token_idx = -1 # initialise these to -1
        self.end_token_idx = -1

    def preprocess(self):

        # join context and question; all one long string separated by spaces
        context = " ".join(str(self.context).split())
        question = " ".join(str(self.question).split())

        # turn context and question into tokens
        tokenized_context = tokenizer.encode(context)
        tokenized_question = tokenizer.encode(question)

        # if there is an answer text
        if self.answer_text is not None:

            # join answer as well
            answer = " ".join(str(self.answer_text).split())
            end_char_idx = self.start_char_idx + len(answer) # get index for end character of answer

            # if end character of answer is longer than context; not possible. Return
            if end_char_idx >= len(context):
                self.skip = True
                return

            # build an array of 0's, with 1 in tokens where the answer is
            is_char_in_ans = [0] * len(context)
            for idx in range(self.start_char_idx, end_char_idx):
                is_char_in_ans[idx] = 1

            # tokenized offsets is the mapping that links the token to its original
            # index; so here we make an array of all the token indexes
            # and if any answer is in a token; we append it to the list; so have indexes of
            # tokens which constitute as our answer
            ans_token_idx = []
            for idx, (start, end) in enumerate(tokenized_context.offsets):
                if sum(is_char_in_ans[start:end]) > 0:
                    ans_token_idx.append(idx)

            # if no answers here; return 0 and we skip
            if len(ans_token_idx) == 0:
                self.skip = True
                return

            # then indexes for our tokens; set start and end
            self.start_token_idx = ans_token_idx[0]
            self.end_token_idx = ans_token_idx[-1]

        # create input ids; join together context and question ids together.
        input_ids = tokenized_context.ids + tokenized_question.ids[1:] # 1: to skip the [CLS] token

        # initialise token type ids; 0 for context, 1 for question
        token_type_ids = [0] * len(tokenized_context.ids) + [1] * len(tokenized_question.ids[1:])
        # attention has 1 for everything.. Is what we will focus on, padding will be zeros
        attention_mask = [1] * len(input_ids)
        # then determine how much to pad
        padding_length = max_seq_length - len(input_ids)

        # if there is padding; append 0's at the end of input ids, attention and token ids
        if padding_length > 0:
            input_ids = input_ids + ([0] * padding_length)
            attention_mask = attention_mask + ([0] * padding_length)
            token_type_ids = token_type_ids + ([0] * padding_length)
        # otherwise no need
        elif padding_length < 0:
            self.skip = True
            return

        # set these as class variables
        self.input_word_ids = input_ids
        self.input_type_ids = token_type_ids
        self.input_mask = attention_mask
        self.context_token_to_char = tokenized_context.offsets # also keep context offsets


def create_squad_examples(raw_data):
    """
    Loop through data; paragraphs; qa's and append to SQUAD examples; create
    Sample dataclasses
    article (442)
        - title
        - paragraphs(list)
            - context
            - qas (list)
                - answers (list)
                    - answer_start
                    - text
                - id
                - question
    """
    squad_examples = []
    for item in raw_data["data"]:
        for para in item["paragraphs"]:
            context = para["context"]
            for qa in para["qas"]:
                question = qa["question"]
                # if answers in qa; 
                if "answers" in qa:
                    answer_text = qa["answers"][0]["text"] # get zero-th answers' text
                    all_answers = [_["text"] for _ in qa["answers"]] # list of all the answers' texts
                    start_char_idx = qa["answers"][0]["answer_start"] # get zero-th answers' index

                    # create SQUAD sample with question, context; and all possible answers
                    squad_eg = Sample(question, context, start_char_idx, answer_text, all_answers)
                else:
                    # else create just sample with question and context
                    squad_eg = Sample(question, context)
                squad_eg.preprocess()
                squad_examples.append(squad_eg)
    return squad_examples


def create_inputs_targets(squad_examples):
    """Preprocessing to x and y"""

    # initialise format which we will store our variables
    dataset_dict = {
        "input_word_ids": [],
        "input_type_ids": [],
        "input_mask": [],
        "start_token_idx": [],
        "end_token_idx": [],
    }

    # loop through squad dataset
    for item in squad_examples:
        if item.skip == True: # continue if skip
            continue

        # get each of the dataset dict; append one by one
        for key in dataset_dict:
            dataset_dict[key].append(getattr(item, key))

    # loop through dataset dict
    for key in dataset_dict:
        dataset_dict[key] = np.array(dataset_dict[key]) # cast to numpy array
    # x is a triple; input word ids, input mask, input type ids
    x = [dataset_dict["input_word_ids"],
         dataset_dict["input_mask"],
         dataset_dict["input_type_ids"]]
         
    # y is start and end token indexes
    y = [dataset_dict["start_token_idx"], dataset_dict["end_token_idx"]]
    return x, y

In [6]:
# download dataset from keras
train_path = keras.utils.get_file("train.json", "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json")
eval_path = keras.utils.get_file("eval.json", "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json")
with open(train_path) as f: raw_train_data = json.load(f)
with open(eval_path) as f: raw_eval_data = json.load(f)

# set max seq length
max_seq_length = 384

Downloading data from https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
Downloading data from https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json


In [7]:
# initialize BERT inputs; word ids, mask and type ids
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name='input_word_ids')
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name='input_mask')
input_type_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name='input_type_ids')

# load bert model and set output sequence
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2", trainable=True)
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, input_type_ids])

# load tokenizer for BERT
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy().decode("utf-8")
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = BertWordPieceTokenizer(vocab=vocab_file, lowercase=True)

In [8]:
print('Sample Context')
display(raw_train_data['data'][0]['paragraphs'][0]['context'])

print('Corresponding QAs')
display(raw_train_data['data'][0]['paragraphs'][0]['qas'][0])

Sample Context


'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

Corresponding QAs


{'answers': [{'answer_start': 515, 'text': 'Saint Bernadette Soubirous'}],
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'}

In [9]:
# build test sample
test_sample = Sample(
    raw_train_data['data'][0]['paragraphs'][0]['qas'][0]['question'], 
    raw_train_data['data'][0]['paragraphs'][0]['context'], 
    raw_train_data['data'][0]['paragraphs'][0]['qas'][0]['answers'][0]['answer_start'], 
    raw_train_data['data'][0]['paragraphs'][0]['qas'][0]['answers'][0]['text'], 
    [_['text'] for _ in raw_train_data['data'][0]['paragraphs'][0]['qas'][0]['answers']]
)
display(test_sample.__dict__) # display dict
print('')
test_sample.preprocess() # do preprocess
print(test_sample.input_word_ids)
print(test_sample.input_type_ids)
print(test_sample.input_mask)
print(test_sample.context_token_to_char)


{'all_answers': ['Saint Bernadette Soubirous'],
 'answer_text': 'Saint Bernadette Soubirous',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'end_token_idx': -1,
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'skip': False,
 'start_char_idx': 515,
 'start_token_idx': -1}


[101, 6549, 2135, 1010, 1996, 2082, 2038, 1037, 3234, 2839, 1012, 10234, 1996, 2364, 2311, 1005, 1055, 2751, 8514, 2003, 1037, 3585, 6231, 1997, 1996, 6261, 2984, 1012, 3202, 1999, 2392, 1997, 1996, 2364, 2311, 1998, 5307, 2009, 1010, 2003, 1037, 6967, 6231, 1997, 4828, 2007, 2608, 2039, 14995, 6924, 2007, 1996, 5722, 1000, 2310, 3490, 2618, 4748, 2033, 18168, 5267, 1000, 1012, 2279, 2000, 1996, 2364, 2311, 2003, 1996, 13546, 1997, 1996, 6730, 2540, 1012, 3202, 2369, 1996, 13546, 2003, 1996, 24665, 23052, 1010, 1037, 14042, 2173, 1997, 7083, 1998, 9185, 1012, 2009, 2003, 1037, 15059, 1997, 1996, 24665, 23052, 2012, 10223, 26371, 1010, 2605, 2073, 1996, 6261, 2984, 22353, 2135, 2596, 2000, 3002, 16595, 9648, 4674, 2061, 12083, 9711, 2271, 1999, 8517, 1012, 2012, 1996, 2203, 1997, 1996, 2364, 3298, 1006, 1998, 1999, 1037, 3622, 2240, 2008, 8539, 2083, 1017, 11342, 1998, 1996, 2751, 8514, 1007, 1010, 2003, 1037, 3722, 1010, 2715, 2962, 6231, 1997, 2984, 1012, 102, 2000, 3183, 2106, 1996,

In [10]:
# create train set examples
train_squad_examples = create_squad_examples(raw_train_data)
x_train, y_train = create_inputs_targets(train_squad_examples)
print(f"{len(train_squad_examples)} training points created.")

# create test set examples
eval_squad_examples = create_squad_examples(raw_eval_data)
x_eval, y_eval = create_inputs_targets(eval_squad_examples)
print(f"{len(eval_squad_examples)} evaluation points created.")

87599 training points created.
10570 evaluation points created.


In [11]:
class ValidationCallback(keras.callbacks.Callback):

    def normalize_text(self, text):
        text = text.lower()
        text = "".join(ch for ch in text if ch not in set(string.punctuation))
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        text = re.sub(regex, " ", text)
        text = " ".join(text.split())
        return text

    def __init__(self, x_eval, y_eval):
        self.x_eval = x_eval
        self.y_eval = y_eval

    def on_epoch_end(self, epoch, logs=None):
        """Predict output on epoch end."""
        pred_start, pred_end = self.model.predict(self.x_eval) # make model prediction on these..
        count = 0
        eval_examples_no_skip = [_ for _ in eval_squad_examples if _.skip == False] # only get examples without skip

        for idx, (start, end) in enumerate(zip(pred_start, pred_end)):

            squad_eg = eval_examples_no_skip[idx] # get corresponding squad example
            offsets = squad_eg.context_token_to_char # and context offsets to reference to

            # get the index of start and end tokens
            start = np.argmax(start)
            end = np.argmax(end)

            # if start is after end of offsets (greater than context); skip
            if start >= len(offsets):
                continue

            pred_char_start = offsets[start][0] # set start predicting character
            # then set end index as last offset; and get predicted answer
            if end < len(offsets):
                pred_char_end = offsets[end][1]
                pred_ans = squad_eg.context[pred_char_start:pred_char_end]
            # otherwise take until end of squad context
            else:
                pred_ans = squad_eg.context[pred_char_start:]

            # normalise answers and count if answer is in; if so, then that is positive match
            normalized_pred_ans = self.normalize_text(pred_ans)
            normalized_true_ans = [self.normalize_text(_) for _ in squad_eg.all_answers]
            if normalized_pred_ans in normalized_true_ans:
                count += 1
        acc = count / len(self.y_eval[0])
        print(f"\nepoch={epoch + 1}, exact match score={acc:.2f}")

In [12]:
# build up our model now

# logits for start index and end index; our model will return the spans of predictions
start_logits = layers.Dense(1, name="start_logit", use_bias=False)(sequence_output)
start_logits = layers.Flatten()(start_logits)

end_logits = layers.Dense(1, name="end_logit", use_bias=False)(sequence_output)
end_logits = layers.Flatten()(end_logits)

# initialise softmax activations for these
start_probs = layers.Activation(keras.activations.softmax)(start_logits)
end_probs = layers.Activation(keras.activations.softmax)(end_logits)

# initialise model; inputs will be triple of word ids, mask, type ids. outputs theindexes
model = keras.Model(inputs=[input_word_ids, input_mask, input_type_ids], outputs=[start_probs, end_probs])

# define loss, optimizer etc.. and compile
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=False)
optimizer = keras.optimizers.Adam(lr=1e-5, beta_1=0.9, beta_2=0.98, epsilon=1e-9)
model.compile(optimizer=optimizer, loss=[loss, loss])
model.summary()

# train; run two epochs; or one for faster training
model.fit(x_train, y_train, epochs=1, batch_size=8, callbacks=[ValidationCallback(x_eval, y_eval)]) 
model.save_weights("./weights.h5")

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_word_ids (InputLayer)    [(None, 384)]        0           []                               
                                                                                                  
 input_mask (InputLayer)        [(None, 384)]        0           []                               
                                                                                                  
 input_type_ids (InputLayer)    [(None, 384)]        0           []                               
                                                                                                  
 keras_layer (KerasLayer)       [(None, 768),        109482241   ['input_word_ids[0][0]',         
                                 (None, 384, 768)]                'input_mask[0][0]',         

  super(Adam, self).__init__(name, **kwargs)


 1153/10767 [==>...........................] - ETA: 1:54:20 - loss: 4.7002 - activation_loss: 2.4024 - activation_1_loss: 2.2978

KeyboardInterrupt: ignored

In [20]:
# ==================================================== TESTING =========================================================
data = {"data":
    [
        {"title": "Project Apollo",
         "paragraphs": [
             {
                 "context": "The Apollo program, also known as Project Apollo, was the third United States human "
                            "spaceflight program carried out by the National Aeronautics and Space Administration ("
                            "NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First "
                            "conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to "
                            "follow the one-man Project Mercury which put the first Americans in space, Apollo was "
                            "later dedicated to President John F. Kennedy's national goal of landing a man on the "
                            "Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in "
                            "a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project "
                            "Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, "
                            "and was supported by the two man Gemini program which ran concurrently with it from 1962 "
                            "to 1966. Gemini missions developed some of the space travel techniques that were "
                            "necessary for the success of the Apollo missions. Apollo used Saturn family rockets as "
                            "launch vehicles. Apollo/Saturn vehicles were also used for an Apollo Applications "
                            "Program, which consisted of Skylab, a space station that supported three manned missions "
                            "in 1973-74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the "
                            "Soviet Union in 1975.",
                 "qas": [
                     {"question": "What project put the first Americans into space?",
                      "id": "Q1"
                      },
                     {"question": "What program was created to carry out these projects and missions?",
                      "id": "Q2"
                      },
                     {"question": "What year did the first manned Apollo flight occur?",
                      "id": "Q3"
                      },
                     {"question": "What President is credited with the original notion of putting Americans in space?",
                      "id": "Q4"
                      },
                     {"question": "Who did the U.S. collaborate with on an Earth orbit mission in 1975?",
                      "id": "Q5"
                      },
                     {"question": "How long did Project Apollo run?",
                      "id": "Q6"
                      },
                     {"question": "What program helped develop space travel techniques that Project Apollo used?",
                      "id": "Q7"
                      },
                     {"question": "What space station supported three manned missions in 1973-1974?",
                      "id": "Q8"
                      }
                 ]}]}]}

# create test sample and get predicted output for it
test_samples = create_squad_examples(data) # create Sample classes
x_test, _ = create_inputs_targets(test_samples) # turn into x, y train targets
pred_start, pred_end = model.predict(x_test)


for idx, (start, end) in enumerate(zip(pred_start, pred_end)):

    test_sample = test_samples[idx] # get corresponding sample index
    offsets = test_sample.context_token_to_char

    start = np.argmax(start) # get argmax; highest probability of start/end pos
    end = np.argmax(end)

    # get corresponding predicted character for the offset
    pred_ans = None
    if start >= len(offsets):
        continue
    pred_char_start = offsets[start][0]
    if end < len(offsets):
        # retrieve from the context
        pred_ans = test_sample.context[pred_char_start:offsets[end][1]]
    else:
        pred_ans = test_sample.context[pred_char_start:]
    print("Q: " + test_sample.question)
    print("A: " + pred_ans)
    print()

Q: What project put the first Americans into space?
A: Mercury

Q: What program was created to carry out these projects and missions?
A: National Aeronautics and Space Administration

Q: What year did the first manned Apollo flight occur?
A: 1968

Q: What President is credited with the original notion of putting Americans in space?
A: John F. Kennedy

Q: Who did the U.S. collaborate with on an Earth orbit mission in 1975?
A: Soviet Union

Q: How long did Project Apollo run?
A: 1961 to 1972

Q: What program helped develop space travel techniques that Project Apollo used?
A: Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions. Apollo used Saturn family rockets

Q: What space station supported three manned missions in 1973-1974?
A: Skylab

