In [1]:
username = 'MarcelloCeresini'
repository = 'QuestionAnswering'

# COLAB ONLY CELLS
try:
    import google.colab
    IN_COLAB = True
    !pip3 install transformers
    !git clone https://www.github.com/{username}/{repository}.git
    from google.colab import drive
    drive.mount('/content/drive/')
    %cd /content/QuestionAnswering/src
except:
    IN_COLAB = False

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.1-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 13.4 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 54.3 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 69.4 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.22.1
Cloning into 'QuestionAnswering'...
remote: Enumerating objects: 1053, done.[K
remote: Counting objects: 100% (223/223), done.[K
remote: Compressing objects: 100% (169/169), done.[K
remote: Total 1053 (delta 150), reused 78 (delta

# Tf-Idf retrieval baseline

## Preparation

In this notebook, we implement a simple baseline for paragraph retrieval using Tf-Idf weighted sparse representations of documents and query questions.

In [2]:
%matplotlib inline

import os
from tqdm import tqdm
import random
import json
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from functools import partial

from sklearn.feature_extraction.text import TfidfVectorizer

from config import Config
config = Config()
import utils

# Fix random seed for reproducibility
np.random.seed(config.RANDOM_SEED)
random.seed(config.RANDOM_SEED)
tf.random.set_seed(config.RANDOM_SEED)

from typing import List, Dict
#os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Load the datasets

In [3]:
ROOT_PATH = os.path.dirname(os.getcwd())
TRAINING_FILE = os.path.join(ROOT_PATH, 'data', 'training_set.json')
VALIDATION_FILE = os.path.join(ROOT_PATH, 'data', 'validation_set.json')
TEST_FILE = os.path.join(ROOT_PATH, 'data', 'dev_set.json')

train_paragraphs_and_questions = utils.read_question_set(TRAINING_FILE)['data']
val_paragraphs_and_questions = utils.read_question_set(VALIDATION_FILE)['data']
test_paragraphs_and_questions = utils.read_question_set(TEST_FILE)['data']

# Remove the validation set from the train set
train_paragraphs_and_questions = [article for article in train_paragraphs_and_questions \
                                  if article not in val_paragraphs_and_questions]

First of all, we separate questions from their paragraphs. We assign a `context_id` to each question and paragraph:

- For questions, the context ID is a tuple `(paragraph_id, question_id_in_paragraph)`
- For paragraphs, the context ID is a simple index corresponding to the first value in the tuple of questions.

In [4]:
def get_questions_and_paragraphs(dataset):
    questions = [{
            'qas': qas,
            'context_id': (i,j)    # We also track the question's original context and paragraph indices so to have a ground truth
        }
        for i in range(len(dataset))
        for j, para in enumerate(dataset[i]['paragraphs'])
        for qas in para['qas']
    ]

    paragraphs = [{
            'context': para['context'],
            'context_id': i
        }
        for i in range(len(dataset))
        for para in dataset[i]['paragraphs']
    ]

    return questions, paragraphs

train_questions, train_paragraphs = get_questions_and_paragraphs(train_paragraphs_and_questions)
val_questions, val_paragraphs = get_questions_and_paragraphs(val_paragraphs_and_questions)
test_questions, test_paragraphs = get_questions_and_paragraphs(test_paragraphs_and_questions)

print(f"Number of training questions: {len(train_questions)}")
print(f"Number of training paragraphs: {len(train_paragraphs)}")
print()
print(f"Number of val questions: {len(val_questions)}")
print(f"Number of val paragraphs: {len(val_paragraphs)}")
print()
print(f"Number of test questions: {len(test_questions)}")
print(f"Number of test paragraphs: {len(test_paragraphs)}")

Number of training questions: 65064
Number of training paragraphs: 13975

Number of val questions: 22535
Number of val paragraphs: 4921

Number of test questions: 10570
Number of test paragraphs: 2067


We build a function to obtain paragraphs given the `context_id`s we built.

In [5]:
def get_paragraph_from_question(qas, dataset):
    i,j = qas['context_id']
    return dataset[i]['paragraphs'][j]

Let's try it on a random question.

In [6]:
x = random.randint(0, len(train_questions)-1)
print(f"Question: {train_questions[x]['qas']['question']}")
print()
print(f"Ground truth context: '{get_paragraph_from_question(train_questions[x], train_paragraphs_and_questions)['context']}'")

Question: What historical event illustrated that dividing sovereignty was not possible?

Ground truth context: 'Until recently, in the absence of prior agreement on a clear and precise definition, the concept was thought to mean (as a shorthand) 'a division of sovereignty between two levels of government'. New research, however, argues that this cannot be correct, as dividing sovereignty - when this concept is properly understood in its core meaning of the final and absolute source of political authority in a political community - is not possible. The descent of the United States into Civil War in the mid-nineteenth century, over disputes about unallocated competences concerning slavery and ultimately the right of secession, showed this. One or other level of government could be sovereign to decide such matters, but not both simultaneously. Therefore, it is now suggested that federalism is more appropriately conceived as 'a division of the powers flowing from sovereignty between two le

## Vectorizers

Now, given a question (query) we would like to obtain the paragraph that most probably contains the answer. Once we have a paragraph, we pass it into the BERT QA model we created for the standard project to obtain an answer. 

One way to do that is by using Tf-Idf on the large set of paragraphs. In reality we will use more complex methods and this should be considered a baseline. We will use a `TdIdfVectorizer` from Scikit Learn.

In [7]:
train_vectorizer = TfidfVectorizer(strip_accents='unicode', 
    lowercase=True, 
    max_df=0.8,     # Filter out common words that appear in more than 80% of the paragraphs
    norm='l2')      # The vectorizer also l2 normalizes the vectors it produces, 
                    # so that the cosine similarity operation between vectors simply becomes a dot product.

val_vectorizer = TfidfVectorizer(strip_accents='unicode', lowercase=True, max_df=0.8, norm='l2')
test_vectorizer = TfidfVectorizer(strip_accents='unicode', lowercase=True, max_df=0.8, norm='l2')

train_docs = train_vectorizer.fit_transform([train_paragraphs[i]['context'] for i in range(len(train_paragraphs))])
val_docs = val_vectorizer.fit_transform([val_paragraphs[i]['context'] for i in range(len(val_paragraphs))])
test_docs = test_vectorizer.fit_transform([test_paragraphs[i]['context'] for i in range(len(test_paragraphs))])

In [8]:
train_docs.shape, val_docs.shape, test_docs.shape

((13975, 65808), (4921, 36612), (2067, 22934))

- The training set is made of 13975 paragraphs and was tokenized into 65808 tokens, weighted with Tf-Idf.
- The validation set is made of 4921 paragraphs and was tokenized into 36612 tokens.
- The test set is made of 2067 paragraphs and was tokenized into 22934 tokens.

A way to compute **scores** between a query question and the set of documents, we use the **cosine similarity**.

$$
S_C (A,B) = \frac{A \cdot B}{\lVert A\rVert  \lVert B\rVert }
$$

With the `TfIdfVectorizer` we already L2-normalized all produced vectors, so in the actual implementation we only compute a dot product, which is slightly faster.

In [9]:
def score_documents(vectorizer, query, docs):
    q = query['qas']['question']
    q = vectorizer.transform([q]) # q will be a (sparse) matrix with dimensionality 1 x vocab_dim
    # We can compute a vector of all dot products scores and transform it from dense matrix to numpy array like this:
    return np.asarray(np.dot(docs, q.T).todense()).flatten()

def top_n_for_question(paragraphs, vectorizer, query, docs, n=5):
    scores = score_documents(vectorizer, query, docs)
    sorted_scores = np.argsort(-scores) # Negated scores for descending order
    return [paragraphs[i] for i in sorted_scores[:n]], scores[sorted_scores[:n]], sorted_scores[:n]

def print_top_5(query, top_5_para, top_5_scores, top_5_indices):
    print(f"Top-5 paragraphs: {top_5_indices}")
    print(f"Question: {query['qas']['question']}")
    print(f"Paragraphs:")
    for i in range(5):
        print(f"{i} (score: {top_5_scores[i]:.2f}): {top_5_para[i]['context']}\n")

In [10]:
QUERY = train_questions[0]
top5_para, top5_scores, top5_indices = top_n_for_question(train_paragraphs, train_vectorizer, QUERY, train_docs)
print_top_5(QUERY, top5_para, top5_scores, top5_indices)

Top-5 paragraphs: [   0 5037 5045 5052 8619]
Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Paragraphs:
0 (score: 0.26): Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

1 (score: 0.26):  In Methodism, Mary is honored as the Mother of God. Methodists do not have any additional teachings on the Virgin Mary except fro

In this case, the prediction was correct. Also, the other paragraphs seem quite relevant too.

## Paragraph Retrieval evaluation

We can measure how many times this simple baseline retrieves the correct paragraph on the training and test sets computing the top-1 and top-5 accuracies.

### Training set evaluation

We use the following function to obtain the ground truth index of any paragraph, given the question and the belonging dataset.

In [11]:
def get_paragraph_encoding_index(question, dataset):
    art_id, par_id = question['context_id']
    idx = sum([len(dataset[i]['paragraphs']) for i in range(art_id)]) + par_id
    return idx

Then we compute the accuracy measures.

In [12]:
%%time
count_top1   = 0
count_top5   = 0
count_top20  = 0
count_top100 = 0
count_total = len(train_questions)

for q in tqdm(train_questions):
    topn_para, topn_scores, topn_indices = top_n_for_question(train_paragraphs, train_vectorizer, q, train_docs, n=100)
    gt_index = get_paragraph_encoding_index(q, train_paragraphs_and_questions)
    if gt_index == topn_indices[0]:
        count_top1 += 1
    if gt_index in topn_indices[:5]:
        count_top5 += 1
    if gt_index in topn_indices[:20]:
        count_top20 += 1
    if gt_index in topn_indices:
        count_top100 += 1

top1_score   = count_top1 / count_total * 100
top5_score   = count_top5 / count_total * 100
top20_score  = count_top20 / count_total * 100
top100_score = count_top100 / count_total * 100

print(f"\nTop 1 score: {top1_score:.2f}%,\nTop 5 score: {top5_score:.2f}%"
      f"\nTop 20 score: {top20_score:.2f}%,\nTop 100 score: {top100_score:.2f}%")

100%|██████████| 65064/65064 [06:58<00:00, 155.29it/s]


Top 1 score: 48.83%,
Top 5 score: 73.34%
Top 20 score: 88.42%,
Top 100 score: 96.51%
CPU times: user 6min 38s, sys: 12.5 s, total: 6min 50s
Wall time: 6min 59s





### Test set evaluation

In [13]:
%%time
count_top1   = 0
count_top5   = 0
count_top20  = 0
count_top100 = 0
count_total = len(test_questions)

for q in tqdm(test_questions):
    topn_para, topn_scores, topn_indices = top_n_for_question(test_paragraphs, test_vectorizer, q, test_docs, n=100)
    gt_index = get_paragraph_encoding_index(q, test_paragraphs_and_questions)
    if gt_index == topn_indices[0]:
        count_top1 += 1
    if gt_index in topn_indices[:5]:
        count_top5 += 1
    if gt_index in topn_indices[:20]:
        count_top20 += 1
    if gt_index in topn_indices:
        count_top100 += 1

top1_score   = count_top1 / count_total * 100
top5_score   = count_top5 / count_total * 100
top20_score  = count_top20 / count_total * 100
top100_score = count_top100 / count_total * 100

print(f"\nTop 1 score: {top1_score:.2f}%,\nTop 5 score: {top5_score:.2f}%"
      f"\nTop 20 score: {top20_score:.2f}%,\nTop 100 score: {top100_score:.2f}%")

100%|██████████| 10570/10570 [00:16<00:00, 646.87it/s]


Top 1 score: 60.79%,
Top 5 score: 84.30%
Top 20 score: 95.00%,
Top 100 score: 98.78%
CPU times: user 16.2 s, sys: 121 ms, total: 16.3 s
Wall time: 16.3 s





The results are quite good already, but we'll investigate whether using a dense representation (eg. vectors computed by Bert) can further improve these scores.

### Test set evaluation with generalization

We evaluate on the test set again, but using the `train_vectorizer` in order to check the generalization capabilities of this method.

In [14]:
test_docs_with_train_vectorizer = \
    train_vectorizer.transform([test_paragraphs[i]['context'] 
                                for i in range(len(test_paragraphs))])

In [15]:
%%time
count_top1   = 0
count_top5   = 0
count_top20  = 0
count_top100 = 0
count_total = len(test_questions)

for q in tqdm(test_questions):
    topn_para, topn_scores, topn_indices = top_n_for_question(test_paragraphs, train_vectorizer, 
                                                              q, test_docs_with_train_vectorizer, 
                                                              n=100)
    gt_index = get_paragraph_encoding_index(q, test_paragraphs_and_questions)
    if gt_index == topn_indices[0]:
        count_top1 += 1
    if gt_index in topn_indices[:5]:
        count_top5 += 1
    if gt_index in topn_indices[:20]:
        count_top20 += 1
    if gt_index in topn_indices:
        count_top100 += 1

top1_score   = count_top1 / count_total * 100
top5_score   = count_top5 / count_total * 100
top20_score  = count_top20 / count_total * 100
top100_score = count_top100 / count_total * 100

print(f"\nTop 1 score: {top1_score:.2f}%,\nTop 5 score: {top5_score:.2f}%"
      f"\nTop 20 score: {top20_score:.2f}%,\nTop 100 score: {top100_score:.2f}%")

100%|██████████| 10570/10570 [00:21<00:00, 503.26it/s]


Top 1 score: 50.46%,
Top 5 score: 74.27%
Top 20 score: 89.91%,
Top 100 score: 97.24%
CPU times: user 20.5 s, sys: 493 ms, total: 21 s
Wall time: 21 s





There has been a drop in accuracy with respect to the vectorizer fitted on the test set, but it's still quite good and in line with the results on the training set.

## Question Answering evaluation

We can plug this simple baseline into the pipeline for using the Bert model for Question Answering we built for the standard project: in this way we can evaluate the actual Question Answering performances of this method.

To do that, we need to modify the dataset generator a little bit, because it needs to compute on-the-fly the associated paragraphs for each query question.

In [None]:
# New dataset generator using the Bert or DistilBert tokenizer to create input encodings for the model
# using the original questions and the PREDICTED paragraphs instead of the ground truth ones
def tf_idf_dataset_generator(questions: List[Dict],       # Can be training or test questions
                             predicted_paragraphs: List,  # This generator takes in input the List of predicted paragraphs for each question.
                             config: Config,
                             return_question_id:bool=False):
    # Iterate over questions
    for i, q in enumerate(questions):
        # We use the paragraph obtained by the vectorizer to compute the best scoring paragraph for the question
        paragraph = predicted_paragraphs[i]
        # Then encode the input as usual using Bert's tokenizer
        encoded_inputs = config.tokenizer(
            q['qas']["question"],               # First we pass the question text
            paragraph['context'],               # Then the best scoring paragraph text
            max_length = config.INPUT_LEN,      # We want to pad and truncate to the max length
            truncation = True,
            padding = 'max_length',             # Pads all sequences to 512.
            return_token_type_ids = config.bert,# Return if the token is from sentence 
                                                # 0 or sentence 1
            return_attention_mask = True,       # Return if it's a pad token or not
        )
        if return_question_id:
            yield dict(encoded_inputs), q['qas']['id']
        else:
            yield dict(encoded_inputs)
    
# Here instead we generate the "original" dataset containing only the context of the predicted
# paragraph and the offset mappings of the tokens.
def create_original_dataset_with_tf_idf(questions: List[Dict], 
                                        predicted_paragraphs: List,
                                        config: Config):
    features = []
    for i, q in enumerate(questions):
        inputs={}
        # The paragraph is collected from those that were pre-predicted
        paragraph = predicted_paragraphs[i]
        encoded_inputs = config.tokenizer(
            q['qas']["question"],               # First we pass the question
            paragraph["context"],               # Then the context

            max_length = config.INPUT_LEN,      # We want to pad and truncate to this length
            truncation = True,
            padding = 'max_length',             # Pads all sequences to 512.

            return_token_type_ids = False,      # Return if the token is from sentence 
                                                # 0 or sentence 1
            return_attention_mask = False,      # Return if it's a pad token or not

            return_offsets_mapping = True       # Returns each token's first and last char 
                                                # positions in the original sentence
                                                # (we will use it to match answers starting 
                                                # and ending points to tokens)
        )
        # We fill the inputs dictionary
        inputs["context"] = paragraph["context"]
        inputs["offset_mapping"] = encoded_inputs["offset_mapping"]
        features.append(inputs)

    return tf.data.Dataset.from_tensor_slices(
        pd.DataFrame.from_dict(features).to_dict(orient="list"))


# This function creates the actual dataset using the generator.
def create_dataset_using_tf_idf_vectorizer( questions: List[Dict],
                                            predicted_paragraphs: List,
                                            config: Config  ) -> tf.data.Dataset:
    # Create expected signature for the generator output
    if config.bert:
        features = {
            'input_ids': tf.TensorSpec(shape=(512,), dtype=tf.int32), 
            'attention_mask': tf.TensorSpec(shape=(512,), dtype=tf.int32),
            'token_type_ids': tf.TensorSpec(shape=(512,), dtype=tf.int32)
        }
    else:
        features = {
            'input_ids': tf.TensorSpec(shape=(512,), dtype=tf.int32), 
            'attention_mask': tf.TensorSpec(shape=(512,), dtype=tf.int32)
        }
    # The dataset contains the features and the question IDs (strings)
    signature = (features, tf.TensorSpec(shape=(), dtype=tf.string))
    # Instantiates a partial generator
    data_gen = partial(tf_idf_dataset_generator, 
        questions, predicted_paragraphs, config, return_question_id=True)
    # Creates the dataset with the computed signature
    dataset = tf.data.Dataset.from_generator(data_gen,
        output_signature=signature)
    # Compute dataset length, to be used by tensorflow internals
    dataset = dataset.apply(tf.data.experimental.assert_cardinality(len(questions)))
    # Return the dataset
    return dataset

Then, we make some changes to the prediction and evaluation function.

In [None]:
### Prediction and evaluation function ###
def predict_and_evaluate(best_weights_path:str, 
                         path_to_predictions_json:str,
                         config:Config,
                         dataset_type:str='test',               # One of 'train', 'val', 'test'
                         hidden_state_list:List[int]=[3,4,5,6],
                         bert=False):
    print("Collecting the requested dataset and vectorizer...")
    if dataset_type == 'train':
        questions = train_questions
        paragraphs = train_paragraphs
        vectorizer = train_vectorizer
        docs = train_docs
        # dataset_path = 
    elif dataset_type == 'val':
        questions = val_questions
        paragraphs = val_paragraphs
        vectorizer = val_vectorizer
        docs = val_docs
        dataset_path = os.path.join(config.ROOT_PATH, 'data', 'validation_set.json')
    elif dataset_type == 'test':
        questions = test_questions
        paragraphs = test_paragraphs
        vectorizer = test_vectorizer
        docs = test_docs
        dataset_path = os.path.join(config.ROOT_PATH, 'data', 'dev_set.json')
    else:
        raise NotImplementedError("That dataset type does not exist. Change the dataset_type argument into one of ['train', 'val', 'test']")
    # We pre-compute the predicted paragraph for each question in the set.
    print("Obtaining best paragraph for questions...")
    predicted_paragraphs = [top_n_for_question(paragraphs, vectorizer, q, docs)[0][0] for q in tqdm(questions)]
    print("Creating model and dataset...")
    config = Config(bert=bert)
    # Process questions
    dataset = create_dataset_using_tf_idf_vectorizer(questions, predicted_paragraphs, config)
    print("Number of samples: ", len(dataset))
    dataset = dataset.batch(config.BATCH_SIZE)
    # Generate the original dataset that contains the original context and token-char mapping
    original_dataset = create_original_dataset_with_tf_idf(questions, predicted_paragraphs, config)
    original_dataset = original_dataset.batch(config.BATCH_SIZE)
    # Load model with the best obtained weights from the old project
    model = config.create_standard_model(hidden_state_list=hidden_state_list)
    model.load_weights(best_weights_path)
    print("Computing predictions...")
    # Predict the answers to the questions in the dataset
    predictions = utils.compute_predictions(dataset, original_dataset, model)
    print(f"Done! Saving predictions at {path_to_predictions_json} and running evaluation script...")
    # Create a prediction file formatted like the one that is expected
    with open(path_to_predictions_json, 'w') as f:
        json.dump(predictions, f)
    
    !python eval/evaluate.py $dataset_path $path_to_predictions_json

Finally, we run an evaluation on the test set.

In [None]:
BEST_WEIGHTS_PATH = "./checkpoints/normal.h5" if not IN_COLAB else '/content/drive/MyDrive/Uni/Magistrale/NLP/Project/weights/normal_100_tpu_h5_cval/normal.h5'
PATH_TO_PREDICTIONS_JSON = '../data/results/tf_idf_test_predictions_normal.json'

In [None]:
predict_and_evaluate(BEST_WEIGHTS_PATH, PATH_TO_PREDICTIONS_JSON, config, dataset_type='test')

Collecting the requested dataset and vectorizer...
Obtaining best paragraph for questions...


100%|██████████| 10570/10570 [00:19<00:00, 540.27it/s]


Creating model and dataset...
Number of samples:  10570


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_layer_norm', 'vocab_transform', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Computing predictions...


100%|██████████| 661/661 [04:26<00:00,  2.48it/s]


Done! Saving predictions at ../data/results/tf_idf_test_predictions_normal.json and running evaluation script...
{
  "exact": 38.666035950804165,
  "f1": 49.26541133366401,
  "total": 10570,
  "HasAns_exact": 38.666035950804165,
  "HasAns_f1": 49.26541133366401,
  "HasAns_total": 10570
}
