In [1]:
username = 'MarcelloCeresini'
repository = 'QuestionAnswering'

# COLAB ONLY CELLS
try:
    import google.colab
    IN_COLAB = True
    !pip3 install transformers
    !nvidia-smi             # Check which GPU has been chosen for us
    !git clone https://www.github.com/{username}/{repository}.git
    from google.colab import drive
    drive.mount('/content/drive/')
    %cd /content/QuestionAnswering/src
except:
    IN_COLAB = False

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 8.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 50.3 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 53.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 6.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 56.2 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found ex

# Tf-Idf retrieval baseline

In this notebook, we implement a simple baseline for paragraph retrieval using Tf-Idf weighted sparse representations of documents and query questions.

In [2]:
%matplotlib inline

import os
from tqdm import tqdm
import random
import json
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from functools import partial

from sklearn.feature_extraction.text import TfidfVectorizer

from config import Config
config = Config()
import utils

# Fix random seed for reproducibility
np.random.seed(config.RANDOM_SEED)
random.seed(config.RANDOM_SEED)
tf.random.set_seed(config.RANDOM_SEED)

from typing import List, Dict
#os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [3]:
ROOT_PATH = os.path.dirname(os.getcwd())
TRAINING_FILE = os.path.join(ROOT_PATH, 'data', 'training_set.json')
paragraphs_and_questions = utils.read_question_set(TRAINING_FILE)

First of all, we separate questions from their paragraphs

In [4]:
questions = [{
        'qas': qas,
        'context_id': (i,j)    # We also track the question's original context and paragraph indices so to have a ground truth
    }
    for i in range(len(paragraphs_and_questions['data']))
    for j, para in enumerate(paragraphs_and_questions['data'][i]['paragraphs'])
    for qas in para['qas']
]

paragraphs = [{
        'context': para['context'],
        'context_id': i
    }
    for i in range(len(paragraphs_and_questions['data']))
    for para in paragraphs_and_questions['data'][i]['paragraphs']
]

print(f"Number of questions: {len(questions)}")
print(f"Number of paragraphs: {len(paragraphs)}")

Number of questions: 87599
Number of paragraphs: 18896


We build a function to obtain the paragraph given our context indices

In [5]:
def get_paragraph_from_question(qas, dataset):
    i,j = qas['context_id']
    return dataset['data'][i]['paragraphs'][j]

In [6]:
x = random.randint(0, len(questions)-1)
print(f"Question: {questions[x]['qas']['question']}")
print()
print(f"Ground truth context: '{get_paragraph_from_question(questions[x], paragraphs_and_questions)['context']}'")

Question: What poet wrote a long poem describing Roman religious holidays?

Ground truth context: 'The meaning and origin of many archaic festivals baffled even Rome's intellectual elite, but the more obscure they were, the greater the opportunity for reinvention and reinterpretation — a fact lost neither on Augustus in his program of religious reform, which often cloaked autocratic innovation, nor on his only rival as mythmaker of the era, Ovid. In his Fasti, a long-form poem covering Roman holidays from January to June, Ovid presents a unique look at Roman antiquarian lore, popular customs, and religious practice that is by turns imaginative, entertaining, high-minded, and scurrilous; not a priestly account, despite the speaker's pose as a vates or inspired poet-prophet, but a work of description, imagination and poetic etymology that reflects the broad humor and burlesque spirit of such venerable festivals as the Saturnalia, Consualia, and feast of Anna Perenna on the Ides of March,

Now, given a question (query) we would like to obtain the paragraph that most probably contains the answer in order to pass it into the BERT QA model. One way to do that is by using tf-idf on the large set of paragraphs. In reality we will use more complex methods and this should be considered a baseline. We will use a `TdIdfVectorizer` from Scikit Learn.

In [7]:
vectorizer = TfidfVectorizer(strip_accents='unicode', 
    lowercase=True, 
    max_df=0.8,     # Filter out common words that appear in more than 80% of the paragraphs
    norm='l2') # The vectorizer also l2 normalizes the vectors it produces, so that the cosine similarity operation between vectors simply becomes a dot product.
docs = vectorizer.fit_transform([paragraphs[i]['context'] for i in range(len(paragraphs))])

In [8]:
docs.shape

(18896, 77747)

Now, in order to compute scores between a query question and the set of document, we use **cosine similarity**.

$$
S_C (A,B) = \frac{A \cdot B}{\lVert A\rVert  \lVert B\rVert }
$$

Note: the `TfIdfVectorizer` we use already L2-normalizes all vectors it produces, so in the actual implementation we only compute a dot product.

In [9]:
def score_documents(vectorizer, query, docs):
    q = query['qas']['question']
    q = vectorizer.transform([q]) # q will be a (sparse) matrix with dimensionality 1 x vocab_dim
    # We can compute a vector of all dot products scores and transform it from dense matrix to numpy array like this:
    return np.asarray(np.dot(docs, q.T).todense()).flatten()

def top_5_for_question(vectorizer, query, docs):
    scores = score_documents(vectorizer, query, docs)
    sorted_scores = np.argsort(-scores) # Negated for descending order
    return [paragraphs[i] for i in sorted_scores[:5]], scores[sorted_scores[:5]], sorted_scores[:5]

def print_top_5(query, top_5_para, top_5_scores, top_5_indices):
    print(f"Top-5 paragraphs: {top_5_indices}")
    print(f"Question: {query['qas']['question']}")
    print(f"Paragraphs:")
    for i in range(5):
        print(f"{i} (score: {top_5_scores[i]:.2f}): {top_5_para[i]['context']}\n")

QUERY = questions[0]
top5_para, top5_scores, top5_indices = top_5_for_question(vectorizer, QUERY, docs)
print_top_5(QUERY, top5_para, top5_scores, top5_indices)


Top-5 paragraphs: [    0  6929  6937  6944 12250]
Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Paragraphs:
0 (score: 0.27): Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

1 (score: 0.26):  In Methodism, Mary is honored as the Mother of God. Methodists do not have any additional teachings on the Virgin Mary excep

We can measure how many times this simple baseline retrieves the correct paragraph (top-1 and top-5 accuracy)

In [None]:
%%time
count_top1 = 0
count_top5 = 0
count_total = len(questions)

for q in tqdm(questions):
    top5_para, top5_scores, top5_indices = top_5_for_question(q)
    top5_context_ids = [top5_para[i]['context_id'] for i in range(len(top5_para))]
    gt_context_id = q['context_id'][0]
    if gt_context_id == top5_context_ids[0]:
        count_top1 += 1
    if gt_context_id in top5_context_ids:
        count_top5 += 1

top1_score = count_top1 / count_total * 100
top5_score = count_top5 / count_total * 100

print(f"\nTop 1 score: {top1_score:.2f}%,\nTop 5 score: {top5_score:.2f}%")

Top 1 score: 72.01%,
Top 5 score: 87.82%


The results are quite good already, but we'll investigate whether using a dense representation (eg. vectors computed by Bert) can further improve these scores.

## Answer computation

Here we check how good are the answers of our usual model which selects the first document retrieved by Tf-Idf. 

Firstly, we define a new kind of dataset containing everything we need.

In [13]:
# First of all, we create a function that instantiates a TensorFlow dataset
# matching the best scoring paragraph for each question using the provided 
# Vectorizer
def tf_idf_dataset_generator(
    questions: List[Dict], paragraphs: List[Dict], 
    vectorizer:TfidfVectorizer, config: Config,
    return_labels:bool=False, return_NER_attention:bool=False,
    return_question_id:bool=False, NER_value:float=0):

    for question_and_answer in questions:
        # Use the vectorizer to obtain the best scoring paragraph from the question
        top5_para, _, _ = top_5_for_question(vectorizer, question_and_answer, paragraphs)
        question_and_answer = question_and_answer['qas']
        paragraph = top5_para[0]
        # Then encode the input using Bert's tokenizer
        encoded_inputs = config.tokenizer(
            question_and_answer["question"],    # First we pass the question
            paragraph['context'],            # Then the best scoring context

            max_length = config.INPUT_LEN,      # We want to pad and truncate to this length
            truncation = True,
            padding = 'max_length',             # Pads all sequences to 512.

            return_token_type_ids = config.bert,# Return if the token is from sentence 
                                                # 0 or sentence 1
            return_attention_mask = True,       # Return if it's a pad token or not

            return_offsets_mapping = True       # Returns each token's first and last char 
                                                # positions in the original sentence
                                                # (we will use it to match answers starting 
                                                # and ending points to tokens)
        )

        if return_labels:
            ### MAPPING OF THE START OF THE ANSWER BETWEEN CHARS AND TOKENS ###
            # We want to pass from the starting position in chars to the starting position in tokens
            label = utils.find_start_end_token_one_hot_encoded(
                # We pass the list of answers (usually there is still one per question,
                #   but we mustn't assume anything)
                answers = question_and_answer["answers"],
                # And also the inputs offset mapping just recieved from the tokenizer
                offsets = encoded_inputs["offset_mapping"]
            )

        if return_NER_attention:
            encoded_inputs['NER_attention'] = utils.create_NER_attention_vector(
                context=paragraph["context"], 
                offsets=encoded_inputs["offset_mapping"],
                spacy_instance=config.ner_extractor,
                config=config, 
                non_ne_weight=1-NER_value,
                ne_weight=1+NER_value
            )

        encoded_inputs.pop("offset_mapping", None) # Removes the offset mapping, not useful anymore 
                                                    # ("None" is used because otherwise KeyError 
                                                    # could be raised if the key wasn't present)

        if return_question_id and return_labels:  
            yield dict(encoded_inputs), question_and_answer['id'], {
                'out_S': label['out_S'],
                'out_E': label['out_E']
            }
        elif return_labels:
            yield dict(encoded_inputs), {
                'out_S': label['out_S'],
                'out_E': label['out_E']
            }
        elif return_question_id:
            yield dict(encoded_inputs), question_and_answer['id']
        else:
            yield dict(encoded_inputs)

def create_original_dataset_with_tf_idf(questions, paragraphs, vectorizer, config: Config):
    features = []
    for question_and_answer in questions:
        # Use the vectorizer to obtain the best scoring paragraph from the question
        top5_para, _, _ = top_5_for_question(vectorizer, question_and_answer, paragraphs)
        question_and_answer = question_and_answer['qas']
        paragraph = top5_para[0]
        inputs={}
        ### QUESTION AND CONTEXT TOKENIZATION ###
        # For question answering with DistilBERT we need to encode both 
        # question and context, and this is the way in which 
        # HuggingFace's DistilBertTokenizer does it.
        # The tokenizer returns a dictionary containing all the information we need
        encoded_inputs = config.tokenizer(
            question_and_answer["question"],    # First we pass the question
            paragraph["context"],               # Then the context

            max_length = config.INPUT_LEN,      # We want to pad and truncate to this length
            truncation = True,
            padding = 'max_length',             # Pads all sequences to 512.

            return_token_type_ids = False,      # Return if the token is from sentence 
                                                # 0 or sentence 1
            return_attention_mask = False,      # Return if it's a pad token or not

            return_offsets_mapping = True       # Returns each token's first and last char 
                                                # positions in the original sentence
                                                # (we will use it to match answers starting 
                                                # and ending points to tokens)
        )
        inputs["context"] = paragraph["context"]
        inputs["offset_mapping"] = encoded_inputs["offset_mapping"]
        features.append(inputs)

    return tf.data.Dataset.from_tensor_slices(
        pd.DataFrame.from_dict(features).to_dict(orient="list"))


def create_dataset_using_tf_idf_vectorizer(
        questions: List[Dict],
        paragraphs: List[Dict],
        vectorizer: TfidfVectorizer,
        config: Config,
        for_training: bool=True,
        use_NER_attention:bool=False,
        NER_value:float=0
    ) -> tf.data.Dataset:
    # Labels are only returned in training, while question IDs only when not training
    return_labels = for_training
    return_question_id = not for_training
    # Create expected signature for the generator output
    if config.bert:
        features = {
            'input_ids': tf.TensorSpec(shape=(512,), dtype=tf.int32), 
            'attention_mask': tf.TensorSpec(shape=(512,), dtype=tf.int32),
            'token_type_ids': tf.TensorSpec(shape=(512,), dtype=tf.int32)
        }
    else:
        features = {
            'input_ids': tf.TensorSpec(shape=(512,), dtype=tf.int32), 
            'attention_mask': tf.TensorSpec(shape=(512,), dtype=tf.int32)
        }
    if use_NER_attention:
        features['NER_attention'] = tf.TensorSpec(shape=(512,), dtype=tf.float64)
    if for_training:
        # The dataset contains the features and the labels
        signature = (features, {
            'out_S': tf.TensorSpec(shape=(512,), dtype=tf.float64), 
            'out_E': tf.TensorSpec(shape=(512,), dtype=tf.float64)
        })
    else:
        # The dataset contains the features and the question IDs (strings)
        signature = (features, tf.TensorSpec(shape=(), dtype=tf.string))
    # Instantiates a partial generator
    data_gen = partial(tf_idf_dataset_generator, questions,
        paragraphs, vectorizer, config, 
        return_labels=return_labels, 
        return_question_id=return_question_id,
        return_NER_attention=use_NER_attention,
        NER_value=NER_value)
    # Creates the dataset with the computed signature
    dataset = tf.data.Dataset.from_generator(data_gen,
        output_signature=signature)
    # Compute dataset length, to be used by tensorflow internals
    dataset = dataset.apply(tf.data.experimental.assert_cardinality(len(questions)))
    # Return the dataset
    return dataset

In [14]:
### Prediction and evaluation function ###
def predict_and_evaluate(
                         DATASET_PATH:str, 
                         BEST_WEIGHTS_PATH:str, 
                         PATH_TO_PREDICTIONS_JSON:str,
                         hidden_state_list:List[int]=[3,4,5,6],
                         use_NER_attention=False, NER_value=0,
                         bert=False):
    '''
    Uses the standard model to predict the answers to the dataset provided in 
    `DATASET_PATH` using the selected weights (`BEST_WEIGHTS_PATH`), 
    saves the predictions into `PATH_TO_PREDICTIONS_JSON` and executes SQuAD's
    evaluation script to get the exact match accuracy and F1 score.
    '''
    '''
    Uses a TfIdf vectorizer to gather the best scoring document for a question (query),
    then uses the standard model to predict the answers to the dataset provided in 
    `DATASET_PATH` using the selected weights (`BEST_WEIGHTS_PATH`), 
    saves the predictions into `PATH_TO_PREDICTIONS_JSON` and executes SQuAD's
    evaluation script to get the exact match accuracy and F1 score.
    '''
    data = utils.read_question_set(DATASET_PATH)
    questions = [{
            'qas': qas,
            'context_id': (i,j)    # We also track the question's original context 
                                   # and paragraph indices so to have a ground truth
        }
        for i in range(len(data['data']))
        for j, para in enumerate(data['data'][i]['paragraphs'])
        for qas in para['qas']
    ]
    paragraphs = [{
            'context': para['context'],
            'context_id': i
        }
        for i in range(len(data['data']))
        for para in data['data'][i]['paragraphs']
    ]
    vectorizer = TfidfVectorizer(
        strip_accents='unicode',    # Text is normalized into unicode characters
        lowercase=True, # Then, we transform all text to lowercase
        max_df=0.8,     # Filter out common words that appear in more 
                        # than 80% of the paragraphs
        norm='l2'       # The vectorizer also l2 normalizes the vectors it produces, 
                        # so that the cosine similarity operation between 
                        # vectors simply becomes a dot product.
    )      
    docs = vectorizer.fit_transform(
        [paragraphs[i]['context'] for i in range(len(paragraphs))] # Transform the paragraphs and fit the vectorizer
    )
    config = Config(bert=bert)
    # Read dataset (JSON file)
    data = utils.read_question_set(DATASET_PATH)
    # Process questions
    dataset = create_dataset_using_tf_idf_vectorizer(questions, 
        docs, vectorizer, config, for_training=False, 
        use_NER_attention=use_NER_attention, 
        NER_value=NER_value
    )
    print("Number of samples: ", len(dataset))
    # Generate the original dataset that contains the original context and token-char mapping
    original_dataset = create_original_dataset_with_tf_idf(questions, docs, vectorizer, config)
    original_dataset = original_dataset.batch(config.BATCH_SIZE)
    dataset = dataset.batch(config.BATCH_SIZE)
    # Load model
    if not use_NER_attention:
        model = config.create_standard_model(hidden_state_list=hidden_state_list)
    else:
        raise NotImplementedError
    # Load best model weights
    model.load_weights(BEST_WEIGHTS_PATH)
    # Predict the answers to the questions in the dataset
    predictions = utils.compute_predictions(dataset, original_dataset, model)
    # Create a prediction file formatted like the one that is expected
    with open(PATH_TO_PREDICTIONS_JSON, 'w') as f:
        json.dump(predictions, f)
    
    !python eval/evaluate.py $DATASET_PATH $PATH_TO_PREDICTIONS_JSON

In [18]:
!ls "/content/QuestionAnswering/data/"

dev_set.json   logs	tiny_training_set.json	validation_set.json
dev-v2.0.json  results	training_set.json


In [19]:
DATASET_PATH = "/content/QuestionAnswering/data/dev_set.json"
BEST_WEIGHTS_PATH = "/content/drive/MyDrive/Uni/Magistrale/NLP/Project/weights/normal_100_tpu_h5_cval/normal.h5"
PATH_TO_PREDICTIONS_JSON = '/content/drive/MyDrive/Uni/Magistrale/NLP/Project/results/tf_idf_predictions_tpu_normal.txt'

predict_and_evaluate(DATASET_PATH, BEST_WEIGHTS_PATH, PATH_TO_PREDICTIONS_JSON)

Number of samples:  10570


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_layer_norm', 'vocab_transform', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.
100%|██████████| 661/661 [04:43<00:00,  2.33it/s]


{
  "exact": 0.01892147587511826,
  "f1": 1.1653256545487614,
  "total": 10570,
  "HasAns_exact": 0.01892147587511826,
  "HasAns_f1": 1.1653256545487614,
  "HasAns_total": 10570
}
