In [1]:
username = 'MarcelloCeresini'
repository = 'QuestionAnswering'

# COLAB ONLY CELLS
try:
    import google.colab
    IN_COLAB = True
    !pip3 install transformers
    !git clone https://www.github.com/{username}/{repository}.git
    from google.colab import drive
    drive.mount('/content/drive/')
    %cd /content/QuestionAnswering/src
except:
    IN_COLAB = False

# Tf-Idf retrieval baseline

In this notebook, we implement a simple baseline for paragraph retrieval using Tf-Idf weighted sparse representations of documents and query questions.

In [2]:
%matplotlib inline

import os
from tqdm import tqdm
import random
import json
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from functools import partial

from sklearn.feature_extraction.text import TfidfVectorizer

from config import Config
config = Config()
import utils

# Fix random seed for reproducibility
np.random.seed(config.RANDOM_SEED)
random.seed(config.RANDOM_SEED)
tf.random.set_seed(config.RANDOM_SEED)

from typing import List, Dict
#os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

Load the datasets

In [17]:
ROOT_PATH = os.path.dirname(os.getcwd())
TRAINING_FILE = os.path.join(ROOT_PATH, 'data', 'training_set.json')
VALIDATION_FILE = os.path.join(ROOT_PATH, 'data', 'validation_set.json')
TEST_FILE = os.path.join(ROOT_PATH, 'data', 'dev_set.json')

train_paragraphs_and_questions = utils.read_question_set(TRAINING_FILE)['data']
val_paragraphs_and_questions = utils.read_question_set(VALIDATION_FILE)['data']
test_paragraphs_and_questions = utils.read_question_set(TEST_FILE)['data']

# Remove the validation set from the train set
train_paragraphs_and_questions = [article for article in train_paragraphs_and_questions \
                                  if article not in val_paragraphs_and_questions]

First of all, we separate questions from their paragraphs. We assign a `context_id` to each question and paragraph:

- For questions, the context ID is a tuple `(paragraph_id, question_id_in_paragraph)`
- For paragraphs, the context ID is a simple index corresponding to the first value in the tuple of questions.

In [19]:
def get_questions_and_paragraphs(dataset):
    questions = [{
            'qas': qas,
            'context_id': (i,j)    # We also track the question's original context and paragraph indices so to have a ground truth
        }
        for i in range(len(dataset))
        for j, para in enumerate(dataset[i]['paragraphs'])
        for qas in para['qas']
    ]

    paragraphs = [{
            'context': para['context'],
            'context_id': i
        }
        for i in range(len(dataset))
        for para in dataset[i]['paragraphs']
    ]

    return questions, paragraphs

train_questions, train_paragraphs = get_questions_and_paragraphs(train_paragraphs_and_questions)
val_questions, val_paragraphs = get_questions_and_paragraphs(val_paragraphs_and_questions)
test_questions, test_paragraphs = get_questions_and_paragraphs(test_paragraphs_and_questions)

print(f"Number of training questions: {len(train_questions)}")
print(f"Number of training paragraphs: {len(train_paragraphs)}")
print()
print(f"Number of val questions: {len(val_questions)}")
print(f"Number of val paragraphs: {len(val_paragraphs)}")
print()
print(f"Number of test questions: {len(test_questions)}")
print(f"Number of test paragraphs: {len(test_paragraphs)}")

Number of training questions: 65064
Number of training paragraphs: 13975

Number of val questions: 22535
Number of val paragraphs: 4921

Number of test questions: 10570
Number of test paragraphs: 2067


We build a function to obtain paragraphs given the `context_id`s we built.

In [20]:
def get_paragraph_from_question(qas, dataset):
    i,j = qas['context_id']
    return dataset[i]['paragraphs'][j]

Let's try it on a random question.

In [23]:
x = random.randint(0, len(train_questions)-1)
print(f"Question: {train_questions[x]['qas']['question']}")
print()
print(f"Ground truth context: '{get_paragraph_from_question(train_questions[x], train_paragraphs_and_questions)['context']}'")

Question: Which 1909 ballet used Chopin's music?

Ground truth context: 'Chopin's music was used in the 1909 ballet Chopiniana, choreographed by Michel Fokine and orchestrated by Alexander Glazunov. Sergei Diaghilev commissioned additional orchestrations—from Stravinsky, Anatoly Lyadov, Sergei Taneyev and Nikolai Tcherepnin—for later productions, which used the title Les Sylphides.'


Now, given a question (query) we would like to obtain the paragraph that most probably contains the answer. Once we have a paragraph, we pass it into the BERT QA model we created for the standard project to obtain an answer. 

One way to do that is by using Tf-Idf on the large set of paragraphs. In reality we will use more complex methods and this should be considered a baseline. We will use a `TdIdfVectorizer` from Scikit Learn.

In [24]:
train_vectorizer = TfidfVectorizer(strip_accents='unicode', 
    lowercase=True, 
    max_df=0.8,     # Filter out common words that appear in more than 80% of the paragraphs
    norm='l2')      # The vectorizer also l2 normalizes the vectors it produces, 
                    # so that the cosine similarity operation between vectors simply becomes a dot product.

val_vectorizer = TfidfVectorizer(strip_accents='unicode', lowercase=True, max_df=0.8, norm='l2')
test_vectorizer = TfidfVectorizer(strip_accents='unicode', lowercase=True, max_df=0.8, norm='l2')

train_docs = train_vectorizer.fit_transform([train_paragraphs[i]['context'] for i in range(len(train_paragraphs))])
val_docs = val_vectorizer.fit_transform([val_paragraphs[i]['context'] for i in range(len(val_paragraphs))])
test_docs = test_vectorizer.fit_transform([test_paragraphs[i]['context'] for i in range(len(test_paragraphs))])

In [25]:
train_docs.shape, val_docs.shape, test_docs.shape

((13975, 65808), (4921, 36612), (2067, 22934))

TODO

Now, in order to compute scores between a query question and the set of document, we use **cosine similarity**.

$$
S_C (A,B) = \frac{A \cdot B}{\lVert A\rVert  \lVert B\rVert }
$$

Note: the `TfIdfVectorizer` we use already L2-normalizes all vectors it produces, so in the actual implementation we only compute a dot product.

In [9]:
def score_documents(vectorizer, query, docs):
    q = query['qas']['question']
    q = vectorizer.transform([q]) # q will be a (sparse) matrix with dimensionality 1 x vocab_dim
    # We can compute a vector of all dot products scores and transform it from dense matrix to numpy array like this:
    return np.asarray(np.dot(docs, q.T).todense()).flatten()

def top_5_for_question(paragraphs, vectorizer, query, docs):
    scores = score_documents(vectorizer, query, docs)
    sorted_scores = np.argsort(-scores) # Negated for descending order
    return [paragraphs[i] for i in sorted_scores[:5]], scores[sorted_scores[:5]], sorted_scores[:5]

def print_top_5(query, top_5_para, top_5_scores, top_5_indices):
    print(f"Top-5 paragraphs: {top_5_indices}")
    print(f"Question: {query['qas']['question']}")
    print(f"Paragraphs:")
    for i in range(5):
        print(f"{i} (score: {top_5_scores[i]:.2f}): {top_5_para[i]['context']}\n")

In [10]:
QUERY = questions[0]
top5_para, top5_scores, top5_indices = top_5_for_question(paragraphs, vectorizer, QUERY, docs)
print_top_5(QUERY, top5_para, top5_scores, top5_indices)

Top-5 paragraphs: [    0  6929  6937  6944 12250]
Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Paragraphs:
0 (score: 0.27): Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

1 (score: 0.26):  In Methodism, Mary is honored as the Mother of God. Methodists do not have any additional teachings on the Virgin Mary excep

We can measure how many times this simple baseline retrieves the correct paragraph on the training and test sets (top-1 and top-5 accuracy)

### Training set evaluation

In [12]:
%%time
count_top1 = 0
count_top5 = 0
count_total = len(questions)

for q in tqdm(questions):
    top5_para, top5_scores, top5_indices = top_5_for_question(paragraphs, vectorizer, q, docs)
    top5_context_ids = [top5_para[i]['context_id'] for i in range(len(top5_para))]
    gt_context_id = q['context_id'][0]
    if gt_context_id == top5_context_ids[0]:
        count_top1 += 1
    if gt_context_id in top5_context_ids:
        count_top5 += 1

top1_score = count_top1 / count_total * 100
top5_score = count_top5 / count_total * 100

print(f"\nTop 1 score: {top1_score:.2f}%,\nTop 5 score: {top5_score:.2f}%")

100%|██████████| 87599/87599 [07:19<00:00, 199.44it/s]


Top 1 score: 72.01%,
Top 5 score: 87.82%
Wall time: 7min 19s





### Test set evaluation

In [13]:
test_paragraphs_and_questions = utils.read_question_set(TEST_FILE)

test_questions = [{
        'qas': qas,
        'context_id': (i,j)    # We also track the question's original context and paragraph indices so to have a ground truth
    }
    for i in range(len(test_paragraphs_and_questions['data']))
    for j, para in enumerate(test_paragraphs_and_questions['data'][i]['paragraphs'])
    for qas in para['qas']
]

In [14]:
%%time
count_top1 = 0
count_top5 = 0
count_total = len(test_questions)

for q in tqdm(test_questions):
    top5_para, top5_scores, top5_indices = top_5_for_question(paragraphs, vectorizer, q, docs)
    top5_context_ids = [top5_para[i]['context_id'] for i in range(len(top5_para))]
    gt_context_id = q['context_id'][0]
    if gt_context_id == top5_context_ids[0]:
        count_top1 += 1
    if gt_context_id in top5_context_ids:
        count_top5 += 1

top1_score = count_top1 / count_total * 100
top5_score = count_top5 / count_total * 100

print(f"\nTop 1 score: {top1_score:.2f}%,\nTop 5 score: {top5_score:.2f}%")

100%|██████████| 10570/10570 [00:52<00:00, 201.93it/s]


Top 1 score: 0.18%,
Top 5 score: 0.96%
Wall time: 52.3 s





In [20]:
paragraphs[0]

{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'context_id': 0}

The results are quite good already, but we'll investigate whether using a dense representation (eg. vectors computed by Bert) can further improve these scores.

## Answer computation

Here we check how good are the answers of our usual model which selects the first document retrieved by Tf-Idf. 

Firstly, we define a new kind of dataset containing everything we need.

In [15]:
def tf_idf_dataset_generator(questions: List[Dict], 
                             predicted_paragraphs: List,
                             config: Config,
                             return_question_id:bool=False):
    # Iterate over questions
    for i, q in enumerate(questions):
        # Use the vectorizer to obtain the best scoring paragraph from the question
        paragraph = predicted_paragraphs[i]
        # Then encode the input using Bert's tokenizer
        encoded_inputs = config.tokenizer(
            q['qas']["question"],               # First we pass the question text
            paragraph['context'],               # Then the best scoring paragraph text
            max_length = config.INPUT_LEN,      # We want to pad and truncate to the max length
            truncation = True,
            padding = 'max_length',             # Pads all sequences to 512.
            return_token_type_ids = config.bert,# Return if the token is from sentence 
                                                # 0 or sentence 1
            return_attention_mask = True,       # Return if it's a pad token or not
        )
        if return_question_id:
            yield dict(encoded_inputs), q['qas']['id']
        else:
            yield dict(encoded_inputs)


def create_original_dataset_with_tf_idf(questions: List[Dict], 
                                        predicted_paragraphs: List,
                                        config: Config):
    features = []
    for i, q in enumerate(questions):
        inputs={}
        paragraph = predicted_paragraphs[i]
        encoded_inputs = config.tokenizer(
            q['qas']["question"],               # First we pass the question
            paragraph["context"],               # Then the context

            max_length = config.INPUT_LEN,      # We want to pad and truncate to this length
            truncation = True,
            padding = 'max_length',             # Pads all sequences to 512.

            return_token_type_ids = False,      # Return if the token is from sentence 
                                                # 0 or sentence 1
            return_attention_mask = False,      # Return if it's a pad token or not

            return_offsets_mapping = True       # Returns each token's first and last char 
                                                # positions in the original sentence
                                                # (we will use it to match answers starting 
                                                # and ending points to tokens)
        )
        inputs["context"] = paragraph["context"]
        inputs["offset_mapping"] = encoded_inputs["offset_mapping"]
        features.append(inputs)

    return tf.data.Dataset.from_tensor_slices(
        pd.DataFrame.from_dict(features).to_dict(orient="list"))


def create_dataset_using_tf_idf_vectorizer( questions: List[Dict],
                                            predicted_paragraphs: List,
                                            config: Config  ) -> tf.data.Dataset:
    # Create expected signature for the generator output
    if config.bert:
        features = {
            'input_ids': tf.TensorSpec(shape=(512,), dtype=tf.int32), 
            'attention_mask': tf.TensorSpec(shape=(512,), dtype=tf.int32),
            'token_type_ids': tf.TensorSpec(shape=(512,), dtype=tf.int32)
        }
    else:
        features = {
            'input_ids': tf.TensorSpec(shape=(512,), dtype=tf.int32), 
            'attention_mask': tf.TensorSpec(shape=(512,), dtype=tf.int32)
        }
    # The dataset contains the features and the question IDs (strings)
    signature = (features, tf.TensorSpec(shape=(), dtype=tf.string))
    # Instantiates a partial generator
    data_gen = partial(tf_idf_dataset_generator, 
        questions, predicted_paragraphs, config, return_question_id=True)
    # Creates the dataset with the computed signature
    dataset = tf.data.Dataset.from_generator(data_gen,
        output_signature=signature)
    # Compute dataset length, to be used by tensorflow internals
    dataset = dataset.apply(tf.data.experimental.assert_cardinality(len(questions)))
    # Return the dataset
    return dataset

Then, we make some changes to the prediction and evaluation function.

In [16]:
### Prediction and evaluation function ###
def predict_and_evaluate(DATASET_PATH:str, 
                         BEST_WEIGHTS_PATH:str, 
                         PATH_TO_PREDICTIONS_JSON:str,
                         hidden_state_list:List[int]=[3,4,5,6],
                         bert=False):
    data = utils.read_question_set(DATASET_PATH)
    print("Gathering questions and paragraphs from dataset...")
    questions = [{
            'qas': qas,
            'context_id': (i,j)    # We also track the question's original context 
                                   # and paragraph indices so to have a ground truth
        }
        for i in range(len(data['data']))
        for j, para in enumerate(data['data'][i]['paragraphs'])
        for qas in para['qas']
    ]
    paragraphs = [{
            'context': para['context'],
            'context_id': i
        }
        for i in range(len(data['data']))
        for para in data['data'][i]['paragraphs']
    ]
    print("Fitting the vectorizer...")
    vectorizer = TfidfVectorizer(
        strip_accents='unicode',    # Text is normalized into unicode characters
        lowercase=True, # Then, we transform all text to lowercase
        max_df=0.8,     # Filter out common words that appear in more 
                        # than 80% of the paragraphs
        norm='l2'       # The vectorizer also l2 normalizes the vectors it produces, 
                        # so that the cosine similarity operation between 
                        # vectors simply becomes a dot product.
    )      
    docs = vectorizer.fit_transform(
        [paragraphs[i]['context'] for i in range(len(paragraphs))] # Transform the paragraphs and fit the vectorizer
    )
    # To be sure that the matching question -> paragraph is always the same
    # we pre-compute it in this generator 
    print("Obtaining best paragraph for questions...")
    predicted_paragraph = [top_5_for_question(paragraphs, vectorizer, q, docs)[0][0] for q in tqdm(questions)]
    print("Creating model and dataset...")
    config = Config(bert=bert)
    # Process questions
    dataset = create_dataset_using_tf_idf_vectorizer(questions, predicted_paragraph, config)
    print("Number of samples: ", len(dataset))
    # Generate the original dataset that contains the original context and token-char mapping
    original_dataset = create_original_dataset_with_tf_idf(questions, predicted_paragraph, config)
    original_dataset = original_dataset.batch(config.BATCH_SIZE)
    dataset = dataset.batch(config.BATCH_SIZE)
    # Load model
    model = config.create_standard_model(hidden_state_list=hidden_state_list)
    model.load_weights(BEST_WEIGHTS_PATH)
    print("Computing predictions...")
    # Predict the answers to the questions in the dataset
    predictions = utils.compute_predictions(dataset, original_dataset, model)
    print(f"Done! Saving predictions at {PATH_TO_PREDICTIONS_JSON} and running evaluation script...")
    # Create a prediction file formatted like the one that is expected
    with open(PATH_TO_PREDICTIONS_JSON, 'w') as f:
        json.dump(predictions, f)
    
    !python eval/evaluate.py $DATASET_PATH $PATH_TO_PREDICTIONS_JSON

Finally, we run the test on the test set.

In [17]:
BEST_WEIGHTS_PATH = "./checkpoints/normal.h5"
PATH_TO_PREDICTIONS_JSON = '../data/results/tf_idf_predictions_tpu_normal.json'

In [18]:
predict_and_evaluate(TEST_FILE, BEST_WEIGHTS_PATH, PATH_TO_PREDICTIONS_JSON)

Gathering questions and paragraphs from dataset...
Fitting the vectorizer...
Obtaining best paragraph for questions...


100%|██████████| 10570/10570 [00:10<00:00, 991.16it/s]


Creating model and dataset...
Number of samples:  10570


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_transform', 'vocab_layer_norm', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Computing predictions...


100%|██████████| 661/661 [02:31<00:00,  4.36it/s]

Done! Saving predictions at ../data/results/tf_idf_predictions_tpu_normal.json and running evaluation script...





{
  "exact": 38.675496688741724,
  "f1": 49.271948260308754,
  "total": 10570,
  "HasAns_exact": 38.675496688741724,
  "HasAns_f1": 49.271948260308754,
  "HasAns_total": 10570
}
