In [1]:
username = 'MarcelloCeresini'
repository = 'QuestionAnswering'

# COLAB ONLY CELLS
try:
    import google.colab
    IN_COLAB = True
    !pip3 install transformers
    !git clone https://www.github.com/{username}/{repository}.git
    from google.colab import drive
    drive.mount('/content/drive/')
    %cd /content/QuestionAnswering/src
except:
    IN_COLAB = False

# Tf-Idf retrieval baseline

## Preparation

In this notebook, we implement a simple baseline for paragraph retrieval using Tf-Idf weighted sparse representations of documents and query questions.

In [2]:
%matplotlib inline

import os
from tqdm import tqdm
import random
import json
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from functools import partial

from sklearn.feature_extraction.text import TfidfVectorizer

from config import Config
config = Config()
import utils

# Fix random seed for reproducibility
np.random.seed(config.RANDOM_SEED)
random.seed(config.RANDOM_SEED)
tf.random.set_seed(config.RANDOM_SEED)

from typing import List, Dict
#os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

Load the datasets

In [3]:
ROOT_PATH = os.path.dirname(os.getcwd())
TRAINING_FILE = os.path.join(ROOT_PATH, 'data', 'training_set.json')
VALIDATION_FILE = os.path.join(ROOT_PATH, 'data', 'validation_set.json')
TEST_FILE = os.path.join(ROOT_PATH, 'data', 'dev_set.json')

train_paragraphs_and_questions = utils.read_question_set(TRAINING_FILE)['data']
val_paragraphs_and_questions = utils.read_question_set(VALIDATION_FILE)['data']
test_paragraphs_and_questions = utils.read_question_set(TEST_FILE)['data']

# Remove the validation set from the train set
train_paragraphs_and_questions = [article for article in train_paragraphs_and_questions \
                                  if article not in val_paragraphs_and_questions]

First of all, we separate questions from their paragraphs. We assign a `context_id` to each question and paragraph:

The context ID is a tuple `(article_id, paragraph_id)` representing the index of the paragraph within the set of paragraphs of the article, within the set of articles of the dataset splits.

In [13]:
from utils import get_questions_and_paragraphs

train_questions, train_paragraphs = get_questions_and_paragraphs(train_paragraphs_and_questions)
val_questions, val_paragraphs = get_questions_and_paragraphs(val_paragraphs_and_questions)
test_questions, test_paragraphs = get_questions_and_paragraphs(test_paragraphs_and_questions)

print(f"Number of training questions: {len(train_questions)}")
print(f"Number of training paragraphs: {len(train_paragraphs)}")
print()
print(f"Number of val questions: {len(val_questions)}")
print(f"Number of val paragraphs: {len(val_paragraphs)}")
print()
print(f"Number of test questions: {len(test_questions)}")
print(f"Number of test paragraphs: {len(test_paragraphs)}")

Number of training questions: 65064
Number of training paragraphs: 13975

Number of val questions: 22535
Number of val paragraphs: 4921

Number of test questions: 10570
Number of test paragraphs: 2067


We test some functions we built to ease indexing into the questions and paragraphs sets we have just obtained.

In [16]:
from utils import get_paragraph_from_question

x = random.randint(0, len(train_questions)-1)
print(f"Question: {train_questions[x]['qas']['question']}")
print()
print(f"Ground truth context: '{get_paragraph_from_question(train_questions[x], train_paragraphs_and_questions)['context']}'")

Question: What is the French term for Sassou's political party?

Ground truth context: 'Congo-Brazzaville has had a multi-party political system since the early 1990s, although the system is heavily dominated by President Denis Sassou Nguesso; he has lacked serious competition in the presidential elections held under his rule. Sassou Nguesso is backed by his own Congolese Labour Party (French: Parti Congolais du Travail) as well as a range of smaller parties.'


## Vectorizers

Now, given a question (query) we would like to obtain the paragraph that most probably contains the answer. Once we have a paragraph, we pass it into the BERT QA model we created for the standard project to obtain an answer. 

One way to do that is by using Tf-Idf on the large set of paragraphs. In reality we will use more complex methods and this should be considered a baseline. We will use a `TdIdfVectorizer` from Scikit Learn.

In [17]:
train_vectorizer = TfidfVectorizer(strip_accents='unicode', 
    lowercase=True, 
    max_df=0.8,     # Filter out common words that appear in more than 80% of the paragraphs
    norm='l2')      # The vectorizer also l2 normalizes the vectors it produces, 
                    # so that the cosine similarity operation between vectors simply becomes a dot product.

val_vectorizer = TfidfVectorizer(strip_accents='unicode', lowercase=True, max_df=0.8, norm='l2')

train_docs = train_vectorizer.fit_transform([train_paragraphs[i]['context'] for i in range(len(train_paragraphs))])
val_docs = val_vectorizer.fit_transform([val_paragraphs[i]['context'] for i in range(len(val_paragraphs))])

In [18]:
train_docs.shape, val_docs.shape

((13975, 65808), (4921, 36612))

To handle the test paragraphs we don't fit a `test_vectorizer` on the test set, because we must assume that the test set contains entirely new and unseen data for the models. Therefore, we create a `test_vectorizer` that is fit on the union of the training and validation sets and create paragraph representations accordingly.

In [19]:
test_vectorizer = TfidfVectorizer(strip_accents='unicode', lowercase=True, max_df=0.8, norm='l2')
test_vectorizer.fit([train_paragraphs[i]['context'] for i in range(len(train_paragraphs))] + 
                    [val_paragraphs[i]['context'] for i in range(len(val_paragraphs))])         # Train on the union of dataset splits
test_docs = test_vectorizer.transform([test_paragraphs[i]['context'] for i in range(len(test_paragraphs))]) # Then transform the test paragraphs

test_docs.shape

(2067, 77747)

- The training set is made of 13975 paragraphs and was tokenized into 65808 tokens, weighted with Tf-Idf.
- The validation set is made of 4921 paragraphs and was tokenized into 36612 tokens.
- The test set is made of 2067 paragraphs and was tokenized into 77747 tokens coming from the union of training and validation sets.

A way to compute **scores** between a query question and the set of documents, we use the **cosine similarity**.

$$
S_C (A,B) = \frac{A \cdot B}{\lVert A\rVert  \lVert B\rVert }
$$

With the `TfIdfVectorizer` we already L2-normalized all produced vectors, so in the actual implementation we only compute a dot product, which is slightly faster.

In [20]:
def score_documents(vectorizer, query, docs):
    q = query['qas']['question']
    q = vectorizer.transform([q]) # q will be a (sparse) matrix with dimensionality 1 x vocab_dim
    # We can compute a vector of all dot products scores and transform it from dense matrix to numpy array like this:
    return np.asarray(np.dot(docs, q.T).todense()).flatten()

def top_n_for_question(paragraphs, vectorizer, query, docs, n=5):
    scores = score_documents(vectorizer, query, docs)
    sorted_scores = np.argsort(-scores) # Negated scores for descending order
    return [paragraphs[i] for i in sorted_scores[:n]], scores[sorted_scores[:n]], sorted_scores[:n]

def print_top_5(query, top_5_para, top_5_scores, top_5_indices):
    print(f"Top-5 paragraphs: {top_5_indices}")
    print(f"Question: {query['qas']['question']}")
    print(f"Paragraphs:")
    for i in range(5):
        print(f"{i} (score: {top_5_scores[i]:.2f}): {top_5_para[i]['context']}\n")

In [22]:
QUERY = train_questions[x]
top5_para, top5_scores, top5_indices = top_n_for_question(train_paragraphs, train_vectorizer, QUERY, train_docs)
print_top_5(QUERY, top5_para, top5_scores, top5_indices)

Top-5 paragraphs: [1273 1276 1271 1277 8829]
Question: What is the French term for Sassou's political party?
Paragraphs:
0 (score: 0.39): Congo's democratic progress was derailed in 1997 when Lissouba and Sassou started to fight for power in the civil war. As presidential elections scheduled for July 1997 approached, tensions between the Lissouba and Sassou camps mounted. On June 5, President Lissouba's government forces surrounded Sassou's compound in Brazzaville and Sassou ordered members of his private militia (known as "Cobras") to resist. Thus began a four-month conflict that destroyed or damaged much of Brazzaville and caused tens of thousands of civilian deaths. In early October, the Angolan socialist régime began an invasion of Congo to install Sassou in power. In mid-October, the Lissouba government fell. Soon thereafter, Sassou declared himself president.

1 (score: 0.38): Congo-Brazzaville has had a multi-party political system since the early 1990s, although the system is h

In most cases, the predictions are correct or the correct paragraph is in the top-5 retrieved paragraphs. Also, the other paragraphs seem quite relevant too.

## Paragraph Retrieval evaluation

We can measure how many times this simple baseline retrieves the correct paragraph on the training and test sets computing the top-1 and top-5 accuracies.

### Training set evaluation

In [None]:
from utils import get_context_ids_from_top_indices

Then we compute the accuracy measures.

In [34]:
%%time
count_top1   = 0
count_top5   = 0
count_top20  = 0
count_top100 = 0
count_total = len(train_questions)

for q in tqdm(train_questions):
    topn_para, topn_scores, topn_indices = top_n_for_question(train_paragraphs, train_vectorizer, q, train_docs, n=100)
    retrieved_context_ids = get_context_ids_from_top_indices(train_paragraphs, topn_indices)
    gt_context_id = q['context_id']
    if gt_context_id == retrieved_context_ids[0]:
        count_top1 += 1
    if gt_context_id in retrieved_context_ids[:5]:
        count_top5 += 1
    if gt_context_id in retrieved_context_ids[:20]:
        count_top20 += 1
    if gt_context_id in retrieved_context_ids:
        count_top100 += 1

top1_score   = count_top1 / count_total * 100
top5_score   = count_top5 / count_total * 100
top20_score  = count_top20 / count_total * 100
top100_score = count_top100 / count_total * 100

print(f"\nTop 1 score: {top1_score:.2f}%,\nTop 5 score: {top5_score:.2f}%"
      f"\nTop 20 score: {top20_score:.2f}%,\nTop 100 score: {top100_score:.2f}%")

100%|██████████| 65064/65064 [06:07<00:00, 176.96it/s]


Top 1 score: 48.83%,
Top 5 score: 73.34%
Top 20 score: 88.42%,
Top 100 score: 96.51%
CPU times: total: 4min 30s
Wall time: 6min 7s





### Test set evaluation

In [35]:
%%time
count_top1   = 0
count_top5   = 0
count_top20  = 0
count_top100 = 0
count_total = len(test_questions)

for q in tqdm(test_questions):
    topn_para, topn_scores, topn_indices = top_n_for_question(test_paragraphs, test_vectorizer, q, test_docs, n=100)
    retrieved_context_ids = get_context_ids_from_top_indices(test_paragraphs, topn_indices)
    gt_context_id = q['context_id']
    if gt_context_id == retrieved_context_ids[0]:
        count_top1 += 1
    if gt_context_id in retrieved_context_ids[:5]:
        count_top5 += 1
    if gt_context_id in retrieved_context_ids[:20]:
        count_top20 += 1
    if gt_context_id in retrieved_context_ids:
        count_top100 += 1

top1_score   = count_top1 / count_total * 100
top5_score   = count_top5 / count_total * 100
top20_score  = count_top20 / count_total * 100
top100_score = count_top100 / count_total * 100

print(f"\nTop 1 score: {top1_score:.2f}%,\nTop 5 score: {top5_score:.2f}%"
      f"\nTop 20 score: {top20_score:.2f}%,\nTop 100 score: {top100_score:.2f}%")

100%|██████████| 10570/10570 [00:18<00:00, 586.99it/s]


Top 1 score: 51.37%,
Top 5 score: 74.87%
Top 20 score: 90.19%,
Top 100 score: 97.46%
CPU times: total: 10.7 s
Wall time: 18 s





The results are quite good already, but we'll investigate whether using a dense representation (eg. vectors computed by Bert) can further improve these scores.