# Tf-Idf retrieval baseline

In this notebook, we implement a simple baseline for paragraph retrieval using Tf-Idf weighted sparse representations of documents and query questions.

In [1]:
%matplotlib inline

import os
from tqdm import tqdm
import random
import json
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer

from config import Config
config = Config()
import utils

# Fix random seed for reproducibility
np.random.seed(config.RANDOM_SEED)
random.seed(config.RANDOM_SEED)
tf.random.set_seed(config.RANDOM_SEED)

from typing import List, Dict
#os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

In [2]:
ROOT_PATH = os.path.dirname(os.getcwd())
TRAINING_FILE = os.path.join(ROOT_PATH, 'data', 'training_set.json')
paragraphs_and_questions = utils.read_question_set(TRAINING_FILE)

First of all, we separate questions from their paragraphs

In [10]:
questions = [{
        'qas': qas,
        'context_id': (i,j)    # We also track the question's original context and paragraph indices so to have a ground truth
    }
    for i in range(len(paragraphs_and_questions['data']))
    for j, para in enumerate(paragraphs_and_questions['data'][i]['paragraphs'])
    for qas in para['qas']
]

paragraphs = [{
        'context': para['context'],
        'context_id': i
    }
    for i in range(len(paragraphs_and_questions['data']))
    for para in paragraphs_and_questions['data'][i]['paragraphs']
]

print(f"Number of questions: {len(questions)}")
print(f"Number of paragraphs: {len(paragraphs)}")

Number of questions: 87599
Number of paragraphs: 18896


We build a function to obtain the paragraph given our context indices

In [11]:
def get_paragraph_from_question(qas):
    i,j = qas['context_id']
    return paragraphs_and_questions['data'][i]['paragraphs'][j]

In [12]:
x = random.randint(0, len(questions)-1)
print(f"Question: {questions[x]['qas']['question']}")
print()
print(f"Ground truth context: '{get_paragraph_from_question(questions[x])['context']}'")

Question: What adjective did Lawrence Toppman use to describe Craig's portrayal of James Bond?

Ground truth context: 'Christopher Orr, writing in The Atlantic, also criticised the film, saying that Spectre "backslides on virtually every [aspect]". Lawrence Toppman of The Charlotte Observer called Craig's performance "Bored, James Bored." Alyssa Rosenberg, writing for The Washington Post, stated that the film turned into "a disappointingly conventional Bond film."'


Now, given a question (query) we would like to obtain the paragraph that most probably contains the answer in order to pass it into the BERT QA model. One way to do that is by using tf-idf on the large set of paragraphs. In reality we will use more complex methods and this should be considered a baseline. We will use a `TdIdfVectorizer` from Scikit Learn.

In [17]:
vectorizer = TfidfVectorizer(strip_accents='unicode', 
    lowercase=True, 
    max_df=0.8,     # Filter out common words that appear in more than 80% of the paragraphs
    norm='l2') # The vectorizer also l2 normalizes the vectors it produces, so that the cosine similarity operation between vectors simply becomes a dot product.
docs = vectorizer.fit_transform([paragraphs[i]['context'] for i in range(len(paragraphs))])

In [19]:
docs.shape

(18896, 77747)

Now, in order to compute scores between a query question and the set of document, we use **cosine similarity**.

$$
S_C (A,B) = \frac{A \cdot B}{\lVert A\rVert  \lVert B\rVert }
$$

Note: the `TfIdfVectorizer` we use already L2-normalizes all vectors it produces, so in the actual implementation we only compute a dot product.

In [33]:
def score_documents(query):
    q = query['qas']['question']
    q = vectorizer.transform([q]) # q will be a (sparse) matrix with dimensionality 1 x vocab_dim
    # We can compute a vector of all dot products scores and transform it from dense matrix to numpy array like this:
    return np.asarray(np.dot(docs, q.T).todense()).flatten()

def top_5_for_question(query):
    scores = score_documents(query)
    sorted_scores = np.argsort(-scores) # Negated for descending order
    return [paragraphs[i] for i in sorted_scores[:5]], scores[sorted_scores[:5]], sorted_scores[:5]

def print_top_5(query, top_5_para, top_5_scores, top_5_indices):
    print(f"Top-5 paragraphs: {top_5_indices}")
    print(f"Question: {query['qas']['question']}")
    print(f"Paragraphs:")
    for i in range(5):
        print(f"{i} (score: {top_5_scores[i]:.2f}): {top_5_para[i]['context']}\n")

QUERY = questions[0]
top5_para, top5_scores, top5_indices = top_5_for_question(QUERY)
print_top_5(QUERY, top5_para, top5_scores, top5_indices)


Top-5 paragraphs: [    0  6929  6937  6944 12250]
Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Paragraphs:
0 (score: 0.27): Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

1 (score: 0.26):  In Methodism, Mary is honored as the Mother of God. Methodists do not have any additional teachings on the Virgin Mary excep

We can measure how many times this simple baseline retrieves the correct paragraph (top-1 and top-5 accuracy)

In [41]:
%%time
count_top1 = 0
count_top5 = 0
count_total = len(questions)

for q in questions:
    top5_para, top5_scores, top5_indices = top_5_for_question(q)
    top5_context_ids = [top5_para[i]['context_id'] for i in range(len(top5_para))]
    gt_context_id = q['context_id'][0]
    if gt_context_id == top5_context_ids[0]:
        count_top1 += 1
    if gt_context_id in top5_context_ids:
        count_top5 += 1

top1_score = count_top1 / count_total * 100
top5_score = count_top5 / count_total * 100

print(f"Top 1 score: {top1_score:.2f}%,\nTop 5 score: {top5_score:.2f}%")

Top 1 score: 72.01%,
Top 5 score: 87.82%


The results are quite good already, but we'll investigate whether using a dense representation (eg. vectors computed by Bert) can further improve these scores.

## TODO: Use the paragraphs collected with tf-idf to compute the possible answer and see the actual QA score.