TensorFlow 2.0 Question Answering Competition Notebook
==================
* Develop a more effective and robust QA system
* This is forked from the following notebook
* Source: https://www.kaggle.com/jazivxt/on-the-professor-and-the-madman

Added comments for understanding the code

In [None]:
import tensorflow as tf
print(tf.__version__)

Installing the regular pandas & numpy libraries

In [None]:
import numpy as np
import pandas as pd
import json

This loads lines from a json dump. The number of lines is provided by the parameter *max_limit* 
The data is appended to a list & then converted into a dataframe

In [None]:
def read_lines_m(path, max_limit=4000):
    rlm = []; ml = max_limit
    for l in open(path, 'r'):
        rlm.append(json.loads(l))
        ml -= 1
        if ml <= 0: break
    return pd.DataFrame(rlm)

Specifies the path to the input files

In [None]:
p = '../input/tensorflow2-question-answering/'

The next few lines of code read the input file & prints out the shape & columns in the input file 

In [None]:
train = read_lines_m(p + 'simplified-nq-train.jsonl')

print(train.shape)
print(train.columns)

The column *Annotation* is interesting since it provides possible responses for the given question.
For example for the first question in the dataset - *Which is the most common use of opt-in e-mail marketing* 
The potential start & end tokens of the answers can be found in the *long_answer* & *short_answer* dictionaries 

In [None]:
train.question_text[0]

In [None]:
train.annotations[0]

So the long_answer & short answers are extracted from the document_text

In [None]:
document_text_tokens = train.document_text[0].split(' ')

In [None]:
' '.join(document_text_tokens[train.annotations[0][0]['long_answer']['start_token']:train.annotations[0][0]['long_answer']['end_token']])

In [None]:
' '.join(document_text_tokens[train.annotations[0][0]['short_answers'][0]['start_token']:train.annotations[0][0]['short_answers'][0]['end_token']])

In [None]:
# Let's look at another question
print(train.annotations[105])


In [None]:
print(train.question_text[105])
print("Long Answer:")

document_text_tokens = train.document_text[105].split(' ')
print(' '.join(document_text_tokens[train.annotations[105][0]['long_answer']['start_token']:train.annotations[105][0]['long_answer']['end_token']]))
print("Short Answer:")
print(' '.join(document_text_tokens[train.annotations[105][0]['short_answers'][0]['start_token']:train.annotations[105][0]['short_answers'][0]['end_token']]))

Well.... Let's get back to the rest of the code

The next piece of code copies the start_tokens for all the long_answers for all the questions

In [None]:
train['D'] = [t[0]['long_answer']['start_token'] for t in train.annotations]
train['D'].head()

The 4th index has a -1, this may indicate that the long answer is missing. Lets check it out 

In [None]:
print(train.annotations[5])
print(train.question_text[5])

document_text_tokens = train.document_text[5].split(' ')
print("Long Answer:")
print(document_text_tokens[train.annotations[5][0]['long_answer']['start_token']:train.annotations[5][0]['long_answer']['end_token']])
print("Short Answer:")
print(document_text_tokens[train.annotations[5][0]['short_answers'][0]['start_token']:train.annotations[5][0]['short_answers'][0]['end_token']])

If the question doesnt have a long answer the start_token is -1. But it can also be that the document doesnt have an relevant answer. 


In the next code block we are removing all questions which do not have a long answer 

In [None]:
train = train[train['D']>-1].reset_index(drop=True)

In [None]:
test = read_lines_m(p + 'simplified-nq-test.jsonl').reset_index(drop=True)
sub = pd.read_csv(p + 'sample_submission.csv')
train.shape, test.shape, sub.shape

In [None]:
i =99
print('URL:', train.document_url[i])
print(train.question_text[i])
print(train.long_answer_candidates[i][0])
print('Long Answer')
print(' '.join(train.document_text[i].split()[train.annotations[i][0]['long_answer']['start_token'] : train.annotations[i][0]['long_answer']['end_token']]))
if len(train.annotations[i][0]['short_answers']) > 0:
    print('Short Answer')
    print(' '.join(train.document_text[i].split()[train.annotations[i][0]['short_answers'][0]['start_token'] : train.annotations[i][0]['short_answers'][0]['end_token']]))

In the next code block, we are calculating the span of all the long answers & short answers. 
Later we print out the median of the long answer span & short answer span 

In [None]:
la=[t[0]['long_answer']['end_token'] - t[0]['long_answer']['start_token'] for t in train.annotations]
sa=[t[0]['short_answers'][0]['end_token'] - t[0]['short_answers'][0]['start_token'] for t in train.annotations if len(t[0]['short_answers'])>0]
np.median(la), np.median(sa)

Importing the usual nltk related packages

In [None]:
from bs4 import BeautifulSoup as b
from nltk.corpus import stopwords
import random, nltk

Let's walk through the next code segment - 

We first iterate through each question in the test set 
The document_text is then parsed using an *html parser* 
> s = b(test.document_text[0], 'html.parser')

We then extract texts from all paragraphs in the document text into a list. The paragraph text is added only if the text length is greater than 50 
> p = [p.get_text() for p in s.find_all('p', text=True) if len(p.get_text()) > 50]

For each paragraph in the above list we run a word match. This is done through the *qa_word_match* function

The short answer is randomly chosen 

>short_answer=random.choice(['YES','NO'])

OR 

> r = random.randrange(r, r + 114)

> short_answer=''.join([str(r),':', str(r + 2)])

**Inside the qa_word_match function call**

Each question string is made lowercase & then split at a word level. Then the nltk stopword list is used to remove all unnecessary words

> q = q.lower().split()


> q = [q1 for q1 in q if q1 not in list(set(stopwords.words('english')))]

We then check if any words in the question list matches with the potential answers (*This is the paragraph text that we have extracted*)
The string where there is the largest match, we return it back
> m = np.sum([1 for w in a1.lower().split() if w in q])

The complete function is defined below

In [None]:
def qa_word_match(q,a):
    q = q.lower().split()
    q = [q1 for q1 in q if q1 not in list(set(stopwords.words('english')))]
    tm = 0
    a2 = a[0]
    for a1 in a:
        m = np.sum([1 for w in a1.lower().split() if w in q])
        if m > tm:
            tm = int(m)
            a2 = str(a1)
    return a2

In [None]:
result = []
for i in range(len(test.example_id)):
    s = b(test.document_text[i], 'html.parser')
    p = [p.get_text() for p in s.find_all('p', text=True) if len(p.get_text()) > 50]
    if len(p)>0:
        a = qa_word_match(test.question_text[i], p)
        r = test.document_text[i].find(a)
        r = len(test.document_text[i][:r].split()) - 1
        long_answer= ''.join([str(r),':', str(r + len(p[0].split()) + 2)])
    else:
        try:
            r = random.randrange(390, len(test.document_text[i].split()))
        except:
            r=7
        long_answer= ''.join([str(r),':', str(r + 114)])

    if len([q for q in ['am', 'are', 'can', 'could', 'did', 'do', 'does', 'has', 'have', 'is', 'may', 'should', 'was', 'were', 'will'] if q in test.question_text[i].lower().split()])>0:
        short_answer=random.choice(['YES','NO'])
    else:
        r = random.randrange(r, r + 114)
        short_answer=''.join([str(r),':', str(r + 2)])
    result.append([test.example_id[i] + '_long', long_answer])
    result.append([test.example_id[i] + '_short', short_answer])
pd.DataFrame(result,columns=['example_id', 'PredictionString']).to_csv('submission.csv', index=False)

Ｈ𝐀𝑷𝑷𝓎 🇰𝗮𝘨𝘨🇱𝖎Ｎɢ 💯
==========================