# <center>Question Answering System Using COVID-19 Article</center>

In [1]:
# importing libraries

import nltk, re, json, string, spacy
nlp = spacy.load('en_core_web_sm')
from sklearn.preprocessing import normalize
from sklearn.metrics import pairwise_distances
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
from nltk.corpus import stopwords
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.corpus import stopwords

## Question and Answering (QA) System

Developed a QA system which allow you to search for answers in an article. For example, the file `qa.json` contains a research article. This article can answer a number of questions about COVID-19. You will design a solution to automatically search answers to these questions in this article.

`qa.json` is taken from https://github.com/deepset-ai/COVID-QA. This file contains a few questions, and answers to these questions have been located in the article. Let's define a QA system and check if your system can locate the right answers.

The following script helps you understand `qa.json`:

In [7]:
# Retrieve the article

data = json.load(open("qa.json","r"))
article = data["context"]

# A long article. Just print the first 200 characters
print(article[0:200])

CDC Summary 21 MAR 2020,
https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/summary.html

This is a rapidly evolving situation and CDC will provide updated information and guidance as it becomes 


In [8]:
# Retrieve all the questions and answers
qas = data["qas"]

# show the first question-answer pair. Note the answer starts at the 6117th character
print(qas[0])

# get all questions
qs = [item["question"] for item in qas]
qs

{'question': 'What age group has the highest rate of severe outcomes?', 'id': 236, 'answers': [{'text': 'people 85 years and older', 'answer_start': 6117}], 'is_impossible': False}


['What age group has the highest rate of severe outcomes?',
 'How is COVID-19 spread?',
 'How many states in the U.S. have reported cases of COVID-19?',
 'When did the White House launch the "15 Days to Slow the Spread" program?',
 'What should mildly-ill patients do?',
 'What type of virus is SARS-CoV-2?',
 'What viruses are similar to the COVID-19 coronavirus?',
 'What are the phases of a pandemic?',
 'At which phase does the peak of the pandemic occur?',
 'People with which medical conditions have a higher rate of severe illness?',
 'What kind of test can diagnose COVID-19?',
 'In what species did the COVID-19 virus likely originate?',
 'What risk factors should be considered in addition to clinical symptoms?']

Next, following the instructions below step by step to develop the QA system

### Creating a Tokenizer

Define a function `tokenize(doc)`  as follows:
   - Take a piece of text (i.e. variable `doc`) as an input
   - Split the input text into unigrams
   - Clean up tokens as follows:
       - Lemmatize all unigrams
       - Remove all stop words
       - Remove all punctuations
       - Convert all unigrams to the lower case 
       - remove empty unigrams
   - Return the list of unigrams after all the processing. (Hint: you can use spacy package for this task. To test if a token is stop word or punctuation, check https://spacy.io/api/token#attributes)

In [9]:
# Define the function

def tokenize(doc):
    doc = nlp(doc)
    tokens = []
    
    # add your code
    #create unigrams and lemmatize
    tokens = [token.lemma_ for token in doc]
    
    #remove stop words and punctuation
    stop_words = stopwords.words('english')
    punctuations = string.punctuation+'—'
    
    tokens = [ token.lower() for token in tokens if token not in stop_words and token not in punctuations]
    
    #remove empty space
    tokens = [token.strip() for token in tokens if token.strip()!='']
    
    #return tokens
    return tokens

In [10]:
# Test the function
doc = 'Older people and people of all ages with severe chronic medical conditions — \
like heart disease, lung disease and diabetes, \
for example — seem to be at higher risk of developing serious COVID-19 illness.'

print(tokenize(doc))

['old', 'people', 'people', 'age', 'severe', 'chronic', 'medical', 'condition', 'like', 'heart', 'disease', 'lung', 'disease', 'diabetes', 'example', 'seem', 'high', 'risk', 'develop', 'serious', 'covid-19', 'illness']


### Computing TF-IDF Matrix

Define a function `compute_tf_idf(docs)` as follows: 

- Take `docs`, a list of documents (e.g. a list of questions) as an input
- Tokenize each document in `docs` using the `tokenize` function defined in Q3.1. 
- Calculate tf_idf weights as shown in lecture notes (Hint: you can reuse the last code segment in NLP Lecture Notes (II))
- Return a smoothed normalized `tf_idf` array

In [11]:
# Define the function

def compute_tfidf(docs):
    
    smoothed_tf_idf = None
    
     # add your code here
    
    # Step 1. and Step 2. get tokens of each document as list (Call function Q.3.1) 
    # process all documents to get list of token list
    docs_tokens={idx:tokenize(doc) and nltk.FreqDist(tokenize(doc)) for idx,doc in enumerate(docs)}
    #print(docs_tokens)
    # step 3. get document-term matrix
    dtm=pd.DataFrame.from_dict(docs_tokens, orient="index")
    dtm
    dtm=dtm.fillna(0)
    dtm
    dtm = dtm.sort_index(axis = 0)
    #print(dtm)
    # step 4. get normalized term frequency (tf) matrix        
    tf=dtm.values
    doc_len=tf.sum(axis=1, keepdims=True)
    tf=np.divide(tf, doc_len)
    #print(tf)
    # step 5. get idf
    df=np.where(tf>0,1,0)
    idf=np.log(np.divide(len(docs), np.sum(df, axis=0)))+1
    #print(idf)
    smoothed_idf=np.log(np.divide(len(docs)+1, np.sum(df, axis=0)+1))+1   
    # get tf-idf
    tf_idf=normalize(tf*idf)
    tf_idf
    smoothed_tf_idf=normalize(tf*smoothed_idf)
    
    #return smoothed_tf_idf
    return smoothed_tf_idf

In [12]:
# Test the function using three questions

np.set_printoptions(precision=2)

compute_tfidf(qs[0:3])

array([[0.41, 0.41, 0.41, 0.41, 0.41, 0.41, 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.61, 0.8 , 0.  , 0.  , 0.  ,
        0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.32, 0.  , 0.42, 0.42, 0.42,
        0.42, 0.42]])

### Puting Everything Together

Define a function `find_solutions(qs, article)` as follows: 

- Take two inputs:
    - `qs`: a list of questions (i.e. strings)
    - `article`: a document which may contain answers to the questions
- Segment the article into sentences (i.e. `sents`). You will locate the sentence which can answer a question.
- Concatenate the questions (`qs`) and sentences (`sents`) into a single list (i.e. `qs + sents`)
- Call the function `compute_tfidf` defined in Q3.2 with `qs + sents` to get a `TF-IDF` matrix. (Note, now `qs` and `sents` are converted to TF-IDF vectors in the same dimension. As a result, you can measure their similarities.) 
- Split the `TF-IDF` matrix into two sub matrices, one corresponding to `qs` and the other for `sents`. 
- Next, calculate the pairwise cosine similarity between the `qs` and `sents`. With $m$ questions and $n$ sentences, you should get a $m \times n$ matrix. (hint: you can `sklearn.metrics.pairwise_distances` to calculate pairwise distances between two matrices)
- Finally, the answer to each question is the sentence which has the `maximum similarity` to it. 
- Print out each question and its matched answer. Check if your QA system is able to find the right answer.

In [17]:
# Define the function

def find_solutions(qs, doc):
    
    # add your code here
    
    #segment article into sentences
    sents = nltk.sent_tokenize(doc)
    #len(sents)
    #type(sents)
    #sents
    
    #concatenatethe questions(qs) and sentences(sents) into a simple list i.e.(qs + sents)
    simple_list = qs + sents
    #print(simple_list)
    
    #call commute_TFIDF defined in Q.3. with qs + sents to get a matrix
    tfidf_matrix = compute_tfidf(simple_list)
    print(tfidf_matrix)
    print(tfidf_matrix.shape)
    
    #splitthe tfidf matrix into two sub matrices
    Q = tfidf_matrix[:len(qs),0:]
    #print(Q)
    #print(Q.shape)
    
    A = tfidf_matrix[len(qs):,0:]
    #print(A)
    #print(A.shape)
    
    #pairwise cosine similarity
    # calculate cosince distance of every pair of documents similarity is 1-distance
    similarity=1-pairwise_distances(Q, A, metric = 'cosine')
    #print(similarity)
    #print(similarity.shape)
    
    QA_system=np.argsort(similarity, axis=1)[:,::-1]
    #print(QA_system)
    for idx,row in enumerate(QA_system):
        print("\033[1m""Question:", " ", qs[idx])
        print( "\033[0m""Answer:"," ", sents[row[0]])
        print(" ")
        print("...............")

In [18]:
# Test the system

find_solutions(qs, article)

[[0.4  0.45 0.36 ... 0.   0.   0.  ]
 [0.   0.   0.   ... 0.   0.   0.  ]
 [0.   0.   0.   ... 0.   0.   0.  ]
 ...
 [0.   0.   0.   ... 0.   0.   0.  ]
 [0.   0.   0.   ... 0.   0.   0.  ]
 [0.   0.   0.   ... 0.31 0.31 0.31]]
(122, 500)
[1mQuestion:   What age group has the highest rate of severe outcomes?
[0mAnswer:   A CDC Morbidity & Mortality Weekly Report that looked at severity of disease among COVID-19 cases in the United States by age group found that 80% of deaths were among adults 65 years and older with the highest percentage of severe outcomes occurring in people 85 years and older.
 
...............
[1mQuestion:   How is COVID-19 spread?
[0mAnswer:   Does the patient reside in an area where there has been community spread of COVID-19?
 
...............
[1mQuestion:   How many states in the U.S. have reported cases of COVID-19?
[0mAnswer:   All 50 states have reported cases of COVID-19 to CDC.
 
...............
[1mQuestion:   When did the White House launch the "15

### Analysis

Compare the answers you find with the "ground-truth" answers in the json file and analyze the following:

- What kind of questions can be correctly answered by your system? What kind of questions CANNOT be correctly answered by your system?

- Why does your system fail to locate the right answers to these questions?

- How should your system be improve so that such questions can be answered?