## Assignment 3 Group 3
### Zhengjie Deng a1865926
### Harsh Mukeshkumar Gandhi a1879980
### Wei You a1728091

#### Dependency version:
- Python: 3.8.16
- pandas: 1.4.2
- sklearn: 1.0.2
- nltk: 3.7
- tqdm: 4.64.1
- matplotlib: 3.7.0
- spacy: 3.5.0
- numpy: 1.23.5
- gensim: 3.8.3
- pytorch: 2.0.0
- pyserini: 0.21.0
- transformers: 4.27.4

In [29]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import BertForQuestionAnswering, BertTokenizer
from transformers import pipeline
import torch
from pyserini.search.lucene import LuceneSearcher
import os
import re
import json
import numpy as np
from nltk.corpus import stopwords
import nltk
from tqdm import tqdm
import pandas as pd
import spacy
from collections import Counter
import gensim.downloader as api
import warnings

# might need to restart the kernel after insatll pyserini
%pip install pyserini
%pip install transformers


# Download stopwords
nltk.download('stopwords')

# Download en_core_web_lg model

spacy.cli.download("en_core_web_sm")

# Download the pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-100')

# ignore the warning
warnings.filterwarnings("ignore")


Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 10.9 MB/s eta 0:00:00


### 1. Reading dataset and pre-processing

#### 1.1 Read dataset

In [2]:
# load the dataset
f_metadata_path = "./archive/metadata.csv"

# construct the data frame of metadata
df = pd.read_csv(f_metadata_path)
df.head(3)

  df = pd.read_csv(f_metadata_path)


Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id
0,ug7v899j,d1aafb70c066a2068b02786f8929fd9c900897fb,PMC,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,PMC35282,11472636,no-cc,OBJECTIVE: This retrospective chart review des...,2001-07-04,"Madani, Tariq A; Al-Ghamdi, Aisha A",BMC Infect Dis,,,,document_parses/pdf_json/d1aafb70c066a2068b027...,document_parses/pmc_json/PMC35282.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,
1,02tnwd4m,6b0567729c2143a66d737eb0a2f63f2dce2e5a7d,PMC,Nitric oxide: a pro-inflammatory mediator in l...,10.1186/rr14,PMC59543,11667967,no-cc,Inflammatory diseases of the respiratory tract...,2000-08-15,"Vliet, Albert van der; Eiserich, Jason P; Cros...",Respir Res,,,,document_parses/pdf_json/6b0567729c2143a66d737...,document_parses/pmc_json/PMC59543.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
2,ejv2xln0,06ced00a5fc04215949aa72528f2eeaae1d58927,PMC,Surfactant protein-D and pulmonary host defense,10.1186/rr19,PMC59549,11667972,no-cc,Surfactant protein-D (SP-D) participates in th...,2000-08-25,"Crouch, Erika C",Respir Res,,,,document_parses/pdf_json/06ced00a5fc04215949aa...,document_parses/pmc_json/PMC59549.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,


In [61]:
df.shape

(1056660, 19)

#### 1.2 Dropping useless columns

In [3]:
# drop useless columns for this assignment
df_trimed = df.drop(["sha", "source_x", "doi", "license", "authors", "journal", "pmc_json_files",
                    "pmcid", "pubmed_id", "mag_id", "who_covidence_id", "arxiv_id", "url", "s2_id"], axis=1)
df_trimed.head(1)


Unnamed: 0,cord_uid,title,abstract,publish_time,pdf_json_files
0,ug7v899j,Clinical features of culture-proven Mycoplasma...,OBJECTIVE: This retrospective chart review des...,2001-07-04,document_parses/pdf_json/d1aafb70c066a2068b027...


#### 1.3 Sampling

Given that COVID-19 was first reported in December 2019, we can limit our data selection to articles published after this date. Therefore, we will sample data from December 1st, 2019, onwards.

In [4]:
# only keep the documents published after 2019-11
df_trimed = df_trimed[df_trimed["publish_time"] > '2019-12']
df_trimed.shape[0]


958037

Check the missing values in the dataset.

In [5]:
# check the number of missing value in each column
missing_values_count = df_trimed.isnull().sum()
missing_values_count


cord_uid               0
title                488
abstract          220150
publish_time           0
pdf_json_files    628520
dtype: int64

Many rows of data have missing titles, abstracts, and PDF files, as demonstrated above. We will drop these rows from our analysis.

In [6]:
df_trimed = df_trimed.dropna(subset=['pdf_json_files', 'title', 'abstract'])
df_trimed.shape[0]

287065

Next, we will verify the validity of the PDF file URLs. Any rows with invalid URLs will be dropped from the dataset.

In [7]:
# remove all the rows that do not have the real file
# tqdm progress apply
tqdm.pandas(desc="removing all the rows that do not have the real file...")
df_trimed = df_trimed[df_trimed['pdf_json_files'].progress_apply(
    lambda x: os.path.isfile("./archive/"+x))]
df_trimed.shape[0]


removing all the rows that do not have the real file...: 100%|██████████| 287065/287065 [00:56<00:00, 5072.17it/s] 


267675

Based on the keywords of the test queries, we can sample the data further. We will only keep the rows whose title or abstract contain the keywords in the test queries. 

In [8]:
# function: select the rows whose title or abstract contain the given strings, ignoring the case
def select_rows_contain_string(df, strings):
    tqdm.pandas(
        desc="selecting the rows whose title or abstract contain the given strings...")
    return df[df.progress_apply(lambda row: any(string in row['title'].lower() or string in row['abstract'].lower() for string in strings), axis=1)]

In [9]:
# title or abstract contain "COVID-19", "SARS-CoV-2", "coronavirus", "2019-nCoV", "covid", "covid-19", "Covid-19"
strings = ["COVID-19", "SARS-CoV-2", "coronavirus",
           "2019-nCoV", "covid", "covid-19", "Covid-19"]
df_trimed_covid = select_rows_contain_string(df_trimed, strings)

keyword_sampled_df_list = []

# title or abstract contain "origin", "Wuhan" from the df_trimed_covid
strings = ["origin", "Wuhan"]
df_trimed_origin = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_origin)

# title or abstract contain "rapid testing"
strings = ["rapid testing"]
df_trimed_testing = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_testing)

# title or abstract contain "social", "distancing", "lockdown", "quarantine"
strings = ["social distancing", "lockdown", "quarantine"]
df_trimed_social = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_social)

# title or abstract contain "transmission route"
strings = ["transmission route"]
df_trimed_transmission = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_transmission)

# title or abstract contain "best masks", "preventing infection", "prevent infection", "preventing transmission", "prevent transmission", "preventing spread", "prevent spread"
strings = ["best masks", "preventing infection", "prevent infection",
           "preventing transmission", "prevent transmission", "preventing spread", "prevent spread"]
df_trimed_masks = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_masks)

# title or abstract contain "hand sanitizer"
strings = ["hand sanitizer"]
df_trimed_sanitizer = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_sanitizer)

# title or abstract contain "vaccine", "vaccination", "vaccines", "vaccinations"
strings = ["vaccine", "vaccination", "vaccines", "vaccinations"]
df_trimed_vaccine = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_vaccine)

# title or abstract contain "Vitamin"
strings = ["Vitamin"]
df_trimed_vitamin = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_vitamin)

# title or abstract contain "live outside the body"
strings = ["live outside the body"]
df_trimed_outside = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_outside)

# title or abstract contain "initial symptoms"
strings = ["initial symptoms"]
df_trimed_symptoms = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_symptoms)


selecting the rows whose title or abstract contain the given strings...: 100%|██████████| 267675/267675 [00:14<00:00, 18940.84it/s]
selecting the rows whose title or abstract contain the given strings...: 100%|██████████| 158540/158540 [00:03<00:00, 40936.46it/s]
selecting the rows whose title or abstract contain the given strings...: 100%|██████████| 158540/158540 [00:02<00:00, 64529.63it/s]
selecting the rows whose title or abstract contain the given strings...: 100%|██████████| 158540/158540 [00:04<00:00, 32059.81it/s]
selecting the rows whose title or abstract contain the given strings...: 100%|██████████| 158540/158540 [00:02<00:00, 65275.43it/s]
selecting the rows whose title or abstract contain the given strings...: 100%|██████████| 158540/158540 [00:10<00:00, 14715.14it/s]
selecting the rows whose title or abstract contain the given strings...: 100%|██████████| 158540/158540 [00:02<00:00, 58777.23it/s]
selecting the rows whose title or abstract contain the given strings...: 100

In [10]:
# join all the dataframes above into one dataframe and remove the duplicates
df_trimed_covid = pd.concat(keyword_sampled_df_list).drop_duplicates()
df_trimed_covid.shape[0]

43311

Finally, we randomly sample 10000 rows from the dataset.

In [11]:
# randomly pick 10000 documents from the dataset
# df_sampled = df_trimed_covid.sample(n=10000, random_state=42)
df_sampled = df_trimed_covid.sample(n=10000, random_state=42)

#### 1.4 Pre-processing the text data 

After sampling the data, our next step is to pre-process the text data. This involves extracting the text from the PDF JSON files as our first task. And then, we will put all the paragraphs of the corpus into a list.

In [12]:
# generate the list of paragraphs
paragraphs = []
for index, row in df_sampled.iterrows():
    paragraph_obj = {}
    paragraph_obj['text'] = row['abstract']
    paragraph_obj['p_id'] = row['cord_uid'] + "_0"
    paragraphs.append(paragraph_obj)
    # extract the paragraphs from the json file
    with open("./archive/"+row['pdf_json_files']) as f:
        json_data = json.load(f)
        p_index = 1
        for body in json_data['body_text']:
            paragraph_obj = {}
            paragraph_obj['text'] = body['text']
            paragraph_obj['p_id'] = row['cord_uid'] + "_" + str(p_index)
            paragraphs.append(paragraph_obj)
            p_index += 1

# turn the list of paragraphs into a dataframe
df_paragraphs = pd.DataFrame(paragraphs)
# set the index of the dataframe to be the p_id
df_paragraphs = df_paragraphs.set_index('p_id')
df_paragraphs

Unnamed: 0_level_0,text
p_id,Unnamed: 1_level_1
0ex7dy6g_0,The COVID‐19 pandemic and the associated infec...
0ex7dy6g_1,"A severe acute respiratory syndrome, coronavir..."
0ex7dy6g_2,"Typically, national/governmental IPC initiativ..."
0ex7dy6g_3,"MB designed the study, collected, and analysed..."
0ex7dy6g_4,All procedures involved in this study were in ...
...,...
7cwve06g_22,"This study has some limitations. First, the su..."
7cwve06g_23,"Our survey revealed that 89% of the 1,499 Chin..."
7cwve06g_24,The original contributions presented in the st...
7cwve06g_25,"NZ, HL, JX, and YL conceived the study. NZ, LL..."


In [13]:
# drop the paragraphs that are not text
df_paragraphs = df_paragraphs[df_paragraphs['text'].progress_apply(lambda x: type(x) == str)]

# drop the paragraphs what are shorter than 50 characters
df_paragraphs = df_paragraphs[df_paragraphs['text'].progress_apply(lambda x: len(x) > 50)]

selecting the rows whose title or abstract contain the given strings...: 100%|██████████| 350671/350671 [00:00<00:00, 1117915.10it/s]
selecting the rows whose title or abstract contain the given strings...: 100%|██████████| 350671/350671 [00:00<00:00, 1141517.96it/s]


Next, we can preprocess each paragraph in the paragraph list for the following word embedding. Preprocessing the text involves several tasks: first, we will convert all text to lowercase and remove stopwords. Next, we will perform lemmatization and store the preprocessed text in a separate column for further analysis.

In [14]:
stop_words = set(stopwords.words('english'))

# function that preprocesses the input string
def preprocess_text(string):
    string = string.lower()
    # remove stopwords
    string = " ".join([word for word in string.split()
                      if word not in stop_words])
    # lemmatization
    lemmatizer = nltk.stem.WordNetLemmatizer()
    string = " ".join([lemmatizer.lemmatize(word) for word in string.split()])
    return string

# function that preprocesses the text of a row in the paragraphs dataset
def preprocess_text_in_df_paragraphs(row):
    row["preprocessed_text"] = preprocess_text(row['text'])
    return row

tqdm.pandas(desc="preprocessing the text in the paragraphs...")
df_paragraphs = df_paragraphs.progress_apply(
    lambda row: preprocess_text_in_df_paragraphs(row), axis=1)


preprocessing the text in the paragraphs...: 100%|██████████| 340517/340517 [04:06<00:00, 1380.45it/s]


In [15]:
df_paragraphs.head()

Unnamed: 0_level_0,text,preprocessed_text
p_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0ex7dy6g_0,The COVID‐19 pandemic and the associated infec...,covid‐19 pandemic associated infection prevent...
0ex7dy6g_1,"A severe acute respiratory syndrome, coronavir...","severe acute respiratory syndrome, coronavirus..."
0ex7dy6g_2,"Typically, national/governmental IPC initiativ...","typically, national/governmental ipc initiativ..."
0ex7dy6g_3,"MB designed the study, collected, and analysed...","mb designed study, collected, analysed data, w..."
0ex7dy6g_4,All procedures involved in this study were in ...,procedure involved study accordance ethical st...


In [16]:
df_paragraphs.to_csv("paragraphs_preprocessed.csv")

### 2. Named Entity Recognition and Knowledge Base

In [17]:
# load the df_sampled dataframe and the paragraphs dataframe
df_paragraphs = pd.read_csv("paragraphs_preprocessed.csv")

#### 2.1 Entity extraction

To save time, we will only extract entities from the title and abstract sections of the text. These sections of the article are most likely to contain entities that are relevant to the topic of the article.

In [18]:
nlp = spacy.load("en_core_web_sm")

# function: get the name entities from the title, abstract, introduction, and conclusion of each data in the dataset
def get_name_entity(row):
    name_entity = []
    doc = nlp(row["title"])
    for ent in doc.ents:
        # if ent is not number
        if ent.label_ != "CARDINAL" and ent.label_ != "PERCENT" and ent.label_ != "MONEY":
            name_entity.append(ent.text)
    doc = nlp(row["abstract"])
    for ent in doc.ents:
        if ent.label_ != "CARDINAL" and ent.label_ != "PERCENT" and ent.label_ != "MONEY":
            name_entity.append(ent.text)
    return name_entity


In [19]:
# get the name entity of the title, abstract, introduction, and conclusion of each data in the dataset
tqdm.pandas(desc="Getting name entity...")
df_sampled['name_entity'] = df_sampled.progress_apply(
    lambda row: get_name_entity(row), axis=1)


Getting name entity...: 100%|██████████| 10000/10000 [06:27<00:00, 25.78it/s]


Presenting the frequency of the extracted entities.

In [20]:
# merge the name entity into one list
name_entity_list = []
for name_entity in df_sampled['name_entity']:
    name_entity_list.extend(name_entity)

# count the frequency of each name entity and sort it
name_entity_count = Counter(name_entity_list)
name_entity_count = sorted(name_entity_count.items(),
                           key=lambda x: x[1], reverse=True)

# show the top 10 name entity
name_entity_count[:10]


[('COVID-19', 26011),
 ('first', 2644),
 ('2019', 2381),
 ('CI', 1797),
 ('COVID‐19', 1511),
 ('China', 1378),
 ('second', 1330),
 ('2020', 924),
 ('India', 846),
 ('daily', 666)]

#### 2.2 Knowledge base

We will manually build knowledge bases that display synonyms of the entities and their associated keywords. This will be based on the results of the Named Entity Recognition and the test query set.

In [21]:
# manually create the knowledge base

# create the knowledge base
knowledge_base_synonym = {
    "COVID-19": ["SARS-CoV-2", "coronavirus disease 2019", "2019-nCoV"],
    "rapid testing": ["RAT", "rapid test", "rapid antigen test", "rapid antigen tests", "rapid antigen testing", "rapid antigen tests", "rapid antigen test kit"],
    "origin": ["origins", "source", "sources"],
    "initial symtoms": ["early signs"]
}

knowledge_base_association = {
    "mask": ["n95", "cloth mask"],
    "vaccine": ["mrna"],
    "origin": ["wuhan", "fish market"],
    "symptoms": ["fever", "chill", "cough", "tired", "headache", "loss taste or small", "sore throat", "diarrhea"],
    "sanitizer": ["alcohol"],
    "social distancing": ["quarantine", "lockdown"],
    "transmission route": ["airborne", "droplet", "contact", "fomite"],
    "testing": ["PCR"], 
    "mental health": ["depression", "anxiety"]
}


### 3. Indexing method

To efficiently retrieve paragraphs containing words from the query, we will utilize the inverted index method. This method creates a dictionary that maps words to the paragraphs that contain them.

In [22]:
# the unique word set of the preprocessed_text of the dataset
unique_word_set = set([])

def get_unique_words(row):
    for word in row['preprocessed_text'].split():
        # if not a single puctuation
        if re.match(r'^[^\w\s]$', word) == None:
            unique_word_set.add(word)


tqdm.pandas(desc="Getting unique words...")
result = df_paragraphs.progress_apply(lambda row: get_unique_words(row), axis=1)


Getting unique words...: 100%|██████████| 340517/340517 [00:29<00:00, 11411.53it/s]


In [23]:
# the inverted index of the preprocessed_text of the dataset
inverted_index = inv_indx = {i: [] for i in unique_word_set}

# function: get the inverted index of the preprocessed_text of the dataset
def get_inverted_index(row, inverted_index):
    for word in row['preprocessed_text'].split():
        if word in inverted_index:
            inverted_index[word].append(row['p_id'])


tqdm.pandas(desc="Getting inverted index...")
result = df_paragraphs.progress_apply(
    lambda row: get_inverted_index(row, inverted_index), axis=1)


Getting inverted index...: 100%|██████████| 340517/340517 [01:16<00:00, 4447.22it/s]


In [24]:
# demonstration: get the paragraphs containing the word "covid"
inverted_index["covid"][:5]

['pyxmblll_8', 'pyxmblll_13', 'ilq5xaey_10', 'n9ytx2pa_11', '8e8evwn4_4']

### 4. Relevant paragraphs retrieval

#### 4.1 Word embedding based 

The first method of retrieval is based on the method we used in the Assignment 2, which is using word embedding and cosine similarity. We will first extend the query by adding related entities from the knowledge base. Next, we will use the inverted index method to filter paragraphs containing the words from the extended query. We will then calculate cosine similarity between the query vector and the paragraph vector to rank them and select the top 3 paragraphs as the relevant paragraphs.

First, we define the function to preprocess the query.

In [25]:
# function: extend the query with the knowledge base
def extend_query(query, knowledge_base_synonym, knowledge_base_association):
    # convert the query to lower case
    query = query.lower()
    # for each key in the knowledge base, if the key is in the query, then add the value to the query string
    for key in knowledge_base_synonym:
        # if the query string contains the key
        if key.lower() in query:
            # concatenate the query string with each value of the key
            for value in knowledge_base_synonym[key]:
                query += " " + value
    for key in knowledge_base_association:
        if key.lower() in query:
            for value in knowledge_base_association[key]:
                query += " " + value
    return query

# query preprocessing: extend the query with the knowledge base, remove stopwords, lemmatization, and remove question mark
def preprocess_query(query):
    # extend the query with the knowledge base
    query = extend_query(query, knowledge_base_synonym,
                         knowledge_base_association)
    query = query.lower()
    # remove stopwords
    query = " ".join([word for word in query.split()
                     if word not in stop_words])
    # lemmatization
    lemmatizer = nltk.stem.WordNetLemmatizer()
    query = " ".join([lemmatizer.lemmatize(word) for word in query.split()])
    # remove question mark
    query = query.replace("?", "")
    return query

# function: get the paragraph id of the paragraphs that contain the query
def get_paragraph_id(query, inverted_index):
    # define the paragraph id set
    paragraph_id_set = set([])
    # for each word in the query
    for word in query.split():
        # if the word is in the inverted index
        if word in inverted_index:
            # add the paragraph id to the paragraph id set
            paragraph_id_set.update(inverted_index[word])
    return paragraph_id_set

# function: get the vector representation of the query by averaging the vector representation of each word in the query
def embedding_string(string):
    vectors = []
    for word in string.split():
        if word in glove_model:
            vectors.append(glove_model[word])
    if vectors == []:
        return np.zeros(100)
    return np.mean(vectors, axis=0)

# function: calculate the cosine similarity between two vectors
def cosine_similarity(vector1, vector2):
    return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

Then we can get the embedding of each paragraph.

In [30]:
# get the vector representation of each paragraph
tqdm.pandas(desc="Getting embedding for all paragraphs...")
df_paragraphs['embedding'] = df_paragraphs.progress_apply(
    lambda row: embedding_string(row['preprocessed_text']), axis=1)

df_paragraphs['embedding'][:5]

Getting embedding for all paragraphs...: 100%|██████████| 340517/340517 [01:01<00:00, 5531.58it/s]


0    [-0.0042234645, 0.21141529, 0.26677096, 0.1978...
1    [-0.040232472, 0.1667354, 0.16827837, 0.267589...
2    [-0.07398014, 0.13700563, 0.047724877, -0.0032...
3    [-0.27477884, 0.23193273, 0.087432, -0.1762163...
4    [0.021956295, 0.05252984, -0.027892865, 0.1695...
Name: embedding, dtype: object

Now we can use the above functions to get the relevant paragraphs using the word embedding method.

In [31]:
# function: get the top n paragraphs that are most similar to the query from the paragraphs dataframe, using word embedding
def get_top_n_paragraphs_wb(query_string, df_paragraphs, inverted_index,  n):
    # embad the query
    query_vector = embedding_string(query_string)
    # get the paragraph id of the paragraphs that contain the query
    indexed_paragraph_id_set = get_paragraph_id(query_string, inverted_index)
    # filter the paragraphs dataframe by the paragraph id set
    indexed_df_paragraphs = df_paragraphs[df_paragraphs['p_id'].isin(
        indexed_paragraph_id_set)]
    # calculate the cosine similarity between the query and each paragraph
    indexed_df_paragraphs['similarity'] = indexed_df_paragraphs.apply(
        lambda row: cosine_similarity(query_vector, row['embedding']), axis=1)
    # sort the paragraphs by the similarity score
    indexed_df_paragraphs = indexed_df_paragraphs.sort_values(by=['similarity'], ascending=False)
    # get the top n paragraphs
    df_top_n = indexed_df_paragraphs[:n]
    return df_top_n

Now we can find the top 3 paragraphs for the question "what are the symptoms of covid-19".

In [32]:
top_3_paragraph_wb = get_top_n_paragraphs_wb("what are the symptom of covid-19", df_paragraphs, inverted_index, 3)

for index, row in top_3_paragraph_wb.iterrows():
    print(row['text'])
    print()

Regarding COVID-19, the research shows that 6.9% of participants had symptoms related to this. However, many of the COVID-19 symptoms such as the flu, cough, fever, and tiredness are already part of the health issues experienced by those living on the streets. As a result of this, they may have gone unnoticed as something different from the usual for many people living in these conditions.

Six months and several thousand papers and preprints after the beginning of the pandemic, if there is one thing we have learnt about SARS-CoV-2, it is that almost every assumption that has been made about the virus has been wrong. Although viral pneumonia, complicated by the "cytokine storm" and a prothrombotic state (28-31), is still the principal symptom in severely ill patients, other tissues, notably the gut, are also directly susceptible to infection (32). While our study may seem at first sight to resemble the parable of the blind men and the elephant, we consider the possibility that SARS-CoV

#### 4.2 Text matching utility based on the pyserini library (Harsh's code)

In [33]:
# convert the paragraphs dataframe to json format that can be used by pyserini
convert_json_list = []
def convert_to_json(row):
    convert_json_list.append({
        "id": row["p_id"],
        "contents": row["text"]
    })

x = df_paragraphs.apply(lambda row: convert_to_json(row), axis=1)

json_str = json.dumps(convert_json_list)
with open("collection/collection.json", "w") as outfile:
    outfile.write(json_str)

In [34]:
%%capture

# use pyserini to index the json file
!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input ./collection \
  --index index \
  --generator DefaultLuceneDocumentGenerator \
  --threads 4 \
  --storePositions --storeDocvectors --storeRaw

In [35]:
def get_top_n_paragraphs_pyserini(query, n):
    searcher = LuceneSearcher('./index')

    hits = searcher.search(query)

    top_n_paragraphs = []

    for i in range(0, n):
        top_n_paragraphs.append(json.loads(hits[i].raw))
    # turn the top_n_paragraphs list into a dataframe
    df_top_n_paragraphs = pd.DataFrame(top_n_paragraphs)

    # turn the contents column into a list
    paragraph_list = df_top_n_paragraphs["contents"].tolist()
    return paragraph_list

top_3_paragraphs_pyserini = get_top_n_paragraphs_pyserini("what are the symptom of covid-19", 3)
top_3_paragraphs_pyserini

["Respondents were asked what they believed their risk of getting COVID-19 was in the next month, 3 months, and 6 months. Responses were rated on scale from 0 to 100%. COVID-19 symptom severity: Respondents were asked if they were to get COVID-19 what severity of symptoms they believed they would experience. This was rated on 5point Likert scale ranging from asymptomatic/no symptoms (1) to deadly symptoms (5). COVID-19 exposure: Respondents were asked whether, at the time of completing the survey, they knew someone who was currently or in the past had been diagnosed with COVID-19. This variable was coded as yes (1), no (0). Responses of 'don't know' were recoded and treated as missing data. COVID-19 media consumption: Frequency of watching, reading, and hearing reports or updates about COVID-19 on social media (e.g., Twitter, Facebook, and WhatsApp) and on traditional media (e.g., TV, radio, and newspaper) over the past month were assessed. Responses were dichotomized to indicate 'low 

### 5. Paragraphs summarization (Wei's code)

In [36]:
import logging
logging.getLogger("transformers").setLevel(logging.ERROR)

model_name = "t5-base"
tokenizer_sum = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Initializa Pipeline
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer_sum)

In [37]:
def get_summary(paragraphs):
  max_length = 1024

  # to store the summaries of each paragraph
  summaries = []

  for paragraph in paragraphs:
      # devide the paragraph into sub-paragraphs
      sub_paragraphs = re.findall(".{1,%d}" % max_length, paragraph)
      sub_summaries = summarizer(sub_paragraphs, max_length=250, min_length=10, do_sample=False);
      # concatenate the sub-paragraphs into one paragraph
      summary = " ".join([s["summary_text"] for s in sub_summaries])

      summaries.append(summary)
      merged_summary = " ".join(summaries)
    
  tokenizer_sum = BertTokenizer.from_pretrained('bert-base-uncased')
  
  # if merged_summary has more than 512 tokens, then we only get the first 500 tokens
  if len(tokenizer_sum.encode(merged_summary)) > 512:
    merged_summary = tokenizer_sum.decode(tokenizer_sum.encode(merged_summary)[:400])

  text = merged_summary
  tokens = tokenizer_sum.encode(text)
  num_tokens = len(tokens)
  return merged_summary, num_tokens


### 6. Bert QA system (Jack's code)

In [38]:
# load the bert model for question answering
model_qa = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer_qa = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

In [39]:
# in-text reference: [1]
# function: get the answer to the question from the paragraph
# question: the question
# answer_text: the paragraph that contains the answer
# output: answer
def answer_question(question, answer_text):
    # Apply the tokenizer to the input question and reference text
    input_ids = tokenizer_qa.encode(question, answer_text)

    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(tokenizer_qa.sep_token_id)

    # The number of segment A tokens includes the [SEP] token istelf.
    num_seg_a = sep_index + 1

    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s.
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    # Run the model and gather the ouput tensors.
    outputs = model_qa(torch.tensor([input_ids]), # The tokens representing our input text.
                    token_type_ids=torch.tensor([segment_ids]), # The segment IDs to differentiate question from answer_text
                    return_dict=True) 

    start_scores = outputs.start_logits
    end_scores = outputs.end_logits

    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)

    # Get the string versions of the input tokens.
    tokens = tokenizer_qa.convert_ids_to_tokens(input_ids)

    # Start with the first token.
    answer = tokens[answer_start]

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        # Otherwise, add a space then the token.
        else:
            answer += ' ' + tokens[i]

    # if contains [CLS] or [SEP], then return no answer
    if '[CLS]' in answer or '[SEP]' in answer:
        return "No answer"
    return answer

### 7. Integrated QA system

In [40]:
# function: using the word-emebdding method to get the answer to the question
def get_answer_a2(question):
    # preprocess the query
    preprocessed_query = preprocess_query(question)
    # get the top 3 paragraphs
    paragraph_df = get_top_n_paragraphs_wb(preprocessed_query, df_paragraphs, inverted_index, 3)
    # turn the text column into a list
    paragraph_list = paragraph_df["text"].tolist()
    # get the summary
    summary, token_nums = get_summary(paragraph_list)
    # get the answer
    return [answer_question(question, summary), summary]


get_answer_a2("what are the symptoms of covid-19?")    

['nausea , eye pain , loss of appetite , cutaneous rush , and hypothermia',
 'fever was the most reported with 98.6%, followed by dry cough (88.7%), dyspnea (71,8%), fatigue (57%), desaturation (35.2%), headache (32,4%), diarrhea (30.3%), and myalgia (25.4%) others, nonrelated with covid-19, were remembered as nausea, eye pain, loss of appetite, cutaneous rush, and hypothermia . Among the 314 symptomatic individuals, the most frequent symptoms were cough (n =93, 29.6%), headaches (n=85, 27.1%), runny nose (n >79, 25.2%), unusual fatigue (n65, 20.7%), fever (n>58, 18.5%), and sore throat (n 49, 15.6%). bob greene: fever or chills, Cough, Shortness of breath or difficulty breathing, Fatigue, Muscle or body aches, Headache, New loss of taste or smell, Sore throat, Congestion or runny nose, Nausea or vomiting, Diarrhea, I do not know, he says.']

In [41]:
# function: using the pyserini method to get the answer to the question
def get_answer_a3(question):
    # preprocess the query
    preprocessed_query = preprocess_query(question)
    # get the top 3 paragraphs using pyserini
    top_3_paragraphs = get_top_n_paragraphs_pyserini(preprocessed_query, 3)
    # get the summary of the top 3 paragraphs
    summary, token_nums = get_summary(top_3_paragraphs)
    # answer the question
    return [answer_question(question, summary), summary]


get_answer_a3("where is the origin of COVID-19?")

['wuhan city of china',
 'genetic studies conducted during the regional spread of the virus revealed that it has a 70-79% genetic similarity to the severe acute respiratory syndrome coronavirus (SARS-CoV) that caused a serious outbreak in 2003 . in a short period of time, Thailand and Japan, the regional neighbors of China, became the first other countries where the disease was seen . SARS-CoV-2 is 96.2% similar to coronaviruses found in bats . it is not possible to speak clearly about the source of the disease since the Wuhan seafood market does not sell bats or bat meat . the global epidemic of novel coronavirus (nCoV) began in Wuhan, china . within four months, the disease was reported from more than 180 countries . the disease is related to sARS-CoV-1, 5,10,11 which led the ivc . clinically, the disease spectrum ranges from mild respiratory tract illness (self-limiting), severe pneumonia, organ failure and death. pandemic has already infected nearly 21.83 million people worldwide a

### 8. Test utility and test results

In [42]:
# the test queries set 
test_set = [
    "what is the origin of COVID-19?",
    "Which city is the origin of COVID-19?",
    "what types of rapid testing for Covid-19 have been developed?",
    "has social distancing had an impact on slowing the spread of COVID-19?",
    "what are the transmission routes of coronavirus?",
    "what are the best masks for preventing infection by Covid-19?",
    "what type of hand sanitizer is needed to destroy Covid-19?",
    "What vaccine candidates are being tested for Covid-19?",
    "does Vitamin D impact COVID-19 prevention and treatment?",
    "how long can the coronavirus live outside the body?",
    "what are the initial symptoms of Covid-19?",
    "which biomarkers predict the severe clinical course of 2019-nCOV infection?",
    "What is the result of phylogenetic analysis of SARS-CoV-2 genome sequence?",
    "what are best practices in hospitals and at home in maintaining quarantine?",
    "how much percentage of persons infected with SARS-CoV-2 might be asymptomatic?",
    "Does COVID-19 have a higher transmission rate than SARS and middle east respiratory syndrome?",
    "What is the short form for coronavirus disease 2019?",
    "Why was covid-19 originally called SARS-CoV-2?",
    "What is the percentage of genetic similarity between covid-19 and the severe acute respiratory syndrome coronavirus (SARS-CoV)",
    "which contries became the first other countries where COVID-19 was seen?",
    "What are the vaccine candidates being developed to bring the pandemic under control?", 
    "can available vaccines for SARS bring rapid contrl of the current pandemic?"
]

In [43]:
QA_results = []

# loop through the test set
for query in test_set:
    result = {}
    result["query"] = query
    result["answer_a2"] = get_answer_a2(query)[0]
    result["answer_a3"] = get_answer_a3(query)[0]
    QA_results.append(result)

# convert the results to a dataframe
QA_results_df = pd.DataFrame(QA_results)

In [44]:
# present the results in a better way
QA_results_df.style.set_properties(**{'text-align': 'left'})

Unnamed: 0,query,answer_a2,answer_a3
0,what is the origin of COVID-19?,huanan animal market,west district of southern china seafood wholesale market
1,Which city is the origin of COVID-19?,huanan,wuhan
2,what types of rapid testing for Covid-19 have been developed?,"rapid diagnostic test ( rdt ) , chemiluminescent immunoassay ( cia ) , enzymelinked immunosorbent assay ( elisa ) , and neutralization assay",No answer
3,has social distancing had an impact on slowing the spread of COVID-19?,many countries implemented statewide social distancing measures and other preventive interventions .,public health authorities have recommended social distancing and even quarantine
4,what are the transmission routes of coronavirus?,droplet and contact transmission,"contact , droplets , airborne , fomite , fecal - oral , bloodborne , mother - to - child , and animalto - human transmission"
5,what are the best masks for preventing infection by Covid-19?,surgical face masks,triple layer surgical masks
6,what type of hand sanitizer is needed to destroy Covid-19?,No answer,alcohol based
7,What vaccine candidates are being tested for Covid-19?,mrna and viral vector vaccines,two mrna vaccines of bnt162b2 ( pfizer / biontech ) and mdna - 1273 ( moderna )
8,does Vitamin D impact COVID-19 prevention and treatment?,rapid vaccine development using the mrna platform is mitigating covid - 19 hospitalizations and deaths,potential
9,how long can the coronavirus live outside the body?,several hours,up to 60 minutes


#### 5.1 result

After obtaining the test results, we will manually identify the correctness of each return snippets and calculate the reciprocal rank for each query.

In [45]:
# construct the results

result = [
    {"query": "what is the origin of COVID-19?", "answer_a2": 1, "answer_a3": 1},
    {"query": "Which city is the origin of COVID-19?", "answer_a2": 0, "answer_a3": 1},
    {"query": "what types of rapid testing for Covid-19 have been developed?", "answer_a2": 1, "answer_a3": 0},
    {"query": "has social distancing had an impact on slowing the spread of COVID-19?", "answer_a2": 0, "answer_a3": 0},
    {"query": "what are the transmission routes of coronavirus?", "answer_a2": 0.5, "answer_a3": 1},
    {"query": "what are the best masks for preventing infection by Covid-19?", "answer_a2": 1, "answer_a3": 1},
    {"query": "what type of hand sanitizer is needed to destroy Covid-19?", "answer_a2": 0, "answer_a3": 1},
    {"query": "What vaccine candidates are being tested for Covid-19?", "answer_a2": 1, "answer_a3": 1},
    {"query": "does Vitamin D impact COVID-19 prevention and treatment?", "answer_a2": 0, "answer_a3": 0},
    {"query": "how long can the coronavirus live outside the body?", "answer_a2": 0, "answer_a3": 1},
    {"query": "what are the initial symptoms of Covid-19?", "answer_a2": 0.5, "answer_a3": 1},
    {"query": "which biomarkers predict the severe clinical course of 2019-nCOV infection?", "answer_a2": 0, "answer_a3": 0},
    {"query": "What is the result of phylogenetic analysis of SARS-CoV-2 genome sequence?", "answer_a2": 0, "answer_a3": 1},
    {"query": "what are best practices in hospitals and at home in maintaining quarantine?", "answer_a2": 0.5, "answer_a3": 0},
    {"query": "how much percentage of persons infected with SARS-CoV-2 might be asymptomatic?", "answer_a2": 0, "answer_a3": 1},
    {"query": "Does COVID-19 have a higher transmission rate than SARS and middle east respiratory syndrome?", "answer_a2": 0, "answer_a3": 1},
    {"query": "What is the short form for coronavirus disease 2019?", "answer_a2": 1, "answer_a3": 1},
    {"query": "Why was covid-19 originally called SARS-CoV-2?", "answer_a2": 0, "answer_a3": 0},
    {"query": "What is the percentage of genetic similarity between covid-19 and the severe acute respiratory syndrome coronavirus (SARS-CoV)", "answer_a2": 0, "answer_a3": 1},
    {"query": "which contries became the first other countries where COVID-19 was seen?", "answer_a2": 0, "answer_a3": 1},
    {"query": "What are the vaccine candidates being developed to bring the pandemic under control?",  "answer_a2": 0, "answer_a3": 1},
    {"query": "can available vaccines for SARS bring rapid contrl of the current pandemic?", "answer_a2": 0, "answer_a3": 1},
]

# convert the results to a dataframe
result_df = pd.DataFrame(result)
result_df

Unnamed: 0,query,answer_a2,answer_a3
0,what is the origin of COVID-19?,1.0,1
1,Which city is the origin of COVID-19?,0.0,1
2,what types of rapid testing for Covid-19 have ...,1.0,0
3,has social distancing had an impact on slowing...,0.0,0
4,what are the transmission routes of coronavirus?,0.5,1
5,what are the best masks for preventing infecti...,1.0,1
6,what type of hand sanitizer is needed to destr...,0.0,1
7,What vaccine candidates are being tested for C...,1.0,1
8,does Vitamin D impact COVID-19 prevention and ...,0.0,0
9,how long can the coronavirus live outside the ...,0.0,1


In [None]:
# present the results in a better way
# make a copy of the QA_results_df
QA_results_df_copy = QA_results_df.copy()

for i in range(len(QA_results_df_copy)):
    QA_results_df_copy.at[i, "answer_a2"] = [QA_results_df_copy.at[i, "answer_a2"], result_df.at[i, "answer_a2"]]
    QA_results_df_copy.at[i, "answer_a3"] = [QA_results_df_copy.at[i, "answer_a3"], result_df.at[i, "answer_a3"]]

In [None]:
# change the answer_a2 column name to answer_word_embedding
QA_results_df_copy.rename(columns={"answer_a2": "answer_word_embedding"}, inplace=True)

# change the answer_a3 column name to answer_pyserini
QA_results_df_copy.rename(columns={"answer_a3": "answer_pyserini"}, inplace=True)

To shows the result more clearly, we will plot the answers with different colors. The correct answers are shown in yellow, the half correct answers are shown in white, and the incorrect answers are shown in red.

In [None]:
# QA_results_df_copy.style.set_properties(**{'text-align': 'left'})

# change the cell color based on the score
def color_negative_red(val):
    if val[1] == 1:
        color = 'yellow'
    elif val[1] == 0:
        color = 'red'
    else:
        color = 'white'
    return 'color: %s' % color

QA_results_df_copy.style.applymap(color_negative_red, subset=["answer_word_embedding", "answer_pyserini"]).set_properties(**{'text-align': 'left'})

Unnamed: 0,query,answer_word_embedding,answer_pyserini
0,what is the origin of COVID-19?,"['huanan animal market', 1.0]","['west district of southern china seafood wholesale market', 1]"
1,Which city is the origin of COVID-19?,"['huanan', 0.0]","['wuhan', 1]"
2,what types of rapid testing for Covid-19 have been developed?,"['rapid diagnostic test ( rdt ) , chemiluminescent immunoassay ( cia ) , enzymelinked immunosorbent assay ( elisa ) , and neutralization assay', 1.0]","['No answer', 0]"
3,has social distancing had an impact on slowing the spread of COVID-19?,"['many countries implemented statewide social distancing measures and other preventive interventions .', 0.0]","['public health authorities have recommended social distancing and even quarantine', 0]"
4,what are the transmission routes of coronavirus?,"['droplet and contact transmission', 0.5]","['contact , droplets , airborne , fomite , fecal - oral , bloodborne , mother - to - child , and animalto - human transmission', 1]"
5,what are the best masks for preventing infection by Covid-19?,"['surgical face masks', 1.0]","['triple layer surgical masks', 1]"
6,what type of hand sanitizer is needed to destroy Covid-19?,"['No answer', 0.0]","['alcohol based', 1]"
7,What vaccine candidates are being tested for Covid-19?,"['mrna and viral vector vaccines', 1.0]","['two mrna vaccines of bnt162b2 ( pfizer / biontech ) and mdna - 1273 ( moderna )', 1]"
8,does Vitamin D impact COVID-19 prevention and treatment?,"['rapid vaccine development using the mrna platform is mitigating covid - 19 hospitalizations and deaths', 0.0]","['potential', 0]"
9,how long can the coronavirus live outside the body?,"['several hours', 0.0]","['up to 60 minutes', 1]"


Finally, we calculate the scores for each method. The result analysis is presented in the report.

In [46]:
# get the sum of the answer_a2 column
answer_a2_score = result_df["answer_a2"].sum() / len(result_df)

# get the sum of the answer_a3 column
answer_a3_score = result_df["answer_a3"].sum() / len(result_df)

# print the results in percentage with 2 decimal places
print("answer_a2 score: {:.2%}".format(answer_a2_score))
print("answer_a3 score: {:.2%}".format(answer_a3_score))

answer_a2 score: 29.55%
answer_a3 score: 72.73%


### 9. Simple user interface

In this section, we will build a simple user interface for the QA system. The user can input a question and the system will return the answer. The user can also choose the whether to play the audio of the answer.

In [47]:
import gtts
from playsound import playsound

In [74]:
def speak(text):
    # make request to google to get synthesis
    tts = gtts.gTTS(text=text, lang="en")

    # save the audio file
    tts.save("temp_audio.mp3")

    # play the audio file
    playsound("temp_audio.mp3")

def QA_user_interface():
    query = input("Please enter your question: ")
    if query == "exit":
        return
    speakOrNot = input("Do you want to hear the answer? This requires network connection. (y/n): ")
    answer = get_answer_a3(query)[0]
    print("Question: {}".format(query))
    print("Answer: {}".format(answer))
    if speakOrNot == "y":
        speak(answer)

In [75]:
QA_user_interface()

Question: Which city is the origin of COVID-19?
Answer: wuhan


### 7. References

[1] https://www.youtube.com/watch?v=l8ZYCvgGu0o&ab_channel=ChrisMcCormickAI