# Question Answer System Based on the COVID-19 Data (CORD 19)

Dataset: COVID-19 Open Research Dataset Challenge (CORD-19) \
<url :https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge>


Members of the team:

- Chinmay Dharmik | a1855351
- Harpreet Kaur Hans | a1873328
- Priyank Dave | a1843068

Group 20 

***

# Section 1 - Data Preparation

1. Load the dataset
2. Preprocess the dataset
3. Detect the language of the dataset
4. Removing irrelevant data, Duplicate data

***

In [28]:
#Importing libraries and packages

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import json
import re
import string
import nltk as nltk
from langdetect import detect
from langdetect import DetectorFactory

from tqdm import tqdm

import warnings
warnings.filterwarnings("ignore")

import multiprocessing as mp
from multiprocessing import Pool
from tqdm import tqdm
import spacy

nlp = spacy.load("en_core_sci_sm")
nlp.max_length = 15000000

# Importing the stop words list from the spacy package
from spacy.lang.en.stop_words import STOP_WORDS

# Creating a list of stop words
stop_words = list(STOP_WORDS)

# Adding custom stop words to the list
stop_words.extend([
    'doi', 'preprint', 'copyright', 'peer', 'reviewed', 'org', 'https', 'et', 'al', 'author', 'figure', 
    'rights', 'reserved', 'permission', 'used', 'using', 'biorxiv', 'medrxiv', 'license', 'fig', 'fig.', 
    'al.', 'Elsevier', 'PMC', 'CZI'
])

# Converting the list to a set of lowercased words
stop_words = set([word.lower() for word in stop_words])


In [29]:
# Running Functions
import string
import re
import numpy as np

# Define a method to preprocess the body text of an article by applying several text cleaning operations
def preprocess(lst):
    # Initialize an empty list to store preprocessed text
    pp_list = []
    
    # Iterate over each text in the input list
    for text in lst:
        # Define allowed characters to keep
        allowed_char = string.ascii_uppercase + string.ascii_lowercase + " ."
        
        # Remove email addresses using regular expression
        text = re.sub(r'[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}', '', text)
        
        # Remove DOI links using regular expression
        text = re.sub(r'https\:\/\/doi\.org[^\s]+', '', text)
        
        # Remove HTTPS links using regular expression
        text = re.sub(r'(\()?\s?http(s)?\:\/\/[^\)]+(\))?', '', text)
        
        # Remove single characters repeated at least 3 times for spacing error (e.g. s u m m a r y)
        text = re.sub(r'(\w\s+){3,}', '', text)
        
        # Replace tags (e.g. [3] [4] [5]) with whitespace
        text = re.sub(r'(\[\d+\]\,?\s?){3,}(\.|\,)?', '', text)
        
        # Replace tags (e.g. [3, 4, 5]) with whitespace
        text = re.sub(r'\[[\d\,\s]+\]', '', text)
        
        # Replace tags (e.g. (NUM1) repeated at least 3 times with whitespace
        text = re.sub(r'(\(\d+\)\s){3,}', '', text)
        
        # Replace '1.3' with '1,3' (we need it for split later)
        text = re.sub(r'(\d+)\.(\d+)', '', text)
        
        # Remove all full stops as abbreviations (e.g. i.e. cit. and so on)
        text = re.sub(r'\.(\s)?([^A-Z\s])', '', text)
        
        # Keep only allowed characters
        text = "".join([x for x in text if x in allowed_char])
        
        # Correctly space the tokens
        text = re.sub(r' {2,}', '', text)
        text = re.sub(r'\.{2,}', '', text)
        
        # Convert text to lowercase
        text = text.lower()
        
        # Remove stop words
        # text = " ".join([x for x in text.split(" ") if x.lower() not in STOP_WORDS]) 
        
        # Append preprocessed text to the list
        pp_list.append(text)
    
    # Return the preprocessed text as a numpy array
    return np.array(pp_list)



import json

def get_body_text(file_path: str) -> list[str]:
    """
    Given a file path, returns a list of the body texts of all entries in the JSON file
    whose length (in number of words) is greater than 5.
    """
    # initialize the body text to an empty string
    body_text = ""

    # open the JSON file at the specified file path
    with open(file_path.split("; ")[0]) as file:
        # load the JSON content from the file
        content = json.load(file)
        # extract the body text of each entry whose length is greater than 5
        body_text = [entry['text'] for entry in content['body_text'] if len(entry['text'].split(" ")) > 5]

    # return the list of body texts
    return body_text

def get_publish_date(text):
    year_str = text[:4] # extract the first 4 characters, which should be the year
    year_int = int(year_str) # convert the year string to an integer
    return year_int # return the year as an integer

# method to detect the language of the body_text
def dect_lang(lst):
    text = ' '.join(lst)
    # Split the body text into words
    text = text.split(" ")
    # Default language is English
    lang = "en"
    try:
        # If there are more than 50 words in the text, use the first 50 words to detect the language
        if len(text) > 50:
            lang = detect(" ".join(text[:50]))
        # Otherwise, use all the words to detect the language
        elif len(text) > 0:
            lang = detect(" ".join(text[:len(text)]))
    # If there is an exception raised, it might be because the beginning of the document is not in a good format
    except Exception as e:
        # Create a set of all the words in the text
        all_words = set(text)
        try:
            # Try to detect the language using all the words
            lang = detect(" ".join(all_words))
        # If there is an exception again, let's try to find any text in the abstract to label the language
        except Exception as e:
            lang = "unknown"
            pass
    # Return the detected language
    return lang

# method to check if the given paper is related to COVID-19 or not
def check_covid_paper(lst):
    text = ' '.join(lst)
    if text.strip() == '':
        return -1
    # list of terms that indicate a paper is related to COVID-19
    covid_terms = ['covid', 'coronavirus disease 19', 'sars cov 2', '2019 ncov', '2019ncov', '2019 n cov', '2019n cov',
        'ncov 2019', 'n cov 2019', 'coronavirus 2019', 'wuhan pneumonia', 'wuhan virus', 'wuhan coronavirus',
        'coronavirus 2', 'covid-19', 'SARS-CoV-2', '2019-nCov']
    
    # convert all terms to lowercase
    covid_terms = [elem.lower() for elem in covid_terms]
    
    # compile the terms into a regular expression pattern for efficient matching
    covid_terms = re.compile('|'.join(covid_terms))
    # search for the terms in the body text of the paper (ignoring case)
    return bool(covid_terms.search(text))


In [30]:
# Importing the MetaData

Dataset = pd.read_csv('metadata.csv', low_memory=False)
print(Dataset.shape)
print(Dataset.columns)
Dataset.drop_duplicates(subset=['cord_uid'], inplace=True)        
Dataset = Dataset[['sha','title', 'doi', 'publish_time', 'authors','pdf_json_files']]
Dataset = Dataset[Dataset['pdf_json_files'].notna()]                         # considering only pdf data files
Dataset = Dataset[Dataset['pdf_json_files'].str.endswith('.json')]           # filtering only valid json files
print(Dataset.shape)  
Dataset.head()

(1056660, 19)
Index(['cord_uid', 'sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id',
       'license', 'abstract', 'publish_time', 'authors', 'journal', 'mag_id',
       'who_covidence_id', 'arxiv_id', 'pdf_json_files', 'pmc_json_files',
       'url', 's2_id'],
      dtype='object')
(372888, 6)


Unnamed: 0,sha,title,doi,publish_time,authors,pdf_json_files
0,d1aafb70c066a2068b02786f8929fd9c900897fb,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,2001-07-04,"Madani, Tariq A; Al-Ghamdi, Aisha A",document_parses/pdf_json/d1aafb70c066a2068b027...
1,6b0567729c2143a66d737eb0a2f63f2dce2e5a7d,Nitric oxide: a pro-inflammatory mediator in l...,10.1186/rr14,2000-08-15,"Vliet, Albert van der; Eiserich, Jason P; Cros...",document_parses/pdf_json/6b0567729c2143a66d737...
2,06ced00a5fc04215949aa72528f2eeaae1d58927,Surfactant protein-D and pulmonary host defense,10.1186/rr19,2000-08-25,"Crouch, Erika C",document_parses/pdf_json/06ced00a5fc04215949aa...
3,348055649b6b8cf2b9a376498df9bf41f7123605,Role of endothelin-1 in lung disease,10.1186/rr44,2001-02-22,"Fagan, Karen A; McMurtry, Ivan F; Rodman, David M",document_parses/pdf_json/348055649b6b8cf2b9a37...
4,5f48792a5fa08bed9f56016f4981ae2ca6031b32,Gene expression in epithelial cells in respons...,10.1186/rr61,2001-05-11,"Domachowske, Joseph B; Bonville, Cynthia A; Ro...",document_parses/pdf_json/5f48792a5fa08bed9f560...


In [31]:
with mp.Pool(15) as pool:
    Dataset['publish_time'] = pool.map(get_publish_date, Dataset.publish_time)
Dataset=Dataset[Dataset['publish_time']>2019]

In [32]:
with mp.Pool(15) as pool:
    Dataset['body_text'] = pool.map(get_body_text, Dataset.pdf_json_files)
Dataset.head(10)

Unnamed: 0,sha,title,doi,publish_time,authors,pdf_json_files,body_text
4237,44449ad1cca160ce491d7624f8ae1028f3570c45,Dexmedetomidine improved renal function in pat...,10.1186/s40560-019-0415-z,2020,"Nakashima, Tsuyoshi; Miyamoto, Kyohei; Shima, ...",document_parses/pdf_json/44449ad1cca160ce491d7...,[Dexmedetomidine is a sedative drug that has a...
4238,def41c08c3cb1b3752bcff34d3aed7f8486e1c86,Aortic volume determines global end-diastolic ...,10.1186/s40635-019-0284-8,2020,"Akohov, Aleksej; Barner, Christoph; Grimmer, S...",document_parses/pdf_json/def41c08c3cb1b3752bcf...,[Transpulmonary thermodilution is commonly use...
4239,f5ae3f66face323615df39d838e056ab5fcc98df,Whole genome sequencing and phylogenetic analy...,10.1186/s12864-019-6400-z,2020,"Kamau, Everlyn; Oketch, John W.; de Laurent, Z...",document_parses/pdf_json/f5ae3f66face323615df3...,[Human metapneumovirus (HMPV) is a single-stra...
4240,5be75ae4e7f8c892abd8dc396b9dbd035772c84a,European intensive care physicians’ experience...,10.1186/s13756-019-0662-8,2020,"Lepape, Alain; Jean, Astrid; De Waele, Jan; Fr...",document_parses/pdf_json/5be75ae4e7f8c892abd8d...,[Antimicrobial resistance (AMR) is a threat to...
4241,1cee4a0d0e823379ec34a462a04561bf4cd736a2,Synthetic carbohydrate-based vaccines: challen...,10.1186/s12929-019-0591-0,2020,"Mettu, Ravinder; Chen, Chiang-Yun; Wu, Chung-Yi",document_parses/pdf_json/1cee4a0d0e823379ec34a...,[Carbohydrate-based vaccines have a long histo...
4242,dfbc39af0a5845b6f3698657c81a4527a45a4021,Acute kidney injury in burn patients admitted ...,10.1186/s13054-019-2710-4,2020,"Folkestad, Torgeir; Brurberg, Kjetil Gundro; N...",document_parses/pdf_json/dfbc39af0a5845b6f3698...,[Acute kidney injury (AKI) is a common complic...
4243,2caf288d2b6723ea0d82657ab1ecb1199a5b3b6b,Identification of antigens presented by MHC fo...,10.1038/s41541-019-0148-y,2020,"Bettencourt, Paulo; Müller, Julius; Nicastri, ...",document_parses/pdf_json/2caf288d2b6723ea0d826...,"[Mycobacterium tuberculosis (M.tb), the etiolo..."
4244,81f4c4710b9844ee1b4981af6e45fe286e44caa8,Imaging of tumour response to immunotherapy,10.1186/s41747-019-0134-1,2020,"Dromain, Clarisse; Beigelman, Catherine; Pozze...",document_parses/pdf_json/81f4c4710b9844ee1b498...,[Immune checkpoint inhibitors remove inhibitor...
4245,d6430e4ecda76d74dc295a831b9fc1914f7c04db,Quantifying the relative impact of contact het...,10.1186/s12879-019-4738-0,2020,"Lei, Hao; Jones, Rachael M.; Li, Yuguo",document_parses/pdf_json/d6430e4ecda76d74dc295...,[Healthcare-associated infections (HAIs) pose ...
4246,778903e91db14b0973295cc8e58a6cd4c7a136ba,"Paediatric nurses’ general self-efficacy, perc...",10.1186/s12913-019-4878-3,2020,"Cheng, Linan; Cui, Yajuan; Chen, Qian; Ye, Yan...",document_parses/pdf_json/778903e91db14b0973295...,[A contradiction between the supply and demand...


In [33]:
with mp.Pool(15) as pool:
    Dataset['check_covid']=pool.map(check_covid_paper,Dataset.body_text)
    Dataset['Language']=pool.map(dect_lang,Dataset.body_text)

In [34]:
Dataset=Dataset[Dataset['check_covid']==True]
Dataset=Dataset[Dataset['Language']=='en']
Dataset.drop('check_covid',axis=1,inplace=True)
Dataset.drop('Language',axis=1,inplace=True)
Dataset=Dataset.reset_index(drop=True)

In [35]:
with mp.Pool(15) as pool:
    Dataset['pp_body_text'] = pool.map(preprocess, Dataset.body_text)
Dataset.head(10)

Unnamed: 0,sha,title,doi,publish_time,authors,pdf_json_files,body_text,pp_body_text
0,6a43f1603edfe8220dc7706a2c269b7aa5370044,COVID-19: A critical care perspective informed...,10.1016/j.accpm.2020.02.002,2020,"Ling, Lowell; Joynt, Gavin M.; Lipman, Jeff; C...",document_parses/pdf_json/6a43f1603edfe8220dc77...,[The world is closely watching the outbreak of...,[the world is closely watching the outbreak of...
1,8f0df7fdd64d2f7b7763c00956a499956139a60e,The response of Milan's Emergency Medical Syst...,10.1016/s0140-6736(20)30493-1,2020,"Spina, Stefano; Marrazzo, Francesco; Migliari,...",document_parses/pdf_json/8f0df7fdd64d2f7b7763c...,[zone or China. The COVID-19 Response Team ass...,[zone or china. the covid response team assess...
2,014ce88b18095a1eaa877ab5aae818df6ec214f6,Viral load of SARS-CoV-2 in clinical samples,10.1016/s1473-3099(20)30113-4,2020,"Pan, Yang; Zhang, Daitao; Yang, Peng; Poon, Le...",document_parses/pdf_json/014ce88b18095a1eaa877...,"[, days 4-7 (R²=0·93, p<0·001) , and days 7-14...","[ daysr pand daysr p., fromconfirmed cases of ..."
3,1b3d032fe3b4d7bc9d975c83b058c73c3c5143b7,Treatment and Outcome of a Patient With Lung C...,10.1016/j.jtho.2020.02.025,2020,"Zhang, Hongyan; Xie, Conghua; Huang, Yihua",document_parses/pdf_json/1b3d032fe3b4d7bc9d975...,[Treatment and Outcome of a Patient With Lung ...,[treatment and outcome of a patient with lung ...
4,0ce817b1fa3f295c44f1ec1bb1a93784cfa1ba4e,Liver injury in COVID-19: management and chall...,10.1016/s2468-1253(20)30057-1,2020,"Zhang, Chao; Shi, Lei; Wang, Fu-Sheng",document_parses/pdf_json/0ce817b1fa3f295c44f1e...,"[are acute and resolve quickly, but the diseas...",[are acute and resolve quickly but the disease...
5,f5a53197117badae809250b44f55f63a19ed7019,Initiation of a new infection control system f...,10.1016/s1473-3099(20)30110-9,2020,"Chen, Xuejiao; Tian, Junzhang; Li, Guanming; L...",document_parses/pdf_json/f5a53197117badae80925...,"[In December, 2019, a group of patients with p...",[in decembera group of patients with pneumonia...
6,f863c592acdbb85b20cfd781082d3ba883dbc8f7,Enteric involvement of coronaviruses: is faeca...,10.1016/s2468-1253(20)30048-0,2020,"Yeo, Charleen; Kaushal, Sanghvi; Yeo, Danson",document_parses/pdf_json/f863c592acdbb85b20cfd...,[knowledge gap of prevalence of viraemic HCV i...,[knowledge gap of prevalence of viraemic hcv i...
7,17b7eebc3090a925fb97c774c94b7f65a4fd527d,Full spectrum of COVID-19 severity still being...,10.1016/s0140-6736(20)30308-1,2020,"Xu, Zhou; Li, Shu; Tian, Shen; Li, Hao; Kong, ...",document_parses/pdf_json/17b7eebc3090a925fb97c...,[the outbreak would involve testing sera of bl...,[the outbreak would involve testing sera of bl...
8,9b7cf9483474f9820bd3350faf461ddb45fc785e,Molecular characterization of SARS-CoV-2 in th...,10.1016/j.cmi.2020.03.020,2020,"Bal, A.; Destras, G.; Gaymard, A.; Bouscambert...",document_parses/pdf_json/9b7cf9483474f9820bd33...,"[In December 2019, a novel coronavirus emerged...",[in decembera novel coronavirus emerged in chi...
9,e7522f4e4b1359da1656fa2c5e1ee9cba9d16909,Considerations for Drug Interactions on QTc in...,10.1016/j.jacc.2020.04.016,2020,"Roden, Dan M.; Harrington, Robert A.; Poppas, ...",document_parses/pdf_json/e7522f4e4b1359da1656f...,[Hydroxychloroquine and azithromycin have been...,[hydroxychloroquine and azithromycin have been...


In [36]:
# import pickle
# pickle.dump(Dataset, open('dataset.pkl', 'wb'))

*** 

## Section 2: Relevant Paragraphs extraction

1. Getting a list of paragraphs from the dataset
2. Entity extraction and Indexing
3. Paragraph Ranking Using Spacy Similarity
4. Creating a list of top 10 paragraphs to be used as context for the Question Answering Model

***


In [37]:
import itertools
lib = Dataset.pp_body_text.tolist()     #   getting the preprocessed body text 
lib = list(itertools.chain(*lib))       #   flattening the nested lists into a single list


libBT = Dataset.body_text.tolist()      #   getting the preprocessed body text
libBT = list(itertools.chain(*libBT))   #   flattening the nested lists into a single list

In [38]:
del Dataset # deleting the Dataset Dataframe to free up memory 

In [39]:
from multiprocessing import Pool
from tqdm import tqdm
import spacy

nlp = spacy.load("en_core_sci_sm")
nlp.max_length = 15000000

In [40]:
# Defining a function to extract named entities from the paper
def process_paper(paper):
    doc = nlp(str(paper))  # Converting paper to string and processing it using spacy
    entities = set([ent.text for ent in doc.ents])  # Extracting unique named entities from the paper
    return entities

# Creating a dictionary to store the indices of papers that contain each named entity
entity_index = {}

# Using multiprocessing to process papers in parallel for faster execution
with Pool(15) as pool:  # Creating a pool of 15 worker processes
    entities = pool.map(process_paper, lib)  # Applying the process_paper function to each paper in lib

# Iterating over the extracted entities from each paper to update the entity index
for i, paper_entities in tqdm(enumerate(entities), total=len(entities)):
    for entity in paper_entities:
        if entity not in entity_index:
            entity_index[entity] = set()  # Creating a new set for the entity if it doesn't exist in the index
        entity_index[entity].add(i)  # Adding the index of the paper that contains the entity to the set

100%|██████████| 1824793/1824793 [38:16<00:00, 794.50it/s]  


In [41]:
import pickle
pickle.dump(entity_index, open('entity_index.pkl', 'wb'))

In [42]:
# import pickle
# entity_index = pickle.load(open('entity_index.pkl', 'rb'))

In [43]:
# Support Functions for Section 2

# Define a function to compute the similarity between a question and an article
def compute_similarity_mp(args):
    # Unpack the arguments
    question, article = args
    # Create Doc objects for the question and article using the nlp pipeline
    question_doc = nlp(question)
    article_doc = nlp(str(lib[article]))
    # Compute the similarity between the question and article and return a tuple with the article index and the similarity score
    return (article, question_doc.similarity(article_doc))

# Define a function to search for articles that match a given query using multiprocessing
def search_mp(query, entity_index):
    # Create a Doc object for the query using the nlp pipeline
    doc = nlp(query)
    # Extract the named entities from the query and store them in a set
    query_entities = set([ent.text for ent in doc.ents])
    # Initialize a set of article indices that match all the named entities in the query
    matching_articles = set(range(len(libBT)))
    for entity in tqdm(query_entities):
        # Check if the named entity is in the entity index
        if entity in entity_index:
            # Update the set of matching articles to only include articles that contain the named entity
            matching_articles &= entity_index[entity]
    # Create a multiprocessing pool with 15 workers
    pool = mp.Pool(15)
    # Create a list of arguments to pass to the compute_similarity_mp function
    args = [(query, i) for i in matching_articles]
    # Use the multiprocessing pool to compute the similarity scores for each matching article
    results = pool.map(compute_similarity_mp, args)
    # Sort the results by similarity score in descending order
    results.sort(key=lambda x: x[-1], reverse=True)
    # Close the multiprocessing pool and wait for all tasks to complete
    pool.close()
    pool.join()
    # Return the top 10 matching articles
    return results[:10]


***

# Section 3

[1] Prompt the user to input a question. \
[2] Call the search_mp() function to search for the user's query in the list of paragraphs and store the results in the result variable.\
[3] Check if the result variable is empty. If it is, notify the user that no results were found and exit the function.\
[4] Concatenate the relevant paragraphs into a single context.\
[5] Use the pre-trained Roberta model to find the answer to the user's query within the context.\[6] Replace the question words in the user's query with the [CLS] token, and combine it with the answer to form a new answer_text string.\
[7] Use the pre-trained Bart model to generate a summary of the answer.\
[8] Prompt the user to choose whether or not to hear the answer.\
[9] If the user wants to hear the answer, call the text_to_speech() function to speak the answer aloud.\
[10] Print the user's question and the generated answer to the console.


In [44]:
import pyttsx3
def text_to_speech(text):
    # Initialize the pyttsx3 engine
    engine = pyttsx3.init()

    # Set the speech rate and volume
    rate = engine.getProperty('rate')
    engine.setProperty('rate', 183)
    engine.setProperty('volume', 1)

    # Convert the text to speech
    engine.say(text)
    engine.runAndWait()

In [45]:
print(len(entity_index.keys()))

4784616


In [49]:
from transformers import pipeline

def query_run(text):
    """Answers a user's query by searching through a list of paragraphs and generating a summary using pre-trained models."""
    
    # Prompt user for query
    query = text
    
    # Search for relevant paragraphs using the user's query
    print(f"Searching for answer for the question: {query} in the list of paragraphs...")
    result = search_mp(query, entity_index)
    
    # If no results are found, notify the user and exit the function
    if len(result) == 0:
        print("No results found.")
        return
    
    # Concatenate the relevant paragraphs into a single context
    context = '. '.join([libBT[a[0]] for a in result])
    
    # Use the Roberta model to find the answer to the user's query within the context
    print("Searching the answer in the top matching paragraphs using Roberta...")
    qa_model = pipeline("question-answering",model='deepset/roberta-base-squad2-covid')
    ans= qa_model(question = query, context = context)
    print("Answer Snippit: " + ans['answer'])

    # Replace the question words in the user's query with the [CLS] token, and combine it with the answer
    question = query.lower().replace('what','[CLS]').replace('when','[CLS]').replace('how','[CLS]').replace('where','[CLS]').replace('which','[CLS]').replace('who','[CLS]')
    answer_text = f"{question} [SEP] {ans['answer']}[SEP]"
    
    # Use the Bart model to generate a summary of the answer
    print("Generating answer using facebook/bart-large-cnn pretrained model...")
    summarizer = pipeline("summarization", model='facebook/bart-large-cnn')
    summary = summarizer(answer_text, min_length=5, max_length=40)
    
    # Ask the user if they want to hear the answer, and speak it if they do
    #t2s = input("Do you want to hear the answer? (y/n) ")
    #if t2s.lower() == 'y':
        #text_to_speech(summary[0]['summary_text'])
    
    # Print the answer to the console
    print(f"Question: {query}\nAnswer: {summary[0]['summary_text']}\n\n\n\n")

# Call the answer function to begin the question-answering process
questions = ["When was covid-19 outbreak declared as a global pandemic?",
        "What health measures should be taken for covid-19?",
        "How to control the spread of coronavirus?",
        "Where did coronavirus start?",
       "How does covid-19 spread?",
       "What are the symptoms of coronavirus?",
       "Which country became the epicenter of the global pandemic in 2021?",
       "What could be the origin of the coronavirus?",
       "What are the vaccines approved against covid-19?",
       "What causes covid-19?"]
[ query_run(query) for query in questions ]

Searching for answer for the question: When was covid-19 outbreak declared as a global pandemic? in the list of paragraphs...


100%|██████████| 3/3 [00:00<00:00, 10.78it/s]


Searching the answer in the top matching paragraphs using Roberta...
Answer Snippit: March 11
Generating answer using facebook/bart-large-cnn pretrained model...


Your max_length is set to 40, but you input_length is only 29. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=14)


Question: When was covid-19 outbreak declared as a global pandemic?
Answer: Was covid-19 outbreak declared as a global pandemic? [SEP] March 11.




Searching for answer for the question: What health measures should be taken for covid-19? in the list of paragraphs...


100%|██████████| 2/2 [00:00<00:00, 210.70it/s]


Searching the answer in the top matching paragraphs using Roberta...
Answer Snippit: ."
Generating answer using facebook/bart-large-cnn pretrained model...


Your max_length is set to 40, but you input_length is only 26. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)


Question: What health measures should be taken for covid-19?
Answer: "What health measures should be taken for covid-19? [SEP] ."




Searching for answer for the question: How to control the spread of coronavirus? in the list of paragraphs...


100%|██████████| 2/2 [00:00<00:00,  7.60it/s]


Searching the answer in the top matching paragraphs using Roberta...
Answer Snippit: total lockdown is in place in India from 24 th March 2020 for 21 days
Generating answer using facebook/bart-large-cnn pretrained model...


Your max_length is set to 40, but you input_length is only 38. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


Question: How to control the spread of coronavirus?
Answer: Total lockdown is in place in India from 24 th March 2020 for 21 days. [CLS] to control the spread of coronavirus?




Searching for answer for the question: Where did coronavirus start? in the list of paragraphs...


100%|██████████| 1/1 [00:00<00:00,  5.54it/s]


Searching the answer in the top matching paragraphs using Roberta...
Answer Snippit: Wuhan, Hubei Province, China
Generating answer using facebook/bart-large-cnn pretrained model...


Your max_length is set to 40, but you input_length is only 30. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


Question: Where did coronavirus start?
Answer: Did coronavirus start in Wuhan, China?




Searching for answer for the question: How does covid-19 spread? in the list of paragraphs...


100%|██████████| 2/2 [00:00<00:00, 33.74it/s]


Searching the answer in the top matching paragraphs using Roberta...
Answer Snippit: tracing contacts within 3 days of cases being confirmed
Generating answer using facebook/bart-large-cnn pretrained model...


Your max_length is set to 40, but you input_length is only 30. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


Question: How does covid-19 spread?
Answer: Covid-19 can spread within 3 days of cases being confirmed, according to the CDC.




Searching for answer for the question: What are the symptoms of coronavirus? in the list of paragraphs...


100%|██████████| 2/2 [00:00<00:00,  9.44it/s]


Searching the answer in the top matching paragraphs using Roberta...
Answer Snippit: cough, fever, and breathlessness
Generating answer using facebook/bart-large-cnn pretrained model...


Your max_length is set to 40, but you input_length is only 29. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=14)


Question: What are the symptoms of coronavirus?
Answer: What are the symptoms of coronavirus? cough, fever, and breathlessness.




Searching for answer for the question: Which country became the epicenter of the global pandemic in 2021? in the list of paragraphs...


100%|██████████| 2/2 [00:00<00:00, 53.40it/s]


Searching the answer in the top matching paragraphs using Roberta...
Answer Snippit: Saudi Arabia
Generating answer using facebook/bart-large-cnn pretrained model...


Your max_length is set to 40, but you input_length is only 29. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=14)


Question: Which country became the epicenter of the global pandemic in 2021?
Answer: Saudi Arabia could become the epicenter of the global pandemic in 2021.




Searching for answer for the question: What could be the origin of the coronavirus? in the list of paragraphs...


100%|██████████| 2/2 [00:00<00:00, 73.59it/s]


Searching the answer in the top matching paragraphs using Roberta...
Answer Snippit: China
Generating answer using facebook/bart-large-cnn pretrained model...


Your max_length is set to 40, but you input_length is only 25. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)


Question: What could be the origin of the coronavirus?
Answer: China could be the origin of the coronavirus?




Searching for answer for the question: What are the vaccines approved against covid-19? in the list of paragraphs...


100%|██████████| 2/2 [00:00<00:00, 56.59it/s]


Searching the answer in the top matching paragraphs using Roberta...
Answer Snippit: 2021
Generating answer using facebook/bart-large-cnn pretrained model...


Your max_length is set to 40, but you input_length is only 25. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)


Question: What are the vaccines approved against covid-19?
Answer: Vaccines approved against covid-19 will be available in 2021.




Searching for answer for the question: What causes covid-19? in the list of paragraphs...


100%|██████████| 1/1 [00:00<00:00, 605.06it/s]


KeyboardInterrupt: 

The keyboard was interupted because of precautionary measures (cpu temperature >98 C)

In [46]:
from transformers import pipeline

def answer():
    """Answers a user's query by searching through a list of paragraphs and generating a summary using pre-trained models."""
    
    # Prompt user for query
    query = input("What is your question? ")
    
    # Search for relevant paragraphs using the user's query
    print(f"Searching for answer for the question: {query} in the list of paragraphs...")
    result = search_mp(query, entity_index)
    
    # If no results are found, notify the user and exit the function
    if len(result) == 0:
        print("No results found.")
        return
    
    # Concatenate the relevant paragraphs into a single context
    context = '. '.join([libBT[a[0]] for a in result])
    
    # Use the Roberta model to find the answer to the user's query within the context
    print("Searching the answer in the top matching paragraphs using Roberta...")
    qa_model = pipeline("question-answering",model='deepset/roberta-base-squad2-covid')
    ans= qa_model(question = query, context = context)
    print("Answer Snippit: " + ans['answer'])

    # Replace the question words in the user's query with the [CLS] token, and combine it with the answer
    question = query.lower().replace('what','[CLS]').replace('when','[CLS]').replace('how','[CLS]').replace('where','[CLS]').replace('which','[CLS]').replace('who','[CLS]')
    answer_text = f"{question} [SEP] {ans['answer']}[SEP]"
    
    # Use the Bart model to generate a summary of the answer
    print("Generating answer using facebook/bart-large-cnn pretrained model...")
    summarizer = pipeline("summarization", model='facebook/bart-large-cnn')
    summary = summarizer(answer_text, min_length=5, max_length=40)
    
    # Ask the user if they want to hear the answer, and speak it if they do
    t2s = input("Do you want to hear the answer? (y/n) ")
    if t2s.lower() == 'y':
        text_to_speech(summary[0]['summary_text'])
        print("Playing Answer Snippet audio...")
    else:
        print("Skipping Answer Snippet audio...")
    
    # Print the answer to the console
    print(f"Question: {query}\nAnswer: {summary[0]['summary_text']}")

# Call the answer function to begin the question-answering process
answer()


Searching for answer for the question: "What could be the origin of the coronavirus?" in the list of paragraphs...


100%|██████████| 2/2 [00:00<00:00, 171.84it/s]


Searching the answer in the top matching paragraphs using Roberta...
Answer Snippit: Huanan Seafood Market (in Wuhan, China),
Generating answer using facebook/bart-large-cnn pretrained model...


Your max_length is set to 40, but you input_length is only 38. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


Question: "What could be the origin of the coronavirus?"
Answer: Huanan Seafood Market (in Wuhan, China) could be the origin of the coronavirus.


References

[1] Sentence Transformers, "All-MiniLM-L6-v2", https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2, accessed on April 19, 2023.\
[2] Doc2Vec Library, Gensim, https://radimrehurek.com/gensim/models/doc2vec.html, accessed on April 19, 2023.\
[3] SciSpacy, https://allenai.github.io/scispacy/, accessed on April 19, 2023.\
[4] RoBERTa model, Deepset, https://huggingface.co/deepset/roberta-base-squad2-covid, accessed on April 19, 2023.\
[5] Squad Style annotated Squad dataset, COVID-QA, https://github.com/deepset-ai/COVID-QA/blob/master/data/question-answering/200423_covidQA.json, accessed on April 19, 2023.\
[6] Facebook's bart-large-cnn model, Hugging Face, https://huggingface.co/facebook/bart-large-cnn, accessed on April 19, 2023.\
[7] T5-Base_GNAD model, Hugging Face, https://huggingface.co/Einmalumdiewelt/T5-Base_GNAD, accessed on April 19, 2023.\
[8] Spacy, https://spacy.io/models, accessed on April 19, 2023.\
[9] pyttsx3 , https://pypi.org/project/pyttsx3/ , accessed on April 19, 2023.