In this notebook we explore all the possibilities offered by Haystack Framework and Hugging Face Transformers for building and deploying a REST API Question Answering Pipeline

# **Environment Setup**

**Install Global Dependencies**

In [None]:
!pip install transformers --quiet

In [None]:
!pip install datasets --quiet

In [None]:
!pip install -U sentence-transformers

# **Dataset Import and Processing**

**Import Dependencies**

In [None]:
import json
import pandas as pd
from datasets import load_dataset,arrow_dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from numpy import random

time: 736 µs (started: 2023-03-28 13:58:57 +00:00)


**Download**

In [None]:
!wget https://raw.githubusercontent.com/deepset-ai/COVID-QA/master/data/question-answering/COVID-QA.json

**JSON to DataFrame**

In [None]:
ds = load_dataset('json',data_files = "/notebooks/COVID-QA.json",field="data")

In [None]:
ds = ds["train"].train_test_split(test_size=0.25)

In [None]:
ds["train"].to_json("train.json",lines=False)

In [None]:
ds["test"].to_json("test.json",lines=False)

**Reader Model Testing**

In [None]:
model_ckpt = "deepset/minilm-uncased-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForQuestionAnswering.from_pretrained(model_ckpt)
context = ds["train"]["paragraphs"][5][0]["context"]
question = ds["train"]["paragraphs"][5][0]["qas"][0]["question"]

In [None]:
pipe = pipeline("question-answering",model=model,tokenizer=tokenizer)

In [None]:
pipe(question=question,context=context,top_k=3)

[{'score': 0.6244222521781921,
  'start': 2380,
  'end': 2412,
  'answer': 'interferon-induced transmembrane'},
 {'score': 0.5352410674095154,
  'start': 6107,
  'end': 6147,
  'answer': 'bonerestricted IFITM-like (BRIL) protein'},
 {'score': 0.48188379406929016, 'start': 1654, 'end': 1657, 'answer': 'TM1'}]

In [None]:
question

'What is IFITM?'

In [None]:
ds["train"]["paragraphs"][5][0]["qas"][0]["answers"][0]["text"] #ground-truth

'interferon-induced transmembrane'

# **Building Question Answer Pipeline with Haystack**

**Install Dependencies**

In [None]:
!pip uninstall lxml -y

In [None]:
!pip install lxml

In [None]:
!pip install farm-haystack --quiet

In [None]:
!pip install farm-haystack[faiss] --quiet

**Import Dependencies**

In [None]:
import os
from subprocess import Popen, PIPE, STDOUT
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes.retriever import BM25Retriever,DensePassageRetriever
from haystack.nodes.reader import FARMReader
from haystack.nodes import PreProcessor
from haystack.pipelines import Pipeline
from haystack.utils import print_answers

**Instantiate ElasticSearch as Background Process**

In [None]:
url = """https://artifacts.elastic.co/downloads/elasticsearch/\
elasticsearch-7.9.2-linux-x86_64.tar.gz"""
!wget -nc -q {url}
!tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz

In [None]:
# Run Elasticsearch as a background process
!chown -R daemon:daemon elasticsearch-7.9.2
es_server = Popen(args=['elasticsearch-7.9.2/bin/elasticsearch'],
                  stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
# Wait until Elasticsearch has started
!sleep 60

In [None]:
!curl -X GET "localhost:9200/?pretty"

{
  "name" : "nbi89opxsn",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "8x35wew6RNG-0kH9QRCF9w",
  "version" : {
    "number" : "7.9.2",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "d34da0ea4a966c4e49417f2da2f244e3e97b4e6e",
    "build_date" : "2020-09-23T00:45:33.626720Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


**Loading Documents in ElasticSearch DB**

In [None]:
document_store = ElasticsearchDocumentStore(return_embedding=True)

In [None]:
document_store.delete_documents()
document_store.delete_labels()

time: 6.04 s (started: 2023-03-28 13:59:11 +00:00)


In [None]:
documents = []
for split,_ in ds.items():
  for i in range(0,ds[split].num_rows):
    document = {"content":ds[split]["paragraphs"][i][0]["context"],
                "meta": {"split":split}}
    documents.append(document)

time: 8.81 s (started: 2023-03-28 13:59:20 +00:00)


In [None]:
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=1000,
    split_respect_sentence_boundary=True,
)
documents = preprocessor.process(documents)

In [None]:
document_store.write_documents(documents,index="document")

In [None]:
print(f"{document_store.get_document_count()} documents were loaded")

**Instantiate and Test Retriever Component**

In [None]:
bm25 = BM25Retriever(document_store = document_store)

In [None]:
retrieved_doc = bm25.retrieve(query = question,top_k=3)

In [None]:
for doc in retrieved_doc:
  print(doc.content[:150])

Architectural Insight into Inovirus-Associated Vectors (IAVs) and Development of IAV-Based Vaccines Inducing Humoral and Cellular Responses: Implicati
The distal end of the virion (right) consists of five copies of each minor coat proteins gp7 and gp9, modeled following the helical parameters of gp8 
Schematic representations of antigen display on the surface of Ff inovirus-associated vectors (IAVs). Foreign antigens are shown as red spheres. The d


In [None]:
context[:50]

'Architectural Insight into Inovirus-Associated Vec'

**Instantiate and Test Reader Component**

In [None]:
reader = FARMReader(model_name_or_path=model_ckpt,max_seq_len=512,doc_stride=128,max_query_length=200,return_no_answer=True,progress_bar=False)

In [None]:
print(reader.predict_on_texts(question,top_k=1,texts=[context]))

**Instantiate and Test the Pipeline**

In [None]:
pipe = Pipeline()
pipe.add_node(component=bm25,name="Retriever",inputs=["Query"])
pipe.add_node(component=reader,name="Reader",inputs=["Retriever"])

In [None]:
preds = pipe.run(query=question,params={"Retriever":{"top_k":3},"Reader":{"top_k":5}},debug=True)

In [None]:
print_answers(
    preds,
    details="all"
)


Query: What are inovirus-associated vectors?
Answers:
[   <Answer {'answer': 'engineered, non-lytic, filamentous bacteriophages', 'type': 'extractive', 'score': 0.8232371807098389, 'context': '\n\nAbstract: Inovirus-associated vectors (IAVs) are engineered, non-lytic, filamentous bacteriophages that are assembled primarily from thousands of co', 'offsets_in_document': [{'start': 454, 'end': 503}], 'offsets_in_context': [{'start': 51, 'end': 100}], 'document_ids': ['15e77b2f9ef601536928de6a42d60db2'], 'meta': {'split': 'train', '_split_id': 0}}>,
    <Answer {'answer': 'IAVs', 'type': 'extractive', 'score': 0.6021418571472168, 'context': 'ults in the presentation of oligopeptides as fusion proteins on the surface of the virion and are herein termed IAVs for inovirus-associated vectors. ', 'offsets_in_document': [{'start': 564, 'end': 568}], 'offsets_in_context': [{'start': 112, 'end': 116}], 'document_ids': ['7d19633cbf2eed6349a144804452115c'], 'meta': {'split': 'train', '_split_id': 2

# **Pipeline Evaluation**

In this section we want to perform the pipeline evaluation in order to get a rough idea of our system performance and select the best retriever model.
In the next sections we will fine-tune the best retriever model and repeat the same process with the reader.
We only label test data, because we want an evaluation which is comparable to the one we will do after fine-tuning the model

**Summary:**



1.   Evaluate BM25 + reader model
2.   Evaluate DPR + reader model
3.   Evaluate ER + reader model
4.   Fine-Tune on Test Data the best Dense Model (DPR/ER) and Evaluate it
5.   Fine-Tune Reader Model and Evaluate it






**Import Dependencies**

In [None]:
from haystack.nodes import EmbeddingRetriever

**Utility Functions**

In [None]:
def convert_json_to_squad_format(path):
  f = open(path,"r")
  text = f.readlines()
  text[0] = """{"data": """ + text[0]
  text[0] = text[0] + "}" 
  f = open(path,"w")
  f.seek(0,0)
  f.write(text[0])

**Adding Evaluation Data to Document Store**

In [None]:
convert_json_to_squad_format("/notebooks/train.json")

In [None]:
convert_json_to_squad_format("/notebooks/test.json")

In [None]:
document_store.add_eval_data(
    filename="/notebooks/test.json",
    doc_index = "eval_docs",
    label_index = "eval_labels"
)

In [None]:
eval_labels = document_store.get_all_labels_aggregated(index="eval_labels")

**Evaluate BM25 + minilm-uncased-squad2**

In [None]:
eval_results = pipe.eval(labels=eval_labels,params={"Retriever":{"top_k":3},"Reader":{"top_k":1}})

In [None]:
pipe.print_eval_report(eval_results) 

                   Pipeline Overview
                      Query
                        |
                        |
                      Retriever
                        |
                        | map: 0.581
                        | mrr: 0.582
                        | ndcg: 0.601
                        | precision: 0.225
                        | recall_multi_hit: 0.657
                        | recall_single_hit: 0.657
                        |
                      Reader
                        |
                        | exact_match: 0.201
                        | exact_match_top_1: 0.201
                        | f1: 0.318
                        | f1_top_1: 0.318
                        | num_examples_for_eval: 5.91e+02
                        | num_examples_for_eval_top_1: 5.91e+02
                        |
                      Output

                Wrong Retriever Examples
Query: 
 	How does the infected airway cell respond?
Gold Document Ids: 
 	b32659e119337dac1ac5

We will now try 2 different embedding semantic search approaches (DPR and EmbeddingRetrieval) and fine-tune the best one to see if we can outperform BM25

**Evaluate DPR + minilm-uncased-squad2**

In [None]:
query_dpr_model = "facebook/dpr-question_encoder-single-nq-base"
passage_dpr_model = "facebook/dpr-ctx_encoder-single-nq-base"

In [None]:
dpr = DensePassageRetriever(document_store = document_store,query_embedding_model=query_dpr_model,
                                      passage_embedding_model=passage_dpr_model,
                                      embed_title=False)

In [None]:
document_store.update_embeddings(retriever=dpr)

In [None]:
pipe = Pipeline()
pipe.add_node(component=dpr,name="Retriever",inputs=["Query"])
pipe.add_node(component=reader,name="Reader",inputs=["Retriever"])

In [None]:
eval_results = pipe.eval(labels=eval_labels,params={"Retriever":{"top_k":3},"Reader":{"top_k":1}})

In [None]:
pipe.print_eval_report(eval_results)

**EmbeddingRetriever + minilm-uncased-squad2**

In [None]:
embedding_model = "sentence-transformers/multi-qa-mpnet-base-dot-v1"

In [None]:
er = EmbeddingRetriever(document_store = document_store,
                        embedding_model=embedding_model)

In [None]:
document_store.update_embeddings(er)

In [None]:
pipe = Pipeline()
pipe.add_node(component=er,name="Retriever",inputs=["Query"])
pipe.add_node(component=reader,name="Reader",inputs=["Retriever"])

In [None]:
eval_results = pipe.eval(labels=eval_labels,params={"Retriever":{"top_k":3},"Reader":{"top_k":1}})

In [None]:
pipe.print_eval_report(eval_results)

                   Pipeline Overview
                      Query
                        |
                        |
                      Retriever
                        |
                        | map: 0.436
                        | mrr: 0.437
                        | ndcg:  0.46
                        | precision: 0.187
                        | recall_multi_hit: 0.529
                        | recall_single_hit:  0.53
                        |
                      Reader
                        |
                        | exact_match: 0.162
                        | exact_match_top_1: 0.162
                        | f1:  0.27
                        | f1_top_1:  0.27
                        | num_examples_for_eval: 5.91e+02
                        | num_examples_for_eval_top_1: 5.91e+02
                        |
                      Output

                Wrong Retriever Examples
Query: 
 	Can biomarkers be used to predict outcomes in acute respiratory distress (ARDS) patie

# **Retriever Improvement**

**Utility Functions**

In [None]:
def ds_to_er_format(ds):
  train_set = []
  for item in ds:
    for label in [label for label in item[0]["qas"] if label['is_impossible']==False]:
      train_set.append({'question':label['question'],
                        'pos_doc': item[0]["context"]})
  return train_set

**Fine-Tuning EmbeddingRetrieval**

In [None]:
er = EmbeddingRetriever(document_store = document_store,
                        embedding_model=embedding_model)

In [None]:
train_ds = load_dataset("json",data_files= "/content/train.json",field="data")

In [None]:
train_ds = train_ds["train"]["paragraphs"]

In [None]:
train_set = ds_to_er_format(train_ds)

In [None]:
er.train(training_data=train_set,learning_rate=1e-05)

In [None]:
document_store.update_embeddings(er)

In [None]:
pipe = Pipeline()
pipe.add_node(component=er,name="Retriever",inputs=["Query"])
pipe.add_node(component=reader,name="Reader",inputs=["Retriever"])

In [None]:
eval_results = pipe.eval(labels=eval_labels,params={"Retriever":{"top_k":3},"Reader":{"top_k":1}})

In [None]:
pipe.print_eval_report(eval_results)

                   Pipeline Overview
                      Query
                        |
                        |
                      Retriever
                        |
                        | map: 0.604
                        | mrr: 0.606
                        | ndcg: 0.627
                        | precision: 0.238
                        | recall_multi_hit: 0.691
                        | recall_single_hit: 0.693
                        |
                      Reader
                        |
                        | exact_match: 0.184
                        | exact_match_top_1: 0.184
                        | f1: 0.335
                        | f1_top_1: 0.335
                        | num_examples_for_eval: 6.58e+02
                        | num_examples_for_eval_top_1: 6.58e+02
                        |
                      Output

                Wrong Retriever Examples
Query: 
 	What was the purpose of the research?
Gold Document Ids: 
 	38edff45e2b790704e8f279b3

# **Reader Improvement**

**minilm-uncased-squad2 Fine-Tuning**

In [None]:
reader = FARMReader(model_name_or_path=model_ckpt,max_seq_len=384,doc_stride=128,max_query_length=200,return_no_answer=True,no_ans_boost=-100,progress_bar=False)

In [None]:
reader.train(
    data_dir="/notebooks",
    train_filename="/notebooks/train.json",
    use_gpu=True,
    n_epochs=3,
    batch_size=24,
    learning_rate=3e-5,
    save_dir="/notebooks/ft_minilm",
    checkpoint_every = 1,
    checkpoint_root_dir = "/notebooks/ft_minilm/checkpoints",
    use_amp=True,
    warmup_proportion = 0.1
)



Preprocessing dataset:   0%|          | 0/1 [00:00<?, ? Dicts/s]

**Evaluating BM25 + Fine-Tuned minilm-uncased-squad2**

In [None]:
pipe = Pipeline()
pipe.add_node(component=bm25,name="Retriever",inputs=["Query"])
pipe.add_node(component=reader,name="Reader",inputs=["Retriever"])

In [None]:
eval_results = pipe.eval(labels=eval_labels,params={"Retriever":{"top_k":3},"Reader":{"top_k":1}})

In [None]:
pipe.print_eval_report(eval_results)

In [None]:
eval_results = reader.eval(document_store=document_store,label_index="eval_labels",doc_index="eval_docs")

In [None]:
eval_results

{'EM': 20.72072072072072,
 'f1': 50.32282947786072,
 'top_n_accuracy': 82.43243243243244,
 'top_n': 4,
 'reader_time': 135.27690143883228,
 'seconds_per_query': 0.30467770594331595,
 'EM_text_answer': 20.72072072072072,
 'f1_text_answer': 50.32282947786072,
 'top_n_accuracy_text_answer': 82.43243243243244,
 'top_n_EM_text_answer': 26.576576576576578,
 'top_n_f1_text_answer': 66.15020245147878,
 'Total_text_answer': 444,
 'EM_no_answer': 0,
 'f1_no_answer': nan,
 'top_n_accuracy_no_answer': nan,
 'Total_no_answer': 0}

# **Generative LFQA**

We will now adopt Generative Long-Form Question Answering to compare it to the extractive solution

**Import Dependencies**

In [None]:
from haystack.nodes import Seq2SeqGenerator
from datasets import load_dataset
from haystack.schema import Document

time: 452 µs (started: 2023-03-28 18:20:31 +00:00)


**Utility Functions**

In [None]:
def evaluate_generator(generator,ds,n):
    for e in ds[:n]:
        question = e[0]['qas'][0]['question']
        answer = e[0]['qas'][0]['answers'][0]
        context = e[0]['context']
        doc = Document(content=context)
        docs = [doc]
        
        results = generator.predict(query=question,documents=docs)
        print(f"Question:{question}\nAnswer:{answer['text']}\nGenerated Answer:{results['answers'][0].answer}\n\n\n")
        

In [None]:
test_set = load_dataset("json",data_files="/notebooks/test.json",field="data")
test_set = test_set["train"]['paragraphs']

In [None]:
generator = Seq2SeqGenerator(model_name_or_path="Davidai/lfqa_covid")

In [None]:
evaluate_generator(generator,test_set,5)

Question:What enzymes have been reported to be linked with severity of infection and various pathological conditions caused by microorganisms?
Answer:cysteine proteases
Generated Answer:I'm not sure if this is what you're looking for, but there are a few enzymes that have been shown to be associated with increased susceptibility to certain infections. The most well-known of these is the cysteine-protease inhibitor (CPI), which has been used to treat a variety of bacterial infections.



Question:What are inovirus-associated vectors?
Answer:engineered, non-lytic, filamentous bacteriophages
Generated Answer:I'm not sure if this is what you're looking for, but I'll give it a shot. Inovirus-associated vectors (IAVs) are a type of bacteriophage that infects bacteria. There are over 50 different species of filamentous viruses; the majority of them capable of infecting Gram-negative bacteria. They are made up of thousands of copies of the major coat protein gp8 and just five copies of each of

# **FAQ QA**

Finally we try,as last approach, the FAQ Question Answering, that uses embeddings of the retriever model to return the best answer in the DB.

**Import Dependencies**

In [None]:
from haystack.pipelines import ExtractiveQAPipeline, GenerativeQAPipeline,FAQPipeline
import pandas as pd
from haystack.nodes import EmbeddingRetriever

We need additional data to build a FAQ pipeline. We will download and process 6 sets of question-answer pairs (no context) of various sizes.

**Utility Functions**

In [None]:
def preprocess_faq_csv(df,q_column,a_column):
  df = df[[q_column,a_column]]
  df.fillna(value="", inplace=True)
  df[q_column] = df[q_column].apply(lambda x: x.strip())
  questions = list(df[q_column].values)
  df["embedding"] = er.embed_queries(queries=questions).tolist()
  df = df.rename(columns={q_column: "content"})

In [None]:
def load_faq_data(df):
  docs_to_index = df.to_dict(orient="records")
  document_store.write_documents(docs_to_index,index="faq")

In [None]:
ds_list = ["additional_data/big_faq.csv",
           "additional_data/community.csv",
           "additional_data/COVID19_FAQ.csv",
           "additional_data/multilingual.csv",
           "additional_data/news.csv",
           "additional_data/small_faq_covid.csv"]

**BIG FAQ Loading**

In [None]:
df = pd.read_csv(ds_list[0])
df = df.loc[df["language"] == "en"]
preprocess_faq_csv(df,"question","answer")
load_faq_data(df)


**Community FAQ Loading**

In [None]:
df = pd.read_csv(ds_list[1])
preprocess_faq_csv(df,"question","answer")
load_faq_data(df)

**COVID 19 FAQ**

In [None]:
df = pd.read_csv(ds_list[2])
preprocess_faq_csv(df,"questions","answers")
load_faq_data(df)

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

**Multilingual FAQ**

In [None]:
df = pd.read_csv(ds_list[3])
df = df.loc[df["language"] == "english"]
load_faq_data(df)

Batches:   0%|          | 0/6 [00:00<?, ?it/s]

**News FAQ**

In [None]:
df = pd.read_csv(ds_list[4])
preprocess_faq_csv(df,"question","answer")
load_faq_data(df)

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

**Small FAQ**

In [None]:
df = pd.read_csv("additional_data/small_faq_covid.csv")
preprocess_faq_csv(df,"question","answer")
load_faq_data(df)

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

**FAQ Pipeline Testing**

In [None]:
er = EmbeddingRetriever(document_store,embedding_model)

In [None]:
faq_pipeline = FAQPipeline(er)

In [None]:
question = "What are the tests for COVID?"
answer = faq_pipeline.run(query=question,params={"Retriever":{'top_k':1, 'index':'faq'}})
print(answer['answers'][0].answer)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

There are two different types of tests available: viral tests (diagnostic) and antibody tests.

A viral (diagnostic) test tells you if you have a current infection.
An antibody test might tell you if you had a past infection. An antibody test might not show if you have a current infection because it can take 1–3 weeks after infection for your body to make antibodies. Having antibodies to the virus that causes COVID-19 might provide protection from getting infected with the virus again. If it does, we do not know how much protection the antibodies might provide or how long this protection might last.

For more information about differences between the different types of tests, please visit VDH’s Testing Webpage.
