<a href="https://colab.research.google.com/github/MartinJohannessen/repoSearch/blob/main/repoSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Haystack is a open source framework that lets you scale open source QA transformer models to millions of documents using an open source document store, and everything is open source. 

In [None]:
!pip install git+https://github.com/deepset-ai/haystack.git
#!pip install urllib3==1.25.3
!pip install grpcio==1.32.0


In [7]:
%%capture
#capture on the first line makes the output smaller.
import numpy as np
import pandas as pd

from haystack.preprocessor import PreProcessor
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.pipeline import ExtractiveQAPipeline
from haystack.retriever.dense import DensePassageRetriever

In [8]:
# Make sure you have a GPU running, if not, go to runtime/change runtime type
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [9]:
from farm.utils import initialize_device_settings

device, n_gpu = initialize_device_settings(use_cuda=True)

Load documents from repo, put in list of ditcts format, preprocess texts.

In [12]:
#uplode Covid Repository Task 3.xlsx to files
df = pd.read_excel('/content/Covid Repository Task 3.xlsx', header=1)
df = df[['Title','Full_Text', 'URL']]


dicts = []
for index, row in df.iterrows():
  dicts.append({'text':row['Full_Text'], 
                'meta':{'name':row['Title'], 
                        'URL':row['URL']
                        }})
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="sentence",
    split_length=7,
    split_respect_sentence_boundary=False
)

nested_docs = [preprocessor.process(d) for d in dicts]
docs = [d for x in nested_docs for d in x]

print(f"n_files_input: {len(dicts)}\nn_docs_output: {len(docs)}")  

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
n_files_input: 101
n_docs_output: 1994


Document store is the file system that the retriever uses to access the documents quickly.

In [13]:
from haystack.document_store.faiss import FAISSDocumentStore

document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", similarity="dot_product")
document_store.write_documents(docs)

  code="qzyx",
  code="qzyx",


Load the search models. The retriever finds the most relevant passages, while the reader extracts the answer from them. Or alternatively the retrieval augmented generator paraphrases the answer, trading explainability for quality. 

In [14]:
%%capture
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2-covid", use_gpu=True)

In [15]:
%%capture
retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  max_seq_len_query=64,
                                  max_seq_len_passage=256,
                                  batch_size=16,
                                  use_gpu=True,
                                  embed_title=False,
                                  use_fast_tokenizers=True)


In [16]:
%%capture
# Add documents embeddings to index
document_store.update_embeddings(retriever=retriever)

In [17]:
# Format answer 
def print_answers(results: dict):
  import pprint
  pp = pprint.PrettyPrinter(indent=4)
  answers = results["answers"]

  # filter the results
  filtered_answers = []
  for ans in answers:
    filtered_answers.append({ 'score':ans['score'], 'URL':ans['meta']['URL'], 'paperName':ans['meta']['name'], 'answer':ans['answer']}) 
  #pp.pprint(filtered_answers)
  return filtered_answers

Run the pipeline, test with a query

In [18]:
%%capture
pipe = ExtractiveQAPipeline(reader, retriever)
prediction = pipe.run(query="What are some economic repercussions from the covid-19 spread?", top_k_retriever=50, top_k_reader=10)


In [19]:
# run this to
filtered_answers = print_answers(prediction)

In [20]:
# export dummy for front end
#import json
#with open('predictions.json', 'w') as fout:
#    json.dump(filtered_answers, fout)

In [21]:
filtered_answers

[{'URL': 'https://ps.psychiatryonline.org/doi/10.1176/appi.ps.202000393?url_ver=Z39.88-2003&rfr_id=ori%3Arid%3Acrossref.org&rfr_dat=cr_pub++0pubmed&',
  'answer': 'A global economic slowdown has resulted from containment measures, accompanied by massive unemployment and fewer governmental resources to support public welfare systems. For each percentage point slowdown in the global economy, at least 14 million people are falling into poverty and food insecurity worldwide',
  'paperName': 'Social Determinants of\xa0Mental\xa0Health\xa0As Mediators and Moderators of the\xa0Mental\xa0Health\xa0Impacts of the\xa0COVID-19\xa0Pandemic.',
  'score': 8.75284194946289},
 {'URL': 'https://onlinelibrary.wiley.com/doi/10.1111/1746-692X.12288',
  'answer': 'Harvesting may be disrupted because of a lack of workers; planting because of a lack of seed or fertiliser; transport because of reduced transport facilities; market exchange because of lockdowns or social distancing.',
  'paperName': 'Covid‐19 a

Now we test the system

In [25]:
# fresh document store
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", similarity="dot_product")

In [26]:
#index names
doc_index = "evaluation_docs"
label_index = "evaluation_labels"

# limited preprosessor capabilities for the evalutaion system
preprocessor = PreProcessor(
    clean_empty_lines=False,
    clean_whitespace=False,
    clean_header_footer=False,
    split_by="word",
    split_length=500,
    split_respect_sentence_boundary=False
)

document_store.delete_all_documents(index=doc_index)
document_store.delete_all_documents(index=label_index)
document_store.add_eval_data(
    filename="/content/answers.json",
    doc_index=doc_index,
    label_index=label_index,
    preprocessor=preprocessor2,
    open_domain=True
)

labels = document_store.get_all_labels_aggregated(index=label_index)
q_to_l_dict = {
    l.question: {
        "retriever": l,
        "reader": l
    } for l in labels
}


In [27]:
retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  max_seq_len_query=64,
                                  max_seq_len_passage=256,
                                  batch_size=16,
                                  use_gpu=True,
                                  embed_title=False,
                                  use_fast_tokenizers=True)
document_store.update_embeddings(retriever=retriever, index=doc_index)

  0%|          | 0/384 [00:00<?, ?it/s]
Creating Embeddings:   0%|          | 0/24 [00:00<?, ? Batches/s][A
Creating Embeddings:   4%|▍         | 1/24 [00:14<05:22, 14.00s/ Batches][A
Creating Embeddings:   8%|▊         | 2/24 [00:28<05:08, 14.00s/ Batches][A
Creating Embeddings:  12%|█▎        | 3/24 [00:41<04:53, 13.97s/ Batches][A
Creating Embeddings:  17%|█▋        | 4/24 [00:55<04:39, 13.96s/ Batches][A
Creating Embeddings:  21%|██        | 5/24 [01:09<04:25, 13.96s/ Batches][A
Creating Embeddings:  25%|██▌       | 6/24 [01:23<04:10, 13.93s/ Batches][A
Creating Embeddings:  29%|██▉       | 7/24 [01:37<03:56, 13.90s/ Batches][A
Creating Embeddings:  33%|███▎      | 8/24 [01:51<03:42, 13.88s/ Batches][A
Creating Embeddings:  38%|███▊      | 9/24 [02:05<03:28, 13.91s/ Batches][A
Creating Embeddings:  42%|████▏     | 10/24 [02:19<03:14, 13.91s/ Batches][A
Creating Embeddings:  46%|████▌     | 11/24 [02:33<03:00, 13.88s/ Batches][A
Creating Embeddings:  50%|█████     | 12/2

In [28]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2-covid", use_gpu=True, return_no_answer=True)

In [29]:
#we import eval here, but the intantiations are used in the pipeline at last.
from haystack.eval import EvalReader, EvalRetriever

eval_retriever = EvalRetriever()
eval_reader = EvalReader()


Only results from here

In [30]:
retriever_eval_results = retriever.eval(top_k=20, label_index=label_index, doc_index=doc_index, open_domain=True)
print("Retriever Recall:", retriever_eval_results["recall"])
print("Retriever Mean Avg Precision:", retriever_eval_results["map"])


100%|██████████| 48/48 [00:14<00:00,  3.25it/s]

Retriever Recall: 0.5625
Retriever Mean Avg Precision: 0.21982135341510345





48

In [32]:
reader_eval_results = reader.eval(document_store=document_store, device=device,  label_index=label_index, doc_index=doc_index, top_k=10)

print("Reader Top-N-Accuracy:", reader_eval_results["top_n_accuracy"])
print("Reader Exact Match:", reader_eval_results["EM"])
print("Reader F1-Score:", reader_eval_results["f1"])


Evaluating: 100%|██████████| 5/5 [03:27<00:00, 41.43s/it]

Reader Top-N-Accuracy: 97.91666666666666
Reader Exact Match: 2.083333333333333
Reader F1-Score: 40.34146548652496



  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


From here and down is impossible to do without GPU

In [33]:
from haystack import Pipeline

p = Pipeline()
p.add_node(component=retriever, name="denseRetriever", inputs=["Query"])
p.add_node(component=eval_retriever, name="EvalRetriever", inputs=["denseRetriever"])
p.add_node(component=reader, name="QAReader", inputs=["EvalRetriever"])
p.add_node(component=eval_reader, name="EvalReader", inputs=["QAReader"])
results = []


In [34]:
%%capture
for q, l in q_to_l_dict.items():
    res = p.run(
        query=q,
        top_k_retriever=25,
        labels=l,
        top_k_reader=10,
        index=doc_index,
    )
    results.append(res)


KeyboardInterrupt: ignored

In [35]:
n_queries = len(labels)
eval_retriever.print()
print()
retriever.print_time()
print()
eval_reader.print(mode="reader")
print()
reader.print_time()
print()
eval_reader.print(mode="pipeline")


Retriever
-----------------
recall: 0.5897 (23 / 39)

Retriever (Speed)
---------------
No indexing performed via Retriever.run()
Queries Performed: 39
Query time: 13.102201267000964s
0.33595387864105036 seconds per query

Reader
-----------------
has answer queries: 22
top 1 EM: 0.0455
top k EM: 0.0455
top 1 F1: 0.3374
top k F1: 0.4874

Reader (Speed)
---------------
Queries Performed: 39
Query time: 4784.799563666998s
122.68716829915381 seconds per query

Pipeline
-----------------
queries: 38
top 1 EM: 0.0263
top k EM: 0.0263
top 1 F1: 0.1953
top k F1: 0.2822
