<a href="https://colab.research.google.com/github/FriendlyUser/stonk_doc_search/blob/main/QA_Pipeline_for_BB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple QA System


Question Answering can be used in a variety of use cases. A very common one:  Using it to navigate through complex knowledge bases or long documents ("search setting").

A "knowledge base" could for example be your website, an internal wiki or a collection of financial reports. 
In this tutorial we will work on a slightly different domain: "Game of Thrones". 

We upload documents for a particular company and use that as data to train the haystack qa system.


### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg">

In [None]:
# Make sure you have a GPU running
!nvidia-smi

Tue Dec 21 06:43:54 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P0    61W / 149W |   1653MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install grpcio-tools==1.34.1
!pip install git+https://github.com/deepset-ai/haystack.git

# If you run this notebook on Google Colab, you might need to
# restart the runtime after installing haystack.

! wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz -q
! tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin

Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-aidbs1ag
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-aidbs1ag
xpdf-tools-linux-4.03/
xpdf-tools-linux-4.03/ANNOUNCE
xpdf-tools-linux-4.03/bin32/
xpdf-tools-linux-4.03/bin32/pdftotext
xpdf-tools-linux-4.03/bin32/pdfinfo
xpdf-tools-linux-4.03/bin32/pdftopng
xpdf-tools-linux-4.03/bin32/pdfimages
xpdf-tools-linux-4.03/bin32/pdftoppm
xpdf-tools-linux-4.03/bin32/pdftops
xpdf-tools-linux-4.03/bin32/pdfdetach
xpdf-tools-linux-4.03/bin32/pdffonts
xpdf-tools-linux-4.03/bin32/pdftohtml
xpdf-tools-linux-4.03/CHANGES
xpdf-tools-linux-4.03/bin64/
xpdf-tools-linux-4.03/bin64/pdftotext
xpdf-tools-linux-4.03/bin64/pdfinfo
xpdf-tools-linux-4.03/bin64/pdftopng
xpdf-tools-linux-4.03/bin64/pdfimages
xpdf-tools-linux-4.03/bin64/pdftoppm
xpdf-tools-linux-4.03/bin64/pdftops
xpdf-tools-linux-4.03/bin64/pdfdetach
xpdf-tools-linux-4.

In [None]:
from haystack.utils import clean_wiki_text, convert_files_to_dicts, fetch_archive_from_http, print_answers
from haystack.nodes import FARMReader, TransformersReader

## Document Store

Haystack finds answers to queries within the documents stored in a `DocumentStore`. The current implementations of `DocumentStore` include `ElasticsearchDocumentStore`, `FAISSDocumentStore`,  `SQLDocumentStore`, and `InMemoryDocumentStore`.

**Here:** We recommended Elasticsearch as it comes preloaded with features like [full-text queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25 retrieval](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector storage for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html).

**Alternatives:** If you are unable to setup an Elasticsearch instance, then follow the [Tutorial 3](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.ipynb) for using SQL/InMemory document stores.

**Hint**: This tutorial creates a new document store instance with Wikipedia articles on Game of Thrones. However, you can configure Haystack to work with your existing document stores.

### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [None]:
# Recommended: Start Elasticsearch using Docker via the Haystack utility function
from haystack.utils import launch_es

launch_es()



In [None]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [None]:
# Connect to Elasticsearch

from haystack.document_stores import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

## Preprocessing of documents

Haystack provides a customizable pipeline for:
 - converting files into texts
 - cleaning texts
 - splitting texts
 - writing them to a Document Store

In this tutorial, we download Wikipedia articles about Game of Thrones, apply a basic cleaning function, and index them in Elasticsearch.

In [None]:
# Let's first fetch some documents that we want to query
# Here: 517 Wikipedia articles for Game of Thrones
from haystack.utils import convert_files_to_dicts
# doc_dir = "data/article_txt_got"
# s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
# fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
# Convert files to dicts
# You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
# It must take a str as input, and return a str.
dicts = convert_files_to_dicts(dir_path="sample_data", clean_func=clean_wiki_text, split_paragraphs=True)

# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is:
# {
#    'text': "<DOCUMENT_TEXT_HERE>",
#    'meta': {'name': "<DOCUMENT_NAME_HERE>", ...}
#}
# (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and
# can be accessed later for filtering or shown in the responses of the Pipeline)

# Let's have a look at the first 3 entries:
print(dicts[:3])

# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(dicts)

INFO - haystack.utils.preprocessing -  Converting sample_data/Continuous_Disclosure_2021-07-27.pdf
INFO - haystack.utils.preprocessing -  Converting sample_data/Continuous_Disclosure_2021-10-29.pdf
INFO - haystack.utils.preprocessing -  Converting sample_data/Continuous_Disclosure_2021-08-26.pdf
INFO - haystack.utils.preprocessing -  Converting sample_data/Continuous_Disclosure_2021-11-15.pdf
INFO - haystack.utils.preprocessing -  Converting sample_data/Continuous_Disclosure_2021-08-03.pdf
INFO - haystack.utils.preprocessing -  Converting sample_data/Continuous_Disclosure_2021-07-22.pdf
INFO - haystack.utils.preprocessing -  Converting sample_data/Continuous_Disclosure_2021-10-22.pdf
INFO - haystack.utils.preprocessing -  Converting sample_data/Continuous_Disclosure_2021-09-07.pdf
INFO - haystack.utils.preprocessing -  Converting sample_data/Continuous_Disclosure_2021-09-20.pdf
INFO - haystack.utils.preprocessing -  Converting sample_data/Continuous_Disclosure_2021-10-01.pdf
INFO - hay

[{'content': "Canada Business Corporations Act\nLoi canadienne sur les socits par actions\nPeak Fintech Group Inc. Groupe Peak Fintech Inc.\nCorporate name / Dnomination sociale\nCorporation number / Numro de socit\nI HEREBY CERTIFY that the articles of the above-named corporation are amended under section 178 of the Canada Business Corporations Act as set out in the attached articles of amendment.\nJE CERTIFIE que les statuts de la socit susmentionne sont modifis aux termes de l'article 178 de la Loi canadienne sur les socits par actions, tel qu'il est indiqu dans les clauses modificatrices ci-jointes.\nDate of amendment (YYYY-MM-DD) Date de modification (AAAA-MM-JJ)\nCanada Business Corporations Act (CBCA) (s. 27 or 177)\n1 Corporate name Dnomination sociale Peak Fintech Group Inc. Groupe Peak Fintech Inc.\n2 Corporation number Numro de la socit 1055923-7\n3 The articles are amended as follows Les statuts sont modifis de la faon suivante\nFormulaire 4 Clauses modificatrices\nLoi cana

## Initalize Retriever, Reader,  & Pipeline

### Retriever

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered.
They use some simple but fast algorithm.

**Here:** We use Elasticsearch's default BM25 algorithm

**Alternatives:**

- Customize the `ElasticsearchRetriever`with custom queries (e.g. boosting) and filters
- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging
- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)
- Use `DensePassageRetriever` to use different embedding models for passage and query (see Tutorial 6)

In [None]:
from haystack.nodes import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

In [None]:
# Alternative: An in-memory TfidfRetriever based on Pandas dataframes for building quick-prototypes with SQLite document store.

# from haystack.nodes import TfidfRetriever
# retriever = TfidfRetriever(document_store=document_store)

### Reader

A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based
on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers.
With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).

**Here:** a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)

**Alternatives (Reader):** TransformersReader (leveraging the `pipeline` of the Transformers package)

**Alternatives (Models):** e.g. "distilbert-base-uncased-distilled-squad" (fast) or "deepset/bert-large-uncased-whole-word-masking-squad2" (good accuracy)

**Hint:** You can adjust the model to return "no answer possible" with the no_ans_boost. Higher values mean the model prefers "no answer possible"

#### FARMReader

In [None]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...
INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2
INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0 
INFO - haystack.modeling.infer -  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \


#### TransformersReader

In [None]:
# Alternative:
# reader = TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)

### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [None]:
from haystack.pipelines import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

## Voilà! Ask a question!

In [None]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers. 
from pprint import pprint

questions = [
             "What is QNX",
             "What are the future revenue guidance",
             "When is the patent sale expected to be finalized",
             "What is the future of Cylance",
             "What is the projected growth of cybersecurity",
             "How much money will be made with blackberry ivy"
]

for question in questions:
  prediction = pipe.run(
      query=question, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
  )
  pprint(prediction)

  print_answers(prediction, details="minimum")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.16 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.12 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.42 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.98 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.76 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.93 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.85 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 20.67 Batches/s]


In [None]:
# prediction = pipe.run(query="Who created the Dothraki vocabulary?", params={"Reader": {"top_k": 5}})
# prediction = pipe.run(query="Who is the sister of Sansa?", params={"Reader": {"top_k": 5}})

In [None]:
# Now you can either print the object directly...

# Sample output:    
# {
#     'answers': [ <Answer: answer='Eddard', type='extractive', score=0.9919578731060028, offsets_in_document=[{'start': 608, 'end': 615}], offsets_in_context=[{'start': 72, 'end': 79}], document_id='cc75f739897ecbf8c14657b13dda890e', meta={'name': '454_Music_of_Game_of_Thrones.txt'}}, context='...' >,
#                  <Answer: answer='Ned', type='extractive', score=0.9767240881919861, offsets_in_document=[{'start': 3687, 'end': 3801}], offsets_in_context=[{'start': 18, 'end': 132}], document_id='9acf17ec9083c4022f69eb4a37187080', meta={'name': '454_Music_of_Game_of_Thrones.txt'}}, context='...' >,
#                  ...
#                ]
#     'documents': [ <Document: content_type='text', score=0.8034909798951382, meta={'name': '332_Sansa_Stark.txt'}, embedding=None, id=d1f36ec7170e4c46cde65787fe125dfe', content='\n===\'\'A Game of Thrones\'\'===\nSansa Stark begins the novel by being betrothed to Crown ...'>,
#                    <Document: content_type='text', score=0.8002150354529785, meta={'name': '191_Gendry.txt'}, embedding=None, id='dd4e070a22896afa81748d6510006d2', 'content='\n===Season 2===\nGendry travels North with Yoren and other Night's Watch recruits, including Arya ...'>,
#                    ...
#                  ],
#     'no_ans_gap':  11.688868522644043,
#     'node_id': 'Reader',
#     'params': {'Reader': {'top_k': 5}, 'Retriever': {'top_k': 5}},
#     'query': 'Who is the father of Arya Stark?',
#     'root_node': 'Query'
# }


{'answers': [<Answer {'answer': '$42.7M', 'type': 'extractive', 'score': 0.364032618701458, 'context': 'ofit margin of 25% by 2023."\nUpdated Financial Guidance Summary\nRevenue $42.7M $109.0M $345.0M $814.0M\nEBITDA** ($2.78M) $11.3M $81.8M $295.7M\nNet Inc', 'offsets_in_document': [{'start': 1808, 'end': 1814}], 'offsets_in_context': [{'start': 72, 'end': 78}], 'document_id': '8b0e1a6fe5187dac29e1765967e5f56b', 'meta': {'name': 'Continuous_Disclosure_2021-10-22.pdf'}}>,
             <Answer {'answer': 'Revenues from external customers', 'type': 'extractive', 'score': 0.302394762635231, 'context': "(1): Revenues from external customers have been identified on the basis of the customer's geographical location, which is China.", 'offsets_in_document': [{'start': 5, 'end': 37}], 'offsets_in_context': [{'start': 5, 'end': 37}], 'document_id': 'e507207657f750b8bbcf1355d98a2dd7', 'meta': {'name': 'Continuous_Disclosure_2021-08-26.pdf'}}>,
             <Answer {'answer': 'Financial service r

In [None]:
# ...or use a util to simplify the output
# Change `minimum` to `medium` or `all` to raise the level of detail
print_answers(prediction, details="minimum")


Query: What are Q3 revenues
Answers:
[   {   'answer': '$42.7M',
        'context': 'ofit margin of 25% by 2023."\n'
                   'Updated Financial Guidance Summary\n'
                   'Revenue $42.7M $109.0M $345.0M $814.0M\n'
                   'EBITDA** ($2.78M) $11.3M $81.8M $295.7M\n'
                   'Net Inc'},
    {   'answer': 'Revenues from external customers',
        'context': '(1): Revenues from external customers have been identified '
                   "on the basis of the customer's geographical location, "
                   'which is China.'},
    {   'answer': 'Financial service revenue Fees/sales from\n'
                  'external customers Supply chain services Inter-segment',
        'context': 'Revenues (1) Financial service revenue Fees/sales from\n'
                   'external customers Supply chain services Inter-segment '
                   'Total revenues Expenses Depreciation and'},
    {   'answer': 'Financial service revenue',
        'con

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!  
Our focus: Industry specific language models & large scale QA systems.  
  
Some of our other work: 
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)
