Adapted from: https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.ipynb

In [None]:
import os
import urllib.parse

from dotenv import load_dotenv
from haystack import Finder
from haystack.preprocessor.cleaning import clean_wiki_text
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers
import requests

In [2]:
use_gpu = False

## Document Store


In [3]:
# In-Memory Document Store
from haystack.document_store.memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()

In [4]:
# SQLite Document Store
# from haystack.document_store.sql import SQLDocumentStore
# document_store = SQLDocumentStore(url="sqlite:///qa.db")

In [5]:
doc_dir = "data/article_txt_kbase"
os.makedirs(doc_dir, exist_ok=True)

In [6]:
load_dotenv("../aws.env")
api_url = urllib.parse.urljoin(os.environ["DBAPI_URL"], os.environ["DBAPI_STAGE"])
login_data = {
    "username": os.environ["FIRST_USER"],
    "password": os.environ["FIRST_USER_PASSWORD"],
}
r = requests.post(f"{api_url}/token", data=login_data)
tokens = r.json()
a_token = tokens["access_token"]
token_headers = {"Authorization": f"Bearer {a_token}"}

In [7]:
document_response = requests.get(f"{api_url}/documents/?skip=0&limit=10", headers=token_headers).json()

In [8]:
for d in document_response:
    file_name = os.path.join(doc_dir, d['id'] + '.txt')
    with open(file_name, 'wt') as fout:
        fout.write(d['title'] + '\n' + d['parsed_text'])

## Preprocessing of documents

Haystack provides a customizable pipeline for:
 - converting files into texts
 - cleaning texts
 - splitting texts
 - writing them to a Document Store

In [None]:
# # Let's first get some documents that we want to query
# # Here: 517 Wikipedia articles for Game of Thrones
# doc_dir = "data/article_txt
# _got"
# s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
# fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# convert files to dicts containing documents that can be indexed to our datastore
# You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
# It must take a str as input, and return a str.
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is: {"name": "<some-document-name>, "text": "<the-actual-text>"}

# Let's have a look at the first 3 entries:
print(dicts[:3])
# Now, let's write the docs to our DB.
document_store.write_documents(dicts)

## Initalize Retriever, Reader,  & Finder

### Retriever

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered. 

With InMemoryDocumentStore or SQLDocumentStore, you can use the TfidfRetriever. For more retrievers, please refer to the tutorial-1.

In [10]:
# An in-memory TfidfRetriever based on Pandas dataframes

from haystack.retriever.sparse import TfidfRetriever
retriever = TfidfRetriever(document_store=document_store)

02/26/2021 17:31:31 - INFO - haystack.retriever.sparse -   Found 10 candidate paragraphs from 10 docs in DB


### Reader

A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based
on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers.
With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).

**Here:** a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)

**Alternatives (Reader):** TransformersReader (leveraging the `pipeline` of the Transformers package)

**Alternatives (Models):** e.g. "distilbert-base-uncased-distilled-squad" (fast) or "deepset/bert-large-uncased-whole-word-masking-squad2" (good accuracy)

**Hint:** You can adjust the model to return "no answer possible" with the no_ans_boost. Higher values mean the model prefers "no answer possible"

#### FARMReader

In [11]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=use_gpu)

02/26/2021 17:31:33 - INFO - farm.utils -   Using device: CPU 
02/26/2021 17:31:33 - INFO - farm.utils -   Number of GPUs: 0
02/26/2021 17:31:33 - INFO - farm.utils -   Distributed Training: False
02/26/2021 17:31:33 - INFO - farm.utils -   Automatic Mixed Precision: None
Some weights of RobertaModel were not initialized from the model checkpoint at deepset/roberta-base-squad2 and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
02/26/2021 17:31:42 - INFO - farm.utils -   Using device: CPU 
02/26/2021 17:31:42 - INFO - farm.utils -   Number of GPUs: 0
02/26/2021 17:31:42 - INFO - farm.utils -   Distributed Training: False
02/26/2021 17:31:42 - INFO - farm.utils -   Automatic Mixed Precision: None
02/26/2021 17:31:42 - INFO - farm.infer -   Got ya 7 parallel workers to do inference ...
02/26/2021 17:31:42 - INFO - farm.infer -    0    0    0  

#### TransformersReader

In [12]:
# Alternative:
# reader = TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)

### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [13]:
from haystack.pipeline import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

## Voilà! Ask a question!

In [18]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers.
questions = ["What is the GIL?", "What is Bandit?", "Which Pep has been accepted?", "Who is Bryan?"]
for q in questions:
    print(q)
    print("-------------------")
    prediction = pipe.run(query=q, top_k_retriever=3, top_k_reader=5)
    print_answers(prediction, details="minimal")
    print("-------------------")
    print("-------------------")

What is the GIL?
-------------------


Inferencing Samples: 100%|██████████| 3/3 [01:02<00:00, 20.96s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [00:52<00:00, 26.14s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [00:43<00:00, 21.81s/ Batches]


[   {   'answer': 'Global Interpreter Lock',
        'context': 'this problem up by doing something like this:\n'
                   'However, the GIL (Global Interpreter Lock) prevents us '
                   'from achieving the performance improvement we are'},
    {   'answer': 'global interpreter lock',
        'context': 'ms of the Gil so like, the Gil is known, otherwise known '
                   'as the global interpreter lock in Python, prevents us from '
                   'really like running a multi thread'},
    {   'answer': 'you can only run one thread at a time for like, one opcode '
                  'at a time',
        'context': 'me up with the Gil, which basically says you can only run '
                   'one thread at a time for like, one opcode at a time as as '
                   'attempts have been made to remove '},
    {   'answer': 'change up your keymap',
        'context': 'keyboards is, in addition to getting ergonomic benefits, '
                   'yo

Inferencing Samples: 100%|██████████| 3/3 [00:58<00:00, 19.56s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [00:43<00:00, 21.82s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [00:44<00:00, 22.08s/ Batches]


[   {   'answer': 'a static analysis security tool',
        'context': '      beyond the location indicated by --> and ^.\n'
                   'Bandit is a static analysis security tool.\n'
                   'It’s like a linter but for security issues.\n'
                   'I prefer to r'},
    {   'answer': 'a static',
        'context': 'the file\n'
                   '        beyond the location indicated by --> and ^.\n'
                   'Bandit is a static analysis security tool.\n'
                   'It’s like a linter but for security issues.\n'
                   'I '},
    {   'answer': 'a linter',
        'context': "t, we'll just do that. Um, but yeah, so bandit is "
                   'basically like, like a linter. But it looks for security '
                   'issues. So you can just like pip install it'},
    {   'answer': 'PI test developer',
        'context': 'een on guest on talk Python to me. He maintains pre '
                   "commit, he's a PI test deve

Inferencing Samples: 100%|██████████| 3/3 [00:59<00:00, 19.72s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [00:45<00:00, 22.71s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [00:54<00:00, 27.41s/ Batches]


[   {   'answer': 'Pep 634 structural pattern matching',
        'context': 'h. All right. Awesome. So I got a couple of quick things. '
                   'Pep 634 structural pattern matching in Python has been '
                   "accepted for Python. 310. That's like"},
    {   'answer': 'PEP 634',
        'context': ' that trainings can be virtual, a couple half days is '
                   'super easy to do.\n'
                   'PEP 634 -- Structural Pattern Matching: Specification '
                   'accepted in 3.10\n'
                   'Sent in'},
    {   'answer': 'my hands',
        'context': 's both awesome and terrifying. Yes, exactly. Yeah.\n'
                   '0:00 Yeah. Yeah. So my hands like this got accepted. It '
                   'seemed to be sort of counter to the simplic'},
    {   'answer': 'Pancakes',
        'context': 'd": 1, "age": 4, "name": "Cleo"},\n'
                   '        {"id": 2, "age": 2, "name": "Pancakes"}\n'
                   'Antho

Inferencing Samples: 100%|██████████| 2/2 [00:56<00:00, 28.14s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [00:48<00:00, 24.03s/ Batches]
Inferencing Samples: 100%|██████████| 2/2 [00:45<00:00, 22.86s/ Batches]

[   {   'answer': 'Brian knockin',
        'context': 'buds. This is Episode 217. recorded. What is it January 19 '
                   "2021. I'm Brian knockin. I'm Michael Kennedy. And I'm "
                   'Omar. Welcome. Thanks for joining us.'},
    {   'answer': 'Byrne Hobart',
        'context': 'respected genius and end up being a janitor who gets into '
                   'fights." - Byrne Hobart\n'
                   '00:00:00 Hello, and welcome to Python bytes where we '
                   'deliver Python '},
    {   'answer': 'Brian',
        'context': '. This was really fun. Yeah, great for vitamins brought. '
                   'Enjoy them. And Brian, thanks as always, man. Thank you. '
                   "It's been fun. Yep. See you. Bye. Th"},
    {   'answer': 'Magnus Carlsen',
        'context': 'ing so this will solve that for sure. And then what free '
                   'Brian from Magnus Carlsen. Yeah, does was it does PIP 621, '
                   'the tama sp


