## Task: Question Answering for Game of Thrones

<img style="float: right;" src="https://upload.wikimedia.org/wikipedia/en/d/d8/Game_of_Thrones_title_card.jpg">

Question Answering can be used in a variety of use cases. A very common one:  Using it to navigate through complex knowledge bases or long documents ("search setting").

A "knowledge base" could for example be your website, an internal wiki or a collection of financial reports. 
In this tutorial we will work on a slightly different domain: "Game of Thrones". 

Let's see how we can use a bunch of wikipedia articles to answer a variety of questions about the 
marvellous seven kingdoms...  


In [None]:
! pip install farm-haystack

In [12]:
from haystack import Finder
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.io import write_documents_to_db, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.retriever.tfidf import TfidfRetriever
from haystack.utils import print_answers

## Document Store

Haystack finds answers to queries from the documents stored in a `DocumentStore`. The current implementations of `DocumentStore` include `ElasticsearchDocumentStore`, `SQLDocumentStore`, and `InMemoryDocumentStore`.

We recommended Elasticsearch as it comes preloaded with features like [full-text queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25 retrieval](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector storage for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html).

This tutorial creates a new document store instance with Wikipedia articles on Game of Thrones. However, you can configure Haystack to work with your existing document stores. 

*If you are unable to setup an Elasticsearch instance, then follow the tutorial-2 for using SQL/InMemory document stores.*


### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily avaiable in your envirnoment(eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [16]:
# Start Elasticsearch using Docker
! docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2

ad2b9ac521a0b56de0cfd7ce3a0f3b1cfa9d3d223e74f87bbe18107502477f6a
docker: Error response from daemon: driver failed programming external connectivity on endpoint happy_bose (be301efcc8fe7df9cd0fb1b23d840de2fbc4ac222528bd115f0fd2e4c5b6abe1): Bind for 0.0.0.0:9200 failed: port is already allocated.


In [14]:
# Start Elasticsearch from source (ONLY REQUIRED IN COLAB)

!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
!tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.6.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.0.0/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )

chown: elasticsearch-7.6.2/bin/elasticsearch-syskeygen: Operation not permitted
chown: elasticsearch-7.6.2/bin/elasticsearch-env: Operation not permitted
chown: elasticsearch-7.6.2/bin/elasticsearch-sql-cli-7.6.2.jar: Operation not permitted
chown: elasticsearch-7.6.2/bin/elasticsearch-env-from-file: Operation not permitted
chown: elasticsearch-7.6.2/bin/elasticsearch-node: Operation not permitted
chown: elasticsearch-7.6.2/bin/elasticsearch-saml-metadata: Operation not permitted
chown: elasticsearch-7.6.2/bin/elasticsearch-keystore: Operation not permitted
chown: elasticsearch-7.6.2/bin/elasticsearch-plugin: Operation not permitted
chown: elasticsearch-7.6.2/bin/x-pack-env: Operation not permitted
chown: elasticsearch-7.6.2/bin/elasticsearch-sql-cli: Operation not permitted
chown: elasticsearch-7.6.2/bin/elasticsearch-shard: Operation not permitted
chown: elasticsearch-7.6.2/bin/elasticsearch-setup-passwords: Operation not permitted
chown: elasticsearch-7.6.2/bin/elasticse

SubprocessError: Exception occurred in preexec_fn.

In [17]:
# Connect to Elasticsearch

from haystack.database.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

04/27/2020 17:36:41 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:400 request:0.016s]


## Cleaning & indexing documents

Haystack provides a document cleaning and indexing pipeline for ingesting documents in 

In [18]:
# Let's first get some documents that we want to query
# Here: 517 Wikipedia articles for Game of Thrones
doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)


# Now, let's write the docs to our DB.
# You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
# It must take a str as input, and return a str.
write_documents_to_db(document_store=document_store, document_dir=doc_dir, clean_func=clean_wiki_text, only_empty_db=True)

04/27/2020 17:37:24 - INFO - haystack.indexing.io -   Found data stored in `data/article_txt_got`. Delete this first if you really want to fetch new data.
04/27/2020 17:37:24 - INFO - elasticsearch -   POST http://localhost:9200/_count [status:200 request:0.008s]
04/27/2020 17:37:24 - INFO - haystack.indexing.io -   Skip writing documents since DB already contains 517 docs ...  (Disable `only_empty_db`, if you want to add docs anyway.)


## Initalize Reader, Retriever & Finder

In [None]:
# A retriever identifies the k most promising chunks of text that might contain the answer for our question
# Retrievers use some simple but fast algorithm, here: TF-IDF
retriever = TfidfRetriever(document_store=document_store)

In [7]:
# A reader scans the text chunks in detail and extracts the k best answers
# Reader use more powerful but slower deep learning models
# You can select a local model or  any of the QA models published on huggingface's model hub (https://huggingface.co/models)
# here: a medium sized BERT QA model trained via FARM on Squad 2.0
reader = FARMReader(model_name_or_path="deepset/bert-base-cased-squad2", use_gpu=False)

# OR: use alternatively a reader from huggingface's transformers package (https://github.com/huggingface/transformers)
# reader = TransformersReader(model="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)

04/27/2020 15:13:58 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
04/27/2020 15:13:58 - INFO - farm.infer -   Could not find `deepset/bert-base-cased-squad2` locally. Try to download from model hub ...
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
04/27/2020 15:14:09 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None


In [8]:
# The Finder sticks together reader and retriever in a pipeline to answer our actual questions
finder = Finder(reader, retriever)

## Voilà! Ask a question!

In [9]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers. 
prediction = finder.get_answers(question="Who is the father of Arya Stark?", top_k_retriever=10, top_k_reader=5)

04/27/2020 15:14:15 - INFO - haystack.retriever.tfidf -   Identified 10 candidates via retriever:
  paragraph_id           document_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

04/27/2020 15:14:15 - INFO - haystack.finder -   Reader is looking for detailed answer in 12569 chars ...


In [None]:
# prediction = finder.get_answers(question="Who created the Dothraki vocabulary?", top_k_reader=5)
# prediction = finder.get_answers(question="Who is the sister of Sansa?", top_k_reader=5)

In [10]:
print_answers(prediction, details="minimal")

[   {   'answer': 'Robert Baratheon',
        'context': 'hen Gendry gives it to Arya, he tells her he is the '
                   'bastard son of Robert Baratheon. Aware of their chances of '
                   'dying in the upcoming battle and Arya w'},
    {   'answer': 'Eddard',
        'context': 's Nymeria after a legendary warrior queen. She travels '
                   "with her father, Eddard, to King's Landing when he is made "
                   'Hand of the King. Before she leaves,'},
    {   'answer': 'Ned',
        'context': '\n'
                   '====Season 1====\n'
                   'Arya accompanies her father Ned and her sister Sansa to '
                   "King's Landing. Before their departure, Arya's "
                   'half-brother Jon Snow gifts A'},
    {   'answer': 'Eddard and Catelyn Stark',
        'context': 'tark ===\n'
                   'Arya Stark is the third child and younger daughter of '
                   'Eddard and Catelyn Stark. She serve