# Build a QA System Without Elasticsearch

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.ipynb)

Haystack provides alternatives to Elasticsearch for developing quick prototypes.

You can use an `InMemoryDocumentStore` or a `SQLDocumentStore`(with SQLite) as the document store.

If you are interested in more feature-rich Elasticsearch, then please refer to the Tutorial 1. 

### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg">

In [1]:
# Make sure you have a GPU running
!nvidia-smi

Fri Aug 20 17:30:39 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   30C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install grpcio-tools==1.34.1
!pip install git+https://github.com/deepset-ai/haystack.git


Collecting grpcio-tools==1.34.1
  Downloading grpcio_tools-1.34.1-cp37-cp37m-manylinux2014_x86_64.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 12.1 MB/s 
Installing collected packages: grpcio-tools
Successfully installed grpcio-tools-1.34.1
Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-ky0s5p65
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-ky0s5p65
Collecting farm==0.8.0
  Downloading farm-0.8.0-py3-none-any.whl (204 kB)
[K     |████████████████████████████████| 204 kB 14.1 MB/s 
[?25hCollecting fastapi
  Downloading fastapi-0.68.0-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 659 kB/s 
[?25hCollecting uvicorn
  Downloading uvicorn-0.15.0-py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 2.5 MB/s 
[?25hCollecting gunicorn
  Downloading gunicorn-20.1.0-py3-none-any.whl (79 kB)
[K  

In [4]:
from haystack.preprocessor.cleaning import clean_wiki_text
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

## Document Store


In [5]:
# In-Memory Document Store
from haystack.document_store.memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()

In [6]:
# SQLite Document Store
# from haystack.document_store.sql import SQLDocumentStore
# document_store = SQLDocumentStore(url="sqlite:///qa.db")

## Preprocessing of documents

Haystack provides a customizable pipeline for:
 - converting files into texts
 - cleaning texts
 - splitting texts
 - writing them to a Document Store

In this tutorial, we download Wikipedia articles on Game of Thrones, apply a basic cleaning function, and index them in Elasticsearch.

In [7]:
# Let's first get some documents that we want to query
# Here: 517 Wikipedia articles for Game of Thrones
doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# convert files to dicts containing documents that can be indexed to our datastore
# You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
# It must take a str as input, and return a str.
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is: {"name": "<some-document-name>, "text": "<the-actual-text>"}

# Let's have a look at the first 3 entries:
print(dicts[:3])
# Now, let's write the docs to our DB.
document_store.write_documents(dicts)

08/20/2021 17:34:35 - INFO - haystack.preprocessor.utils -   Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip to `data/article_txt_got`
100%|██████████| 1095120/1095120 [00:00<00:00, 16740650.48B/s]
08/20/2021 17:34:36 - INFO - haystack.preprocessor.utils -   Converting data/article_txt_got/371_Cersei_Lannister.txt
08/20/2021 17:34:36 - INFO - haystack.preprocessor.utils -   Converting data/article_txt_got/25_Game_of_Thrones__Season_2__soundtrack_.txt
08/20/2021 17:34:36 - INFO - haystack.preprocessor.utils -   Converting data/article_txt_got/506_Game_of_Thrones_Theme.txt
08/20/2021 17:34:36 - INFO - haystack.preprocessor.utils -   Converting data/article_txt_got/211_The_Watchers_on_the_Wall.txt
08/20/2021 17:34:36 - INFO - haystack.preprocessor.utils -   Converting data/article_txt_got/367_Gregor_Clegane.txt
08/20/2021 17:34:36 - INFO - haystack.preprocessor.utils -   Converting data/article_txt_got/52_Catch_the_Thron

[{'text': "'''Cersei Lannister''' is a fictional character in the ''A Song of Ice and Fire'' series of fantasy novels by American author George R. R. Martin, and its television adaptation ''Game of Thrones'', where she is portrayed by English actress Lena Headey. In the later novels of the series, she is a point of view character.\nIntroduced in 1996's ''A Game of Thrones'', Cersei is a member of House Lannister, one of the wealthiest and most powerful families on the continent of Westeros. She subsequently appeared in ''A Clash of Kings'' (1998) and ''A Storm of Swords'' (2000). She becomes a prominent point of view character in the novels beginning in ''A Feast for Crows'' (2005) and ''A Dance with Dragons'' (2011). The character will also appear in the forthcoming volume ''The Winds of Winter''.\nIn the story, Cersei Lannister, Queen of the Seven Kingdoms of Westeros, is the wife of King Robert Baratheon. Her father arranged the marriage after his attempt to betroth her to Prince Rh

## Initalize Retriever, Reader & Pipeline

### Retriever

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered. 

With InMemoryDocumentStore or SQLDocumentStore, you can use the TfidfRetriever. For more retrievers, please refer to the tutorial-1.

In [8]:
# An in-memory TfidfRetriever based on Pandas dataframes

from haystack.retriever.sparse import TfidfRetriever
retriever = TfidfRetriever(document_store=document_store)

08/20/2021 17:34:37 - INFO - haystack.retriever.sparse -   Found 2357 candidate paragraphs from 2357 docs in DB


### Reader

A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based
on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers.
With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).

**Here:** a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)

**Alternatives (Reader):** TransformersReader (leveraging the `pipeline` of the Transformers package)

**Alternatives (Models):** e.g. "distilbert-base-uncased-distilled-squad" (fast) or "deepset/bert-large-uncased-whole-word-masking-squad2" (good accuracy)

**Hint:** You can adjust the model to return "no answer possible" with the no_ans_boost. Higher values mean the model prefers "no answer possible"

#### FARMReader

In [9]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

08/20/2021 17:34:38 - INFO - farm.utils -   Using device: CUDA 
08/20/2021 17:34:38 - INFO - farm.utils -   Number of GPUs: 1
08/20/2021 17:34:38 - INFO - farm.utils -   Distributed Training: False
08/20/2021 17:34:38 - INFO - farm.utils -   Automatic Mixed Precision: None
08/20/2021 17:34:38 - INFO - filelock -   Lock 140177142582672 acquired on /root/.cache/huggingface/transformers/c40d0abb589629c48763f271020d0b1f602f5208c432c0874d420491ed37e28b.122ed338b3591c07dba452777c59ff52330edb340d3d56d67aa9117ad9905673.lock


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

08/20/2021 17:34:38 - INFO - filelock -   Lock 140177142582672 released on /root/.cache/huggingface/transformers/c40d0abb589629c48763f271020d0b1f602f5208c432c0874d420491ed37e28b.122ed338b3591c07dba452777c59ff52330edb340d3d56d67aa9117ad9905673.lock
08/20/2021 17:34:39 - INFO - filelock -   Lock 140177159300112 acquired on /root/.cache/huggingface/transformers/eac3273a8097dda671e3bea1db32c616e74f36a306c65b4858171c98d6db83e9.084aa7284f3a51fa1c8f0641aa04c47d366fbd18711f29d0a995693cfdbc9c9e.lock


Downloading:   0%|          | 0.00/496M [00:00<?, ?B/s]

08/20/2021 17:34:54 - INFO - filelock -   Lock 140177159300112 released on /root/.cache/huggingface/transformers/eac3273a8097dda671e3bea1db32c616e74f36a306c65b4858171c98d6db83e9.084aa7284f3a51fa1c8f0641aa04c47d366fbd18711f29d0a995693cfdbc9c9e.lock
Some weights of the model checkpoint at deepset/roberta-base-squad2 were not used when initializing RobertaModel: ['qa_outputs.weight', 'qa_outputs.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at deepset/roberta-base-squad2 and are newly initialized: ['ro

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

08/20/2021 17:35:03 - INFO - filelock -   Lock 140177129914192 released on /root/.cache/huggingface/transformers/81c80edb4c6cefa5cae64ccfdb34b3b309ecaf60da99da7cd1c17e24a5d36eb5.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05.lock
08/20/2021 17:35:03 - INFO - filelock -   Lock 140177094806992 acquired on /root/.cache/huggingface/transformers/b87d46371731376b11768b7839b1a5938a4f77d6bd2d9b683f167df0026af432.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock


Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

08/20/2021 17:35:04 - INFO - filelock -   Lock 140177094806992 released on /root/.cache/huggingface/transformers/b87d46371731376b11768b7839b1a5938a4f77d6bd2d9b683f167df0026af432.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock
08/20/2021 17:35:05 - INFO - filelock -   Lock 140177103803024 acquired on /root/.cache/huggingface/transformers/c9d2c178fac8d40234baa1833a3b1903d393729bf93ea34da247c07db24900d0.cb2244924ab24d706b02fd7fcedaea4531566537687a539ebb94db511fd122a0.lock


Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

08/20/2021 17:35:06 - INFO - filelock -   Lock 140177103803024 released on /root/.cache/huggingface/transformers/c9d2c178fac8d40234baa1833a3b1903d393729bf93ea34da247c07db24900d0.cb2244924ab24d706b02fd7fcedaea4531566537687a539ebb94db511fd122a0.lock
08/20/2021 17:35:06 - INFO - filelock -   Lock 140177094801424 acquired on /root/.cache/huggingface/transformers/e8a600814b69e3ee74bb4a7398cc6fef9812475010f16a6c9f151b2c2772b089.451739a2f3b82c3375da0dfc6af295bedc4567373b171f514dd09a4cc4b31513.lock


Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

08/20/2021 17:35:06 - INFO - filelock -   Lock 140177094801424 released on /root/.cache/huggingface/transformers/e8a600814b69e3ee74bb4a7398cc6fef9812475010f16a6c9f151b2c2772b089.451739a2f3b82c3375da0dfc6af295bedc4567373b171f514dd09a4cc4b31513.lock
08/20/2021 17:35:07 - INFO - farm.utils -   Using device: CUDA 
08/20/2021 17:35:07 - INFO - farm.utils -   Number of GPUs: 1
08/20/2021 17:35:07 - INFO - farm.utils -   Distributed Training: False
08/20/2021 17:35:07 - INFO - farm.utils -   Automatic Mixed Precision: None
08/20/2021 17:35:07 - INFO - farm.infer -   Got ya 2 parallel workers to do inference ...
08/20/2021 17:35:07 - INFO - farm.infer -    0    0 
08/20/2021 17:35:07 - INFO - farm.infer -   /w\  /w\
08/20/2021 17:35:07 - INFO - farm.infer -   /'\  / \
08/20/2021 17:35:07 - INFO - farm.infer -     
08/20/2021 17:35:07 - INFO - farm.utils -   Using device: CUDA 
08/20/2021 17:35:07 - INFO - farm.utils -   Number of GPUs: 1
08/20/2021 17:35:07 - INFO - farm.utils -   Distributed 

#### TransformersReader

In [10]:
# Alternative:
# reader = TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)

### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [11]:
from haystack.pipeline import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

## Voilà! Ask a question!

In [12]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers.
prediction = pipe.run(query="Who is the father of Arya Stark?", top_k_retriever=10, top_k_reader=5)

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.28 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.79 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.61 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.61 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.02 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.10 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.59 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.36 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 23.82 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 22.34 Batches/s]


In [13]:
# prediction = pipe.run(query="Who created the Dothraki vocabulary?", top_k_reader=5)
# prediction = pipe.run(query="Who is the sister of Sansa?", top_k_reader=5)

In [14]:
print_answers(prediction, details="minimal")

[   {   'answer': 'Eddard',
        'context': 's Nymeria after a legendary warrior queen. She travels '
                   "with her father, Eddard, to King's Landing when he is made "
                   'Hand of the King. Before she leaves,'},
    {   'answer': 'Ned',
        'context': '\n'
                   '====Season 1====\n'
                   'Arya accompanies her father Ned and her sister Sansa to '
                   "King's Landing. Before their departure, Arya's "
                   'half-brother Jon Snow gifts A'},
    {   'answer': 'Eddard and Catelyn Stark',
        'context': 'tark ===\n'
                   'Arya Stark is the third child and younger daughter of '
                   'Eddard and Catelyn Stark. She serves as a POV character '
                   "for 33 chapters throughout ''A "},
    {   'answer': 'Robert Baratheon',
        'context': 'hen Gendry gives it to Arya, he tells her he is the '
                   'bastard son of Robert Baratheon. Aware of thei

In [39]:
df = prediction['answers']
answers = []
contexts = []
names = []
scores = []

for i in range(5):
  answers.append(df[i]['answer'])
  contexts.append(df[i]['context'])
  names.append(df[i]['meta']['name'])
  scores.append(df[i]['score']*100)

data = pd.DataFrame()
data['answers'] = answers
data['contexts'] = contexts
data['names'] = names
data['scores'] = scores

data.head()

Unnamed: 0,answers,contexts,names,scores
0,Eddard,"s Nymeria after a legendary warrior queen. She travels with her father, Edda...",43_Arya_Stark.txt,98.99838
1,Ned,\n====Season 1====\nArya accompanies her father Ned and her sister Sansa to ...,43_Arya_Stark.txt,97.366059
2,Eddard and Catelyn Stark,tark ===\nArya Stark is the third child and younger daughter of Eddard and C...,30_List_of_A_Song_of_Ice_and_Fire_characters.txt,95.734215
3,Robert Baratheon,"hen Gendry gives it to Arya, he tells her he is the bastard son of Robert Ba...",43_Arya_Stark.txt,95.309973
4,Lord Eddard and Catelyn Stark,rk of House Stark is the younger daughter and third child of Lord Eddard and...,349_List_of_Game_of_Thrones_characters.txt,95.06138


## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!  
Our focus: Industry specific language models & large scale QA systems.  
  
Some of our other work: 
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://apply.workable.com/deepset/) 