# Answering questions from a document corpus in an extractive manner

For the use cases where we have a document corpus that contains a large number of documents, it’s not feasible to load the document content at runtime to answer a question. Such an approach would lead to long query times and would not be suitable for production-grade systems.

In this recipe, we will learn how to preprocess the documents and transform them into a form for faster reading, indexing, and retrieval that allows the system to extract the answer for a given question with short query times.


### Getting ready

As part of this recipe, we will use the **Haystack** (https://haystack.deepset.ai/) framework to build a **QA system** that can answer questions from a document corpus. We will download a dataset based on Game of Thrones and index it. For our QA system to be performant, we will need to index the documents beforehand. Once the documents are indexed, answering a question follows a two-step process:


    1. Retriever: Since we have many documents, scanning each document to fetch an answer is not a feasible approach. We will first retrieve a set of candidate documents that can possibly contain an answer to our question. This step is performed using a Retriever component. This searches through the pre-created index to filter the number of documents that we will need to scan to retrieve the exact answer.
    2. Reader: Once we have a candidate set of documents that could contain the answer, we will search these documents to retrieve the exact answer to our question.


Imports

In [5]:
%pip install farm-haystack

Collecting farm-haystack
  Downloading farm_haystack-1.26.4.post0-py3-none-any.whl.metadata (28 kB)
Collecting boilerpy3 (from farm-haystack)
  Using cached boilerpy3-1.0.7-py3-none-any.whl.metadata (5.8 kB)
Collecting events (from farm-haystack)
  Using cached Events-0.5-py3-none-any.whl.metadata (3.9 kB)
Collecting lazy-imports==0.3.1 (from farm-haystack)
  Using cached lazy_imports-0.3.1-py3-none-any.whl.metadata (10 kB)
Collecting posthog (from farm-haystack)
  Using cached posthog-7.5.1-py3-none-any.whl.metadata (6.4 kB)
Collecting prompthub-py==4.0.0 (from farm-haystack)
  Using cached prompthub_py-4.0.0-py3-none-any.whl.metadata (2.2 kB)
Collecting pydantic<2 (from farm-haystack)
  Using cached pydantic-1.10.26-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (155 kB)
Collecting quantulum3 (from farm-haystack)
  Using cached quantulum3-0.9.2-py3-none-any.whl.metadata (16 kB)
Collecting rank-bm25 (from farm-haystack)
  Using cached rank_bm25-0.2.2-py3-none-any.

In [23]:
import os
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import BM25Retriever, FARMReader
from haystack.pipelines import ExtractiveQAPipeline
from haystack.pipelines.standard_pipelines import(
    TextIndexingPipeline)
from haystack.utils import (fetch_archive_from_http,
    print_answers)

In this step, we specify a folder that will be used to save our dataset. Then, we retrieve the dataset from the source. The second parameter to the fetch_archive_from_http method is the folder in which the dataset will be downloaded. We set the parameter to the folder that we defined in the first line. The fetch_archive_from_http method decompresses the archive .zip file and extracts all files into the same folder. We then read from the folder and create a list of files contained in the folder. We also print the number of files that are present:

In [24]:
doc_dir = "data/got_dataset"
fetch_archive_from_http(
    url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip",
    output_dir=doc_dir,
    )
files_to_index = [doc_dir + "/" + f for f in os.listdir(
    doc_dir)]
print(len(files_to_index))

183


e initialize a document store based on the files. We create an indexing pipeline based on the document store and execute the indexing operation. To achieve this, we initialize an InMemoryDocumentStore instance. In this method call, we set the use_bm25 argument as True. The document store uses Best Match 25 (bm25) as the algorithm for the retriever step. The bm25 algorithm is a simple bag-of-words-based algorithm that uses a scoring function. This function utilizes the number of times a term is present in the document and the length of the document. Chapter 3 covers the bm25 algorithm in more detail and we recommend you refer to that chapter for better understanding. Note that there are various other DocumentStore options such as ElasticSearch, OpenSearch, and so on. We used an InMemoryDocumentStore document store to keep the recipe simple and focus on the retriever and reader concepts:



In [29]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25 = True)
indexing_pipeline = TextIndexingPipeline(document_store)
indexing_pipeline.run_batch(file_paths = files_to_index)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
Converting files: 100%|██████████| 183/183 [00:01<00:00, 140.33it/s]
Preprocessing: 100%|██████████| 183/183 [00:06<00:00, 27.69docs/s]
Updating BM25 representation...: 100%|██████████| 2359/2359 [00:00<00:00, 13349.15 docs/s]


{'documents': [<Document: {'content': "\n\n'''Petyr Baelish''', nicknamed '''Littlefinger''', is a fictional character in the ''A Song of Ice and Fire'' series of fantasy novels by American author George R. R. Martin, and its television adaptation ''Game of Thrones''.\n\nIntroduced in 1996's ''A Game of Thrones'', Littlefinger is the master of coin on King Robert's small council. He is a childhood friend of Catelyn Stark, having grown up with her and her two siblings at Riverrun. He subsequently appeared in Martin's books ''A Clash of Kings'' (1998), ''A Storm of Swords'' (2000), and ''A Feast for Crows'' (2005). He is set to appear in the forthcoming novel ''The Winds of Winter''. Littlefinger's primary character attributes are his cunning and boundless ambition. Originally hailing from a minor family with little wealth or influence, Baelish used manipulation, bribery, and the connections he secured at Riverrun to gain power and prestige in King's Landing. Since then, his various intr

Once we have loaded the documents, we initialize our retriever and reader instances. To achieve this, we initialize the retriever and the reader components. BM25Retriever uses the bm25 scoring function to retrieve the initial set of documents. For the reader, we initialize the FARMReader object. This is based on deepset’s FARM framework, which can utilize the QA models from Hugging Face. In our case, we use the deepset/roberta-base-squad2 model as a reader. The use_gpu argument can be set appropriately based on whether your device has a GPU or not:

In [30]:
#!pip install farm-haystack[inference]

In [31]:
!sudo apt update
!sudo apt install build-essential curl
!curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
!source $HOME/.cargo/env


[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
[33m0% [Connecting to archive.ubuntu.com (185.125.190.81)] [Connecting to security.[0m                                                                               Get:2 https://cli.github.com/packages stable InRelease [3,917 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Fetched 133 kB in 1s (129 kB/s)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
32 packages can be upgraded. Run 'apt list --upgradable'

In [32]:
!pip install --upgrade pip setuptools wheel




In [33]:
!pip install tokenizers --prefer-binary




In [34]:
!pip install tokenizers==0.13.3


Collecting tokenizers==0.13.3
  Using cached tokenizers-0.13.3.tar.gz (314 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: tokenizers
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding wheel for tokenizers [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m No available output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for tokenizers (pyproject.toml) ... [?25l[?25herror
[31m  ERROR: Failed building wheel for tokenizers[0m[31m
[0mFailed to build tokenizers
[1;31merror[0m: [1mfailed-wheel-build-for-install[0m

[31m×[0m Failed to build installable wheels for some pyproject.toml based projects
[31m╰─>[0m tokenizers


In [35]:
!pip install transformers==4.32.1
!pip install farm-haystack[inference]


Collecting transformers==4.32.1
  Using cached transformers-4.32.1-py3-none-any.whl.metadata (118 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.32.1)
  Using cached tokenizers-0.13.3.tar.gz (314 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Using cached transformers-4.32.1-py3-none-any.whl (7.5 MB)
Building wheels for collected packages: tokenizers
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding wheel for tokenizers [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m No available output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for tokenizers (pyproject.toml) ... [?25l[?25herror
[31m  ERROR: Failed building wheel for tokenizers[0m[31m
[0m

In [36]:
from haystack.nodes import BM25Retriever, TransformersReader

retriever = BM25Retriever(document_store=document_store)
reader = TransformersReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)


Device set to use cpu


We now create a pipeline that we can use to answer questions. After having initialized the retriever and reader in the previous step, we want to combine them for querying. The pipeline abstraction from the Haystack framework allows us to integrate the reader and retriever together using a series of pipelines that address different use cases. In this instance, we will use ExtractiveQAPipeline for our QA system. After the initialization of the pipeline, we generate the answer to a question from the Game of Thrones series. The run method takes the question as the query. The second argument, params, dictates how the results from the retriever and reader are combined to present the answer:

1. "Retriever": {"top_k": 10}: The top_k keyword argument specifies that the top-k (in this case, 10) results from the retriever are used by the reader to search for the exact answer
2. "Reader": {"top_k": 5}: The top_k keyword argument specifies that the top-k (in this case, 5) results from the reader are presented as the output of the method:

In [37]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)
prediction = pipe.run(
    query="Who is the father of Arya Stark?",
    params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)



We print the answer to our question. The system prints out the exact answer along with the associated context that it used to extract the answer from. Note that we use the value of all for the details argument. Using the all value for the same argument prints out start and end spans for the answer along with all the auxiliary information. Setting the value of medium for the details argument provides the relative score of each answer. This score can be used to filter out the results further based on the accuracy requirements of the system. Using the argument of medium presents only the answer and the context. We encourage you to make a suitable choice based on your requirements:

In [38]:
print_answers(prediction, details = "all")

'Query: Who is the father of Arya Stark?'
'Answers:'
[   <Answer {'answer': 'Ned', 'type': 'extractive', 'score': 1.9078067541122437, 'context': " the television series.\n\n====Season 1====\nArya accompanies her father Ned and her sister Sansa to King's Landing. Before their departure, Arya'", 'offsets_in_document': [{'start': 630, 'end': 633}], 'offsets_in_context': [{'start': 70, 'end': 73}], 'document_ids': ['7d3360fa29130e69ea6b2ba5c5a8f9c8'], 'meta': {'_split_id': 10}}>,
    <Answer {'answer': 'Ned', 'type': 'extractive', 'score': 1.1788423657417297, 'context': "l disguised as a boy all along and is surprised to learn she is Arya, Ned Stark's daughter. After the Goldcloaks get help from Ser Amory Lorch ", 'offsets_in_document': [{'start': 848, 'end': 851}], 'offsets_in_context': [{'start': 70, 'end': 73}], 'document_ids': ['257088f56d2faba55e2ef2ebd19502dc'], 'meta': {'_split_id': 31}}>,
    <Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.9892422031261958, 'context':