<a href="https://colab.research.google.com/github/Shhreyya/py/blob/master/QA_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Build Your First Question Answering System

> We've modified this first tutorial to make it simpler to start with. If you're looking for a Question Answering tutorial that uses a DocumentStore such as Elasticsearch, go to our new [Build a Scalable Question Answering System](https://haystack.deepset.ai/tutorials/03_scalable_qa_system) tutorial.

- **Level**: Beginner
- **Time to complete**: 15 minutes
- **Nodes Used**: `InMemoryDocumentStore`, `BM25Retriever`, `FARMReader`
- **Goal**: After completing this tutorial, you will have learned about the Reader and Retriever, and built a question answering pipeline that can answer questions about the Game of Thrones series.


## Overview

Learn how to build a question answering system using Haystack's DocumentStore, Retriever, and Reader. Your system will use Game of Thrones files and will be able to answer questions like "Who is the father of Arya Stark?". But you can use it to run on any other set of documents, such as your company's internal wikis or a collection of financial reports. 

To help you get started quicker, we simplified certain steps in this tutorial. For example, Document preparation and pipeline initialization are handled by ready-made classes that replace lines of initialization code. But don't worry! This doesn't affect how well the question answering system performs.


## Preparing the Colab Environment

- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)
- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/log-level)


## Installing Haystack

To start, let's install the latest release of Haystack with `pip`:

In [2]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-23.1-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 27.7 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.0.1
    Uninstalling pip-23.0.1:
      Successfully uninstalled pip-23.0.1
Successfully installed pip-23.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack[colab]
  Downloading farm_haystack-1.15.1-py3-none-any.whl (681 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 681.0/681.0 kB 10.0 MB/s eta 0:00:00
Collecting azure-ai-formrecognizer>=3.2.0b2 (from farm-haystack[colab])
  Downloading azure_ai_formrecognizer-3.3.0b1-py3-none-any.whl (299 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 299.9/299.9 kB 24.2 MB/s eta 0:00:00
Collecting boilerpy3 (from farm-haystack[colab])
 

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 3.20.2 which is incompatible.
tensorflow-metadata 1.13.1 requires protobuf<5,>=3.20.3, but you have protobuf 3.20.2 which is incompatible.


### Enabling Telemetry 
Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details.

In [3]:
from haystack.telemetry import tutorial_running

tutorial_running(1)

Set the logging level to INFO:

In [4]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Initializing the DocumentStore

We'll start creating our question answering system by initializing a DocumentStore. A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. In this tutorial, we're using the `InMemoryDocumentStore`, which is the simplest DocumentStore to get started with. It requires no external dependencies and it's a good option for smaller projects and debugging. But it doesn't scale up so well to larger Document collections, so it's not a good choice for production systems. To learn more about the DocumentStore and the different types of external databases that we support, see [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).

Let's initialize the the DocumentStore:

In [5]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


The DocumentStore is now ready. Now it's time to fill it with some Documents.

## Preparing Documents

1. Download 517 articles from the Game of Thrones Wikipedia. You can find them in *data/build_your_first_question_answering_system* as a set of *.txt* files.

In [6]:
from haystack.utils import fetch_archive_from_http

doc_dir = "data/build_your_first_question_answering_system"

fetch_archive_from_http(
    url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip",
    output_dir=doc_dir
)

INFO:haystack.utils.import_utils:Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip to 'data/build_your_first_question_answering_system'


True

2. Use `TextIndexingPipeline` to convert the files you just downloaded into Haystack [Document objects](https://docs.haystack.deepset.ai/docs/documents_answers_labels#document) and write them into the DocumentStore:

In [19]:
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline

files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline = TextIndexingPipeline(document_store)
indexing_pipeline.run_batch(file_paths=files_to_index)



INFO:haystack.pipelines.base:It seems that an indexing Pipeline is run, so using the nodes' run method instead of run_batch.


Converting files:   0%|          | 0/184 [00:00<?, ?it/s]

Preprocessing:   0%|          | 0/184 [00:00<?, ?docs/s]



Updating BM25 representation...:   0%|          | 0/2433 [00:00<?, ? docs/s]

{'documents': [<Document: {'content': '\n\n"\'\'\'The Bear and the Maiden Fair\'\'\'" is a folk song in \'\'A Song of Ice and Fire\'\', and it is sung in the television series adaptation \'\'Game of Thrones\'\'.  The lyrics are provided by George R. R. Martin in the original novel; Ramin Djawadi composed the tune\'s music in 2012, at the request of the series creators David Benioff and D. B. Weiss, and the recording, by The Hold Steady, was arranged by Tad Kubler.\n\n==History==\nThe US indie rock band The Hold Steady recorded "The Bear and the Maiden Fair" for season 3. Brienne and Jaime\'s captors (who include musician Gary Lightbody from Snow Patrol, in a cameo appearance) sing the song in episode 3 of that season ("Walk of Punishment"), and The Hold Steady\'s recording is played over the end credits. The recording was released on a seven-inch record on April 20, 2013.\n\nIn the \'\'A Song of Ice and Fire\'\' novels, "The Bear and the Maiden Fair" is a traditional song popular among

The code in this tutorial uses the Game of Thrones data, but you can also supply your own *.txt* files and index them in the same way.

As an alternative, you can cast you text data into [Document objects](https://docs.haystack.deepset.ai/docs/documents_answers_labels#document) and write them into the DocumentStore using `DocumentStore.write_documents()`.

## Initializing the Retriever

Our search system will use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only the ones relevant to the question. This tutorial uses the BM25 algorithm. For more Retriever options, see [Retriever](https://docs.haystack.deepset.ai/docs/retriever).

Let's initialize a BM25Retriever and make it use the InMemoryDocumentStore we initialized earlier in this tutorial:

In [8]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

The Retriever is ready but we still need to initialize the Reader. 

## Initializing the Reader

A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. In this tutorial, we're using a FARMReader with a base-sized RoBERTa question answering model called [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2). It's a strong all-round model that's good as a starting point. To find the best model for your use case, see [Models](https://haystack.deepset.ai/pipeline_nodes/reader#models).

Let's initialize the Reader:

In [9]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="google/electra-large-discriminator", use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading (…)lve/main/config.json:   0%|          | 0.00/668 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'google/electra-large-discriminator' (Electra)


Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()
INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'google/electra-large-discriminator' (Electra model) from model hub.
Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing ElectraForQuestionAnswering: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.weight']
- This IS expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassi

Downloading (…)okenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


We've initalized all the components for our pipeline. We're now ready to create the pipeline.

## Creating the Retriever-Reader Pipeline

In this tutorial, we're using a ready-made pipeline called `ExtractiveQAPipeline`. It connects the Reader and the Retriever. The combination of the two speeds up processing because the Reader only processes the Documents that the Retriever has passed on. To learn more about pipelines, see [Pipelines](https://docs.haystack.deepset.ai/docs/pipelines).

To create the pipeline, run:

In [10]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

The pipeline's ready, you can now go ahead and ask a question!

## Asking a Question

1. Use the pipeline `run()` method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the `top-k` parameter. To learn more about setting arguments, see [Arguments](https://docs.haystack.deepset.ai/docs/pipelines#arguments). To understand the importance of the `top-k` parameter, see [Choosing the Right top-k Values](https://docs.haystack.deepset.ai/docs/optimization#choosing-the-right-top-k-values).

In [21]:
prediction = pipe.run(
    query="Washing machine is over foaming",
    params={
        "Retriever": {"top_k": 10},
        "Reader": {"top_k": 5}
    }
)

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

Here are some questions you could try out:
- Who is the father of Arya Stark?
- Who created the Dothraki vocabulary?
- Who is the sister of Sansa?

2. Print out the answers the pipeline returned:

In [22]:
from pprint import pprint

pprint(prediction)

{'answers': [<Answer {'answer': '.\n\nThe door is not locked properly\nClose the door properly until you can hear an audible click. An overloaded washing machine drum may be preventing the door from being', 'type': 'extractive', 'score': 0.010078501887619495, 'context': '.\n\nThe door is not locked properly\nClose the door properly until you can hear an audible click. An overloaded washing machine drum may be preventing the door from being', 'offsets_in_document': [{'start': 596, 'end': 764}], 'offsets_in_context': [{'start': 0, 'end': 168}], 'document_ids': ['10a1b519f616f5fdcb655bc4a1d7b9ef'], 'meta': {'_split_id': 7}}>,
             <Answer {'answer': '.\n\nWashing machine is not taking detergent\nThe', 'type': 'extractive', 'score': 0.008626541122794151, 'context': 'inspection by a trained and qualified Bosch engineer.\n\nWashing machine is not taking detergent\nThe water pressure is too low\nCheck the water pressure ', 'offsets_in_document': [{'start': 195, 'end': 241}], 'offsets_

3. Simplify the printed answers:

In [25]:
from haystack.utils import print_answers

print_answers(
    prediction,
    details="minimum" ## Choose from `minimum`, `medium`, and `all`
)

'Query: Washing machine is over foaming'
'Answers:'
[   {   'answer': '.\n'
                  '\n'
                  'The door is not locked properly\n'
                  'Close the door properly until you can hear an audible '
                  'click. An overloaded washing machine drum may be preventing '
                  'the door from being',
        'context': '.\n'
                   '\n'
                   'The door is not locked properly\n'
                   'Close the door properly until you can hear an audible '
                   'click. An overloaded washing machine drum may be '
                   'preventing the door from being'},
    {   'answer': '.\n\nWashing machine is not taking detergent\nThe',
        'context': 'inspection by a trained and qualified Bosch engineer.\n'
                   '\n'
                   'Washing machine is not taking detergent\n'
                   'The water pressure is too low\n'
                   'Check the water pressure '},
    {   

And there you have it! Congratulations on building your first machine learning based question answering system!

# Next Steps

Check out [Build a Scalable Question Answering System](https://haystack.deepset.ai/tutorials/03_scalable_qa_system) to learn how to make a more advanced question answering system that uses an Elasticsearch backed DocumentStore and makes more use of the flexibility that pipelines offer.