# Predicting most relevant answers to a particular question from a given set of paragraphs and returning the exact answer
---


The data set that has been used and all other preprocessed datasets have been uploaded to this [Link](https://https://github.com/ArkadeepAcharya/Question-Answering)


## Preparing the Colab Environment

- Enable GPU Runtime in Colab



#Installing Haystack
Haystack provides convenient way to store the documents and do inference but with our own models


In [None]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab]

Set the logging level to INFO:

In [3]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Initializing the DocumentStore

 A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. In this tutorial, we're using the `InMemoryDocumentStore`, which is the simplest DocumentStore to get started with. It requires no external dependencies and it's a good option for smaller projects and debugging.
Let's initialize the the DocumentStore:

In [11]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


The DocumentStore is now ready. Now we fill it with the paragraphs

In [12]:
import pandas as pd
df_para = pd.read_csv("paragraphs.csv", index_col="id")

In [13]:
df_para.head()

Unnamed: 0_level_0,paragraph,theme
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,The iPod is a line of portable media players and multi-purpose pocket comput...,IPod
2,"Like other digital music players, iPods can serve as external data storage d...",IPod
3,Apple's iTunes software (and other alternative software) can be used to tran...,IPod
4,"Before the release of iOS 5, the iPod branding was used for the media player...",IPod
5,"In mid-2015, a new model of the iPod Touch was announced by Apple, and was o...",IPod


In [22]:
df_para['theme'].unique()

array(['IPod', '2008_Sichuan_earthquake', 'Wayback_Machine',
       'Canadian_Armed_Forces', 'Cardinal_(Catholicism)',
       'Human_Development_Index', 'Heresy', 'Warsaw_Pact', 'Materialism',
       'Pub', 'Web_browser', 'Catalan_language', 'Paper',
       'Adult_contemporary_music', 'Nanjing', 'Dialect', 'Southampton',
       'The_Times', 'Immunology', 'Imamah_(Shia_doctrine)', 'Grape',
       'United_States_dollar', 'Everton_F.C.', 'Hard_rock',
       'Great_Plains', 'Biodiversity', 'Federal_Bureau_of_Investigation',
       'Mary_(mother_of_Jesus)', 'Unknown', 'DevRev'], dtype=object)

In [14]:
docs = []
for index, row in df_para.iterrows():
  docs.append({'content': row["paragraph"],
                'meta': {'name': row["theme"]},
                'id': index})

## Preparing Documents

We add the paragraphs to the document system.

In [None]:
document_store.delete_documents()
document_store.write_documents(docs)


Updating BM25 representation...:   0%|          | 0/1179 [00:00<?, ? docs/s]

## Initializing the Retriever

We use BM25 retriever to fiter out the relvant documents before using the reder to find the exact answer. We do this as it would have been extrememly computationally expensive to use the reader on the entire dataset.

In [15]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

The Retriever is ready but we still need to initialize the Reader. 

## Initializing the Reader

A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. We're using a FARMReader with a question answering model called "deepset/minilm-uncased-squad2".

Let's initialize the Reader:

In [16]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/minilm-uncased-squad2", use_gpu=True,progress_bar=True)

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


Downloading (…)lve/main/config.json:   0%|          | 0.00/477 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/minilm-uncased-squad2' (Bert)


Downloading pytorch_model.bin:   0%|          | 0.00/133M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()
INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/minilm-uncased-squad2' (Bert model) from model hub.


Downloading (…)okenizer_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


We've initalized all the components for our pipeline. We're now ready to create the pipeline.

## Creating the Retriever-Reader Pipeline
To create the pipeline, run:

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader,retriever)

The pipeline's ready, you can now go ahead and ask a question!

## Asking a Question

1. Use the pipeline `run()` method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the `top-k` parameter.

For experimentation pusposes we chose the top 10 paragraphs from the retriever and select the top 5 answers given by the reader.

In [None]:
prediction = pipe.run(
    query="Which current iPod product features the largest data storage capacity?",
    params={
        "Retriever": {"top_k": 10},
        "Reader": {"top_k": 5}
    }
)

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

Here are some questions you could try out:
- What is the largest data capacity for an iPod product?
-What did an official with the Seismological Bureau deny receiving?
- Where were office towers evacuated?

2. Print out the answers the pipeline returned:

In [None]:
from pprint import pprint

pprint(prediction)

{'answers': [<Answer {'answer': 'iPod Shuffle', 'type': 'extractive', 'score': 0.7516720294952393, 'context': ' on July 15, 2015. There are three current versions of the iPod: the ultra-compact iPod Shuffle, the compact iPod Nano and the touchscreen iPod Touch.', 'offsets_in_document': [{'start': 356, 'end': 368}], 'offsets_in_context': [{'start': 83, 'end': 95}], 'document_ids': ['1'], 'meta': {'name': 'IPod'}}>,
             <Answer {'answer': 'U2', 'type': 'extractive', 'score': 0.7421851754188538, 'context': 'In 2006 Apple presented a special edition for iPod 5G of Irish rock band U2. Like its predecessor, this iPod has engraved the signatures of the four m', 'offsets_in_document': [{'start': 73, 'end': 75}], 'offsets_in_context': [{'start': 73, 'end': 75}], 'document_ids': ['8'], 'meta': {'name': 'IPod'}}>,
             <Answer {'answer': 'iPod Hi-Fi', 'type': 'extractive', 'score': 0.5721650123596191, 'context': ' number are made by third party companies, although many, such as t

3. The predicted answers

In [None]:
from haystack.utils import print_answers

print_answers(
    prediction,
    details="minimum" ## Choose from `minimum`, `medium`, and `all`
)

'Query: Which current iPod product features the largest data storage capacity?'
'Answers:'
[   {   'answer': 'iPod Shuffle',
        'context': ' on July 15, 2015. There are three current versions of the '
                   'iPod: the ultra-compact iPod Shuffle, the compact iPod '
                   'Nano and the touchscreen iPod Touch.'},
    {   'answer': 'U2',
        'context': 'In 2006 Apple presented a special edition for iPod 5G of '
                   'Irish rock band U2. Like its predecessor, this iPod has '
                   'engraved the signatures of the four m'},
    {   'answer': 'iPod Hi-Fi',
        'context': ' number are made by third party companies, although many, '
                   'such as the iPod Hi-Fi, are made by Apple. Some '
                   'accessories add extra features that other mu'},
    {   'answer': 'iPod Touch',
        'context': 'ies by model, ranging from 2 GB for the iPod Shuffle to '
                   '128 GB for the iPod Touch (previous

Thus we get the most relevvant answers from the given text. The accuracy of this approach can be further enhanced by distilling the model on our own custom dataset and ranking the retreived paragraphs efficiently