<a href="https://colab.research.google.com/github/KunalSinha1125/haystack/blob/master/tutorials/Tutorial12_LFQA_modified.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Long-Form Question Answering

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial12_LFQA.ipynb)

### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.  
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg">

In [4]:
# Make sure you have a GPU running
!nvidia-smi

Fri Jan 28 23:14:48 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [1]:
# Install the latest release of Haystack in your own environment 
!pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git

#Install FAISS
!pip install faiss-gpu

Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-h2me1n5p
  Running command git clone --filter=blob:none -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-h2me1n5p
  Resolved https://github.com/deepset-ai/haystack.git to commit ee6b8d068834eae064cee15f20adf8ff18f88809
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [2]:
from haystack.utils import convert_files_to_dicts, fetch_archive_from_http, clean_wiki_text
from haystack.nodes import Seq2SeqGenerator

### Document Store

FAISS is a library for efficient similarity search on a cluster of dense vectors.
The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood
to store the document text and other meta data. The vector embeddings of the text are
indexed on a FAISS Index that later is queried for searching answers.
The default flavour of FAISSDocumentStore is "Flat" but can also be set to "HNSW" for
faster search at the expense of some accuracy. Just set the faiss_index_factor_str argument in the constructor.
For more info on which suits your use case: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

In [3]:
import faiss
from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore(embedding_dim=128, faiss_index_factory_str="Flat")

### Cleaning & indexing documents

Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore

In [17]:
# Let's first get some files that we want to use
#doc_dir = "data/article_txt_got"
#s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
#fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
# Convert files to dicts
#dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

dicts = []
for file_name in [r'rescadm.txt', r'ibtogpa.txt']:
    with open(r'rescadm.txt', 'r') as f:
        dic = {
            'content': f.read(),
            'meta': {
                'name': file_name
            }
        }
        dicts.append(dic)
    
# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(dicts)

Writing Documents:   0%|          | 0/2 [00:00<?, ?it/s]

### Initalize Retriever and Reader/Generator

#### Retriever

**Here:** We use a `RetribertRetriever` and we invoke `update_embeddings` to index the embeddings of documents in the `FAISSDocumentStore`



In [18]:
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(document_store=document_store,
                               embedding_model="yjernite/retribert-base-uncased",
                               model_format="retribert")

document_store.update_embeddings(retriever)

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.nodes.retriever.dense -  Init retriever using embeddings of model yjernite/retribert-base-uncased
Some weights of RetriBertModel were not initialized from the model checkpoint at yjernite/retribert-base-uncased and are newly initialized: ['bert_query.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO - haystack.document_stores.faiss -  Updating embeddings for 2358 docs...


Updating Embedding:   0%|          | 0/2358 [00:00<?, ? docs/s]

Creating Embeddings:   0%|          | 0/74 [00:00<?, ? Batches/s]

Before we blindly use the `RetribertRetriever` let's empirically test it to make sure a simple search indeed finds the relevant documents.

In [19]:
from haystack.utils import print_documents
from haystack.pipelines import DocumentSearchPipeline

p_retrieval = DocumentSearchPipeline(retriever)
res = p_retrieval.run(
    query="Tell me something about Arya Stark?",
    params={"Retriever": {"top_k": 10}}
)
print_documents(res, max_text_len=512)



Query: Tell me something about Arya Stark?

{   'content': '\n'
               '=== Arya Stark ===\n'
               'Arya Stark is the third child and younger daughter of Eddard '
               'and Catelyn Stark. She serves as a POV character for 33 '
               "chapters throughout ''A Game of Thrones'', ''A Clash of "
               "Kings'', ''A Storm of Swords'', ''A Feast for Crows'', and ''A "
               "Dance with Dragons''. So far she is the only character to "
               'appear in all 5 books as a POV character.\n'
               'In the HBO television adaptation, she is portrayed by Maisie '
               'Williams.',
    'name': '30_List_of_A_Song_of_Ice_and_Fire_characters.txt'}

{   'content': '\n'
               '=== Background ===\n'
               'Arya is the third child and younger daughter of Eddard and '
               'Catelyn Stark and is nine years old at the beginning of the '
               'book series.  She has five siblings: an older broth

#### Reader/Generator

Similar to previous Tutorials we now initalize our reader/generator.

Here we use a `Seq2SeqGenerator` with the *yjernite/bart_eli5* model (see: https://huggingface.co/yjernite/bart_eli5)



In [20]:
generator = Seq2SeqGenerator(model_name_or_path="yjernite/bart_eli5")

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1


### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `GenerativeQAPipeline` that combines a retriever and a reader/generator to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [21]:
from haystack.pipelines import GenerativeQAPipeline
pipe = GenerativeQAPipeline(generator, retriever)

## Voilà! Ask a question!

In [24]:
pipe.run(
    query="By how much does GPA need to drop for admission to be rescinded?",
    params={"Retriever": {"top_k": 2}}
)

{'answers': [" GPA is a measure of how well a student performs in school. If a student's GPA drops below a certain threshold, they are no longer eligible to be admitted to the school."],
 'documents': [<Document: {'content': "Colleges do not like to renege on admission decisions but will do so on occasion. This most typically happens when a student's grades drop SIGNIFICANTLY after the student is admitted. In other words, if an A student suffers a bout of senioritis and drops to a B average, it's not a dealbreaker. But if the grades plummet to C's and D's (or worse), it can be. If there are extenuating circumstances behind this change in GPA (e.g., an illness or family crisis), they should be explained by the school counselor. The college will probably be sympathetic and stand by their original acceptance, sometimes putting the student on academic probation when the school year starts.\n\nColleges may also revoke acceptances if the student is suspended from school or arrested outside o

In [25]:
pipe.run(query="IB scores are from 0-7. How does in convert to GPA?", params={"Retriever": {"top_k": 3}})

{'answers': [' GPA is an average of all your grades. IB is a weighted average of your IB scores.'],
 'documents': [<Document: {'content': "\n=== Duolingo course ===\nOn  a course in High Valyrian for English speakers began to be constructed in the Duolingo Language Incubator. David J. Peterson is one of the contributors to the course. The beta version was released on July 12, 2017. In April of 2019, the course was updated in anticipation of Game of Thrones' eighth and final season. As a part of this update, Peterson created audio for the course's lessons and exercises.", 'content_type': 'text', 'score': 0.5378316354613146, 'meta': {'name': '213_Valyrian_languages.txt', 'vector_id': '625'}, 'embedding': None, 'id': 'fa02ad06b0a4c9c740af454d63a89daa'}>,
  <Document: {'content': "Colleges do not like to renege on admission decisions but will do so on occasion. This most typically happens when a student's grades drop SIGNIFICANTLY after the student is admitted. In other words, if an A stud

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!
Our focus: Industry specific language models & large scale QA systems.

Some of our other work:
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)