## Multiple pipeline's paths

In this test, I will try to create a pipeline which will behave differently based on the input: keyword or question.
In the first case, the pipeline will just utilise the Elasticsearch retriever. In the second case, the pipeline will utilise a DPR and then a generator to provide a better answer. 

Let's strat with the usual settings

In [4]:
# Make sure you have a GPU running
!nvidia-smi

Fri Apr  8 09:02:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,ocr,faiss]

!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz
!tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin

In [72]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
)
# wait until ES has started
! sleep 30

In [73]:
from haystack.document_stores import ElasticsearchDocumentStore

document_store_es = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

I will install graphviz to visualize the pipeline

In [None]:
!apt install libgraphviz-dev graphviz
!pip install pygraphviz

Usual preprocess of all the Harry Potter's books

In [6]:
from haystack.nodes import PreProcessor
from haystack.utils import convert_files_to_docs

DOC_DIR = '/content/data'

INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/


In [7]:
all_docs = convert_files_to_docs(dir_path=DOC_DIR)
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=100,
    split_respect_sentence_boundary=True,
)
docs = preprocessor.process(all_docs)

print(f"n_files_input: {len(all_docs)}\nn_docs_output: {len(docs)}")

INFO - haystack.telemetry -  Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry
INFO - haystack.utils.preprocessing -  Converting /content/data/HP-half-blood-prince.pdf
INFO - haystack.utils.preprocessing -  Converting /content/data/harry-potter-and-the-order-of-the-phoenix.pdf
INFO - haystack.utils.preprocessing -  Converting /content/data/Harry Potter and The Sorcerer’s Stone.pdf
INFO - haystack.utils.preprocessing -  Converting /content/data/HP-chamber-of-secret.pdf
INFO - haystack.utils.preprocessing -  Converting /content/data/Harry-potter-prison-of-azkaban.pdf
INFO - haystack.utils.preprocessing -  Converting /content/data/Har

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


100%|██████████| 7/7 [00:03<00:00,  2.04docs/s]

n_files_input: 7
n_docs_output: 12563





In [8]:
from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")

In [9]:
document_store.write_documents(docs)

Writing Documents:   0%|          | 0/12563 [00:00<?, ?it/s]

Sparse retriever for fast results in case of keywords

In [74]:
from haystack.nodes import ElasticsearchRetriever
document_store_es.write_documents(docs)
es_retriever = ElasticsearchRetriever(document_store=document_store_es)

Deep retriever in case a question is being asked

In [None]:
from haystack.nodes import DensePassageRetriever

dpr_retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    max_seq_len_query=64,
    max_seq_len_passage=256,
    batch_size=16,
    use_gpu=True,
    embed_title=True,
    use_fast_tokenizers=True,
)
# Important:
# Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all
# previously indexed documents and update their embedding representation.
# While this can be a time consuming operation (depending on corpus size), it only needs to be done once.
# At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.
document_store.update_embeddings(dpr_retriever)



Generator to provide a more appropriate answer

In [None]:
from haystack.nodes import Seq2SeqGenerator
generator = Seq2SeqGenerator(model_name_or_path="vblagoje/bart_lfqa")

In [82]:
from haystack import Pipeline
from haystack.nodes import TransformersQueryClassifier

query_classifier = TransformersQueryClassifier(model_name_or_path="shahrukhx01/bert-mini-finetune-question-detection")

p = Pipeline()
p.add_node(component=query_classifier, name="QueryClassifier", inputs=["Query"])
p.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_1"])
p.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_2"])
p.add_node(component=generator, name="QAReader", inputs=["DPRRetriever"])
res = p.run(query="who is harry?")

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1


In [81]:
res

{'answers': [<Answer {'answer': 'Harry Potter is a fictional character from the Harry Potter series of books. He is the main character of the series, and he is a wizard. He was a student at Hogwarts School of Witchcraft and Wizardry, which is a magical school in the United Kingdom. Harry was a member of the Order of the Phoenix, which was a group of witches and wizards who were responsible for the creation of the Wizarding World. Dumbledore was the head of the school, and Harry was one of the students at Hogwarts. Harry is the son of Harry Potter and his wife, Hermione. His father was a wizard, and his mother was a witch. Harry grew up in an orphanage, where he was raised by his mother and his grandmother. His mother died when he was very young, leaving him with a younger sister, Hermione, and a younger brother, Sirius Black. Sirius Black was the leader of the Quidditch team, and was responsible for defeating Voldemort in the Battle of Hogwarts. He', 'type': 'generative', 'score': None

Let's save the pipeline structure

In [59]:
p.draw(path='/content/a.png')