### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg">

In [None]:
# Make sure you have a GPU running
!nvidia-smi

Sat Mar  5 04:53:42 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

Collecting pip
  Downloading pip-22.0.3-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 35.3 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.0.3
Collecting farm-haystack[colab]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-938fb8xu/farm-haystack_e0524afa82314e0999ab39c39c9a0a41
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-938fb8xu/farm-haystack_e0524afa82314e0999ab39c39c9a0a41
  Resolved https://github.com/deepset-ai/haystack.git to commit 5951fc463ec53ed889f2243b2e8d9832b9f01355
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting mlflow<=1.13.1
  Downloading mlflow-

In [None]:
from haystack.preprocessor.cleaning import clean_wiki_text
from haystack.utils import convert_files_to_dicts, fetch_archive_from_http, print_answers
from haystack.nodes import FARMReader, TransformersReader

INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/


## Document Store

Haystack finds answers to queries within the documents stored in a `DocumentStore`. The current implementations of `DocumentStore` include `ElasticsearchDocumentStore`, `FAISSDocumentStore`,  `SQLDocumentStore`, and `InMemoryDocumentStore`.

**Here:** We recommended Elasticsearch as it comes preloaded with features like [full-text queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25 retrieval](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector storage for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html).

**Alternatives:** If you are unable to setup an Elasticsearch instance, then follow the [Tutorial 3](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.ipynb) for using SQL/InMemory document stores.

**Hint**: This tutorial creates a new document store instance with Wikipedia articles on Game of Thrones. However, you can configure Haystack to work with your existing document stores.

### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [None]:
# Recommended: Start Elasticsearch using Docker via the Haystack utility function
from haystack.utils import launch_es

launch_es()



In [None]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
)
# wait until ES has started
! sleep 30

In [None]:
# Connect to Elasticsearch

from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document", analyzer="portuguese")

## Preprocessing of documents

Haystack provides a customizable pipeline for:
 - converting files into texts
 - cleaning texts
 - splitting texts
 - writing them to a Document Store

In this tutorial, we download Wikipedia articles about Game of Thrones, apply a basic cleaning function, and index them in Elasticsearch.

In [None]:
# Let's first fetch some documents that we want to query
doc_dir = "data/fakebr"
git_url = "https://github.com/Jthnn/Q-A-Haystack/raw/main/noticias-true.zip"
#git_url = "https://github.com/Jthnn/Q-A-Haystack/blob/main/noticias-true-normalized.zip?raw=true"

fetch_archive_from_http(url=git_url, output_dir=doc_dir)

# Convert files to dicts
dicts = convert_files_to_dicts(dir_path=doc_dir, split_paragraphs=True)

# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is:
# {
#    'text': "<DOCUMENT_TEXT_HERE>",
#    'meta': {'name': "<DOCUMENT_NAME_HERE>", ...}
# }
# (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and
# can be accessed later for filtering or shown in the responses of the Pipeline)

# Let's have a look at the first 3 entries:
print(dicts[:3])

# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(dicts)

INFO - haystack.utils.import_utils -  Fetching from https://github.com/Jthnn/Q-A-Haystack/raw/main/noticias-true.zip to `data/fakebr`
INFO - haystack.utils.preprocessing -  Converting data/fakebr/noticias-true/1444.txt
INFO - haystack.utils.preprocessing -  Converting data/fakebr/noticias-true/1453.txt
INFO - haystack.utils.preprocessing -  Converting data/fakebr/noticias-true/3131.txt
INFO - haystack.utils.preprocessing -  Converting data/fakebr/noticias-true/2507.txt
INFO - haystack.utils.preprocessing -  Converting data/fakebr/noticias-true/2904.txt
INFO - haystack.utils.preprocessing -  Converting data/fakebr/noticias-true/79.txt
INFO - haystack.utils.preprocessing -  Converting data/fakebr/noticias-true/1701.txt
INFO - haystack.utils.preprocessing -  Converting data/fakebr/noticias-true/2347.txt
INFO - haystack.utils.preprocessing -  Converting data/fakebr/noticias-true/1833.txt
INFO - haystack.utils.preprocessing -  Converting data/fakebr/noticias-true/3337.txt
INFO - haystack.ut

[{'content': 'Antes de voto do relator, Lula disse que vai até o fim com candidatura. A portas fechadas no Sindicato dos Metalúrgicos, o petista pediu votos e afirmou: \'Comecei aqui e aqui vou recomeçar\'.  SÃO BERNARDO DO CAMPO - Vestido com uma camiseta vermelha, o ex-presidente Luiz Inácio Lula da Silva disse nesta quarta-feira, 24, que seus julgadores estão com a consciência "menos tranquila" do que a dele. "A única decisão que espero hoje é 3 a 0 pela minha absolvição", afirmou Lula no Sindicato dos Metalúrgicos do ABC, em São Bernardo do Campo (SP). A fala ocorreu antes da decisão\xa0do relator do processo, João Pedro Gebran Neto , sobre sua condenação a 12 anos e 1 mês de prisão em regime fechado. + AO VIVO: Julgamento de Lula no TRF-4 A portas fechadas, Lula disse aos amigos, em uma sala reservada do sindicato, que irá até o fim para ser candidato à Presidência, independentemente do resultado do julgamento desta quarta no Tribunal Regional Federal da 4.a Região, em Porto Alegr

## Initalize Retriever, Reader,  & Pipeline

### Retriever

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered.
They use some simple but fast algorithm.

**Here:** We use Elasticsearch's default BM25 algorithm

**Alternatives:**

- Customize the `ElasticsearchRetriever`with custom queries (e.g. boosting) and filters
- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging
- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)
- Use `DensePassageRetriever` to use different embedding models for passage and query (see Tutorial 6)

In [None]:
from haystack.nodes import ElasticsearchRetriever

retriever = ElasticsearchRetriever(document_store=document_store)

In [None]:
# Alternative: An in-memory TfidfRetriever based on Pandas dataframes for building quick-prototypes with SQLite document store.

#from haystack.nodes import TfidfRetriever
#retriever = TfidfRetriever(document_store=document_store)

### Reader

A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based
on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers.
With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).

**Here:** a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)

**Alternatives (Reader):** TransformersReader (leveraging the `pipeline` of the Transformers package)

**Alternatives (Models):** e.g. "distilbert-base-uncased-distilled-squad" (fast) or "deepset/bert-large-uncased-whole-word-masking-squad2" (good accuracy)

**Hint:** You can adjust the model to return "no answer possible" with the no_ans_boost. Higher values mean the model prefers "no answer possible"

#### FARMReader

In [None]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="pierreguillou/bert-large-cased-squad-v1.1-portuguese", use_gpu=True)
#reader = FARMReader(model_name_or_path="pierreguillou/bert-base-cased-squad-v1.1-portuguese", use_gpu=True)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find pierreguillou/bert-large-cased-squad-v1.1-portuguese locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/918 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded pierreguillou/bert-large-cased-squad-v1.1-portuguese


Downloading:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/506 [00:00<?, ?B/s]

INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0     0  
INFO - haystack.modeling.infer -  /w\   /w\ 
INFO - haystack.modeling.infer -  /'\   / \ 


### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

## Voilà! Ask a question!

In [None]:
# Printando as respostas -3601

prediction = pipe.run(query="Qual é a crítica de Zico ao Flamengo?", params={"Retriever": {"top_k": 5}, "Reader": {"top_k": 5}})

print_answers(prediction, details="all")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.00 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.71 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.29s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.55 Batches/s]


Query: Qual é a crítica de Zico ao Flamengo?
Answers:
[   <Answer {'answer': 'montagem do elenco', 'type': 'extractive', 'score': 0.8862322270870209, 'context': ' no estádio do Maracanã, no Rio de Janeiro, Zico foi questionado sobre o ano do Flamengo. E o ídolo do clube não poupou críticas à montagem do elenco.', 'offsets_in_document': [{'start': 260, 'end': 278}], 'offsets_in_context': [{'start': 131, 'end': 149}], 'document_id': 'd1689758473f6284e4a185c5c130fba7', 'meta': {'name': '2.txt'}}>,
    <Answer {'answer': 'falta de garra', 'type': 'extractive', 'score': 0.704047679901123, 'context': 'om o clube. Na reta final da temporada, muito se criticou a suposta falta de garra de alguns atletas. "Precisa saber também o que representa a camisa ', 'offsets_in_document': [{'start': 160, 'end': 174}], 'offsets_in_context': [{'start': 68, 'end': 82}], 'document_id': '89f4eaa0fa016d2a7f41015689ab8c4d', 'meta': {'name': '2.txt'}}>,
    <Answer {'answer': 'faltou um trabalho melhor de avaliaç




In [None]:
# Printando as respostas -3653
prediction = pipe.run(query="O que o deputado Wladimir Costa divulgou no whatsapp?", params={"Retriever": {"top_k": 5}, "Reader": {"top_k": 5}})

print_answers(prediction, details="all")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.64 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.79 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.56 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.88 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.39 Batches/s]


Query: O que o deputado Wladimir Costa divulgou no whatsapp?
Answers:
[   <Answer {'answer': 'imagem de caráter homofóbico', 'type': 'extractive', 'score': 0.9590848982334137, 'context': 'chel Temer no ombro, divulgou nesta terça (8), via WhatsApp, imagem de caráter homofóbico, cujo alvo é o jornalista Ricardo Boechat, apresentador do g', 'offsets_in_document': [{'start': 152, 'end': 180}], 'offsets_in_context': [{'start': 61, 'end': 89}], 'document_id': '4b466f2a48ea27e2506407898e10b5dc', 'meta': {'name': '54.txt'}}>,
    <Answer {'answer': 'uma montagem', 'type': 'extractive', 'score': 0.8906220197677612, 'context': 'o aplicativo Whastapp – do qual fazem parte deputados e assessores – uma montagem com o propósito de “atacar a condição de mulher, mãe e parlamentar” ', 'offsets_in_document': [{'start': 630, 'end': 642}], 'offsets_in_context': [{'start': 69, 'end': 81}], 'document_id': '12092c42e01fd1fad5108f8553892d97', 'meta': {'name': '2695.txt'}}>,
    <Answer {'answer': 'fotos ínt




In [None]:
# Printando as respostas - 3745
prediction = pipe.run(query="Quem é Jair Bolsonaro?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}})

print_answers(prediction, details="all")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.62 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.77 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.69 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.11 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.82 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.72 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 22.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.71 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.19 Batches/s]


Query: Quem é Jair Bolsonaro?
Answers:
[   <Answer {'answer': 'pré-candidato à Presidência', 'type': 'extractive', 'score': 0.9899130165576935, 'context': 'nto é o deputado federal Eduardo Bolsonaro (PSC-SP), filho do pré-candidato à Presidência Jair Bolsonaro, que repete a exaustão que "bandido só respei', 'offsets_in_document': [{'start': 87, 'end': 114}], 'offsets_in_context': [{'start': 62, 'end': 89}], 'document_id': '71d3f0093fce7366bc660d8fcf5f855b', 'meta': {'name': '45.txt'}}>,
    <Answer {'answer': 'deputado federal e militar da reserva', 'type': 'extractive', 'score': 0.9754301905632019, 'context': 'itares podem voltar ao poder por meio do voto –é o que o deputado federal e militar da reserva Jair Bolsonaro, em segundo lugar nas pesquisas, afirmou', 'offsets_in_document': [{'start': 184, 'end': 221}], 'offsets_in_context': [{'start': 57, 'end': 94}], 'document_id': '5ec52664d111f19974693abaf21da7bc', 'meta': {'name': '72.txt'}}>,
    <Answer {'answer': 'PSC-RJ', 'type': 'e




In [None]:
# Printando as respostas -3752
prediction = pipe.run(query="Qual foi a captação da poupança em 2017?", params={"Retriever": {"top_k": 5}, "Reader": {"top_k": 5}})

print_answers(prediction, details="all")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.11 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 20.43 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 20.11 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.45 Batches/s]


Query: Qual foi a captação da poupança em 2017?
Answers:
[   <Answer {'answer': '6,61%', 'type': 'extractive', 'score': 0.9804734885692596, 'context': 'Em 2017, a remuneração dos depósitos de poupança foi de 6,61%, superando largamente a inflação estimada em menos de 3%. Em 2016, as cadernetas já havi', 'offsets_in_document': [{'start': 56, 'end': 61}], 'offsets_in_context': [{'start': 56, 'end': 61}], 'document_id': '3863f452c678dc4336aa64125567d43f', 'meta': {'name': '153.txt'}}>,
    <Answer {'answer': 'R$ 7,74 bilhões', 'type': 'extractive', 'score': 0.9683222770690918, 'context': 'ões por ano para a linha, o que representa uma queda de 35% ante os R$ 7,74 bilhões disponibilizados para 2017 (orçamento final já considerando remane', 'offsets_in_document': [{'start': 630, 'end': 645}], 'offsets_in_context': [{'start': 68, 'end': 83}], 'document_id': 'f00b11796ead2ceb5d9c133a016f4ce8', 'meta': {'name': '2244.txt'}}>,
    <Answer {'answer': 'R$ 7,74 bilhões', 'type': 'extractive', 'sc




In [None]:
# Printando as respostas -3752
prediction = pipe.run(query="Qual foi a inflação em 2017?", params={"Retriever": {"top_k": 5}, "Reader": {"top_k": 5}})

print_answers(prediction, details="all")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 17.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 21.49 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.72 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 20.52 Batches/s]


Query: Qual foi a inflação em 2017?
Answers:
[   <Answer {'answer': '2,95%', 'type': 'extractive', 'score': 0.9960058033466339, 'context': 'A inflação encerrou 2017 em 2,95%, divulgou o IBGE nesta quarta-feira (10), abaixo do piso da meta do governo, de 3%.', 'offsets_in_document': [{'start': 28, 'end': 33}], 'offsets_in_context': [{'start': 28, 'end': 33}], 'document_id': '3788353e5002016a97e962a670c2cd1e', 'meta': {'name': '214.txt'}}>,
    <Answer {'answer': '5,3%', 'type': 'extractive', 'score': 0.9959687888622284, 'context': ' reajustes abaixo da inflação, que ficou em 30% em dezembro de 2016, despencou para 5,3% em novembro de 2017, segundo o boletim Salariômetro, da Fipe.', 'offsets_in_document': [{'start': 105, 'end': 109}], 'offsets_in_context': [{'start': 84, 'end': 88}], 'document_id': '1c02fcd36662bb7b827e3995fda7037d', 'meta': {'name': '214.txt'}}>,
    <Answer {'answer': '79,6%', 'type': 'extractive', 'score': 0.9795089662075043, 'context': 'Considerando o período de jan


