Let's try to implement a QA system based on a pipeline composed of retriever and reader to answer questions on "Harry Potter and The Sorcerer’s Stone" (HP).
I will preprocess the HP pdf utilizing the Haystack suite and then store the documents in elasticsearch.
First, I will use a normal sparse retriever and then a will try to apply a DPR to compare the results.

We start by setting up Haystack and Elasticsearch

In [3]:
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,ocr]

!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz
!tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin

Collecting pip
  Downloading pip-22.0.4-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 4.3 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.0.4
Collecting farm-haystack[colab,ocr]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-xv7dhoh0/farm-haystack_7cb53ce8dbab4577a4002c234ca32113
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-xv7dhoh0/farm-haystack_7cb53ce8dbab4577a4002c234ca32113
  Resolved https://github.com/deepset-ai/haystack.git to commit ae712fe6bf087c717f3e38e4e87d2347165fc12b
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting mmh3
  Downloading mmh3-3.0.0-cp3

In [6]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
)
# wait until ES has started
! sleep 30

In [7]:
# Connect to Elasticsearch

from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")


Let's preprocess our Harry Potter pdf - http://www.passuneb.com/elibrary/ebooks/Harry%20Potter%20and%20The%20Sorcerer%E2%80%99s%20Stone.pdf

In [8]:
# Here are the imports we need
from haystack.nodes import PDFToTextConverter,  PreProcessor
from haystack.utils import convert_files_to_docs

In [9]:
converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])
doc_pdf = converter.convert(file_path="/content/Harry Potter and The Sorcerer’s Stone.pdf", meta=None)[0]

In [10]:
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=100,
    split_respect_sentence_boundary=True,
)
docs = preprocessor.process([doc_pdf])
print(f"n_docs_input: 1\nn_docs_output: {len(docs)}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


100%|██████████| 1/1 [00:00<00:00,  3.91docs/s]

n_docs_input: 1
n_docs_output: 887





Let's add the preprocessed docs into Elasticsearch

In [11]:
# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(docs)

Now, I will proceed setting up the pipeline with the sparse retriever.

step 1: retriever - BM25 implemented by elasticsearch

In [89]:
from haystack.nodes import ElasticsearchRetriever

retriever = ElasticsearchRetriever(document_store=document_store)

step 2: reader - let's try with roberta-base-squad2. Suggested model in the docs

In [13]:
from haystack.nodes import FARMReader, TransformersReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0     0  
INFO - haystack.modeling.infer -  /w\   /w\ 
INFO - haystack.modeling.infer -  /'\   / \ 


For this first attempt, I will  leverage the ready-made pipeline ExtractiveQAPipeline

In [90]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)


Let's now ask ome questions to our system

In [107]:
questions = ['Who is Dumbledore?', "How is it called Harry's aunt?", "what are the four houses names?", "How is it called Harry's uncle?", "who is Norbert"]

In [108]:
QA_set = {}
for q in questions:
  prediction = pipe.run(
    query=q, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
  )
  QA_set[q] = [(answer.answer, answer.score) for answer in prediction['answers']]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.99 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.76 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.96 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.60 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.88 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 20.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.68 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.05 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.04 Batches/s

In [109]:
for k, v in QA_set.items():
  print(k)
  for answer in v:
    print('\t{}'.format(answer))
  print("\n")

Who is Dumbledore?
	('Albus Dumbledore', 0.8194479644298553)
	('a very great wizard', 0.4702693670988083)
	('Professor Dumbledore', 0.23260757327079773)
	('the only one You-Know-Who was ever afraid of', 0.19613751024007797)
	('Lily and James Potter', 0.19224070757627487)


How is it called Harry's aunt?
	('Aunt Petunia', 0.7305570542812347)
	('Aunt Petunia', 0.2939871773123741)
	('Nothing, nothing..."', 0.27101390808820724)
	('Aunt Petunia', 0.21833521127700806)
	("Devil's Snare", 0.05505102686583996)


what are the four houses names?
	('Gryffindor, Hufflepuff, Ravenclaw, and Slytherin', 0.9716241955757141)
	('School houses', 0.6905372440814972)
	('Gryffindor', 0.5084449350833893)
	('Some sort of test', 0.10287788510322571)
	('Houses', 0.06106731854379177)


How is it called Harry's uncle?
	('Uncle Vernon', 0.704220324754715)
	('Uncle Vernon', 0.6269576549530029)
	('Uncle Vernon', 0.4061504751443863)
	('Uncle Vernon', 0.24284610897302628)
	('Quirrell', 0.04651808366179466)


who is Nor

Answers are pretty valid. I also tried with other questions but the results are poor (Who are Harry Potter's parents, Hermione Granger's hair color etc).

Now let's try with a DPR:

1.   Load the new retriever
2.   Rebuild the pipeline with the new retriever but same reader
3.   Run the quetions and check the results



In [84]:
from haystack.nodes import DensePassageRetriever

retriever_DPR = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    max_seq_len_query=64,
    max_seq_len_passage=256,
    batch_size=16,
    use_gpu=True,
    embed_title=True,
    use_fast_tokenizers=True,
)
# Important:
# Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all
# previously indexed documents and update their embedding representation.
# While this can be a time consuming operation (depending on corpus size), it only needs to be done once.
# At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.
document_store.update_embeddings(retriever_DPR)

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find facebook/dpr-question_encoder-single-nq-base locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...
INFO - haystack.modeling.model.language_model -  Loaded facebook/dpr-question_encoder-single-nq-base
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find facebook/dpr-ctx_encoder-single-nq-base locally.
IN

Updating embeddings:   0%|          | 0/887 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/896 [00:00<?, ? Docs/s]

In [95]:
pipe_DPR = ExtractiveQAPipeline(reader, retriever_DPR)

QA_set_DPR = {}
for q in questions:
  prediction = pipe_DPR.run(
    query=q, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
  )
  QA_set_DPR[q] = [(answer.answer, answer.score) for answer in prediction['answers']]

for k, v in QA_set_DPR.items():
  print(k)
  for answer in v:
    print('\t{}'.format(answer))
  print("\n")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.83 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.30 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.65 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.81 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.41 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.89 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.73 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.97 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 17.98 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.80 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.93 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.65 Batches/s

Who is Dumbledore?
	('Albus Dumbledore', 0.821822464466095)
	('Professor Dumbledore, sir', 0.401757150888443)
	('Hagrid', 0.3709588944911957)
	('Hagrid', 0.22871458530426025)
	('Dumbledore is particularly famous', 0.22741584479808807)


How is it called Harry's aunt?
	('Aunt Petunia', 0.8935014009475708)
	('Aunt Petunia', 0.8920823037624359)
	('Great Auntie Enid', 0.5380969643592834)
	('Hagrid', 0.09339652583003044)
	('Mrs. Norris', 0.09175241366028786)


what are the four houses names?
	('Gryffindor, Hufflepuff, Ravenclaw, and Slytherin', 0.9716241955757141)
	('Gryffindor', 0.5084449350833893)
	('Gryffindor', 0.07008694484829903)
	('your houses', 0.030082549899816513)
	('Great Hall', 0.025165279395878315)


How is it called Harry's uncle?
	('Uncle Vernon', 0.8343906700611115)
	('Uncle Vernon', 0.7464466989040375)
	('Harvey', 0.04913013614714146)
	('Dudley', 0.03689198009669781)
	('wizard', 0.027278142981231213)







The results are quite similar I'd say. The major difference is in the score. In some occurances it is higher than in the sparse method.

In [105]:
questions_hard = ["Who are Harry Potter's parents", "lily hair color", "who is Norbert?"]

In [106]:
QA_set_DPR_hard = {}
for q in questions_hard:
  prediction = pipe_DPR.run(
    query=q, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
  )
  QA_set_DPR_hard[q] = [(answer.answer, answer.score) for answer in prediction['answers']]

for k, v in QA_set_DPR_hard.items():
  print(k)
  for answer in v:
    print('\t{}'.format(answer))
  print("\n")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.14 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.85 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.86 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.60 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.43 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 17.32 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 17.51 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.96 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.73 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.96 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.88 Batches/s

Who are Harry Potter's parents
	("mum an' dad", 0.7500946819782257)
	('Mr. and Mrs. Dursley', 0.6728163659572601)
	('Weasley twins', 0.6465665996074677)
	('aunt and uncle', 0.39339931309223175)
	('Mrs. Dursley', 0.35713834315538406)


lily hair color
	('ebony and unicorn', 0.5361964553594589)
	('dark red', 0.43080934882164)
	('silver', 0.31448034942150116)
	('holly', 0.0568766500800848)
	('bald', 0.031883254647254944)


who is Norbert?
	('Norwegian Ridgeback', 0.8646981120109558)
	('GRYFFINDOR', 0.2662728950381279)
	('Malfoy', 0.1594613641500473)
	('Hagrid hadn\'t been doing his gamekeeping duties because the dragon was keeping him so busy. There were empty brandy bottles and chicken feathers all over the floor. "I\'ve decided to call him Norbert," said Hagrid', 0.13378018140792847)
	('Nitwit! Blubber! Oddment! Tweak! "Thank you!" He sat back down. Everybody clapped and cheered. Harry didn\'t know whether to laugh or not. "Is he -- a bit mad?" he asked Percy uncertainly. "Mad?" said Pe




I am very surprised about the last answer! Even though it is not perfect, it clearly shows that the model was able to associate the egg's typology to the dragon 