# Open-domain question answering with DeepPavlov


The architecture of the DeepPavlov ODQA skill is modular and consists of two components: a **ranker** and a **reader**. In order to answer any question, the **ranker** first retrieves a few relevant articles from the article collection, and then the **reader** scans them carefully to identify the answer. The **ranker** is based on DrQA [1] proposed by Facebook Research. Specifically, the DrQA approach uses unigram-bigram hashing and TF-IDF matching designed to efficiently return a subset of relevant articles based on a question. The **reader** is based on R-NET [2] proposed by Microsoft Research Asia and its implementation by Wenxuan Zhou. The R-NET architecture is an end-to-end neural network model that aims to answer questions based on a given article. R-NET first matches the question and the article via gated attention-based recurrent networks to obtain a question-aware article representation. Then the self-matching attention mechanism refines the representation by matching the article against itself, which effectively encodes information from the whole article. Finally, the pointer networks locate the positions of answers in the article. The scheme below shows DeepPavlov ODQA system architecture.

DeepPavlov’s ODQA system has two Wikipedia-based models. The first one is based on the English Wikipedia dump from 2018-02-11 (5,180,368 articles) and the second one is based on the Russian Wikipedia dump from 2018-04-01 (1,463,888 articles).

[1] [Chen, Danqi, et al. "Reading wikipedia to answer open-domain questions." arXiv preprint arXiv:1704.00051 (2017)](https://arxiv.org/pdf/1704.00051.pdf)

[2] [R-NET: Machine reading comprehension with self-matching networks](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf)

<img src="https://github.com/deepmipt/dp_notebooks/blob/master/odqa.png?raw=1">

<center>Picture 1. The DeepPavlov-based ODQA system architecture</center>

# Model Requirements

The DeepPavlov ODQA system has two Wikipedia-based models. The English Wikipedia model requires 35 GB of local storage, whereas the Russian version takes up about 20 GB. The Wikipedia dumps can be rebuilt by steps described in the [documentation](http://docs.deeppavlov.ai/en/0.1.6/components/tfidf_ranking.html#available-data-and-pretrained-models). Both models require about 24 GB of RAM. It is possible to run them on a 16 GB machine, but the swap size should be at least 8 GB.
 
But first, install DeepPavlov and all the model's requirements.

In [1]:
!pip install -q deeppavlov
!python -m deeppavlov install en_odqa_infer_wiki

[K     |████████████████████████████████| 696kB 2.8MB/s 
[K     |████████████████████████████████| 6.7MB 22.1MB/s 
[K     |████████████████████████████████| 4.1MB 28.4MB/s 
[K     |████████████████████████████████| 61kB 21.2MB/s 
[K     |████████████████████████████████| 51kB 17.7MB/s 
[K     |████████████████████████████████| 8.0MB 29.4MB/s 
[K     |████████████████████████████████| 2.8MB 28.3MB/s 
[K     |████████████████████████████████| 61kB 22.0MB/s 
[K     |████████████████████████████████| 2.1MB 22.5MB/s 
[K     |████████████████████████████████| 51kB 21.7MB/s 
[K     |████████████████████████████████| 51kB 19.8MB/s 
[K     |████████████████████████████████| 2.3MB 30.3MB/s 
[K     |████████████████████████████████| 7.1MB 25.2MB/s 
[K     |████████████████████████████████| 102kB 25.6MB/s 
[?25h  Building wheel for overrides (setup.py) ... [?25l[?25hdone
  Building wheel for pytelegrambotapi (setup.py) ... [?25l[?25hdone
[31mERROR: google-colab 1.0.0 has requir

# Model Description

The architecture of the ODQA skill is modular and consists of two components, a **ranker** and a **reader**. In order to answer any question, the **reader** first retrieves **top_n** relevant articles from the document collection, and then the **reader** scans them carefully to identify the answer. The detailed description of the ODQA models can be found in the [DeepPavlov documentation](http://docs.deeppavlov.ai/en/0.1.6/skills/odqa.html).

In [0]:
%load https://github.com/deepmipt/DeepPavlov/blob/0.1.6/deeppavlov/configs/odqa/en_odqa_infer_wiki.json

# Interacting with the model

**As it was mentioned, the Wikipedia-based models have significant storage and RAM requirements, therefore it's impossible to interact with them on Colab, however you can do so localy (of course when the requirements are satisfied). Alternatively, you can check out our [demo](http://demo.ipavlov.ai/).**

Make sure that you can navigate the configuration files by using Autocomplete (Tab key) with **configs** module.

from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model

odqa = build_model(configs.odqa.en_odqa_infer_wiki, download = True)
answers = odqa([
                "Where did guinea pigs originate?", 
                "When did the Lynmouth floods happen?",
                "When is the Bastille Day?"
                ])

# Training the model

You can train a model by running the framework with **train** parameter, wherein the model will be trained on the document collection defined in the **dataset_reader** section of the configuration file. The **dataset_reader** section of the ranker’s configuration defines the source of the articles. The source can be of the following **dataset_format-**:

wiki — the Wikipedia dump,
txt — the path to the separated text files,
json — JSON files, which should be formatted as a list with dicts that contain the *title* and *doc* keywords.


* *wiki* - The Wikipedia dump
* *txt* - each document in separate txt file
* *json* - JSON files should be formatted as list with dicts which contain 'title' and 'doc' keywords.

As a training corpus, I will use the PloS sentence corpus. It consists of 300 computational biology articles, each of them stored in a separate *txt* file. For simplicity, we will use the same configuration files that is used for the Wikipedia-based ODQA system; however, we strongly encourage you to create custom configuration files for your own models.

In [3]:
!wget -q http://archive.ics.uci.edu/ml/machine-learning-databases/00311/SentenceCorpus.zip
!unzip SentenceCorpus.zip

Archive:  SentenceCorpus.zip
   creating: SentenceCorpus/
  inflating: SentenceCorpus/.DS_Store  
   creating: __MACOSX/
   creating: __MACOSX/SentenceCorpus/
  inflating: __MACOSX/SentenceCorpus/._.DS_Store  
  inflating: SentenceCorpus/Instructions_for_SentenceAnnotation.pdf  
  inflating: __MACOSX/SentenceCorpus/._Instructions_for_SentenceAnnotation.pdf  
   creating: SentenceCorpus/labeled_articles/
  inflating: SentenceCorpus/labeled_articles/.DS_Store  
   creating: __MACOSX/SentenceCorpus/labeled_articles/
  inflating: __MACOSX/SentenceCorpus/labeled_articles/._.DS_Store  
  inflating: SentenceCorpus/labeled_articles/arxiv_annotate10_7_1.txt  
  inflating: __MACOSX/SentenceCorpus/labeled_articles/._arxiv_annotate10_7_1.txt  
  inflating: SentenceCorpus/labeled_articles/arxiv_annotate10_7_2.txt  
  inflating: __MACOSX/SentenceCorpus/labeled_articles/._arxiv_annotate10_7_2.txt  
  inflating: SentenceCorpus/labeled_articles/arxiv_annotate10_7_3.txt  
  inflating: __MACOSX/SentenceC

In order to fit a model on new data, first, change the **data_path** parameter of the **dataset_reader** section. Then change the **dataset_format** to *txt*. Finally, train the model.

In [4]:
from deeppavlov import configs
from deeppavlov.core.common.file import read_json
from deeppavlov import configs, train_model

model_config = read_json(configs.doc_retrieval.en_ranker_tfidf_wiki)
model_config["dataset_reader"]["data_path"] = "/content/SentenceCorpus/unlabeled_articles/plos_unlabeled"
model_config["dataset_reader"]["dataset_format"] = "txt"
doc_retrieval = train_model(model_config)

2019-08-06 09:36:39.466 INFO in 'deeppavlov.dataset_readers.odqa_reader'['odqa_reader'] at line 57: Reading files...
2019-08-06 09:36:39.471 INFO in 'deeppavlov.dataset_readers.odqa_reader'['odqa_reader'] at line 134: Building the database...
  0%|          | 0/300 [00:00<?, ?it/s]
 72%|███████▏  | 215/300 [00:00<00:00, 2138.22it/s]
100%|██████████| 300/300 [00:00<00:00, 2724.41it/s]
2019-08-06 09:36:39.675 INFO in 'deeppavlov.dataset_iterators.sqlite_iterator'['sqlite_iterator'] at line 57: Connecting to database, path: /root/.deeppavlov/downloads/odqa/enwiki.db
2019-08-06 09:36:39.678 INFO in 'deeppavlov.dataset_iterators.sqlite_iterator'['sqlite_iterator'] at line 112: SQLite iterator: The size of the database is 300 documents
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package perluniprop

Examine the ranker output.

In [5]:
doc_retrieval(['cerebellum'])

[['499.txt',
  '563.txt',
  '566.txt',
  '585.txt',
  '58.txt',
  '50.txt',
  '426.txt',
  '494.txt',
  '490.txt',
  '485.txt',
  '484.txt',
  '583.txt',
  '478.txt',
  '466.txt',
  '46.txt',
  '453.txt',
  '445.txt',
  '438.txt',
  '437.txt',
  '436.txt',
  '430.txt',
  '429.txt',
  '505.txt',
  '470.txt',
  '59.txt']]

Everything is done to run the ODQA component, make sure that the **download = False** otherwise the pretrained Wikipedia dump will overwrite your model.

In [8]:
from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model

# Download all the SQuAD models
squad = build_model(configs.squad.multi_squad_noans_infer, download = True)
# Do not download the ODQA models, we've just trained it
odqa = build_model(configs.odqa.en_odqa_infer_wiki, download = False)
answers = odqa(["what is tuberculosis?", "how should I take antibiotics?"])

2019-08-06 09:38:25.848 INFO in 'deeppavlov.core.data.utils'['utils'] at line 63: Downloading from http://files.deeppavlov.ai/deeppavlov_data/multi_squad_model_noans_1.1.tar.gz to /root/.deeppavlov/multi_squad_model_noans_1.1.tar.gz
I0806 09:38:25.848002 140130954237824 utils.py:63] Downloading from http://files.deeppavlov.ai/deeppavlov_data/multi_squad_model_noans_1.1.tar.gz to /root/.deeppavlov/multi_squad_model_noans_1.1.tar.gz
100%|██████████| 265M/265M [01:02<00:00, 4.23MB/s]
2019-08-06 09:39:28.476 INFO in 'deeppavlov.core.data.utils'['utils'] at line 201: Extracting /root/.deeppavlov/multi_squad_model_noans_1.1.tar.gz archive into /root/.deeppavlov/models
I0806 09:39:28.476614 140130954237824 utils.py:201] Extracting /root/.deeppavlov/multi_squad_model_noans_1.1.tar.gz archive into /root/.deeppavlov/models
2019-08-06 09:39:33.930 INFO in 'deeppavlov.models.preprocessors.squad_preprocessor'['squad_preprocessor'] at line 310: SquadVocabEmbedder: loading saved tokens vocab from /ro

UnknownError: ignored

# Useful links

[DeepPavlov repository](https://github.com/deepmipt/DeepPavlov)

[DeepPavlov demo page](https://demo.ipavlov.ai)

[DeepPavlov documentation](https://docs.deeppavlov.ai)