## Build an Open-Domain Question-Answering System With BERT and `ktrain`

We first install `ktrain` and load a dataset into a Python list. We use the [20 Newsgroups dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) in this example.

In [1]:
!pip3 install -q ktrain

[K     |████████████████████████████████| 25.3 MB 1.6 MB/s 
[K     |████████████████████████████████| 981 kB 49.1 MB/s 
[K     |████████████████████████████████| 263 kB 61.6 MB/s 
[K     |████████████████████████████████| 3.8 MB 53.2 MB/s 
[K     |████████████████████████████████| 1.3 MB 61.4 MB/s 
[K     |████████████████████████████████| 468 kB 65.2 MB/s 
[K     |████████████████████████████████| 7.6 MB 54.9 MB/s 
[K     |████████████████████████████████| 880 kB 13.2 MB/s 
[K     |████████████████████████████████| 163 kB 41.6 MB/s 
[?25h  Building wheel for ktrain (setup.py) ... [?25l[?25hdone
  Building wheel for keras-bert (setup.py) ... [?25l[?25hdone
  Building wheel for keras-transformer (setup.py) ... [?25l[?25hdone
  Building wheel for keras-embed-sim (setup.py) ... [?25l[?25hdone
  Building wheel for keras-layer-normalization (setup.py) ... [?25l[?25hdone
  Building wheel for keras-multi-head (setup.py) ... [?25l[?25hdone
  Building wheel for keras-pos-e

In [2]:
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

Next, we will import `ktrain` modules and set the location of the search index.

In [3]:
from ktrain.text import SimpleQA

In [4]:
INDEXDIR = '/tmp/myindex'

### STEP 1: Create a Search Index

In [5]:
SimpleQA.initialize_index(INDEXDIR)
SimpleQA.index_from_list(docs, INDEXDIR, commit_every=len(docs),
                         multisegment=True, procs=4, # these args speed up indexing
                         breakup_docs=True         # this slows indexing but speeds up answer retrieval
                         )

### STEP 2: Create a QA Instance

In [6]:
qa = SimpleQA(INDEXDIR)

Downloading:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

### Ask Questions!

##### Space Question

In [7]:
answers = qa.ask('When did the Cassini probe launch?')
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,in october of 1997,cassini is scheduled for launch aboard a titan iv / centaur in october of 1997 .,0.819033,1257
1,"on january 26,1962","ranger 3, launched on january 26,1962 , was intended to land an instrument capsule on the surface of the moon, but problems during the launch caused the probe to miss the moon and head into solar orbit.",0.151229,9591
2,- 10 / 06 / 97,key scheduled dates for the cassini mission (vvejga trajectory)-------------------------------------------------------------10 / 06 / 97-titan iv / centaur launch 04 / 21 / 98-venus 1 gravity assist 06 / 20 / 99-venus 2 gravity assist 08 / 16 / 99-earth gravity assist 12 / 30 / 00-jupiter gravity assist 06 / 25 / 04-saturn arrival 01 / 09 / 05-titan probe release 01 / 30 / 05-titan probe entry 06 / 25 / 08-end of primary mission (schedule last updated 7 / 22 / 92) - 10 / 06 / 97,0.029694,1257
3,* 98,"cassini * * * * * * * * * * * * * * * * * * 98 ,115 * * * *",2.6e-05,12775
4,the latter part of the 1990s,"scheduled for launch in the latter part of the 1990s , the craf and cassini missions are a collaborative project of nasa, the european space agency and the federal space agencies of germany and italy, as well as the united states air force and the department of energy.",1.7e-05,16868


As shown above, the top candidate answer of **October 1997** is the correct one.  (This won't always be the case, but it is here.)

##### Technical Support Question

In [8]:
answers = qa.ask('What causes computer images to be too dark?')
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,if your viewer does not do gamma correction,"if your viewer does not do gamma correction , then linear images will look too dark, and gamma corrected images will ok.",0.93799,10107
1,is gamma correction,"this, is gamma correction (or the lack of it).",0.045165,10107
2,so if you just dump your nice linear image out to a crt,"so if you just dump your nice linear image out to a crt , the image will look much too dark.",0.010337,10107
3,that small color details,"the algorithm achieves much of its compression by exploiting known limitations of the human eye, notably the fact that small color details are not perceived as well as small details of light and dark.",0.002114,2808
4,that small color details,"the algorithm achieves much of its compression by exploiting known limitations of the human eye, notably the fact that small color details are not perceived as well as small details of light and dark.",0.002114,1948


It looks like a **lack of gamma correction** is a cause of this technical problem.

##### Religious Question

In [35]:
answers = qa.ask('Who was Muhammad?')
for key,value in answers[0].items():
  if key == 'answer':
    print(value)

the holy prophet of islam


In [11]:
!pip3 install -q gradio

[K     |████████████████████████████████| 11.6 MB 30.1 MB/s 
[K     |████████████████████████████████| 84 kB 4.0 MB/s 
[K     |████████████████████████████████| 56 kB 5.1 MB/s 
[K     |████████████████████████████████| 106 kB 74.7 MB/s 
[K     |████████████████████████████████| 55 kB 4.0 MB/s 
[K     |████████████████████████████████| 54 kB 3.6 MB/s 
[K     |████████████████████████████████| 84 kB 3.4 MB/s 
[K     |████████████████████████████████| 272 kB 64.2 MB/s 
[K     |████████████████████████████████| 213 kB 45.5 MB/s 
[K     |████████████████████████████████| 2.3 MB 60.8 MB/s 
[K     |████████████████████████████████| 63 kB 2.3 MB/s 
[K     |████████████████████████████████| 80 kB 10.7 MB/s 
[K     |████████████████████████████████| 68 kB 6.8 MB/s 
[K     |████████████████████████████████| 46 kB 4.3 MB/s 
[K     |████████████████████████████████| 593 kB 73.9 MB/s 
[K     |████████████████████████████████| 856 kB 68.9 MB/s 
[K     |███████████████████████████████

In [36]:
import gradio as gr

def greet(quest):
    answers = qa.ask(quest)
    for key,value in answers[0].items():
      if key == 'answer':
        return value

demo = gr.Interface(fn=greet, inputs="text", outputs="text")
    
demo.launch()   

Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`

Using Embedded Colab Mode (NEW). If you have issues, please use share=True and file an issue at https://github.com/gradio-app/gradio/
Note: opening the browser inspector may crash Embedded Colab Mode.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

(<gradio.routes.App at 0x7fcbc63a2d50>, 'http://127.0.0.1:7866/', None)