# Building a document answering model 
In this section, let us learn how to use ktrain for building a document answering model. We know that in document answering, we will have some set of documents and we use those documents to answer a question. Let us see how we can do this using ktrain. 

First, let us import the necessary libraries:


In [1]:
%%capture
!pip install ktrain==0.25.3

In [2]:
from ktrain import text
import os
import shutil


In this exercise, we will use the BBC news dataset. The BBC news dataset consists of 2225 documents containing news from 2004 to 2005. It includes news in these five categories - business, entertainment, politics, sport, tech. 

First, let us download the dataset: 


In [3]:
!wget http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip

--2020-12-30 22:23:23--  http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip
Resolving mlg.ucd.ie (mlg.ucd.ie)... 137.43.93.132
Connecting to mlg.ucd.ie (mlg.ucd.ie)|137.43.93.132|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2874078 (2.7M) [application/zip]
Saving to: ‘bbc-fulltext.zip’


2020-12-30 22:23:24 (3.76 MB/s) - ‘bbc-fulltext.zip’ saved [2874078/2874078]




Next, unzip the dataset: 


In [4]:
%%capture
!unzip bbc-fulltext.zip


Now, import the necessary libraries: 


In [5]:
from ktrain import text
import os


Change directory to our BBC news folder: 

In [6]:
os.chdir(os.getcwd() + '/bbc')


The first step is initializing the index directory. It is used for indexing all the documents. We don't need to create any new directory manually. We just need to pass the name of the index directory to the function intialize_index and then the index directory will be created: 


In [7]:
text.SimpleQA.initialize_index('index')

FileIndex(FileStorage('index'), 'MAIN')


After initializing the index, we need to index the documents. Since we will have all the documents in the folder, we will use the function index_from_folder.  The function index_from_folder takes the folder_path where we have all the documents and the index_dir as the parameter: 


In [8]:
text.SimpleQA.index_from_folder(folder_path='entertainment',index_dir='index')


The next step is creating an instance of the SimpleQA class as shown below. We need to pass the index directory as an argument: 


In [9]:
qa = text.SimpleQA('index')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=443.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1341090760.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…





That's it. Now, we can use the ask function and retrieve answers from the documents for any question: 


In [10]:
answers = qa.ask('who who had a global hit with where is the love?')



Let us print the top 5 answers:


In [11]:
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,the black eyed peas,"the black eyed peas -who had a global hit with where is the love ?-picked up the prize for best pop act, beating anastacia, avril lavigne, robbie williams and britney spears.",0.9997414,153.txt
1,out kast,out kast will add their awards to the four they won at the us mtv awards in august and three grammys in february.,0.0001829924,153.txt
2,of johnny cash,"he wrote ring of fire with june carter cash, the future wife of johnny cash who went on to score his most popular hit with the track.",7.449766e-05,119.txt
3,s free,"rodgers was singer with early 1970s rocker s free , who had a global hit with all right now, before forming bad company, a successful "" supergroup "" with members of king crimson and mott the hoople.",3.345142e-07,138.txt
4,s free,"rodgers was singer with early 1970s rocker s free , who had a global hit with all right now, before forming bad company, a successful "" supergroup "" with members of king crimson and mott the hoople.",3.345142e-07,158.txt



As we can observe, along with obtaining the candidate answer, we will get other information such as context, confidence, and document reference. 

Let us try with another question: 

In [12]:
answers = qa.ask('who win at mtv europe awards?')

In [13]:
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,duo out kast,us hip hop duo out kast have capped a year of award glory with three prizes at the mtv europe music awards in rome.,0.600309,153.txt
1,was justin timberlake,"last year ' s big winner at the mtv europe awards, held in edinburgh, scotland, was justin timberlake , who walked away with three trophies.",0.157786,153.txt
2,duo out kast,us rap duo out kast ' s trio of trophies at the mtv europe awards crowns a year of huge success for the band.,0.121886,132.txt
3,eminem,"eminem performed on hms belfast on friday, which is docked on the river thames, where he filmed two songs for bbc one ' s top of the pops.",0.05998,151.txt
4,the black sabbath,""" at the end of it i did not like having cameras around the house all the time, "" the black sabbath singer told reporters at the mtv europe awards in rome.",0.059523,206.txt



In this way, we can use the ktrain for document answering use cases. 