Create a better system for answering questions

Requirements

pip3 install -r requirements.txt
Download questions from here and save as 'qanta.train.json'. Users can also change the value of the variable 'questions_file' in test_queries.py to the correct path.
Download documents from here and save as 'wiki_lookup.json'. Users can also change the value of the variable 'file_name_documents' in Index_Creation_code.py to the correct path.

Steps to run the code:

Install all the requirements in the file requirements.txt by using the above code.
python run.py (comment out the command to run the index creation code after running the script for the first time.)

Document Retrieval

Index creation for tf-idf is done by the python file Index_Creation_code.py
Please note that location of files containing the corpus can be changed according to convenience in the beginning of all the python files.
- I have only retrieved the document titles while displaying the ranking. I assume that the titles are sufficient to identify the document. The top three documents are also stored in the for use in BERT-SQuAD.

Supported improvements to tf-idf vector based retrieval:

Spell checking.
Using synoynms for equivalence classes.
Using zonal indexing to give weights to the title and the body.

Results for retrieval

Case	Accuracy
If the prediction is considered to be correct when the actual document is in the top 1 results	64%
If the prediction is considered to be correct when the actual document is in the top 5 results	80%
If the prediction is considered to be correct when the actual document is in the top 10 results	84%
If the prediction is considered to be correct when the actual document is in the top 20 results	84%
If the prediction is considered to be correct when the actual document is in the top 30 results	85%

Case when only the first sentence of the document is considered as the query	Accuracy
If the prediction is considered to be correct when the actual document is in the top 1 results	9%
If the prediction is considered to be correct when the actual document is in the top 5 results	20%
If the prediction is considered to be correct when the actual document is in the top 10 results	26%
If the prediction is considered to be correct when the actual document is in the top 20 results	30%
If the prediction is considered to be correct when the actual document is in the top 30 results	36%

Answering system

I use the BERT-SQuAD pre-trained model from here [https://github.com/kamalkraj/BERT-SQuAD]. Given a document and a question, this model gives specific answers with confidence levels.

Pretrained model download from here

unzip and move files to model directory

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Index_Creation_code.py		Index_Creation_code.py
README.md		README.md
api.py		api.py
bert.py		bert.py
inference.py		inference.py
requirements.txt		requirements.txt
run.py		run.py
test_queries.py		test_queries.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index_Creation_code.py

Index_Creation_code.py

README.md

README.md

api.py

api.py

bert.py

bert.py

inference.py

inference.py

requirements.txt

requirements.txt

run.py

run.py

test_queries.py

test_queries.py

utils.py

utils.py

Repository files navigation

Create a better system for answering questions

Requirements

Steps to run the code:

Document Retrieval

Supported improvements to tf-idf vector based retrieval:

Results for retrieval

Answering system

Pretrained model download from here

About

Releases

Packages

Languages

Raj-Sanjay-Shah/QA_Document_Retrieval

Folders and files

Latest commit

History

Repository files navigation

Create a better system for answering questions

Requirements

Steps to run the code:

Document Retrieval

Supported improvements to tf-idf vector based retrieval:

Results for retrieval

Answering system

Pretrained model download from here

About

Resources

Stars

Watchers

Forks

Languages