Skip to content

Raj-Sanjay-Shah/QA_Document_Retrieval

Repository files navigation

Create a better system for answering questions

Requirements

  1. pip3 install -r requirements.txt
  2. Download questions from here and save as 'qanta.train.json'. Users can also change the value of the variable 'questions_file' in test_queries.py to the correct path.
  3. Download documents from here and save as 'wiki_lookup.json'. Users can also change the value of the variable 'file_name_documents' in Index_Creation_code.py to the correct path.

Steps to run the code:

  1. Install all the requirements in the file requirements.txt by using the above code.
  2. python run.py (comment out the command to run the index creation code after running the script for the first time.)

Document Retrieval

  1. Index creation for tf-idf is done by the python file Index_Creation_code.py
  2. Please note that location of files containing the corpus can be changed according to convenience in the beginning of all the python files.
    • I have only retrieved the document titles while displaying the ranking. I assume that the titles are sufficient to identify the document. The top three documents are also stored in the for use in BERT-SQuAD.

Supported improvements to tf-idf vector based retrieval:

  1. Spell checking.
  2. Using synoynms for equivalence classes.
  3. Using zonal indexing to give weights to the title and the body.

Results for retrieval

Case Accuracy
If the prediction is considered to be correct when the actual document is in the top 1 results 64%
If the prediction is considered to be correct when the actual document is in the top 5 results 80%
If the prediction is considered to be correct when the actual document is in the top 10 results 84%
If the prediction is considered to be correct when the actual document is in the top 20 results 84%
If the prediction is considered to be correct when the actual document is in the top 30 results 85%
Case when only the first sentence of the document is considered as the query Accuracy
If the prediction is considered to be correct when the actual document is in the top 1 results 9%
If the prediction is considered to be correct when the actual document is in the top 5 results 20%
If the prediction is considered to be correct when the actual document is in the top 10 results 26%
If the prediction is considered to be correct when the actual document is in the top 20 results 30%
If the prediction is considered to be correct when the actual document is in the top 30 results 36%

Answering system

I use the BERT-SQuAD pre-trained model from here [https://github.com/kamalkraj/BERT-SQuAD]. Given a document and a question, this model gives specific answers with confidence levels.

Pretrained model download from here

unzip and move files to model directory

About

Creating a question answering system that uses TF-IDF for document retrieval and BERT for retrieving a specific answer.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages