- pip3 install -r requirements.txt
- Download questions from here and save as 'qanta.train.json'. Users can also change the value of the variable 'questions_file' in test_queries.py to the correct path.
- Download documents from here and save as 'wiki_lookup.json'. Users can also change the value of the variable 'file_name_documents' in Index_Creation_code.py to the correct path.
- Install all the requirements in the file requirements.txt by using the above code.
- python run.py (comment out the command to run the index creation code after running the script for the first time.)
- Index creation for tf-idf is done by the python file Index_Creation_code.py
- Please note that location of files containing the corpus can be changed according to convenience
in the beginning of all the python files.
- I have only retrieved the document titles while displaying the ranking. I assume that the titles are sufficient to identify the document. The top three documents are also stored in the for use in BERT-SQuAD.
- Spell checking.
- Using synoynms for equivalence classes.
- Using zonal indexing to give weights to the title and the body.
Case | Accuracy |
---|---|
If the prediction is considered to be correct when the actual document is in the top 1 results | 64% |
If the prediction is considered to be correct when the actual document is in the top 5 results | 80% |
If the prediction is considered to be correct when the actual document is in the top 10 results | 84% |
If the prediction is considered to be correct when the actual document is in the top 20 results | 84% |
If the prediction is considered to be correct when the actual document is in the top 30 results | 85% |
Case when only the first sentence of the document is considered as the query | Accuracy |
---|---|
If the prediction is considered to be correct when the actual document is in the top 1 results | 9% |
If the prediction is considered to be correct when the actual document is in the top 5 results | 20% |
If the prediction is considered to be correct when the actual document is in the top 10 results | 26% |
If the prediction is considered to be correct when the actual document is in the top 20 results | 30% |
If the prediction is considered to be correct when the actual document is in the top 30 results | 36% |
I use the BERT-SQuAD pre-trained model from here [https://github.com/kamalkraj/BERT-SQuAD]. Given a document and a question, this model gives specific answers with confidence levels.
Pretrained model download from here
unzip and move files to model directory