This work concentrates on comparing & improving BERT based QA models using less Computation and GPU
To Improve the performance of the BERT-based question-answering model on the SQuAD dataset by proposing several enhancements with BERT architecture
Techniques Used:
- Hyperparameter tuning
- Ensembling Multiple Models
- The Stanford Question Answering Dataset (SQuAD) 2.0 is a dataset designed for evaluating the performance of question-answering systems on more challenging tasks.
- Contains questions that do not have a definite answer in the given context
- The dataset contains over 100,000 questions that are derived from Wikipedia articles and covers a wide range of topics.
Dataset Download: https://rajpurkar.github.io/SQuAD-explorer/
- F1 score measures the model's ability to correctly predict the answer and is calculated based on the overlap between the predicted answer and the ground truth answer.
- Exact Match (EM), on the other hand, measures the model's ability to provide the exact same answer as the ground truth answer.
- Ensembling Techniques Improve the QA model performance overall across both metrics (F1 and EM)
- Hyperparameter tuning was performed but did not give increased change in performance
- Ensembling Pretrained Large scale Language models is time efficient solution
- The Best model is Ensemble with Majority Voting with a F1-score of 0.942 and Exact match of 0.960