Data Quality Index (DQI)

Neural language models have achieved human level performance across several NLP datasets. However, recent studies have shown that these models are not truly learning how to perform the desired task; rather, their high performance is attributed to overfitting using spurious biases. In order to help dataset creators create datasets free of such unwanted biases, and dataset solvers adopt special methods that exploit the same, we introduce an empirical formula for Data Quality Index (DQI). We further tune this formula through rigorous experimentation. We have also proposed a novel adversarial filtering algorithm, Robust AFLite (RAFLite), to remove dataset biases. We show the efficacy of our approach across various NLI, Question Answering and Reading Comprehension datasets. Our work takes forward the process of dynamic dataset creation wherein datasets evolve together with the evolving state of the art, therefore serving as a means of benchmarking the true progress of AI.

Repository Structure:
AFLite Implementation Python Notebook: Our implementation of AFLite mentioned in Winogrande paper.
RAFLite: Proposed Adversarial Filtering Algorithm implementation.
Viz: UI work to incorporate DQI for providing feedback to data creators.
Papers: List of papers read for identifying paramters along with their summaries in the excel sheet.
Pre Analysis Images: Plots showing quality of various datasets we consider.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
CAM		CAM
Papers		Papers
Parameter Notebooks		Parameter Notebooks
Pre Analysis Images		Pre Analysis Images
Viz		Viz
dqi_seprated		dqi_seprated
.DS_Store		.DS_Store
.gitignore		.gitignore
AFLite Implementation.ipynb		AFLite Implementation.ipynb
AFLiteSpedUp.py		AFLiteSpedUp.py
AFLiteSpedUp_2.py		AFLiteSpedUp_2.py
LICENSE		LICENSE
MultiThreading.py		MultiThreading.py
MultiThreading_2.py		MultiThreading_2.py
README.md		README.md
RFLite Synthetic Clean-Copy1.py		RFLite Synthetic Clean-Copy1.py
RFLite Synthetic Clean.ipynb		RFLite Synthetic Clean.ipynb
RFLite Synthetic Clean.py		RFLite Synthetic Clean.py
comp_cos_sim_mat.py		comp_cos_sim_mat.py
comp_cos_sim_mat_2.py		comp_cos_sim_mat_2.py
convertData.py		convertData.py
criteria.py		criteria.py
get_Roberta_Embeddings.py		get_Roberta_Embeddings.py
get_Roberta_Embeddings_2.py		get_Roberta_Embeddings_2.py
google_search.py		google_search.py
modules.py		modules.py
remove_duplicates.py		remove_duplicates.py
sentence_genrator_4.py		sentence_genrator_4.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Quality Index (DQI)

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

bhavdeep98/DQI_Released

Folders and files

Latest commit

History

Repository files navigation

Data Quality Index (DQI)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages