MRC Ablation

This is a repository for the paper "Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets" (Sugawara et al., AAAI 2020).

Analyzed Datasets

Dataset year web spec size paper misc
1 CoQA 2018 link dialogue-based QA 127k link
2 DuoRC 2018 link QA on movie scripts 186k link
3 HotpotQA 2018 link multi-hop reasoning 113k link
4 SQuAD1.1 2016 link QA on Wikipedia 100k link
5 SQuAD2.0 2018 link unanswerable QA on Wikipedia 100k link
6 ARC 2018 link science exam on retrieved docs 8k link
7 MCTest 2015 link children-level narrative QA 2.6k link
8 MultiRC 2018 link multi-sentence QA 6k link
9 RACE 2017 link English exam 100k link
10 SWAG 2018 link machine-generated commonsense QA 113k link

Scripts for Ablation

Our codebase is extended from huggingface's BERT implementation (originally huggingface/pytorch-pretrained-bert as of Nov. 2018).

Coming soon.

Ablation Methods

Each dataset directory under results contains following directories:

Ablation method Directory Description
0 original original the original data (development set)
1 Question interrogatives only drop_question_except_interrogatives drop question words except interrogatives (wh*, how)
2 Function words only drop_content_words drop content words (verb, noun, ...)
3 Content words only drop_function_words drop function words (= stop words here)
4 Vocabulary anonymization vocab_anon replace tokens with their POS tags
5 Question-context similarity drop_except_most_similar_sentences keep the sentences that are the most similar to the question in terms of unigram overlap and drop the other sentences.
6 Shuffle context words shuffle_document_words randomly shuffle all words in the context
7 Shuffle sentence words shuffle_sentence_words randomly shuffle the words in all the sentences except the last token
8 Shuffle sentence order shuffle_sentence_order randomly shuffle the order of the sentences in the context
9 Dummy numerics mask_numerics replace numerical expressions with random numbers
10 Logical words dropped drop_logical_words drop logical terms such as not, every, and if
11 Pronoun words dropped mask_pronouns drop personal and possessive pronouns (PRP and PRP$ tags)
12 Causal words dropped drop_causal_words drop causal terms/clauses such as because and therefore
3' (trained) content words only train_content_only drop function words (= stop words here) (also in training)
6' (trained) shuffle context words train_doc_shuff randomly shuffle all words in the context (also in training)
7' (trained) shuffle sentence words train_sent_shuff randomly shuffle the words in all the sentences except the last token (also in training)
x Context dropped drop_question_words drop all question words
y Question dropped drop_context_words drop all context words
z Options only drop_except_options drop all question and context words (only for multiple choice datasets)

There are results of five different seeds for the shuffle-based methods (seed1 to seed12345).

Each result directory has args_log.txt that specifies hyperparameters.

