DMath (Diverse Math Word Problems)

This repository provides DMath (Diverse Math Word Problems), a collection of 10K high-quality grade school-level math word problems for the paper "It Ain’t Over: A Multi-aspect Diverse Math Word Problem Dataset".

Details of the dataset

DMath is a multi-aspect diverse MWP dataset, which has the following key features:

It fully covers problem types across five categories. These include arithmetic calculation (ARI), comparison (COM), correspondence (COR), geometry(GEO), and possibility (POS).
It consists of about 10,000 problems manually created by 43 human workers covering various lexical usage patterns on natural language narratives and expression trees.
It supports bilingual input languages, i.e., English and Korean.
It offers the annotation of expression trees and Python code as intermediate solutions.

In summary, DMath offers a wide range of diversity in problem types, lexical usage patterns, languages, and intermediate solution forms.

Statistics of the dataset

The table below shows the number of samples per category on DMath.

	ARI	COM	COR	POS	GEO	Total
Train	2,476	1,338	1,656	1,417	1,056	7,943
Test	669	334	402	383	291	2,079
Total	3,145	1,672	2,058	1,800	1,347	10,002

Data format

You can see a sample of the data here.

"1": {
    "category": "Geometry",
    "question_ko": "테이프를 일직선으로 쭉 붙이려고 한다. 잘라낸 3개의 테이프가 각각 36센티미터(㎝), 42센티미터(㎝), 48센티미터(㎝)이고 같은 부분만큼 겹쳐서 붙이면 총 길이가 97센티미터(㎝)이다. 겹쳐진 부분의 길이는 얼마인지 구하시오.",
    "question_en": "We are trying to stick a tape in a straight line. The three tapes cut are 36 centimeters (cm), 42 centimeters (cm), and 48 centimeters (cm) respectively, and if the same part is overlapped and attached, the total length is 97 centimeters (cm). Find the length of the overlapping part.",
    "answer_ko": "14.5",
    "answer_en": "14.5",
    "solution_abst_ko": "36 42 [OP_ADD] 48 [OP_ADD] 97 [OP_SUB] 3 1 [OP_SUB] [OP_DIV]",
    "solution_abst_en": "36 42 [OP_ADD] 48 [OP_ADD] 97 [OP_SUB] 3 1 [OP_SUB] [OP_DIV]",
    "solution_code_ko": "var_a = 36\nvar_b = 42\nvar_c = var_a + var_b\nvar_d = 48\nvar_e = var_c + var_d\nvar_f = 97\nvar_g = var_e - var_f\nvar_h = 3\nvar_i = 1\nvar_j = var_h - var_i\nvar_k = var_g / var_j\nprint('{:.2f}'.format(round(var_k+1e-10,2)))",
    "solution_code_en": "var_a = 36\nvar_b = 42\nvar_c = var_a + var_b\nvar_d = 48\nvar_e = var_c + var_d\nvar_f = 97\nvar_g = var_e - var_f\nvar_h = 3\nvar_i = 1\nvar_j = var_h - var_i\nvar_k = var_g / var_j\nprint('{:.2f}'.format(round(var_k+1e-10,2)))"
    },

One data consists of the following keys:

category : The problem types. It can be Arithmetic Calculation, Comparison, Correspondence, Possibility, and Geometry.
question_ko : The natural language narratives in Korean.
question_en : The natural language narratives in English.
answer_ko : The answer of the question in Korean.
answer_en : The answer of the question in English.
solution_abst_ko : The expression tree solution (=abstract solution) of the question in Korean.
solution_abst_en : The expression tree solution (=abstract solution) of the question in English.
solution_code_ko : The Python code solution of the question in Korean.
solution_code_en : The Python code solution of the question in English.

Experimental Results

The figure below shows the accuracy comparison over various reasoning categories on DMath for RoBERTa (Liu et al., 2019), GPT-2 (Radford et al., 2019), ChatGPT (gpt-3.5-turbo; OpenAI) and GPT-4 (OpenAI, 2023). We use the fine-tuning approach for RoBERTa and GPT-2 and use the prompting approach for ChatGPT and GPT-4. The worst problem categories differ across MWP models.

The figure below shows the accuracy comparison results of MWP models on the DMath per expression forms in English. "NL prompt" means the natural language prompt. We set few-shot CoT (Wei et al., 2022) as NL prompt and PAL (Gao et al., 2022) as Python code prompt. We use RoBERTa (Liu et al., 2019) and CodeGPT (Lu et al., 2021) as the models for expression tree and Python code in the fine-tuning approach. We use GPT-4 (OpenAI, 2023) as the models for NL prompt and Python code prompt in the prompting approach. A specific prompt type is preferred depending on the problem.

We use gpt-3.5-turbo-0301 for ChatGPT and gpt-4-0314 for GPT-4.

Citation

If you find this work useful for your research, please cite our paper:

@inproceedings{kim-etal-2023-aint,
    title = "It Ain{'}t Over: A Multi-aspect Diverse Math Word Problem Dataset",
    author = "Kim, Jiwoo  and
      Kim, Youngbin  and
      Baek, Ilwoong  and
      Bak, JinYeong  and
      Lee, Jongwuk",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.927",
    doi = "10.18653/v1/2023.emnlp-main.927",
    pages = "14984--15011",
    abstract = "The math word problem (MWP) is a complex task that requires natural language understanding and logical reasoning to extract key knowledge from natural language narratives. Previous studies have provided various MWP datasets but lack diversity in problem types, lexical usage patterns, languages, and annotations for intermediate solutions. To address these limitations, we introduce a new MWP dataset, named DMath (Diverse Math Word Problems), offering a wide range of diversity in problem types, lexical usage patterns, languages, and intermediate solutions. The problems are available in English and Korean and include an expression tree and Python code as intermediate solutions. Through extensive experiments, we demonstrate that the DMath dataset provides a new opportunity to evaluate the capability of large language models, i.e., GPT-4 only achieves about 75{\%} accuracy on the DMath dataset.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dataset		dataset
figures		figures
solution		solution
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

figures

figures

solution

solution

LICENSE

LICENSE

README.md

README.md

Repository files navigation

DMath (Diverse Math Word Problems)

Details of the dataset

Statistics of the dataset

Data format

Experimental Results

Citation

About

Releases

Packages

Languages

License

JiwooKimAR/dmath

Folders and files

Latest commit

History

Repository files navigation

DMath (Diverse Math Word Problems)

Details of the dataset

Statistics of the dataset

Data format

Experimental Results

Citation

About

Resources

License

Stars

Watchers

Forks

Languages