Skip to content

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

Notifications You must be signed in to change notification settings

PremiLab-Math/MathCheck

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MathCheck

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

🌐 Homepage | 🤗 Dataset | 📖 Paper | 💻 Results

Intro

Exceptional mathematical reasoning ability is one of the key features that demonstrate the power of large language models (LLMs). How to comprehensively define and evaluate the mathematical abilities of LLMs, and even reflect the user experience in real-world scenarios, has emerged as a critical issue. Current benchmarks predominantly concentrate on problem-solving capabilities, which presents a substantial risk of model overfitting and fails to accurately represent genuine mathematical reasoning abilities. In this paper, we argue that if a model really understands a problem, it should be robustly and readily applied across a diverse array of tasks. Motivated by this, we introduce MATHCHECK, a well-designed checklist for testing task generalization and reasoning robustness, as well as an automatic tool to generate checklists efficiently. MATHCHECK includes multiple mathematical reasoning tasks and robustness test types to facilitate a comprehensive evaluation of both mathematical reasoning ability and behavior testing. Utilizing MATHCHECK, we develop MATHCHECK-GSM and MATHCHECK-GEO to assess mathematical textual reasoning and multi-modal reasoning capabilities, respectively, servingas upgraded versions of benchmarks including GSM8k, GeoQA, UniGeo, and Geometry3K.

image

Commands for Prediction

In our paper, we evaluste the base model and mathematical model in few-shot to follow the instructions. For other models, we use zero-shot setting. Before prediction, please unzip the images.zip first.

# Call GPT model for GSM-checklist
python scripts/openai_model_inference.py --input_file gsm_checklist.json --model_name gpt-4o  --check_task all --check_question all --task_prompt zeroshot

# Call GPT model for GEO-checklist
python scripts/openai_model_inference.py --input_file geo_checklist.json --model_name gpt-4o  --check_task all --check_question all --task_prompt zeroshot

# [MODEL_NAME] can be: gpt-3.5-turbo, gpt-4o, gpt-4-turbo-2024-04-09, etc.
# [TASK_PROMPT] can be: fewshot, zeroshot

# Call LLaMa-3-8b-Instruct for GSM-checklist, please modify the model name and parameters to be evaluated in the sh file
bash scripts/llama3-8b-instruct_inference.sh

# Call Phi-3V for GEO-checklist, please first refer to https://github.com/modelscope/swift to install requirements
python scripts/vl_model_inference_phi3.py --input_file geo_checklist.json --check_task all --check_question all --task_prompt zeroshot 

Commands for Score Output

python scripts/results_evaluate.py --model_name gpt-4o --eval_data gsm_checklist --task_prompt zeroshot
# [MODEL_NAME] can be: gpt-3.5-turbo, gpt-4o, gpt-4-turbo-2024-04-09, etc.
# [TASK_PROMPT] can be: fewshot, zeroshot

Contact

Citation

@article{zhou2024modelreallygoodmath,
    title={Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist}, 
    author={Zihao Zhou and Shudong Liu and Maizhen Ning and Wei Liu and Jindong Wang and Derek F. Wong and Xiaowei Huang and Qiufeng Wang and Kaizhu Huang},
    year={2024},
    eprint={2407.08733}
}

About

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published