GeoEval

This is the Repository for Geometry Problem Solving Method Evaluation

Code for the Paper "GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving".

Overview

This project comprises a LLM evaluation of Geometry Problem Solving methods and the construction of comprehensive datasets. The aim is to advance the field of solving geometry problems. The project is focused on the construction of datasets in the field of geometry problem solving and to provide a comprehensive evaluation of current large language models.

DataSet Download

Our Open Dataset version will come soon.

About GeoEval

The GeoEval benchmark is specifically designed for assessing the ability of models in resolving geometric math problems. This benchmark features five characteristics: Comprehensive Variety, Varied Problems, Dual Inputs, Diverse Challenges, and Complexity Ratings.

For an insightful contrast, we offer a comparative analysis of GeoEval against earlier datasets.

The GeoEval benchmark is specifically designed for assessing the ability of models in resolving geometric math problems. This benchmark features five characteristics: Comprehensive Variety, Varied Problems, Dual Inputs, Diverse Challenges, and Complexity Ratings.

For an insightful contrast, we offer a comparative analysis of GeoEval against earlier datasets.

GeoEval Benchmark Features Comprehensive Variety: The benchmark covers a wide range of geometric topics, providing a comprehensive test for models.

Varied Problems: Problems in the benchmark are varied, testing the model's ability to handle different types of geometric problems.

Dual Inputs: The benchmark includes both text and diagram inputs, testing the model's ability to process and integrate information from different sources.

Diverse Challenges: The benchmark poses diverse challenges, testing the model's ability to handle complex and varied geometric problems.

Complexity Ratings: The benchmark includes problems of different complexity levels, allowing for a nuanced assessment of the model's capabilities.

## Table of Contents

Data Preparation
- Download Initial Dataset
- Preprocess Data
Model Evaluation

Model Evaluation

Run the scripts under sh_files/{model} to achieve inference responsiveness, result extraction, and metric calculation for large models.

bash sh_files/{model_name}/evaluate_general.sh
bash sh_files/{model_name}/ext_all.sh
bash sh_files/{model_name}/caculate_score.sh
bash sh_files/{model_name}/caculate_aug_score.sh
bash sh_files/{model_name}/caculate_back_score.sh
bash sh_files/{model_name}/caculate_solid_score.sh
bash sh_files/{model_name}/caculate_score.sh

🏆 Leaderboard 🏆

Model	GeoEval-2000 (A/T %)	GeoEval-backward (A %)	GeoEval-aug (A %)	GeoEval-hard (A %)
CodeGen2-16B $\lozenge$	28.76 / 22.06	5.10	8.50	5.66
GPT-3.5 $\lozenge$	24.71 / 21.27	22.66	41.25	22.33
GPT-4 $\lozenge$	27.95 / 43.86	26.00	45.75	10.10
WizardMath-70B $\lozenge$	55.67 / 34.20	28.66	37.75	6.00
WizardMath-7B-V1.1 $\lozenge$	54.78 / 32.76	32.66	47.75	6.00
llava-7B-V1.5	12.80 / 21.01	11.33	20.25	20.30
Qwen-VL	25.60 / 25.97	5.66	22.25	21.66
mPLUG-Owl2	37.76 / n/a	35.33	38.00	22.66
InstructBLIP $\dagger$	52.18 / n/a	15.66	35.00	70.30
GPT-4V	37.22 / 43.86 $\ddagger$	26.00	45.75	10.10

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
LLM_eval		LLM_eval
aug_result		aug_result
backaug_result		backaug_result
img_bench		img_bench
model		model
result		result
sh_files		sh_files
tool		tool
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_llm_response.py		run_llm_response.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM_eval

LLM_eval

aug_result

aug_result

backaug_result

backaug_result

img_bench

img_bench

model

model

result

result

sh_files

sh_files

tool

tool

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

run_llm_response.py

run_llm_response.py

Repository files navigation

GeoEval

Overview

DataSet Download

About GeoEval

Model Evaluation

🏆 Leaderboard 🏆

About

Releases

Packages

Contributors 3

Languages

GeoEval/GeoEval

Folders and files

Latest commit

History

Repository files navigation

GeoEval

Overview

DataSet Download

About GeoEval

Model Evaluation

🏆 Leaderboard 🏆

About

Resources

Stars

Watchers

Forks

Languages