A comprehensive multimodal benchmark designed to evaluate the capability of Multimodal Large Language Models to recognize, parse, and reason over discrete visual symbols across five domains: Language, Culture, Mathematics, Physics, and Chemistry.
Symbolbench/
βββ evaluation/
β βββ language/ # Language domain evaluation
β β βββ infer.py # Inference via OpenAI-compatible API
β β βββ evaluate_llm.py # LLM-as-Judge scoring
β β βββ evaluate_metric.py # Rule-based metric evaluation (F1 / EM / Edit Distance)
β β βββ infer.sh # Example inference shell script
β β βββ extract_answer.py # Example extract answer from response script
β β βββ evaluate_llm.sh # Example LLM evaluation shell script
β β
β βββ STEM/ # Math / Physics / Chemistry domain evaluation
β β βββ baseline_test.py # Local open-source model inference via vLLM
β β βββ infer_API.py # API inference
β β βββ LLM_evaluate.py # LLM-as-Judge scoring
β β βββ evaluate_metric.py # Rule-based \boxed{} exact-match evaluation
β β βββ eval.sh # Example vLLM inference shell script
β β
β βββ emoji/
β βββ GPT_idiom.py # GPT-4o few-shot inference for emojiβidiom task
β
βββ data/ # Data in Huggingface
β
βββ figures/ # Visualization assets
β βββ introduction.png
β βββ data_task_introduction.png
β βββ data_case.png
β βββ overall_performance.png
β βββ analysis.png
β
βββ requirements.txt
βββ environment.yml
βββ README.md
SymbolBench spans 5 domains with multi-level difficulty (Level 1β3) and multiple task types per domain.
| Domain | Task Description | Evaluation Metric |
|---|---|---|
| Language | Task 1: Unrecognizable character detection (mark with X) |
Character-level F1 |
| Task 2: Miswritten character detection (output diff list as JSON) | Token-pair F1 | |
| Task 3: Sentence correction (output corrected full sentence) | Exact Match / Edit Distance | |
| Chemistry | Identify atoms and counts from molecular structure images | Exact Match / LLM-Judge |
| Physics | Multiple-choice physics questions with diagrams (from MMMU) | Accuracy |
| Math | Symbolic math reasoning (answer in \boxed{}) |
Exact Match / LLM-Judge |
| Culture | Infer Chinese/English idiom or word from emoji images | LLM-Judge / Accuracy |
The evaluation consists of two sequential stages:
βββββββββββββββββββββββ
β Dataset (JSON) β
ββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Stage 1: Inference β
β β
β Closed-source (GPT / Gemini) β
β β
β Open-source (vLLM) β
ββββββββββ¬βββββββββββββββββββββββββββββ
β predictions.jsonl
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Stage 2: Evaluation β
β β
β Rule-based metrics: β
β evaluation/language/ β
β evaluate_metric.py (F1/EM) β
β evaluation/math/ β
β evaluate_metric.py (EM) β
β β
β LLM-as-Judge: β
β evaluation/language/ β
β evaluate_llm.py β
β evaluation/math/ β
β evaluate_llm.py β
ββββββββββ¬βββββββββββββββββββββββββββββ
β score.jsonl + metrics.json
βΌ
βββββββββββββββββββββββ
β Aggregated β
β Metrics by β
β level / task_type β
βββββββββββββββββββββββ
# Clone the repository
git clone https://github.com/THUKElab/SymbolBench.git
cd SymbolBench
# Create and activate conda environment (recommended)
conda env create -f environment.yml
conda activate symbol-bench
# Or install with pip
pip install -r requirements.txtSee Environment Configuration for details.
# For OpenAI-compatible APIs (inference + LLM-judge)
export OPENAI_API_KEY="sk-xxxxxxxxxxxx"
export OPENAI_API_BASE="https://your-proxy-endpoint/v1" # optional, if using a proxycd scripts/language
python infer.py \
--input_json DATAPATH \
--images_dir IMAGEPATH \
--output_jsonl results/predictions.jsonl \
--model gpt-5-mini \
--temperature 0.7Or use the provided shell script (defaults to gemini-2.5-pro):
cd scripts/language
bash infer.shcd scripts/STEM
python baseline_test.py \
--input_json DATAPATH \
--images_dir IMAGEPATH \
--save_dir results/Qwen2.5-VL-7B \
--model_name /path/to/Qwen2.5-VL-7B-Instruct \
--clm_max_length 2048 \
--eval_lang chn \
--additional_stop_sequence "<|im_end|>"Or use the provided shell script:
cd scripts/STEM
bash eval.shcd scripts/STEM
python infer_API.py \
--input_json DATAPATH \
--images_dir IMAGEPATH \
--output_jsonl results/gpt-5-mini/predictions.jsonl \
--language zh \
--model gpt-5-minicd scripts/culture
bash infer.sh # runs all four sub-datasets with gpt-4o by defaultOr run a single sub-dataset directly:
cd scripts/culture
python infer.py \
--subset Chinese_idiom_4 \
--input_json DATAPATH \
--images_dir IMAGEPATH \
--xlsx chengyu.xlsx \
--output_jsonl results/Chinese_idiom_4/predictions.jsonl \
--model gpt-5-mini \
--shot 1cd scripts/language
python evaluate_metric.py \
--input results/predictions.jsonl \
--output results/score.jsonl \
--result results/metrics.jsoncd scripts/language
python evaluate_llm.py \
--input results/predictions.jsonl \
--score results/score.jsonl \
--metrics results/metrics.json \
--model gpt-5-miniOr use the shell script:
cd scripts/language
bash evaluate_llm.shcd scripts/STEM
python evaluate_metric.py \
--input results/predictions.jsonl \
--output results/score.jsonl \
--result results/metrics.jsoncd scripts/STEM
python LLM_evaluate.py \
--input results/predictions.jsonl \
--output results/llm_judge.jsonl \
--result results/results.json \
--model gpt-5-minicd scripts/culture
bash evaluate.shOne JSON object per line. Each object contains all original dataset fields plus:
| Field | Description |
|---|---|
prediction |
Model's raw text output |
model_output |
Full generation from vLLM (math only) |
Extends predictions.jsonl with:
| Field | Description |
|---|---|
correct |
{"correct": 1|0, "reason": "..."} from judge LLM |
Aggregated statistics:
{
"total": 700,
"evaluated": 698,
"correct": 423,
"accuracy": 0.605,
"by_level": { "1": { "evaluated": 233, "correct": 155, "accuracy": 0.665 }, ... },
"by_task_type": { "1": { ... }, "2": { ... }, "3": { ... } }
}Our benchmark reveals significant cognitive mismatches in current VLMs when processing discrete visual symbols, with performance gaps varying across domains and difficulty levels.
See requirements.txt and environment.yml for the full dependency list.
| Package | Version | Purpose |
|---|---|---|
openai |
β₯1.30 | OpenAI API client |
vllm |
β₯0.4 | Local open-source model inference |
torch |
β₯2.1 | PyTorch (required by vLLM) |
transformers |
β₯4.40 | HuggingFace tokenizer support |
tqdm |
β₯4.60 | Progress bars |
requests |
β₯2.28 | HTTP requests for direct API calls |
openpyxl |
β₯3.1 | Read .xlsx emoji data files |
If you use Symbolbench in your research, please cite:
@article{li2026cognitive,
title={Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding},
author={Li, Yinghui and Kuang, Jiayi and Xing, Peng and Liu, Daixian and Dong, Junnan and Guo, Shu-Yu and Li, Yangning and Zhou, Qingyu and Jiang, Wenhao and Zheng, Hai-Tao and others},
journal={arXiv preprint arXiv:2603.18472},
year={2026}
}This project is released under the MIT License.




