Skip to content

THUKElab/SymbolBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

Paper Dataset License

A comprehensive multimodal benchmark designed to evaluate the capability of Multimodal Large Language Models to recognize, parse, and reason over discrete visual symbols across five domains: Language, Culture, Mathematics, Physics, and Chemistry.

SymbolBench Overview

Figure 1: Overview of SymbolBench benchmark covering five domains with discrete visual symbols.


Repository Structure

Symbolbench/
β”œβ”€β”€ evaluation/
β”‚   β”œβ”€β”€ language/                   # Language domain evaluation
β”‚   β”‚   β”œβ”€β”€ infer.py                # Inference via OpenAI-compatible API
β”‚   β”‚   β”œβ”€β”€ evaluate_llm.py         # LLM-as-Judge scoring 
β”‚   β”‚   β”œβ”€β”€ evaluate_metric.py      # Rule-based metric evaluation (F1 / EM / Edit Distance)
β”‚   β”‚   β”œβ”€β”€ infer.sh                # Example inference shell script
β”‚   β”‚   β”œβ”€β”€ extract_answer.py       # Example extract answer from response script
β”‚   β”‚   └── evaluate_llm.sh         # Example LLM evaluation shell script
β”‚   β”‚
β”‚   β”œβ”€β”€ STEM/                       # Math / Physics / Chemistry domain evaluation
β”‚   β”‚   β”œβ”€β”€ baseline_test.py          # Local open-source model inference via vLLM 
β”‚   β”‚   β”œβ”€β”€ infer_API.py            # API inference 
β”‚   β”‚   β”œβ”€β”€ LLM_evaluate.py         # LLM-as-Judge scoring
β”‚   β”‚   β”œβ”€β”€ evaluate_metric.py      # Rule-based \boxed{} exact-match evaluation
β”‚   β”‚   └── eval.sh                 # Example vLLM inference shell script
β”‚   β”‚
β”‚   └── emoji/
β”‚       └── GPT_idiom.py            # GPT-4o few-shot inference for emojiβ†’idiom task
β”‚
β”œβ”€β”€ data/                           # Data in Huggingface
β”‚
β”œβ”€β”€ figures/                        # Visualization assets
β”‚   β”œβ”€β”€ introduction.png
β”‚   β”œβ”€β”€ data_task_introduction.png
β”‚   β”œβ”€β”€ data_case.png
β”‚   β”œβ”€β”€ overall_performance.png
β”‚   └── analysis.png
β”‚
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ environment.yml
└── README.md

Benchmark Overview

SymbolBench spans 5 domains with multi-level difficulty (Level 1–3) and multiple task types per domain.

Task Types and Examples

Figure 2: Task types and representative examples across five domains.

Task Summary

Domain Task Description Evaluation Metric
Language Task 1: Unrecognizable character detection (mark with X) Character-level F1
Task 2: Miswritten character detection (output diff list as JSON) Token-pair F1
Task 3: Sentence correction (output corrected full sentence) Exact Match / Edit Distance
Chemistry Identify atoms and counts from molecular structure images Exact Match / LLM-Judge
Physics Multiple-choice physics questions with diagrams (from MMMU) Accuracy
Math Symbolic math reasoning (answer in \boxed{}) Exact Match / LLM-Judge
Culture Infer Chinese/English idiom or word from emoji images LLM-Judge / Accuracy
Data Cases

Figure 3: Detailed examples from each domain showing the diversity of visual symbols.


Evaluation Pipeline

The evaluation consists of two sequential stages:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Dataset (JSON)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Stage 1: Inference        β”‚
β”‚                                     β”‚
β”‚  Closed-source (GPT / Gemini)       β”‚
β”‚                                     β”‚
β”‚  Open-source (vLLM)                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚  predictions.jsonl
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Stage 2: Evaluation       β”‚
β”‚                                     β”‚
β”‚  Rule-based metrics:                β”‚
β”‚    evaluation/language/             β”‚
β”‚        evaluate_metric.py  (F1/EM)  β”‚
β”‚    evaluation/math/                 β”‚
β”‚        evaluate_metric.py  (EM)     β”‚
β”‚                                     β”‚
β”‚  LLM-as-Judge:                      β”‚
β”‚    evaluation/language/             β”‚
β”‚        evaluate_llm.py              β”‚
β”‚    evaluation/math/                 β”‚
β”‚        evaluate_llm.py              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚  score.jsonl + metrics.json
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Aggregated        β”‚
β”‚   Metrics by        β”‚
β”‚   level / task_type β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Start

1. Environment Setup

# Clone the repository
git clone https://github.com/THUKElab/SymbolBench.git
cd SymbolBench

# Create and activate conda environment (recommended)
conda env create -f environment.yml
conda activate symbol-bench

# Or install with pip
pip install -r requirements.txt

See Environment Configuration for details.

2. Prepare API Keys

# For OpenAI-compatible APIs (inference + LLM-judge)
export OPENAI_API_KEY="sk-xxxxxxxxxxxx"
export OPENAI_API_BASE="https://your-proxy-endpoint/v1"   # optional, if using a proxy

3. Run Inference

Language Domain

cd scripts/language
python infer.py \
  --input_json  DATAPATH \
  --images_dir  IMAGEPATH \
  --output_jsonl results/predictions.jsonl \
  --model gpt-5-mini \
  --temperature 0.7

Or use the provided shell script (defaults to gemini-2.5-pro):

cd scripts/language
bash infer.sh

STEM Domain (Math / Physics / Chemistry) β€” open-source model via vLLM

cd scripts/STEM
python baseline_test.py \
  --input_json  DATAPATH \
  --images_dir  IMAGEPATH \
  --save_dir    results/Qwen2.5-VL-7B \
  --model_name  /path/to/Qwen2.5-VL-7B-Instruct \
  --clm_max_length 2048 \
  --eval_lang   chn \
  --additional_stop_sequence "<|im_end|>"

Or use the provided shell script:

cd scripts/STEM
bash eval.sh

STEM Domain β€” GPT / closed-source API

cd scripts/STEM
python infer_API.py \
  --input_json  DATAPATH \
  --images_dir  IMAGEPATH \
  --output_jsonl results/gpt-5-mini/predictions.jsonl \
  --language zh \
  --model gpt-5-mini

Culture (Emoji) Domain

cd scripts/culture
bash infer.sh   # runs all four sub-datasets with gpt-4o by default

Or run a single sub-dataset directly:

cd scripts/culture
python infer.py \
  --subset       Chinese_idiom_4 \
  --input_json  DATAPATH \
  --images_dir  IMAGEPATH \
  --xlsx         chengyu.xlsx \
  --output_jsonl results/Chinese_idiom_4/predictions.jsonl \
  --model        gpt-5-mini \
  --shot         1

4. Run Evaluation

Language – Rule-based Metrics

cd scripts/language
python evaluate_metric.py \
  --input  results/predictions.jsonl \
  --output results/score.jsonl \
  --result results/metrics.json

Language – LLM-as-Judge

cd scripts/language
python evaluate_llm.py \
  --input   results/predictions.jsonl \
  --score   results/score.jsonl \
  --metrics results/metrics.json \
  --model   gpt-5-mini

Or use the shell script:

cd scripts/language
bash evaluate_llm.sh

STEM – Rule-based Exact Match

cd scripts/STEM
python evaluate_metric.py \
  --input  results/predictions.jsonl \
  --output results/score.jsonl \
  --result results/metrics.json

STEM – LLM-as-Judge

cd scripts/STEM
python LLM_evaluate.py \
  --input  results/predictions.jsonl \
  --output results/llm_judge.jsonl \
  --result results/results.json \
  --model  gpt-5-mini

Culture – Evaluation

cd scripts/culture
bash evaluate.sh

Output Format

predictions.jsonl

One JSON object per line. Each object contains all original dataset fields plus:

Field Description
prediction Model's raw text output
model_output Full generation from vLLM (math only)

score.jsonl (LLM-as-Judge)

Extends predictions.jsonl with:

Field Description
correct {"correct": 1|0, "reason": "..."} from judge LLM

metrics.json

Aggregated statistics:

{
  "total": 700,
  "evaluated": 698,
  "correct": 423,
  "accuracy": 0.605,
  "by_level": { "1": { "evaluated": 233, "correct": 155, "accuracy": 0.665 }, ... },
  "by_task_type": { "1": { ... }, "2": { ... }, "3": { ... } }
}

Evaluation Performance

Overall Performance

Figure 4: Performance comparison of mainstream VLMs on SymbolBench.

Key Findings

Analysis and Insights

Figure 5: Cognitive mismatch analysis across different symbol types and model architectures.

Our benchmark reveals significant cognitive mismatches in current VLMs when processing discrete visual symbols, with performance gaps varying across domains and difficulty levels.


Environment Configuration

See requirements.txt and environment.yml for the full dependency list.

Key Dependencies

Package Version Purpose
openai β‰₯1.30 OpenAI API client
vllm β‰₯0.4 Local open-source model inference
torch β‰₯2.1 PyTorch (required by vLLM)
transformers β‰₯4.40 HuggingFace tokenizer support
tqdm β‰₯4.60 Progress bars
requests β‰₯2.28 HTTP requests for direct API calls
openpyxl β‰₯3.1 Read .xlsx emoji data files

Citation

If you use Symbolbench in your research, please cite:

@article{li2026cognitive,
  title={Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding},
  author={Li, Yinghui and Kuang, Jiayi and Xing, Peng and Liu, Daixian and Dong, Junnan and Guo, Shu-Yu and Li, Yangning and Zhou, Qingyu and Jiang, Wenhao and Zheng, Hai-Tao and others},
  journal={arXiv preprint arXiv:2603.18472},
  year={2026}
}

License

This project is released under the MIT License.

About

A comprehensive multimodal benchmark designed to evaluate the capability of Multimodal Large Language Models to recognize, parse, and reason over discrete visual symbols.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors