Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

A comprehensive multimodal benchmark designed to evaluate the capability of Multimodal Large Language Models to recognize, parse, and reason over discrete visual symbols across five domains: Language, Culture, Mathematics, Physics, and Chemistry.

Figure 1: Overview of SymbolBench benchmark covering five domains with discrete visual symbols.

Repository Structure

Symbolbench/
├── evaluation/
│   ├── language/                   # Language domain evaluation
│   │   ├── infer.py                # Inference via OpenAI-compatible API
│   │   ├── evaluate_llm.py         # LLM-as-Judge scoring 
│   │   ├── evaluate_metric.py      # Rule-based metric evaluation (F1 / EM / Edit Distance)
│   │   ├── infer.sh                # Example inference shell script
│   │   ├── extract_answer.py       # Example extract answer from response script
│   │   └── evaluate_llm.sh         # Example LLM evaluation shell script
│   │
│   ├── STEM/                       # Math / Physics / Chemistry domain evaluation
│   │   ├── baseline_test.py          # Local open-source model inference via vLLM 
│   │   ├── infer_API.py            # API inference 
│   │   ├── LLM_evaluate.py         # LLM-as-Judge scoring
│   │   ├── evaluate_metric.py      # Rule-based \boxed{} exact-match evaluation
│   │   └── eval.sh                 # Example vLLM inference shell script
│   │
│   └── emoji/
│       └── GPT_idiom.py            # GPT-4o few-shot inference for emoji→idiom task
│
├── data/                           # Data in Huggingface
│
├── figures/                        # Visualization assets
│   ├── introduction.png
│   ├── data_task_introduction.png
│   ├── data_case.png
│   ├── overall_performance.png
│   └── analysis.png
│
├── requirements.txt
├── environment.yml
└── README.md

Benchmark Overview

SymbolBench spans 5 domains with multi-level difficulty (Level 1–3) and multiple task types per domain.

Figure 2: Task types and representative examples across five domains.

Task Summary

Domain	Task Description	Evaluation Metric
Language	Task 1: Unrecognizable character detection (mark with `X`)	Character-level F1
	Task 2: Miswritten character detection (output diff list as JSON)	Token-pair F1
	Task 3: Sentence correction (output corrected full sentence)	Exact Match / Edit Distance
Chemistry	Identify atoms and counts from molecular structure images	Exact Match / LLM-Judge
Physics	Multiple-choice physics questions with diagrams (from MMMU)	Accuracy
Math	Symbolic math reasoning (answer in `\boxed{}`)	Exact Match / LLM-Judge
Culture	Infer Chinese/English idiom or word from emoji images	LLM-Judge / Accuracy

Figure 3: Detailed examples from each domain showing the diversity of visual symbols.

Evaluation Pipeline

The evaluation consists of two sequential stages:

┌─────────────────────┐
│   Dataset (JSON)    │
└────────┬────────────┘
         │
         ▼
┌─────────────────────────────────────┐
│           Stage 1: Inference        │
│                                     │
│  Closed-source (GPT / Gemini)       │
│                                     │
│  Open-source (vLLM)                 │
└────────┬────────────────────────────┘
         │  predictions.jsonl
         ▼
┌─────────────────────────────────────┐
│           Stage 2: Evaluation       │
│                                     │
│  Rule-based metrics:                │
│    evaluation/language/             │
│        evaluate_metric.py  (F1/EM)  │
│    evaluation/math/                 │
│        evaluate_metric.py  (EM)     │
│                                     │
│  LLM-as-Judge:                      │
│    evaluation/language/             │
│        evaluate_llm.py              │
│    evaluation/math/                 │
│        evaluate_llm.py              │
└────────┬────────────────────────────┘
         │  score.jsonl + metrics.json
         ▼
┌─────────────────────┐
│   Aggregated        │
│   Metrics by        │
│   level / task_type │
└─────────────────────┘

Quick Start

1. Environment Setup

# Clone the repository
git clone https://github.com/THUKElab/SymbolBench.git
cd SymbolBench

# Create and activate conda environment (recommended)
conda env create -f environment.yml
conda activate symbol-bench

# Or install with pip
pip install -r requirements.txt

See Environment Configuration for details.

2. Prepare API Keys

# For OpenAI-compatible APIs (inference + LLM-judge)
export OPENAI_API_KEY="sk-xxxxxxxxxxxx"
export OPENAI_API_BASE="https://your-proxy-endpoint/v1"   # optional, if using a proxy

3. Run Inference

Language Domain

cd scripts/language
python infer.py \
  --input_json  DATAPATH \
  --images_dir  IMAGEPATH \
  --output_jsonl results/predictions.jsonl \
  --model gpt-5-mini \
  --temperature 0.7

Or use the provided shell script (defaults to gemini-2.5-pro):

cd scripts/language
bash infer.sh

STEM Domain (Math / Physics / Chemistry) — open-source model via vLLM

cd scripts/STEM
python baseline_test.py \
  --input_json  DATAPATH \
  --images_dir  IMAGEPATH \
  --save_dir    results/Qwen2.5-VL-7B \
  --model_name  /path/to/Qwen2.5-VL-7B-Instruct \
  --clm_max_length 2048 \
  --eval_lang   chn \
  --additional_stop_sequence "<|im_end|>"

Or use the provided shell script:

cd scripts/STEM
bash eval.sh

STEM Domain — GPT / closed-source API

cd scripts/STEM
python infer_API.py \
  --input_json  DATAPATH \
  --images_dir  IMAGEPATH \
  --output_jsonl results/gpt-5-mini/predictions.jsonl \
  --language zh \
  --model gpt-5-mini

Culture (Emoji) Domain

cd scripts/culture
bash infer.sh   # runs all four sub-datasets with gpt-4o by default

Or run a single sub-dataset directly:

cd scripts/culture
python infer.py \
  --subset       Chinese_idiom_4 \
  --input_json  DATAPATH \
  --images_dir  IMAGEPATH \
  --xlsx         chengyu.xlsx \
  --output_jsonl results/Chinese_idiom_4/predictions.jsonl \
  --model        gpt-5-mini \
  --shot         1

4. Run Evaluation

Language – Rule-based Metrics

cd scripts/language
python evaluate_metric.py \
  --input  results/predictions.jsonl \
  --output results/score.jsonl \
  --result results/metrics.json

Language – LLM-as-Judge

cd scripts/language
python evaluate_llm.py \
  --input   results/predictions.jsonl \
  --score   results/score.jsonl \
  --metrics results/metrics.json \
  --model   gpt-5-mini

Or use the shell script:

cd scripts/language
bash evaluate_llm.sh

STEM – Rule-based Exact Match

cd scripts/STEM
python evaluate_metric.py \
  --input  results/predictions.jsonl \
  --output results/score.jsonl \
  --result results/metrics.json

STEM – LLM-as-Judge

cd scripts/STEM
python LLM_evaluate.py \
  --input  results/predictions.jsonl \
  --output results/llm_judge.jsonl \
  --result results/results.json \
  --model  gpt-5-mini

Culture – Evaluation

cd scripts/culture
bash evaluate.sh

Output Format

`predictions.jsonl`

One JSON object per line. Each object contains all original dataset fields plus:

Field	Description
`prediction`	Model's raw text output
`model_output`	Full generation from vLLM (math only)

`score.jsonl` (LLM-as-Judge)

Extends predictions.jsonl with:

Field	Description
`correct`	`{"correct": 1\|0, "reason": "..."}` from judge LLM

`metrics.json`

Aggregated statistics:

{
  "total": 700,
  "evaluated": 698,
  "correct": 423,
  "accuracy": 0.605,
  "by_level": { "1": { "evaluated": 233, "correct": 155, "accuracy": 0.665 }, ... },
  "by_task_type": { "1": { ... }, "2": { ... }, "3": { ... } }
}

Evaluation Performance

Figure 4: Performance comparison of mainstream VLMs on SymbolBench.

Key Findings

Figure 5: Cognitive mismatch analysis across different symbol types and model architectures.

Our benchmark reveals significant cognitive mismatches in current VLMs when processing discrete visual symbols, with performance gaps varying across domains and difficulty levels.

Environment Configuration

See requirements.txt and environment.yml for the full dependency list.

Key Dependencies

Package	Version	Purpose
`openai`	≥1.30	OpenAI API client
`vllm`	≥0.4	Local open-source model inference
`torch`	≥2.1	PyTorch (required by vLLM)
`transformers`	≥4.40	HuggingFace tokenizer support
`tqdm`	≥4.60	Progress bars
`requests`	≥2.28	HTTP requests for direct API calls
`openpyxl`	≥3.1	Read `.xlsx` emoji data files

Citation

If you use Symbolbench in your research, please cite:

@article{li2026cognitive,
  title={Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding},
  author={Li, Yinghui and Kuang, Jiayi and Xing, Peng and Liu, Daixian and Dong, Junnan and Guo, Shu-Yu and Li, Yangning and Zhou, Qingyu and Jiang, Wenhao and Zheng, Hai-Tao and others},
  journal={arXiv preprint arXiv:2603.18472},
  year={2026}
}

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
figures		figures
scripts		scripts
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

Repository Structure

Benchmark Overview

Task Summary

Evaluation Pipeline

Quick Start

1. Environment Setup

2. Prepare API Keys

3. Run Inference

Language Domain

STEM Domain (Math / Physics / Chemistry) — open-source model via vLLM

STEM Domain — GPT / closed-source API

Culture (Emoji) Domain

4. Run Evaluation

Language – Rule-based Metrics

Language – LLM-as-Judge

STEM – Rule-based Exact Match

STEM – LLM-as-Judge

Culture – Evaluation

Output Format

predictions.jsonl

score.jsonl (LLM-as-Judge)

metrics.json

Evaluation Performance

Key Findings

Environment Configuration

Key Dependencies

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`predictions.jsonl`

`score.jsonl` (LLM-as-Judge)

`metrics.json`

Packages