A²RBench: Automated Paradigm for Formally Verifiable Abstract Reasoning

This repository contains the official implementation of the pipeline described in the paper: "A²RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation".

A²RBench is an automated framework that leverages LLMs to generate, verify, expand, and evaluate abstract reasoning tasks. It ensures logical soundness through Cycle Consistency Check ($g(f(x)) = x$) and code-based verification.

🌟 Task Visualizations

A²RBench generates tasks across 1D, 2D, and 3D dimensions. Below are examples of the symbolic transformation rules:

1D: Sequence Interleaving	2D: Block-wise Diagonal Swap	3D: Coordinate Mapping

Rule: Interleaving two halves.	Rule: Local 2x2 diagonal swaps.	Rule: Multi-axis cyclic shifts.

📂 Repository Structure

.
├── llm_client.py           # Unified API wrapper for various LLM providers (OpenAI, Gemini, etc.)
├── question.py             # Stage 1: Seed Generation (Rules & Code) via Cycle Consistency
├── expand_questions.py     # Stage 2: Task Expansion (Variations V0-V9)
├── answer.py               # Stage 3: Solver Evaluation (P0 & P1 perturbations)
├── analysis.py             # Stage 4: Global Performance & Symbolic Dependency Analysis
├── analyze_complexity.py   # Analysis: Code Complexity (AST) & Author Fingerprinting
├── classify_failures.py    # Analysis: Cognitive Failure Classification (CoT Diagnosis)
└── analysis_expand.py      # Analysis: Augmentation Paradox & Entropy Analysis

🛠️ Installation

Clone the repository:
Install dependencies:

🚀 Usage Pipeline

The pipeline consists of four sequential stages.

Stage 1: Seed Generation

Generates valid Python-based abstract reasoning rules (Forward $f$ and Inverse $g$) and verifies them via Cycle Consistency. Supports 1D, 2D, and 3D tasks.

python question.py

Outputs: questions_arc_text_v9/validated_arc_text_questions.jsonl

Stage 2: Task Expansion

Augments the seed tasks into 9 distinct variations (Standard, Edge Case, Adversarial) to test robustness.

python expand_questions.py

Outputs: questions_arc_text_v9/expanded_arc_text_questions.jsonl

Stage 3: Solver Evaluation

Evaluates various "Solver" LLMs on the generated tasks. This script handles Symbolic Dependency Testing by creating symbol-remapped versions (P1) of the tasks.

python answer.py

Outputs: results_arc_text_v9/results_raw.jsonl

Stage 4: Analysis & Metrics

Global Leaderboard: Generate accuracy tables and symbolic dependency gaps.
```
python analysis.py
```
Code Complexity: Analyze AST of generated rules (loop depth, conditional complexity).
```
python analyze_complexity.py
```
Cognitive Failure Diagnosis: Diagnose why a model failed (e.g., Abstraction vs. Reasoning).
```
python classify_failures.py
```

📊 Data Format

Each generated task includes executable Python code, a natural language rule description, and the puzzle data.

{
  "question_id": "SYM_O4_D3_001_V0",
  "task_type": "SymbolicRule",
  "dimensionality": 3,
  "rule_description": "Forward transformation (Encoder): ...",
  "python_code": "def transform_grid(grid): ...",
  "puzzle_data": {
    "examples": [{"input": [...], "output": [...]}],
    "question_plaintext": [...],
    "answer_ciphertext": [...]
  }
}

⚖️ License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A²RBench: Automated Paradigm for Formally Verifiable Abstract Reasoning

🌟 Task Visualizations

📂 Repository Structure

🛠️ Installation

🚀 Usage Pipeline

Stage 1: Seed Generation

Stage 2: Task Expansion

Stage 3: Solver Evaluation

Stage 4: Analysis & Metrics

📊 Data Format

⚖️ License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
questions_arc_text_v9		questions_arc_text_v9
1d.png		1d.png
2d.png		2d.png
3d.png		3d.png
analysis.py		analysis.py
analysis_expand.py		analysis_expand.py
analyze_complexity.py		analyze_complexity.py
answer.py		answer.py
classify_failures.py		classify_failures.py
expand_questions.py		expand_questions.py
llm_client.py		llm_client.py
question.py		question.py
readme.md		readme.md
seed_rule_clusters_pca_by_author.png		seed_rule_clusters_pca_by_author.png

Folders and files

Latest commit

History

Repository files navigation

A²RBench: Automated Paradigm for Formally Verifiable Abstract Reasoning

🌟 Task Visualizations

📂 Repository Structure

🛠️ Installation

🚀 Usage Pipeline

Stage 1: Seed Generation

Stage 2: Task Expansion

Stage 3: Solver Evaluation

Stage 4: Analysis & Metrics

📊 Data Format

⚖️ License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages