FusionBench: Nuclear Fusion Benchmark Dataset and Evaluation Framework

A complete nuclear fusion domain LLM evaluation solution, containing the professional FusionBench dataset and accompanying evaluation framework. Used for systematically evaluating and comparing the knowledge mastery and reasoning capabilities of large language models on nuclear fusion science problems.

🌟 Features

Complete Solution: Provides both FusionBench dataset and professional evaluation framework for one-stop nuclear fusion LLM evaluation
High-Quality Dataset: Carefully constructed nuclear fusion domain benchmark dataset covering core scientific concepts and latest research
Comprehensive Model Support: Supports mainstream LLM services including DeepSeek, Ollama, SiliconFlow, Gemini, Zhizengzeng, etc.
Rich Question Types: Supports true/false, multiple choice, and fill-in questions to comprehensively evaluate models' understanding and reasoning capabilities
Smart Cache Optimization: Content hash-based caching mechanism that significantly improves testing efficiency and reduces API costs
Deep Performance Analysis: Provides detailed accuracy statistics, domain classification analysis, and professional reports on error pattern analysis
Flexible Extensibility: Supports multiple dataset formats, making it easy to integrate new nuclear fusion research questions
Professional CLI Tools: Rich parameter configuration supporting batch testing, conditional filtering, result export, and other advanced features

📚 FusionBench Dataset

FusionBench is a high-quality benchmark dataset specifically constructed for the nuclear fusion domain, including:

Comprehensive Coverage: Covers core nuclear fusion areas including plasma physics, magnetic confinement fusion, laser inertial confinement, etc.
Question Diversity: Includes true/false, multiple choice, and fill-in questions to test different levels of understanding capabilities
Academic Foundation: Built based on the latest nuclear fusion research literature and experimental data
Continuous Updates: Supports community contributions to timely reflect the latest developments in the nuclear fusion field
Standard Format: Uses standard JSON format for easy use and expansion by the academic community

🚀 Quick Start

Environment Requirements

Python 3.7+

Install Dependencies

pip install -r requirements.txt

Basic Usage

# Evaluate all configured models using FusionBench dataset
python main.py

# Quick test: Evaluate FusionBench dataset with lightweight models
python main.py --models fast

# Small-scale test: Evaluate 100 FusionBench questions
python main.py --limit 100

# Specialized evaluation: Test only true/false and multiple choice questions
python main.py --exclude-types fill_in

# Quick verification: Evaluate only cached FusionBench answers
python main.py --eval-only

📋 Supported Models and Services

Local Models (Ollama)

Qwen2.5-7B
Qwen3-8B
Llama3-70B
Llama3.3-70B
DeepSeek-R1 series

Cloud Services

DeepSeek: DeepSeek-R1-Distill-Qwen-32B, Qwen2.5-72B-instruct
SiliconFlow: DeepSeek-V3, DeepSeek-R1
Gemini: Gemini-2.0-Flash
Zhizengzeng: GPT-4.1, O3-mini, Claude-Sonnet-4

⚙️ Configuration

Environment Configuration

Configure API keys and other parameters in config.py:

class LLMConfig:
    # DeepSeek configuration
    DEEPSEEK_PRIVATE_TOKEN = "your_token_here"
    DEEPSEEK_BEARER_TOKEN = "your_token_here"

    # Gemini configuration
    GEMINI_API_KEY = "your_key_here"

    # SiliconFlow configuration
    SILICONFLOW_API_KEY = "your_key_here"

    # Zhizengzeng configuration
    ZHIZENGZENG_API_KEY = "your_key_here"

    # Ollama configuration (if using custom port)
    OLLAMA_BASE_URL = "http://localhost:11434"

Model Configuration

Define the list of models to test in LLMConfig.MODELS_TO_TEST.

📊 Output and Reports

After nuclear fusion benchmark evaluation, professional analysis reports will be generated in the results/ directory:

Standard Report: {model_name}_report.json - Detailed result data from nuclear fusion knowledge evaluation
Enhanced Report: {model_name}_enhanced_report.json - In-depth analysis report containing domain error analysis and statistics
Summary Report: {model_name}_summary.txt - Researcher-friendly readable summary report
Answer Cache: {model_name}_answers.json - Nuclear fusion question answer cache for repeated evaluation

Professional Report Contents Include

Overall accuracy statistics and ranking in the nuclear fusion domain
Specialized accuracy analysis by question types (physical concepts, engineering parameters, experimental phenomena, etc.)
Classification accuracy by nuclear fusion sub-domains (plasma physics, magnetic confinement, laser inertial confinement, etc.)
Token consumption statistics of LLM models on nuclear fusion professional questions
Common error pattern analysis and typical error cases
Model inference time and efficiency comparison statistics

🎯 Command Line Parameters

Parameter	Description	Example
`--dataset`	Dataset file path	`--dataset custom_dataset.json`
`--limit`	Limit number of evaluation questions	`--limit 500`
`--models`	Select model groups or specific models	`--models fast deepseek`
`--output-dir`	Output directory	`--output-dir my_results`
`--exclude-types`	Skip specific question types	`--exclude-types fill_in`
`--no-cache`	Disable caching, force API calls	`--no-cache`
`--eval-only`	Evaluate only cached answers	`--eval-only`

Available Model Groups

fast: Fast local models (Qwen2.5-7B, Qwen3-8B)
large: Large local models (Llama3-70B, Llama3.3-70B)
ollama: All Ollama models
deepseek: DeepSeek cloud models
config: All models defined in config.py
all: All available models

📁 Project Structure

fusionbench/
├── main.py                 # Main program entry and command line interface
├── llm_adapters.py         # Multi-LLM service adapters (DeepSeek, Ollama, etc.)
├── evaluator.py           # Nuclear fusion domain answer evaluator
├── reporter.py            # Professional FusionBench report generator
├── dataset.py             # FusionBench dataset loader
├── config.py              # Model and service configuration management
├── prompt_templates.py    # Nuclear fusion professional question prompt templates
├── requirements.txt       # Python dependency package list
├── fusionbench.json       # FusionBench nuclear fusion dataset
└── run_*.py              # Dedicated model testing scripts

🔧 Advanced Usage

FusionBench Dataset Format

FusionBench dataset supports two professional nuclear fusion domain data formats:

FusionBench Format (dictionary format with article IDs as keys) - Professional question bank built from nuclear fusion literature with rich contextual information
Items Format (list format with each item containing question data) - Structured nuclear fusion knowledge Q&A dataset for easy expansion and maintenance

Caching Mechanism

Smart caching based on question content hashing
Supports cache migration (old ID-based cache automatically converted to content hash cache)
Auto-save cache, backup every 100 API calls

Error Handling

Network timeout retry
Failed answer parsing marking
Detailed error logging and statistics

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Important Notes

Please comply with the terms of service and API limits of each LLM service provider
Large-scale evaluation may incur high API costs, please use the nuclear fusion professional dataset cautiously
It is recommended to test configuration with small-scale datasets (e.g., --limit 50) first to ensure models can properly handle nuclear fusion professional questions
Nuclear fusion domain knowledge updates rapidly, it is recommended to regularly update the dataset to reflect the latest research developments

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmark_project		benchmark_project
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FusionBench: Nuclear Fusion Benchmark Dataset and Evaluation Framework

🌟 Features

📚 FusionBench Dataset

🚀 Quick Start

Environment Requirements

Install Dependencies

Basic Usage

📋 Supported Models and Services

Local Models (Ollama)

Cloud Services

⚙️ Configuration

Environment Configuration

Model Configuration

📊 Output and Reports

Professional Report Contents Include

🎯 Command Line Parameters

Available Model Groups

📁 Project Structure

🔧 Advanced Usage

FusionBench Dataset Format

Caching Mechanism

Error Handling

📄 License

⚠️ Important Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FusionBench: Nuclear Fusion Benchmark Dataset and Evaluation Framework

🌟 Features

📚 FusionBench Dataset

🚀 Quick Start

Environment Requirements

Install Dependencies

Basic Usage

📋 Supported Models and Services

Local Models (Ollama)

Cloud Services

⚙️ Configuration

Environment Configuration

Model Configuration

📊 Output and Reports

Professional Report Contents Include

🎯 Command Line Parameters

Available Model Groups

📁 Project Structure

🔧 Advanced Usage

FusionBench Dataset Format

Caching Mechanism

Error Handling

📄 License

⚠️ Important Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages