A complete nuclear fusion domain LLM evaluation solution, containing the professional FusionBench dataset and accompanying evaluation framework. Used for systematically evaluating and comparing the knowledge mastery and reasoning capabilities of large language models on nuclear fusion science problems.
- Complete Solution: Provides both FusionBench dataset and professional evaluation framework for one-stop nuclear fusion LLM evaluation
- High-Quality Dataset: Carefully constructed nuclear fusion domain benchmark dataset covering core scientific concepts and latest research
- Comprehensive Model Support: Supports mainstream LLM services including DeepSeek, Ollama, SiliconFlow, Gemini, Zhizengzeng, etc.
- Rich Question Types: Supports true/false, multiple choice, and fill-in questions to comprehensively evaluate models' understanding and reasoning capabilities
- Smart Cache Optimization: Content hash-based caching mechanism that significantly improves testing efficiency and reduces API costs
- Deep Performance Analysis: Provides detailed accuracy statistics, domain classification analysis, and professional reports on error pattern analysis
- Flexible Extensibility: Supports multiple dataset formats, making it easy to integrate new nuclear fusion research questions
- Professional CLI Tools: Rich parameter configuration supporting batch testing, conditional filtering, result export, and other advanced features
FusionBench is a high-quality benchmark dataset specifically constructed for the nuclear fusion domain, including:
- Comprehensive Coverage: Covers core nuclear fusion areas including plasma physics, magnetic confinement fusion, laser inertial confinement, etc.
- Question Diversity: Includes true/false, multiple choice, and fill-in questions to test different levels of understanding capabilities
- Academic Foundation: Built based on the latest nuclear fusion research literature and experimental data
- Continuous Updates: Supports community contributions to timely reflect the latest developments in the nuclear fusion field
- Standard Format: Uses standard JSON format for easy use and expansion by the academic community
- Python 3.7+
pip install -r requirements.txt# Evaluate all configured models using FusionBench dataset
python main.py
# Quick test: Evaluate FusionBench dataset with lightweight models
python main.py --models fast
# Small-scale test: Evaluate 100 FusionBench questions
python main.py --limit 100
# Specialized evaluation: Test only true/false and multiple choice questions
python main.py --exclude-types fill_in
# Quick verification: Evaluate only cached FusionBench answers
python main.py --eval-only- Qwen2.5-7B
- Qwen3-8B
- Llama3-70B
- Llama3.3-70B
- DeepSeek-R1 series
- DeepSeek: DeepSeek-R1-Distill-Qwen-32B, Qwen2.5-72B-instruct
- SiliconFlow: DeepSeek-V3, DeepSeek-R1
- Gemini: Gemini-2.0-Flash
- Zhizengzeng: GPT-4.1, O3-mini, Claude-Sonnet-4
Configure API keys and other parameters in config.py:
class LLMConfig:
# DeepSeek configuration
DEEPSEEK_PRIVATE_TOKEN = "your_token_here"
DEEPSEEK_BEARER_TOKEN = "your_token_here"
# Gemini configuration
GEMINI_API_KEY = "your_key_here"
# SiliconFlow configuration
SILICONFLOW_API_KEY = "your_key_here"
# Zhizengzeng configuration
ZHIZENGZENG_API_KEY = "your_key_here"
# Ollama configuration (if using custom port)
OLLAMA_BASE_URL = "http://localhost:11434"Define the list of models to test in LLMConfig.MODELS_TO_TEST.
After nuclear fusion benchmark evaluation, professional analysis reports will be generated in the results/ directory:
- Standard Report:
{model_name}_report.json- Detailed result data from nuclear fusion knowledge evaluation - Enhanced Report:
{model_name}_enhanced_report.json- In-depth analysis report containing domain error analysis and statistics - Summary Report:
{model_name}_summary.txt- Researcher-friendly readable summary report - Answer Cache:
{model_name}_answers.json- Nuclear fusion question answer cache for repeated evaluation
- Overall accuracy statistics and ranking in the nuclear fusion domain
- Specialized accuracy analysis by question types (physical concepts, engineering parameters, experimental phenomena, etc.)
- Classification accuracy by nuclear fusion sub-domains (plasma physics, magnetic confinement, laser inertial confinement, etc.)
- Token consumption statistics of LLM models on nuclear fusion professional questions
- Common error pattern analysis and typical error cases
- Model inference time and efficiency comparison statistics
| Parameter | Description | Example |
|---|---|---|
--dataset |
Dataset file path | --dataset custom_dataset.json |
--limit |
Limit number of evaluation questions | --limit 500 |
--models |
Select model groups or specific models | --models fast deepseek |
--output-dir |
Output directory | --output-dir my_results |
--exclude-types |
Skip specific question types | --exclude-types fill_in |
--no-cache |
Disable caching, force API calls | --no-cache |
--eval-only |
Evaluate only cached answers | --eval-only |
fast: Fast local models (Qwen2.5-7B, Qwen3-8B)large: Large local models (Llama3-70B, Llama3.3-70B)ollama: All Ollama modelsdeepseek: DeepSeek cloud modelsconfig: All models defined in config.pyall: All available models
fusionbench/
βββ main.py # Main program entry and command line interface
βββ llm_adapters.py # Multi-LLM service adapters (DeepSeek, Ollama, etc.)
βββ evaluator.py # Nuclear fusion domain answer evaluator
βββ reporter.py # Professional FusionBench report generator
βββ dataset.py # FusionBench dataset loader
βββ config.py # Model and service configuration management
βββ prompt_templates.py # Nuclear fusion professional question prompt templates
βββ requirements.txt # Python dependency package list
βββ fusionbench.json # FusionBench nuclear fusion dataset
βββ run_*.py # Dedicated model testing scripts
FusionBench dataset supports two professional nuclear fusion domain data formats:
- FusionBench Format (dictionary format with article IDs as keys) - Professional question bank built from nuclear fusion literature with rich contextual information
- Items Format (list format with each item containing question data) - Structured nuclear fusion knowledge Q&A dataset for easy expansion and maintenance
- Smart caching based on question content hashing
- Supports cache migration (old ID-based cache automatically converted to content hash cache)
- Auto-save cache, backup every 100 API calls
- Network timeout retry
- Failed answer parsing marking
- Detailed error logging and statistics
This project is licensed under the MIT License - see the LICENSE file for details.
- Please comply with the terms of service and API limits of each LLM service provider
- Large-scale evaluation may incur high API costs, please use the nuclear fusion professional dataset cautiously
- It is recommended to test configuration with small-scale datasets (e.g., --limit 50) first to ensure models can properly handle nuclear fusion professional questions
- Nuclear fusion domain knowledge updates rapidly, it is recommended to regularly update the dataset to reflect the latest research developments