Skip to content

PKU-XLab/FusionBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

FusionBench: Nuclear Fusion Benchmark Dataset and Evaluation Framework

A complete nuclear fusion domain LLM evaluation solution, containing the professional FusionBench dataset and accompanying evaluation framework. Used for systematically evaluating and comparing the knowledge mastery and reasoning capabilities of large language models on nuclear fusion science problems.

🌟 Features

  • Complete Solution: Provides both FusionBench dataset and professional evaluation framework for one-stop nuclear fusion LLM evaluation
  • High-Quality Dataset: Carefully constructed nuclear fusion domain benchmark dataset covering core scientific concepts and latest research
  • Comprehensive Model Support: Supports mainstream LLM services including DeepSeek, Ollama, SiliconFlow, Gemini, Zhizengzeng, etc.
  • Rich Question Types: Supports true/false, multiple choice, and fill-in questions to comprehensively evaluate models' understanding and reasoning capabilities
  • Smart Cache Optimization: Content hash-based caching mechanism that significantly improves testing efficiency and reduces API costs
  • Deep Performance Analysis: Provides detailed accuracy statistics, domain classification analysis, and professional reports on error pattern analysis
  • Flexible Extensibility: Supports multiple dataset formats, making it easy to integrate new nuclear fusion research questions
  • Professional CLI Tools: Rich parameter configuration supporting batch testing, conditional filtering, result export, and other advanced features

πŸ“š FusionBench Dataset

FusionBench is a high-quality benchmark dataset specifically constructed for the nuclear fusion domain, including:

  • Comprehensive Coverage: Covers core nuclear fusion areas including plasma physics, magnetic confinement fusion, laser inertial confinement, etc.
  • Question Diversity: Includes true/false, multiple choice, and fill-in questions to test different levels of understanding capabilities
  • Academic Foundation: Built based on the latest nuclear fusion research literature and experimental data
  • Continuous Updates: Supports community contributions to timely reflect the latest developments in the nuclear fusion field
  • Standard Format: Uses standard JSON format for easy use and expansion by the academic community

πŸš€ Quick Start

Environment Requirements

  • Python 3.7+

Install Dependencies

pip install -r requirements.txt

Basic Usage

# Evaluate all configured models using FusionBench dataset
python main.py

# Quick test: Evaluate FusionBench dataset with lightweight models
python main.py --models fast

# Small-scale test: Evaluate 100 FusionBench questions
python main.py --limit 100

# Specialized evaluation: Test only true/false and multiple choice questions
python main.py --exclude-types fill_in

# Quick verification: Evaluate only cached FusionBench answers
python main.py --eval-only

πŸ“‹ Supported Models and Services

Local Models (Ollama)

  • Qwen2.5-7B
  • Qwen3-8B
  • Llama3-70B
  • Llama3.3-70B
  • DeepSeek-R1 series

Cloud Services

  • DeepSeek: DeepSeek-R1-Distill-Qwen-32B, Qwen2.5-72B-instruct
  • SiliconFlow: DeepSeek-V3, DeepSeek-R1
  • Gemini: Gemini-2.0-Flash
  • Zhizengzeng: GPT-4.1, O3-mini, Claude-Sonnet-4

βš™οΈ Configuration

Environment Configuration

Configure API keys and other parameters in config.py:

class LLMConfig:
    # DeepSeek configuration
    DEEPSEEK_PRIVATE_TOKEN = "your_token_here"
    DEEPSEEK_BEARER_TOKEN = "your_token_here"

    # Gemini configuration
    GEMINI_API_KEY = "your_key_here"

    # SiliconFlow configuration
    SILICONFLOW_API_KEY = "your_key_here"

    # Zhizengzeng configuration
    ZHIZENGZENG_API_KEY = "your_key_here"

    # Ollama configuration (if using custom port)
    OLLAMA_BASE_URL = "http://localhost:11434"

Model Configuration

Define the list of models to test in LLMConfig.MODELS_TO_TEST.

πŸ“Š Output and Reports

After nuclear fusion benchmark evaluation, professional analysis reports will be generated in the results/ directory:

  • Standard Report: {model_name}_report.json - Detailed result data from nuclear fusion knowledge evaluation
  • Enhanced Report: {model_name}_enhanced_report.json - In-depth analysis report containing domain error analysis and statistics
  • Summary Report: {model_name}_summary.txt - Researcher-friendly readable summary report
  • Answer Cache: {model_name}_answers.json - Nuclear fusion question answer cache for repeated evaluation

Professional Report Contents Include

  • Overall accuracy statistics and ranking in the nuclear fusion domain
  • Specialized accuracy analysis by question types (physical concepts, engineering parameters, experimental phenomena, etc.)
  • Classification accuracy by nuclear fusion sub-domains (plasma physics, magnetic confinement, laser inertial confinement, etc.)
  • Token consumption statistics of LLM models on nuclear fusion professional questions
  • Common error pattern analysis and typical error cases
  • Model inference time and efficiency comparison statistics

🎯 Command Line Parameters

Parameter Description Example
--dataset Dataset file path --dataset custom_dataset.json
--limit Limit number of evaluation questions --limit 500
--models Select model groups or specific models --models fast deepseek
--output-dir Output directory --output-dir my_results
--exclude-types Skip specific question types --exclude-types fill_in
--no-cache Disable caching, force API calls --no-cache
--eval-only Evaluate only cached answers --eval-only

Available Model Groups

  • fast: Fast local models (Qwen2.5-7B, Qwen3-8B)
  • large: Large local models (Llama3-70B, Llama3.3-70B)
  • ollama: All Ollama models
  • deepseek: DeepSeek cloud models
  • config: All models defined in config.py
  • all: All available models

πŸ“ Project Structure

fusionbench/
β”œβ”€β”€ main.py                 # Main program entry and command line interface
β”œβ”€β”€ llm_adapters.py         # Multi-LLM service adapters (DeepSeek, Ollama, etc.)
β”œβ”€β”€ evaluator.py           # Nuclear fusion domain answer evaluator
β”œβ”€β”€ reporter.py            # Professional FusionBench report generator
β”œβ”€β”€ dataset.py             # FusionBench dataset loader
β”œβ”€β”€ config.py              # Model and service configuration management
β”œβ”€β”€ prompt_templates.py    # Nuclear fusion professional question prompt templates
β”œβ”€β”€ requirements.txt       # Python dependency package list
β”œβ”€β”€ fusionbench.json       # FusionBench nuclear fusion dataset
└── run_*.py              # Dedicated model testing scripts

πŸ”§ Advanced Usage

FusionBench Dataset Format

FusionBench dataset supports two professional nuclear fusion domain data formats:

  1. FusionBench Format (dictionary format with article IDs as keys) - Professional question bank built from nuclear fusion literature with rich contextual information
  2. Items Format (list format with each item containing question data) - Structured nuclear fusion knowledge Q&A dataset for easy expansion and maintenance

Caching Mechanism

  • Smart caching based on question content hashing
  • Supports cache migration (old ID-based cache automatically converted to content hash cache)
  • Auto-save cache, backup every 100 API calls

Error Handling

  • Network timeout retry
  • Failed answer parsing marking
  • Detailed error logging and statistics

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Important Notes

  • Please comply with the terms of service and API limits of each LLM service provider
  • Large-scale evaluation may incur high API costs, please use the nuclear fusion professional dataset cautiously
  • It is recommended to test configuration with small-scale datasets (e.g., --limit 50) first to ensure models can properly handle nuclear fusion professional questions
  • Nuclear fusion domain knowledge updates rapidly, it is recommended to regularly update the dataset to reflect the latest research developments

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages