PowerInstruct is an automated dataset generation tool for power system instructions based on large language models. The tool supports two generation modes:
- Seed Generation: Uses models like GPT-4o/Qwen to generate seed data in a standard format.
- Code Generation: Uses models like Claude/o1 to generate transformation code based on the seed data, enabling batch dataset generation.
- Supports multiple large language models (GPT-4o, Claude, Qwen, etc.)
- Batch data processing capabilities
- Real-time execution result display
- Supports JSON/JSONL format export
- Visualization platform support
- Clone the repository
git clone https://github.com/Joserii/PowerInstruct.git
cd PowerInstruct- Create and activate a virtual environment
conda create -n powerinstruct python=3.10
conda activate powerinstruct- Install dependencies
pip install -r requirements.txt- Configure environment variables (OpenAI, Anthropic, Qwen API key support)
You need to manually set the following environment variables in your terminal:
- Linux/macOS (bash/zsh):
export OPENAI_API_KEY="sk-xxxxxxx"
export OPENAI_BASE_URL="https://api.openai.com/v1"
export ANTHROPIC_API_KEY="sk-xxxxxxx"
export ANTHROPIC_BASE_URL="https://api.anthropic.com/v1"
export DASHSCOPE_API_KEY="sk-xxxxxxxxx"
export DASHSCOPE_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"- Windows (Powershell):
$env:OPENAI_API_KEY="sk-xxxxxx"
$env:OPENAI_BASE_URL="https://api.openai.com/v1"
$env:ANTHROPIC_API_KEY="sk-xxxxxx"
$env:ANTHROPIC_BASE_URL="https://api.anthropic.com/v1"
$env:DASHSCOPE_API_KEY="sk-xxxxxxx"
$env:DASHSCOPE_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"- Start the service
python run.py- Access the web interface
- Open your browser and visit
http://localhost:5000
- Workflow
- Upload data file (support JSON format). Here we provide a test data set, download link test_data10.json
- Select the model to use (different models can be chosen for Seed Generation and Code Generation)
- Execute seed data generation
- Merge code generation templates
- Execute code generation
- Download the generated dataset (JSON/JSONL format supported)
powerinstruct/
├── app/ # Backend service
│ ├── api/ # API routes
│ ├── core/ # Core business logic
│ ├── services/ # Service layer
│ └── utils/ # Utility functions
├── static/ # Frontend assets
│ ├── css/ # Stylesheets
│ └── js/ # JavaScript files
│ templates/ # HTML templates
├── data/ # Data directory
│ ├── templates/ # Template files
│ └── uploads/ # Uploaded files
├── config/ # Configuration files
├── tests/ # Test cases
├── requirements.txt # Project dependencies
└── README.md # Project documentation
-
File Handling
- Supports JSON file uploads
- File format validation
- Batch file processing
-
Model Integration
- Supports multiple large language models
- Configurable model parameters
- Error retry mechanism
-
Data Generation
- Seed data generation
- Code template generation
- Batch data transformation
-
Result Export
- JSON format export
- JSONL format export
- Instruction data extraction
- Model Configuration
SUPPORT_MODELS = [
# Qwen Series
"qwen-max",
"qwen-coder-plus",
"qwen2.5-7b-instruct",
# OpenAI Series
"o1",
"o1-mini",
"gpt-4o",
"gpt-4o-mini",
# Anthropic Claude Series
"claude-3-7-sonnet",
"claude-3-5-sonnet"
]- Template Configuration
{
"prompt": "your_prompt_template",
"codegen": "your_code_template"
}

