Skip to content

DataArcTech/DataArc-SynData-Toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataArc SynData Toolkit

A modular, highly user-friendly synthetic data generation toolkit supporting multi-source, multi-language data synthesis.

Easily synthesize training data for LLMs with zero-code CLI and GUI !

📖 [ English | 中文 ]

🎯 Project Overview

DataArc SynData Toolkit is a synthetic data generation toolkit developed and open-sourced by DataArc. It enables users to generate customized training data in one step through simple configuration files based on their requirements.

💡 Key Features

  • Extremely Simple Usage: Synthesize data with a single command and a configuration file. Gradio UI is also provided for easy operations.
  • Support for Multi-Source Synthetic Data:
    • Local Synthesis: Support for generating data based on local corpora.
    • Huggingface Integration: Automatically screens and retrieves data from Huggingface.
    • Model Distillation: Enable synthetic data generation through model distillation.
  • Multilingual Support: Supports English and various low-resource languages.
  • Multi-Provider Model Support: Works with local deployment, OpenAI APIs, and more.
  • Highly Extensible: The entire synthetic data workflow is modular, allowing developers to flexibly customize them.

🔬 Performance

Model Medical Finance Law
Qwen-2.5-7B-Instruct 42.34% 52.91% 19.80%
Trained with Synthetic Data 64.57% 73.93% 42.80%

A few lines of code deliver over 20% performance improvements.

📓 Changelog

[25/11/17] We open-sourced our synthetic data platform.

Tip

If you cannot use the latest feature, please pull the latest code.

🏭 DataArc SynData Toolkit Pipeline

DataArc SynData Toolkit is designed to synthesize data in a modular pipeline, allowing users to customize the strategies and implementation methods of each step. The main components include:

  • Synthetic Data Generation: Generate data through methods such as local synthesis, Huggingface dataset retrieval, and model distillation.
  • Data Filtering and Rewriting: Filter and rewrite initially synthesized data according to the target model's requirements.

dataarc-sdg_pipeline

By decoupling modules, developers can achieve flexible customization of functional modules based on specific needs.

🧩 Use Cases

We provide three different use cases that sythesize data through DataArc SynData Toolkit.

📁 Project Structure

dataarc-sdg/
├── configs/						# Configuration Examples
│   ├── example.yaml				# example YAML file
|
├── sdgsystem/						# Implementation of Functions
│   ├── configs/					# Configuration Module
|	|	├── config.py				# configuration parsing
|	|	└── constants.py			# default arguments
|
│   ├── dataset/					# Dataset Module
|	|	├── dataset.py				# dataset class
|	|	└── process.py				# quality control and formatting
|
│   ├── huggingface/				# Huggingface Crawling
│   ├── documents/					# Retrieve/Parsing/Chunk of Local Corpora
│   ├── distillation/				# Model Distillation
|
│   ├── evaluation/					# Evaluation Module
|	|	├── answer_comparison.py	# answer comparison
|	|	├── evaluator.py			# evaluator
|
│   ├── generation/					# Generation Module
|	|	├── base.py					# base class of generation
|	|	├── generator.py			# data generator
|	|	├── rewriter.py				# data rewriter
|
│   ├── models/						# Model Interaction Module
|	|	├── postprocess/			# postprocess of model responses (e.g. majority voting)
|	|	├── answer_extraction.py	# answer extraction from responses
|	|	├── models.py				# model deployment and chatting
|	|	├── processor_arguments.py	# arguments of post-processor
|	|	├── client.py				# client for interacting with models
|
│   ├── tasks/						# Generation Task Execution Module
|	|	├── base.py					# base class of executor
|	|	├── (local/web/distill).py	# executor for different sources/route
|	|	├── total_executor.py		# total executor
|
│   ├── translation/				# Support for Low-Resource Languages
|
│   ├── cli.py						# API for project functions
│   ├── pipeline.py					# pipeline of data synthesis
│   ├── prompts.py					# prompts used in project
│   ├── token_counter.py			# token usage estimation
│   └── utils.py					# other function utils
|
├── tests/							# Test Suite
|
├── app.py							# gradio UI
├── pyproject.toml					# project dependencies
└── README.md						# project documentation

🚀 Quick Start

1. Install DataArc SynData Toolkit

# 1. Clone the repository
git clone https://github.com/DataArcTech/DataArc-SynData-Toolkit.git
cd DataArc-SynData-Toolkit

# 2. Install uv if not already installed
pip install uv

# 3. Install dependencies 
uv sync

For hardware requirements and dependencies detail, please refer to dependency and installation guide.

2. Configuration

Please refer to the example configuration file and modify the configuration based on your requirements.

3. Synthesize Data

Run through CLI:

Create a .env file and specified the following fields.

OPENAI_API_KEY=sk-xxx   # your api key
OPENAI_BASE_URL=https://api.openai.com/v1  # Optional: your base url

And run following command.

uv run sdg configs/example.yaml  # or change to your .yaml file

🖥️ Synthesizing Data with GUI

The UI is powered by Gradio. Build with following command.

uv run python app.py

🔧 Configuration System

DataArc-SDG is configured using a flexible YAML file, please check our provided example yaml file.

📅 Schedule for the Next Release

  • Arabic Support: Support for generating Arabic synthetic data.
  • Custom Data Sources: Support for custom addition of data sources and corresponding protocol file conversion.
  • Model Fine-tuning Module: Support fine-tuning models using synthetic data within the pipeline.

🤝 Contributing

We welcome contributions!

About

Synthetic Data Generation Platform By DataArcTech

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages