DataSynth-Gen is a Python library for synthetic data generation for machine learning, focusing on privacy-preserving techniques and data augmentation. It enables developers and researchers to create realistic synthetic datasets, crucial for scenarios where real data is scarce, sensitive, or expensive to acquire.
- Diverse Data Types: Supports generation of numerical, categorical, and time-series data.
- Privacy-Preserving: Implements techniques like differential privacy and generative adversarial networks (GANs) to ensure data privacy.
- Data Augmentation: Provides methods to expand existing datasets, improving model robustness and performance.
- Customizable Schemas: Define complex data schemas to generate synthetic data that closely mimics real-world distributions.
- Integration: Designed for easy integration with popular machine learning frameworks like scikit-learn, TensorFlow, and PyTorch.
- Python 3.8+
- pandas
- numpy
- scikit-learn
- (Optional) tensorflow or pytorch for GAN-based synthesis
git clone https://github.com/FunctionFlow1/DataSynth-Gen.git
cd DataSynth-Gen
pip install -r requirements.txtimport pandas as pd
from datasynth_gen import Synthesizer, augment_data
# Define a schema for synthetic data generation
schema = {
"age": {"type": "numerical", "mean": 35, "std": 10, "min": 18, "max": 70},
"gender": {"type": "categorical", "categories": ["Male", "Female", "Other"], "probabilities": [0.49, 0.49, 0.02]},
"income": {"type": "numerical", "distribution": "lognormal", "mean": 50000, "std": 15000},
"education": {"type": "categorical", "categories": ["High School", "Bachelors", "Masters", "PhD"]},
"purchase_frequency": {"type": "numerical", "mean": 2.5, "std": 1.2, "min": 0, "max": 10}
}
# Initialize and generate synthetic data
synthesizer = Synthesizer(schema)
synthetic_df = synthesizer.generate(num_samples=1000)
print("\n--- Synthetic Data Sample ---")
print(synthetic_df.head())
print("\n--- Synthetic Data Description ---")
print(synthetic_df.describe(include=\'all\'))
# Example of data augmentation (conceptual - actual implementation would vary)
# For demonstration, let\'s assume we have a small real dataset
real_data = pd.DataFrame({
"age": [25, 45, 30],
"gender": ["Female", "Male", "Female"],
"income": [40000, 70000, 55000],
"education": ["Bachelors", "Masters", "Bachelors"],
"purchase_frequency": [3, 2, 4]
})
# Augmented data (conceptual call, actual implementation depends on strategy)
# augmented_df = augment_data(real_data, strategy=\'SMOTE\', target_column=\'purchase_frequency\')
# print("\n--- Augmented Data Sample ---")
# print(augmented_df.head())We welcome contributions from the community! Please read our Contributing Guidelines for more information.
DataSynth-Gen is released under the MIT License.