Skip to content

FunctionFlow1/DataSynth-Gen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

304 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataSynth-Gen

DataSynth-Gen is a Python library for synthetic data generation for machine learning, focusing on privacy-preserving techniques and data augmentation. It enables developers and researchers to create realistic synthetic datasets, crucial for scenarios where real data is scarce, sensitive, or expensive to acquire.

Key Features

  • Diverse Data Types: Supports generation of numerical, categorical, and time-series data.
  • Privacy-Preserving: Implements techniques like differential privacy and generative adversarial networks (GANs) to ensure data privacy.
  • Data Augmentation: Provides methods to expand existing datasets, improving model robustness and performance.
  • Customizable Schemas: Define complex data schemas to generate synthetic data that closely mimics real-world distributions.
  • Integration: Designed for easy integration with popular machine learning frameworks like scikit-learn, TensorFlow, and PyTorch.

Getting Started

Prerequisites

  • Python 3.8+
  • pandas
  • numpy
  • scikit-learn
  • (Optional) tensorflow or pytorch for GAN-based synthesis

Installation

git clone https://github.com/FunctionFlow1/DataSynth-Gen.git
cd DataSynth-Gen
pip install -r requirements.txt

Usage Example

import pandas as pd
from datasynth_gen import Synthesizer, augment_data

# Define a schema for synthetic data generation
schema = {
    "age": {"type": "numerical", "mean": 35, "std": 10, "min": 18, "max": 70},
    "gender": {"type": "categorical", "categories": ["Male", "Female", "Other"], "probabilities": [0.49, 0.49, 0.02]},
    "income": {"type": "numerical", "distribution": "lognormal", "mean": 50000, "std": 15000},
    "education": {"type": "categorical", "categories": ["High School", "Bachelors", "Masters", "PhD"]},
    "purchase_frequency": {"type": "numerical", "mean": 2.5, "std": 1.2, "min": 0, "max": 10}
}

# Initialize and generate synthetic data
synthesizer = Synthesizer(schema)
synthetic_df = synthesizer.generate(num_samples=1000)

print("\n--- Synthetic Data Sample ---")
print(synthetic_df.head())
print("\n--- Synthetic Data Description ---")
print(synthetic_df.describe(include=\'all\'))

# Example of data augmentation (conceptual - actual implementation would vary)
# For demonstration, let\'s assume we have a small real dataset
real_data = pd.DataFrame({
    "age": [25, 45, 30],
    "gender": ["Female", "Male", "Female"],
    "income": [40000, 70000, 55000],
    "education": ["Bachelors", "Masters", "Bachelors"],
    "purchase_frequency": [3, 2, 4]
})

# Augmented data (conceptual call, actual implementation depends on strategy)
# augmented_df = augment_data(real_data, strategy=\'SMOTE\', target_column=\'purchase_frequency\')
# print("\n--- Augmented Data Sample ---")
# print(augmented_df.head())

Contributing

We welcome contributions from the community! Please read our Contributing Guidelines for more information.

License

DataSynth-Gen is released under the MIT License.

About

A Python library for synthetic data generation for machine learning, focusing on privacy-preserving techniques and data augmentation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages