DataSynth-Gen

DataSynth-Gen is a Python library for synthetic data generation for machine learning, focusing on privacy-preserving techniques and data augmentation. It enables developers and researchers to create realistic synthetic datasets, crucial for scenarios where real data is scarce, sensitive, or expensive to acquire.

Key Features

Diverse Data Types: Supports generation of numerical, categorical, and time-series data.
Privacy-Preserving: Implements techniques like differential privacy and generative adversarial networks (GANs) to ensure data privacy.
Data Augmentation: Provides methods to expand existing datasets, improving model robustness and performance.
Customizable Schemas: Define complex data schemas to generate synthetic data that closely mimics real-world distributions.
Integration: Designed for easy integration with popular machine learning frameworks like scikit-learn, TensorFlow, and PyTorch.

Getting Started

Prerequisites

Python 3.8+
pandas
numpy
scikit-learn
(Optional) tensorflow or pytorch for GAN-based synthesis

Installation

git clone https://github.com/FunctionFlow1/DataSynth-Gen.git
cd DataSynth-Gen
pip install -r requirements.txt

Usage Example

import pandas as pd
from datasynth_gen import Synthesizer, augment_data

# Define a schema for synthetic data generation
schema = {
    "age": {"type": "numerical", "mean": 35, "std": 10, "min": 18, "max": 70},
    "gender": {"type": "categorical", "categories": ["Male", "Female", "Other"], "probabilities": [0.49, 0.49, 0.02]},
    "income": {"type": "numerical", "distribution": "lognormal", "mean": 50000, "std": 15000},
    "education": {"type": "categorical", "categories": ["High School", "Bachelors", "Masters", "PhD"]},
    "purchase_frequency": {"type": "numerical", "mean": 2.5, "std": 1.2, "min": 0, "max": 10}
}

# Initialize and generate synthetic data
synthesizer = Synthesizer(schema)
synthetic_df = synthesizer.generate(num_samples=1000)

print("\n--- Synthetic Data Sample ---")
print(synthetic_df.head())
print("\n--- Synthetic Data Description ---")
print(synthetic_df.describe(include=\'all\'))

# Example of data augmentation (conceptual - actual implementation would vary)
# For demonstration, let\'s assume we have a small real dataset
real_data = pd.DataFrame({
    "age": [25, 45, 30],
    "gender": ["Female", "Male", "Female"],
    "income": [40000, 70000, 55000],
    "education": ["Bachelors", "Masters", "Bachelors"],
    "purchase_frequency": [3, 2, 4]
})

# Augmented data (conceptual call, actual implementation depends on strategy)
# augmented_df = augment_data(real_data, strategy=\'SMOTE\', target_column=\'purchase_frequency\')
# print("\n--- Augmented Data Sample ---")
# print(augmented_df.head())

Contributing

We welcome contributions from the community! Please read our Contributing Guidelines for more information.

License

DataSynth-Gen is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 304 Commits
datasynth_gen		datasynth_gen
README.md		README.md
contribution_log.txt		contribution_log.txt
datasynth_gen.py		datasynth_gen.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataSynth-Gen

Key Features

Getting Started

Prerequisites

Installation

Usage Example

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataSynth-Gen

Key Features

Getting Started

Prerequisites

Installation

Usage Example

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages