# Tabular synthetic data 
### A generation example with **Clearbox Synthetic Kit**

This notebook walks you through the tabular synthetic data generation process with **Clearbox Synthetic Kit**.

You can run this notebook on Google Colab or on your local machine.<br> 
In the second case, we highly recommend to create a dedicated virtual environment.

<div class="alert alert-secondary">
To run this notebook, make sure you change the runtime to <strong>GPU</strong><br>
<hr>
<strong>Runtime</strong> --> <strong>Change Runtime Type</strong> <br>
and set <strong>Hardware Accelerator</strong> to "<strong>GPU</strong>"
</div>

In [None]:
# Install the library and its dependencies

%pip install clearbox-synthetic-kit

In [None]:
# Import necessary dependencies
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from clearbox_synthetic.utils import Dataset, Preprocessor
from clearbox_synthetic.generation import TabularEngine, LabeledSynthesizer

## 0. Data import and preparation

In [None]:
# Load the example datasets from GitHub

file_path = "https://raw.githubusercontent.com/Clearbox-AI/clearbox-synthetic-kit/main/tests/resources/uci_adult_dataset"

train_dataset = Dataset.from_csv(
        os.path.join(file_path, "dataset.csv"),
        target_column="income",
        regression=False
    )

validation_dataset = Dataset.from_csv(
        os.path.join(file_path, "validation_dataset.csv"),
        column_types=train_dataset.column_types,
        target_column=train_dataset.target_column,
        regression=train_dataset.regression
    )

### Data pre-processing
Datasets are pre-processd with the **Preprocessor** class, which prepares data for the subsequent steps.

In [None]:
# Preprocessor initialization
preprocessor = Preprocessor(train_dataset) 

# Preprocessing training dataset 
X_train_raw = train_dataset.get_x() # Get all columns of the training dataset except the target column (y)
X_train = preprocessor.transform(X_train_raw)

In [None]:
# Preprocessing validation dataset

X_val_raw = validation_dataset.get_x() # Get all columns of the validation dataset except the target column (y)
X_val = preprocessor.transform(X_val_raw)

In [6]:
# Normalize the target column (y) of the training dataset if train_dataset.regression is True, otherwise perform one-hot encoding on that column

if train_dataset.regression:
    Y = train_dataset.get_normalized_y()
else:
    Y = train_dataset.get_one_hot_encoded_y()

## 1. Synhetic Data Generation

In [None]:
# Initialize the tabular synthetic data generator

engine = TabularEngine(
    layers_size=[50],
    x_shape=X_train[0].shape,
    y_shape=Y[0].shape,
    numerical_feature_sizes=preprocessor.get_features_sizes()[0],
    categorical_feature_sizes=preprocessor.get_features_sizes()[1],
)

# Start the training of the tabular synthetic data generator

engine.fit(X_train, y_train_ds=Y, epochs=5, learning_rate=0.001)

In [10]:
# Initilaize the Synthetsizer for data generation

synthesizer = LabeledSynthesizer(train_dataset, engine)

In [11]:
# Generate the syntehtic dataset from the Synthesizer and save it to a .csv file

pd_synthetic_dataset = synthesizer.generate(has_header=True)

pd_synthetic_dataset.to_csv("synthetic_dataset.csv", index=False)

In [12]:
# Load the synethetic dataset

synthetic_dataset = Dataset.from_csv(
        "synthetic_dataset.csv",
        column_types=train_dataset.column_types,  
        target_column=train_dataset.target_column, 
        regression=train_dataset.regression
    )