# Hands-on Synthetic data generation - Why and how?

### Synthetic data

**Synthetic data** is artificially generated information that mimics the statistical properties of real-world data but does
not directly correspond to real events or individuals. This type of data is created through algorithms and statistical models,
such as generative adversarial networks (GANs) or other simulation techniques.

In [None]:
!pip install ydata-profiling==4.2.*
!pip install ydata-synthetic==1.4.*

### The dataset
The data used is the [Adult Census Income](https://www.kaggle.com/datasets/uciml/adult-census-income) which we will fecth by importing the pmlb library (a wrapper for the Penn Machine Learning Benchmark data repository).

In [1]:
from pmlb import fetch_data

In [2]:
# Load data
data = fetch_data('adult')
num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_cols = ['workclass','education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex',
            'native-country', 'target']

### Keeping all the analysis unbiased - Holdout

#### Train Test Data split and the holdout

Holdout refers to a portion of historical data that is held out of the datasets used for training and validating machine learning models, some might recognize the **holdout** set as **test** dataset.
*So why is it relevant if we are already doing a train validation set creation?*

- The **train** dataset always returns model optimistic results/performance, as the model have seen the all the training data throughout the training process;
- On the other hand, the **validation** set is still somewhat optimistic although less compared to the training set. Why? Simple it is also used to select the best model. For that reason the obtained results are somewhat biased.
- The **holdout** or **test** dataset is completely independent form the trainign and model selection, for that reason us the best set to build unbiased performance metrics that can properly represent the behaviour of the model with new data inputs (eg. a production system).

In [None]:
from sklearn.model_selection import train_test_split

y = data['target']
X = data.iloc[:, data.columns != 'target']

X_train, X_Hold, y_train, y_hold = train_test_split(X, y, test_size=0.3, random_state=123)

X_train['target'] = y_train
X_Hold['target'] = y_hold

X_train.info()

## 1. Data profiling

Data profiling (from Exploratory analysis to data monitoring) can help us to understand the data bahviour and how to best select the model and process for synthetic data generation. Does my data have missing data? So I want to keep my missing data behaviours or have it imputed during the process? What are the current distributions per variable? Are there any outliers?

Data profiling, in particular, is a powerfull tool that allows to examine the data and identify not only its validity for a certain use-case but also the quality.

Let's start with the overall data quality.

In [None]:
from ydata_profiling import ProfileReport

report = ProfileReport(X_train,
                       title='Census dataset',
                       minimal=True)
report

### 1.2. Data profiling for time-series

Time-series datasets have different requirements and needs in what concerns data profiling and exploratory data anaylis. Don't worry if you want to measure stationarity or check wether your data have seasonality we got you covered.

Check this tutorial on [Time-series EDA](https://towardsdatascience.com/how-to-do-an-eda-for-time-series-cbb92b3b1913).


### 2. Synthetic data generation

In [None]:
from ydata_synthetic.synthesizers.regular import RegularSynthesizer

synth = RegularSynthesizer(modelname='fast')
synth.fit(data=data, num_cols=num_cols, cat_cols=cat_cols)

In [18]:
synth_data = synth.sample(len(X_Hold))

In [None]:
synth_data.head()

#### 2.2.1 Augmentation with Generative AI (CTGAN - Challenge)

Because GANs training can take from minutes to hours depending on the size of the dataset and do they also require the use of GPU acceleration, we have used a faster synthesis method based on density methods.

But what would be the results if we have used CTGAN to balance our data? Would it bring more variability and help our model generalize?

In [None]:
#add here the code for the challenge.
from ydata_synthetic.regular import RegularSynthesizer

# Defining the training parameters
batch_size = 500
epochs = 500+1
learning_rate = 2e-4
beta_1 = 0.5
beta_2 = 0.9

ctgan_args = ModelParameters(batch_size=batch_size,
                             lr=learning_rate,
                             betas=(beta_1, beta_2))

train_args = TrainParameters(epochs=epochs)

In [None]:
#Training our Gan-based synthetic data generation method
synth = RegularSynthesizer(modelname='ctgan',
                           model_parameters=ctgan_args)

synth.fit(data=data, train_arguments=train_args, num_cols=num_cols, cat_cols=cat_cols)

In [None]:
#Decide what is the size of the synthetic sample that you expect to generate to balance the original data
synth_data = synth.sample(num_sample=len(X_Hold))
print(synth_data)

## Step 3 - Synthetic data quality validation & iteration

The whole process os **Synthetic Data** generation is not the same unless we evaluate the outputed data quality and iterate until the expected results are achieved - privacy, augmentation, de-biasing data,etc.

To understand one of the dimensions of synthetic data quality, we will be again leveraging the ydata-profiling report.

### Remember the holdout?

Now that we have our synthetic data generated, it is time to evaluate its quality. For that reason we will be leveraging ydata-profiling report once again, but this time we will be comparing the original vs synthetic data report.



In [None]:
syntheticdata_report = ProfileReport(synth_data,
                                    title='Census Synthetic Data',
                                    minimal=True)

compare_report = report.compare(syntheticdata_report)

In [None]:
compare_report

### The importance of data pipelines

Throughout our small tutorial we have checked how we can easily setup a flow for synthetic data generation. But in order for this flow to be reproducible and versionable we need something else - Pipelines!

Data pipelines streamline the flow of data through various stages—from extraction and processing to modeling and storage—providing a structured environment for automating and monitoring each step. This structured approach is particularly crucial for synthetic data generation, which often involves complex simulations or algorithms to produce data that is both realistic and diverse.

Effective data pipelines enable the generation of high-quality synthetic data by ensuring that the input data is correctly preprocessed and that the generation algorithms are executed under controlled and reproducible conditions. This results in synthetic datasets that closely mimic real-world data distributions without exposing sensitive information, thereby facilitating more accurate and ethical AI models.

Moreover, scalable data pipelines are essential for handling large volumes of data, allowing organizations to generate and utilize synthetic data at a scale that matches their needs. By automating repetitive tasks, data pipelines also free up data scientists to focus on more strategic tasks such as improving data models and extracting valuable insights.

Check how this can be done in a drag-and-drop manner in **YData Fabric** (https://ydata.ai/register).

![ydata fabric pipelines]("img/ydata_fabric_pipelines.png")
