# Data Generation Tutorial

## Introduction

In this tutorial, I'll walk you through on how to generate syntheic data using state of the art deep learning architectures in `Teras`.

**Model**: `CTGAN`

**Dataset**: Gemstone dataset (from Kaggle)

**Task**: Generating synthetic data using CTGAN

## Data Loading and Preprocessing

In [1]:
import pandas as pd

# We'll use the first 10000 instances
# We also drop the id column since that is useless
gem_df = pd.read_csv("./datasets/gemstone_dataset.csv").drop("id", axis=1)[:10000]
gem_df.head(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
0,1.52,Premium,F,VS2,62.2,58.0,7.27,7.33,4.55,13619
1,2.03,Very Good,J,SI2,62.0,58.0,8.06,8.12,5.05,13387
2,0.7,Ideal,G,VS1,61.2,57.0,5.69,5.73,3.5,2772


In [2]:
categorical_feats = ["cut", "color", "clarity"]
numerical_feats = ["carat", "depth", "table", "x", "y", "z"]

### Data Transformer and Data Sampler classes for Generative models:

The generative architectures like `GAIN`, `PCGAIN`, `CTGAN` and `TVAE` require sophisticated data preprocessing and transformation as well as how the batches of data are generated, hence to make it easier for the users, `Teras` implements `DataTransformer` and `DataSampler` classes for each of these models.

These can be imported from the `teras.preprocessing.<architecture_name>` module.
For instance, for CTGAN, we'll import its `DataSampler` and `DataTransformer` classes as follows,

`from teras.preprocessing.ctgan import DataSampler, DataTransformer`

In [None]:
from teras.preprocessing.ctgan import DataTransformer, DataSampler

data_transformer = DataTransformer(numerical_features=numerical_feats,
                                   categorical_features=categorical_feats)
gem_transformed = data_transformer.fit_transform(gem_df)

data_sampler = DataSampler(meta_data=data_transformer.get_meta_data(),
                           categorical_features=categorical_feats,
                           numerical_features=numerical_feats,
                           batch_size=1024)
dataset = data_sampler.get_dataset(x_transformed=gem_transformed,
                                   x_original=gem_df)

## Training 

In [None]:
from teras.generative import CTGAN

Notice how we use `data_sampler.data_dim` instead of the dimensions of the original dataset. 

That is because during data transformation, most of the time the data dimensions are expanded by quite a lot so it's safer to use `DataSampler`'s `.data_dim` attribute.

Similary we pass `meta_data` using the `.get_meta_data()` method of
the `DataTransformer` class.

In [5]:
ctgan = CTGAN(data_dim=data_sampler.data_dim,
              meta_data=data_transformer.get_meta_data())

For highly customized architectures like these, which employ custom loss functions, `Teras` has default values in place.
So it's recommended to just call the `compile` method without any parameters unless you understand the underlying structure.

Read more at **Section 4** of *General Guidelines and FAQs* notebook in tutorial directory.

In [6]:
ctgan.compile()

In [None]:
history = ctgan.fit(dataset, epochs=2)

## Generating new data

All generative models offered by `Teras` have a `.generate` method that can be used after training to generate new data samples.

The `.generate` method expects the `num_samples` parameter along with the instance of `DataSampler` class and  the instance of `DataTransformer` class that was used to transform the original data during the preprocessing step which is used to resverse transform the generated data back to original data format.

In [8]:
generated_data = ctgan.generate(num_samples=1000,
                                data_sampler=data_sampler,
                                data_transformer=data_transformer,
                                reverse_transform=True)
generated_data.head()

Generating Data: 100%|██████████| 1/1 [00:01<00:00,  1.32s/it]


Unnamed: 0,carat,depth,table,x,y,z,cut,color,clarity
0,1.360215,60.123539,62.105259,6.442325,5.218063,3.226227,Fair,G,SI1
1,1.294181,61.860909,55.327568,6.760636,5.747765,5.011518,Very Good,G,SI1
2,0.891586,63.083645,54.066082,6.760636,5.747765,5.012003,Very Good,G,SI1
3,0.327514,60.526867,57.000118,7.525471,5.747765,5.011518,Very Good,G,SI1
4,1.495654,63.083645,53.911446,6.78633,5.747765,5.021211,Very Good,G,SI1


## Wrapping it up!

And that wraps up our data imputation tutorial using Teras.

If you need more help, consult documentation, and other available resources and if that still leaves you with questions, feel free to raise an issue or email me khawaja.abaid@gmail.com

If you find `Teras` useful, please consider giving it a star on GitHub and sharing it with others!

Thank you!