# CTGAN Model

In this guide we present the `CTGAN` model: A GAN-based Deep Learning data synthesizer that can generate synthetic tabular data with high  fidelity. Based on our [CTGAN Library](https://github.com/sdv-dev/CTGAN).

<div class="alert alert-info">

**NOTE**

During this guide we will walk you through the specific functionalities of the `CTGAN` model.
For a more generic view of the SDV Tabular Models and their common functionalities, please visit
the [Tabular Models](01_Tabular_Models.ipynb) guide.

</div>

## Modeling Tabular data using Conditional GAN

CTGAN is a Deep Learning based data syntheziser that uses Generative Adversarial Networks
to generate tabular data and which was presented at the NeurIPS 2020 conference by the
paper titled [Modeling Tabular data using Conditional GAN](https://arxiv.org/abs/1907.00503).
For more details about it, please read the linked paper and visit the [CTGAN library](
https://github.com/sdv-dev/CTGAN).

Let's now discover how to learn a dataset and later on generate synthetic data with the same
format and statistical properties by using the `CTGAN` class from SDV.

## Introducing CTGAN

We will start by loading one of our demo datasets, the `student_placements`, which we used during
[Tabular Models](01_Tabular_Models.ipynb) guide.

In [1]:
# Setup logging and warnings - change ERROR to INFO for increased verbosity
import logging;
logging.basicConfig(level=logging.ERROR)

logging.getLogger().setLevel(level=logging.WARNING)
logging.getLogger('sdv').setLevel(level=logging.ERROR)

import warnings
warnings.simplefilter("ignore")

In [2]:
from sdv.demo import load_tabular_demo

data = load_tabular_demo('student_placements')
data.head().T

Unnamed: 0,0,1,2,3,4
student_id,155368,155369,155370,155371,155372
gender,M,M,M,M,M
second_perc,67,79.33,65,56,85.8
high_perc,91,78.33,68,52,73.6
high_spec,Commerce,Science,Arts,Science,Commerce
degree_perc,58,77.48,64,52,73.3
degree_type,Sci&Tech,Sci&Tech,Comm&Mgmt,Sci&Tech,Comm&Mgmt
work_experience,False,True,False,False,False
experience_years,0,1,0,0,0
employability_perc,55,86.5,75,66,96.8


As we learned in the [Tabular Models](01_Tabular_Models.ipynb) guide, the first
step that we need to do in order to use tabular model like `CTGAN` is to import
its class and create an instance of it passing the details about our data.

In this case, we would only need to indicate that the primary key is the `student_id`
field and call its `fit` method.

In [3]:
from sdv.tabular import CTGAN

model = CTGAN(
    primary_key='student_id',
)

In [4]:
model.fit(data)

After this is done, we can simply call its `sample` method to obtain
syntehtically generated data from it:

In [5]:
new_data = model.sample(len(data))

In [6]:
new_data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,F,61.108031,52.981301,Commerce,66.319547,Comm&Mgmt,True,0,68.962617,Mkt&HR,80.480863,,True,2020-03-03 06:18:23.293556224,NaT,
1,1,F,36.36001,32.877947,Arts,68.147573,Others,False,0,45.740355,Mkt&Fin,63.067984,,True,NaT,2020-08-18 02:03:56.491553792,6.0
2,2,M,86.177856,63.062234,Science,67.478922,Sci&Tech,True,0,87.830416,Mkt&HR,53.663343,26539.203952,True,2020-01-21 14:51:32.047771136,2020-07-07 17:46:48.295203072,3.0
3,3,F,89.697271,51.764268,Science,48.069979,Comm&Mgmt,False,0,73.583232,Mkt&HR,56.094191,21495.804429,True,2020-01-05 13:14:18.874041344,2020-08-11 14:33:20.861589248,6.0
4,4,M,50.196195,52.490953,Commerce,70.737148,Comm&Mgmt,True,0,110.250433,Mkt&Fin,46.988378,20650.888524,True,2020-01-03 23:58:19.519774976,2020-08-14 16:16:14.812664832,12.0


## CTGAN Hyperparameters

A part from the common Tabular Model arguments, `CTGAN` has a number of additional
hyperparameters that control its learning behavior and can impact on the
performance of the model, both in terms of quality of the generated data
and computationa time.

### epochs and batch size

The first hyperparameters that we see are the `epochs` and `batch_size` arguments,
which control the number of iterations that the model will perform to optimize
its parameters, as well as the number of samples used in each step.

Its default values are `300` and `500` respectively, and `batch_size` needs to
always be a value which is multiple of `10`.

These hyperparameters have a very direct effect in time the training process lasts,
but also on the performance of the data.

For new datasets, you might want to start by setting a low value on both of them
to see how long the training process takes on your data and later on increase the number
to acceptable values in order to improve the performance.

### log_frequency

Whether to use log frequency of categorical levels in conditional sampling.

Defaults to `True`.

This argument affects how the model processes the frequencies of the categorical
values that are used to condition the rest of the values. In some cases,
changing it to `False` could lead to better performance.

### Neural Network dimensions

`CTGAN` has the following hyperparameters that allow you to control the
size of the different layers that compose its neural networks:

- embedding_dim (int): Size of the random sample passed to the Generator. Defaults to 128.
- gen_dim (tuple or list of ints): Size of the output samples for each one of the Residuals.
  A Resiudal Layer will be created for each one of the values provided. Defaults to (256, 256).
- dis_dim (tuple or list of ints): Size of the output samples for each one of the Discriminator
  Layers. A Linear Layer will be created for each one of the values provided. Defaults to (256, 256).

### l2scale

The `l2scale` argument, which defaults to `1e-6`, sets the wheight Decay of the Adam Optimizer
used to optimize the Neural Networks.

### verbose

Whether to print fit progress on stdout. Defaults to `False`.

<div class="alert alert-warning">
    
**WARNING**
    
The value that you set on the `batch_size` argument must always be
a multiple of `10`!

</div>

As an example, we will try to fit the `CTGAN` model slightly increasing the number of epochs,
reducing the `batch_size`, adding one additional layer to the models involved and using a
smaller wright decay.

Before we start, we will evaluate the qualtiy of the previously generated data using the
`sdv.evaluation.evaluate` function

In [7]:
from sdv.evaluation import evaluate

evaluate(new_data, data)

-132.95529047945246

Afterwards, we create a new instance of the `CTGAN` model with the
hyperparameter values that we want to use

In [13]:
model = CTGAN(
    primary_key='student_id',
    epochs=500,
    batch_size=100,
    gen_dim=(256, 256, 256),
    dis_dim=(256, 256, 256),
    l2scale=1e-07
)

And fit to our data.

In [14]:
model.fit(data)

Finally, we are ready to generate new data and evaluate the results.

In [15]:
new_data = model.sample(len(data))

In [16]:
new_data

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,M,77.515473,116.747455,Commerce,73.592690,Comm&Mgmt,True,0,65.583448,Mkt&Fin,73.380323,31583.650209,True,NaT,2020-12-13 03:50:33.613757184,3.0
1,1,M,74.477053,102.006808,Science,69.975020,Comm&Mgmt,False,2,71.767375,Mkt&Fin,68.596593,32732.459478,True,2020-02-20 18:37:23.758584576,NaT,
2,2,M,94.315061,65.595406,Commerce,78.327972,Comm&Mgmt,False,1,106.978161,Mkt&Fin,66.399862,32353.245636,True,2020-02-28 06:51:15.425168384,NaT,3.0
3,3,F,83.955873,88.867142,Commerce,73.663499,Comm&Mgmt,False,1,105.845540,Mkt&Fin,67.277985,30465.561060,True,NaT,2020-12-19 21:23:16.062616576,
4,4,M,90.502278,94.096468,Commerce,52.351108,Comm&Mgmt,False,0,112.966196,Mkt&HR,62.794439,,True,2020-02-08 09:28:27.879909376,2020-12-10 17:11:43.361209344,12.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,210,F,86.579575,81.148341,Commerce,45.945317,Comm&Mgmt,False,0,72.403539,Mkt&Fin,63.897002,,True,2020-02-05 22:11:20.351433984,2020-10-24 08:25:34.983526912,
211,211,F,72.217464,72.405834,Commerce,64.917737,Sci&Tech,True,1,97.965395,Mkt&Fin,58.194729,,False,2020-03-08 23:59:00.229267712,2020-10-06 07:31:12.930507776,3.0
212,212,F,73.701105,72.239221,Commerce,47.869824,Comm&Mgmt,True,0,69.900527,Mkt&HR,61.086874,,True,2020-03-02 19:58:31.763453440,2021-03-01 17:38:19.501878016,3.0
213,213,M,91.916320,93.888259,Arts,52.732050,Comm&Mgmt,True,0,80.843845,Mkt&Fin,79.914218,33358.673626,True,2020-01-04 00:32:55.122470912,2020-08-31 05:23:08.366545920,6.0


In [17]:
from sdv.evaluation import evaluate

evaluate(new_data, data)

-140.10961406207406

As we can see, in this case these modifications changed the obtained results slightly,
but they did neither introduce dramatic changes in the performance.