# CTGAN Model

In this guide we will go through a series of steps that will let you discover
functionalities of the `CTGAN` model, including how to:

* Create an instance of `CTGAN`.
* Fit the instance to your data.
* Generate synthetic versions of your data.
* Use the a `CTGAN` to anonymize PII information.
* Customize the data tranformations to improve the learning process.
* Specify column hyperparameters to improve the output quality.

## What is CTGAN?

The `sdv.tabular.CTGAN` model from `SDV` is based on the `ctgan.CTGANSynthesizer` class
from the [CTGAN library](https://github.com/sdv-dev/CTGAN), a Deep Learning based data
synthesizer that uses Generative Adversarial Networks to generate tabular data and which
was presented at the NeurIPS 2020 conference by the paper titled [Modeling Tabular data
using Conditional GAN](https://arxiv.org/abs/1907.00503). For more details about the model,
please read the linked paper and visit the [CTGAN library](https://github.com/sdv-dev/CTGAN).

Let's now discover how to learn a dataset and later on generate synthetic data with the same
format and statistical properties by using the `CTGAN` class from SDV.

## Quick Usage

We will start by loading one of our demo datasets, the `student_placements`, which contains information
about MBA students that applied for placements during the year 2020.

<div class="alert alert-warning">

WARNING

In order to follow this guide you need to have `ctgan` installed on your system.
If you have not done it yet, please install `ctgan` now by executing the command
`pip install sdv[ctgan]` in a terminal.

</div>

In [3]:
# Setup logging and warnings - change ERROR to INFO for increased verbosity
import logging;
logging.basicConfig(level=logging.ERROR)

logging.getLogger().setLevel(level=logging.WARNING)
logging.getLogger('sdv').setLevel(level=logging.ERROR)

import warnings
warnings.simplefilter("ignore")

In [4]:
from sdv.demo import load_tabular_demo

data = load_tabular_demo('student_placements')
data.head().T

Unnamed: 0,0,1,2,3,4
student_id,17264,17265,17266,17267,17268
gender,M,M,M,M,M
second_perc,67,79.33,65,56,85.8
high_perc,91,78.33,68,52,73.6
high_spec,Commerce,Science,Arts,Science,Commerce
degree_perc,58,77.48,64,52,73.3
degree_type,Sci&Tech,Sci&Tech,Comm&Mgmt,Sci&Tech,Comm&Mgmt
work_experience,False,True,False,False,False
experience_years,0,1,0,0,0
employability_perc,55,86.5,75,66,96.8


As you can see, this table contains information about students which includes, among other things:

- Their id and gender
- Their grades and specializations
- Their work experience
- The salary that they where offered
- The duration and dates of their placement

You will notice that there is data with the following characteristics:

- There are float, integer, boolean, categorical and datetime values.
- There are some variables that have missing data. In particular, all the data related to the
  placement details is missing in the rows where the studen was not placed.

Let us use `CTGAN` to learn this data and then sample synthetic data about new students
to see how well de model captures the characteristics indicated above. In order to do this you wil
need to:

- Import the `sdv.tabular.CTGAN` class and create an instance of it.
- Call its `fit` method passing our table.
- Call its `sample` method indicating the number of synthetic rows that you want to generate.

In [5]:
from sdv.tabular import CTGAN

model = CTGAN()
model.fit(data)

<div class="alert alert-info">

**NOTE**

Notice that the model `fitting` process took care of transforming the different fields using the
appropriate [Reversible Data Transforms](http://github.com/sdv-dev/RDT) to ensure that the data has
a format that the CTGANSynthesizer class can handle.

</div>

### Generate synthetic data from the model

Once the modeling has finished you are ready to generate new synthetic data by calling the `sample` method
from your model passing the number of rows that we want to generate.

In [6]:
new_data = model.sample(200)

This will return a table identical to the one which the model was fitted on, but filled with new data
which resembles the original one.

In [7]:
new_data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17353,M,46.382158,45.456456,Commerce,72.052872,Sci&Tech,False,1,105.797477,Mkt&Fin,70.656049,30002.17085,False,2020-02-24 17:13:40.795029504,2020-08-24 23:28:53.271384832,
1,17349,F,83.120215,47.732438,Commerce,75.903205,Comm&Mgmt,False,1,80.561992,Mkt&Fin,58.128451,,True,2020-03-13 10:36:26.176266496,NaT,3.0
2,17449,F,77.50298,68.061047,Science,54.526966,Sci&Tech,False,0,67.962156,Mkt&Fin,49.789638,70982.978168,True,2020-03-04 08:42:19.493672704,2020-03-26 19:23:02.405166336,
3,17384,M,52.983545,64.88669,Commerce,58.718478,Others,False,0,72.111335,Mkt&Fin,61.034151,31221.419186,True,NaT,2020-11-27 00:15:40.085506304,
4,17259,M,63.592512,83.374095,Science,72.034368,Comm&Mgmt,False,0,76.203421,Mkt&HR,51.853051,,True,NaT,2020-08-01 19:44:18.879604992,3.0


<div class="alert alert-info">

**Note**

You can control the number of rows by specifying the number of `samples` in the
`model.sample(<num_rows>)`. To test, try `model.sample(10000)`. Note that the original 
table only had ~200 rows.

</div>

### Save and Load the model

In many scenarios it will be convenient to generate synthetic versions of your data
directly in systems that do not have access to the original data source. For example,
if you may want to generate testing data on the fly inside a testing environment that
does not have access to your production database. In these scenarios, fitting the
model with real data every time that you need to generate new data is feasible, so you
will need to fit a model in your production environment, save the fitted model into a
file, send this file to the testing environment and then load it there to be able to
`sample` from it.

Let's see how this process works.

#### Save and share the model

Once you have fitted the model, all you need to do is call its `save` method passing the
name of the file in which you want to save the model. Note that the extension of the filename
is not relevant, but we will be using the `.pkl` extension to highlight that the serialization
protocol used is [pickle](https://docs.python.org/3/library/pickle.html).

In [8]:
model.save('my_model.pkl')

This will have created a file called `my_model.pkl` in the same directory in which you are
running SDV.

<div class="alert alert-info">

**IMPORTANT**
    
If you inspect the generated file you will notice that its size is much smaller
than the size of the data that you used to generate it. This is because the serialized model
contains **no information about the original data**, other than the parameters it needs to
generate synthetic versions of it. This means that you can safely share this `my_model.pkl`
file without the risc of disclosing any of your real data!
    
</div>

#### Load the model and generate new data

The file you just generated can be send over to the system where the synthetic data will be
generated. Once it is there, you can load it using the `CTGAN.load` method, and
then you are ready to sample new data from the loaded instance:

In [9]:
loaded = CTGAN.load('my_model.pkl')
new_data = loaded.sample(200)

<div class="alert alert-warning">
    
**WARNING**
    
Notice that the system where the model is loaded needs to also have `sdv` and `ctgan`
installed, otherwise it will not be able to load the model and use it.
    
</div>

### Specifying the Primary Key of the table

One of the first things that you may have noticed when looking that demo data
is that there is a `student_id` column which acts as the primary key of the table,
and which is supposed to have unique values. Indeed, if we look at the number of
times that each value appears, we see that all of them appear at most once:

In [10]:
data.student_id.value_counts().max()

1

However, if we look at the synthetic data that we generated, we observe that there
are some values that appear more than once:

In [11]:
new_data.student_id.value_counts().max()

4

In [12]:
new_data[new_data.student_id == new_data.student_id.value_counts().index[0]]

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
41,17379,M,80.704001,48.55603,Commerce,62.146339,Sci&Tech,False,1,56.995571,Mkt&Fin,57.644647,29772.210741,True,2020-03-04 00:25:18.201964544,NaT,6.0
58,17379,M,56.603675,57.150984,Commerce,76.039075,Sci&Tech,False,1,95.576293,Mkt&Fin,68.029574,28609.354434,True,2020-07-21 03:01:35.799904256,2020-11-09 09:12:48.531838464,
118,17379,F,71.765579,32.065834,Arts,47.496846,Comm&Mgmt,False,0,57.863569,Mkt&Fin,59.967604,,True,2020-03-01 10:19:17.022447872,NaT,
187,17379,F,44.379334,62.921519,Commerce,82.719495,Sci&Tech,True,0,58.81806,Mkt&Fin,49.606222,30981.168572,True,2020-03-04 12:46:37.198953472,2020-05-31 13:45:35.545154560,


This happens because the model was not notified at any point about the fact that the
`student_id` had to be unique, so when it generates new data it will provoke collisions
sooner or later. In order to solve this, we can pass the argument `primary_key` to our
model when we create it, indicating the name of the column that is the index of the table.

In [13]:
model = CTGAN(
    primary_key='student_id'
)
model.fit(data)
new_data = model.sample(200)

As a result, the model will learn that this column must be unique and generate a unique
sequence of valures for the column:

In [14]:
new_data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,M,86.450397,58.721895,Commerce,75.014288,Comm&Mgmt,False,0,61.473889,Mkt&HR,58.692403,32893.810861,True,NaT,2020-05-09 13:44:34.654232576,
1,1,F,65.646807,59.248725,Commerce,83.659565,Comm&Mgmt,False,0,56.272349,Mkt&HR,76.306959,46842.743183,True,NaT,2020-08-13 04:59:44.944402944,3.0
2,2,M,58.055843,86.835433,Science,63.344503,Others,False,0,70.622931,Mkt&Fin,72.904164,19926.613285,False,2020-02-15 00:08:44.837243136,2020-06-30 03:55:16.218238464,3.0
3,3,M,92.088063,62.546539,Science,67.285421,Others,False,0,58.054804,Mkt&HR,63.208692,28067.791098,True,2019-12-29 19:43:30.730423552,2020-05-05 19:24:10.305772032,3.0
4,4,M,77.678651,53.132231,Arts,65.747072,Comm&Mgmt,False,0,66.506365,Mkt&Fin,58.935336,29121.33248,True,2020-09-10 04:20:47.193090304,2020-07-03 00:10:45.385397248,


In [15]:
new_data.student_id.value_counts().max()

1

### Anonymizing Personally Identifiable Information (PII) 

There will be many cases where the data will contain Personally Identifiable Information
which we cannot disclose. In these cases, we will want our Tabular Models to replace the
information within these fields with fake, simulated data that looks similar to the real
one but does not contain any of the original values.

Let's load a new dataset that contains a PII field, the `student_placements_pii` demo, and
try to generate synthetic versions of it that do not contain any of the PII fields.

<div class="alert alert-info">
    
**NOTE**
    
The `student_placements_pii` dataset is a modified version of the `student_placements`
dataset with one new field, `address`, which contains PII information about the students.
Notice that this additional `address` field has been simulated and does not correspond to data
from the real users.

</div>

In [16]:
data_pii = load_tabular_demo('student_placements_pii')

In [17]:
data_pii.head().T

Unnamed: 0,0,1,2,3,4
student_id,17264,17265,17266,17267,17268
address,"70304 Baker Turnpike\nEricborough, MS 15086","805 Herrera Avenue Apt. 134\nMaryview, NJ 36510","3702 Bradley Island\nNorth Victor, FL 12268",Unit 0879 Box 3878\nDPO AP 42663,"96493 Kelly Canyon Apt. 145\nEast Steven, NC 3..."
gender,M,M,M,M,M
second_perc,67,79.33,65,56,85.8
high_perc,91,78.33,68,52,73.6
high_spec,Commerce,Science,Arts,Science,Commerce
degree_perc,58,77.48,64,52,73.3
degree_type,Sci&Tech,Sci&Tech,Comm&Mgmt,Sci&Tech,Comm&Mgmt
work_experience,False,True,False,False,False
experience_years,0,1,0,0,0


If we use our tabular model on this new data we will see how the synthetic
data that it generates discloses the addresses from the real students:

In [18]:
model = CTGAN(
    primary_key='student_id',
)
model.fit(data_pii)

In [19]:
new_data_pii = model.sample(200)
new_data_pii.head()

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,"752 Johnson Turnpike\nWrightfurt, NY 74928",M,91.499858,78.872534,Commerce,72.771089,Comm&Mgmt,False,0,59.974746,Mkt&Fin,78.548686,24200.820187,False,2020-02-26 07:17:45.558047744,2020-03-04 01:37:56.561632512,12.0
1,1,"530 Katrina Wall Suite 443\nJimmouth, WV 05020",M,80.850573,82.396021,Commerce,77.501456,Comm&Mgmt,True,0,58.937513,Mkt&HR,70.76086,24073.270239,False,2020-02-17 00:53:08.354403584,2021-01-07 01:13:38.550712576,
2,2,"PSC 2024, Box 1677\nAPO AP 99732",M,81.127493,68.665002,Commerce,48.030628,Comm&Mgmt,False,1,66.440086,Mkt&Fin,76.90987,28725.055145,False,NaT,2020-07-11 20:31:16.109528064,3.0
3,3,"8897 Brandon Ports\nNew Patriciachester, MS 76485",M,71.400642,66.740188,Science,59.55244,Comm&Mgmt,False,0,113.514815,Mkt&Fin,72.627793,,False,2020-01-10 02:36:44.947848448,2020-07-30 14:48:26.332177152,3.0
4,4,"814 Mcclain Walk\nNew Melissashire, MT 20272",M,75.065008,66.390104,Commerce,64.726468,Comm&Mgmt,False,1,100.525276,Mkt&HR,79.254538,22941.445477,False,2020-02-15 08:00:35.441444608,2020-09-15 16:15:59.047650048,3.0


In [20]:
new_data_pii.address.isin(data_pii.address).sum()

200

In order to solve this, we can pass an additional argument `anonymize_fields` to
our model when we create the instance.

This `anonymize_fields` argument will need to be a dictionary that contains:
- The name of the field that we want to anonymize.
- The category of the field that we want to use when we generate fake values for it.

The list complete list of possible categories can be seen in the [Faker Providers
](https://faker.readthedocs.io/en/master/providers.html) page, and it contains a huge
list of concepts such as:

- name
- address
- country
- city
- ssn
- credit_card_number
- credit_card_expier
- credit_card_security_code
- email
- telephone
- ...

In this case, since the field is an e-mail address, we will pass a dictionary indicating
the category `address`

In [21]:
model = CTGAN(
    primary_key='student_id',
    anonymize_fields={
        'address': 'address'
    }
)
model.fit(data_pii)

As a result, we can see how the real `address` values have been replaced by other fake
addresses that were not taken from the real data that we learned.

In [22]:
new_data_pii = model.sample(200)
new_data_pii.head()

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,"2541 Vargas Key\nCarrollport, CT 49868",M,63.628861,81.515809,Science,75.268115,Others,False,0,62.541468,Mkt&Fin,71.109248,27135.418718,True,2020-02-08 02:07:30.426697472,2020-08-25 14:08:27.326846976,
1,1,"504 Bailey Port Apt. 375\nKingbury, SC 17667",M,49.982513,62.900301,Commerce,61.496118,Comm&Mgmt,False,0,68.443574,Mkt&HR,69.018472,31934.401408,True,2020-04-04 07:47:47.085661184,2020-03-24 19:19:39.353576960,12.0
2,2,"6779 Karen Curve\nJeremystad, OH 66328",F,69.720897,66.930366,Arts,55.681328,Sci&Tech,False,0,59.392131,Mkt&HR,67.195895,,False,2020-02-02 17:42:32.028860928,2020-03-20 16:09:44.188886784,6.0
3,3,"92730 Angela Roads\nEast Michael, MD 32564",M,43.020849,72.917115,Commerce,70.28625,Comm&Mgmt,False,0,66.901252,Mkt&HR,52.98921,28127.077315,True,2020-02-16 13:59:04.988082944,2020-07-14 00:26:45.081564928,12.0
4,4,"207 Diane Parkways\nEast Erinmouth, MN 32943",F,71.638835,57.239479,Commerce,61.270519,Sci&Tech,False,0,76.555091,Mkt&HR,60.359768,24297.422515,True,2020-01-24 06:01:21.862124288,2020-08-26 02:01:19.091943424,


In [23]:
new_data_pii.address.isin(data_pii.address).sum()

0

### Specifying constraints

If you look closely at the data you may notice that some properties were not
completely captured by the model. For example, you may have seen that sometimes
the model produces an `experience_years` number greater than `0` while also
indicating that `work_experience` is `False`. These type of properties are what
we call `Constraints` and can also be handled using `SDV`. For further details
about them please visit the [Handling Constraints](03_Handling_Constraints.ipynb)
guide.

## Advanced Usage

Now that we have discovered the basics, let's go over a few more advanced usage examples
and see the different arguments that we can pass to our `CTGAN` Model in order to
customize it to our needs.

### CTGAN Hyperparameters

A part from the common Tabular Model arguments, `CTGAN` has a number of additional
hyperparameters that control its learning behavior and can impact on the
performance of the model, both in terms of quality of the generated data
and computationa time.

#### epochs and batch size

The first hyperparameters that we see are the `epochs` and `batch_size` arguments,
which control the number of iterations that the model will perform to optimize
its parameters, as well as the number of samples used in each step.

Its default values are `300` and `500` respectively, and `batch_size` needs to
always be a value which is multiple of `10`.

These hyperparameters have a very direct effect in time the training process lasts,
but also on the performance of the data.

For new datasets, you might want to start by setting a low value on both of them
to see how long the training process takes on your data and later on increase the number
to acceptable values in order to improve the performance.

#### log_frequency

Whether to use log frequency of categorical levels in conditional sampling.

Defaults to `True`.

This argument affects how the model processes the frequencies of the categorical
values that are used to condition the rest of the values. In some cases,
changing it to `False` could lead to better performance.

#### Neural Network dimensions

`CTGAN` has the following hyperparameters that allow you to control the
size of the different layers that compose its neural networks:

- embedding_dim (int): Size of the random sample passed to the Generator. Defaults to 128.
- gen_dim (tuple or list of ints): Size of the output samples for each one of the Residuals.
  A Resiudal Layer will be created for each one of the values provided. Defaults to (256, 256).
- dis_dim (tuple or list of ints): Size of the output samples for each one of the Discriminator
  Layers. A Linear Layer will be created for each one of the values provided. Defaults to (256, 256).

#### l2scale

The `l2scale` argument, which defaults to `1e-6`, sets the wheight Decay of the Adam Optimizer
used to optimize the Neural Networks.

#### verbose

Whether to print fit progress on stdout. Defaults to `False`.

<div class="alert alert-warning">
    
**WARNING**
    
The value that you set on the `batch_size` argument must always be
a multiple of `10`!

</div>

As an example, we will try to fit the `CTGAN` model slightly increasing the number of epochs,
reducing the `batch_size`, adding one additional layer to the models involved and using a
smaller wright decay.

Before we start, we will evaluate the qualtiy of the previously generated data using the
`sdv.evaluation.evaluate` function

In [6]:
from sdv.evaluation import evaluate

evaluate(new_data, data)

-148.173551500529

Afterwards, we create a new instance of the `CTGAN` model with the
hyperparameter values that we want to use

In [7]:
model = CTGAN(
    primary_key='student_id',
    epochs=500,
    batch_size=100,
    gen_dim=(256, 256, 256),
    dis_dim=(256, 256, 256),
    l2scale=1e-07
)

And fit to our data.

In [8]:
model.fit(data)

Finally, we are ready to generate new data and evaluate the results.

In [9]:
new_data = model.sample(len(data))

In [10]:
new_data

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,F,34.949775,56.543753,Science,56.386929,Others,True,0,57.047205,Mkt&HR,70.880870,52487.953908,False,2020-02-11 14:51:51.540146176,NaT,
1,1,M,60.293816,63.604466,Commerce,57.213664,Comm&Mgmt,False,1,76.443608,Mkt&Fin,69.989944,49340.176244,False,NaT,2020-08-22 07:51:09.761925888,3.0
2,2,M,65.549707,67.099558,Science,67.883755,Comm&Mgmt,False,0,63.329463,Mkt&HR,77.980852,,False,NaT,NaT,6.0
3,3,M,31.861321,51.808897,Commerce,59.313310,Comm&Mgmt,False,4,91.136983,Mkt&Fin,84.226341,,False,NaT,2020-07-21 22:16:00.456667392,12.0
4,4,F,78.827052,75.281691,Commerce,64.924865,Comm&Mgmt,False,1,68.389091,Mkt&Fin,80.486051,51901.492908,False,2020-01-25 13:18:42.237984512,NaT,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,210,M,67.010665,43.334043,Science,58.679588,Comm&Mgmt,False,0,89.946393,Mkt&Fin,78.518156,,False,NaT,2020-10-02 00:22:02.203142656,
211,211,M,60.100223,68.346138,Science,72.964973,Sci&Tech,False,0,101.124253,Mkt&Fin,76.647721,32727.642571,False,2020-02-13 20:57:41.302214656,2021-01-03 03:54:04.681563904,12.0
212,212,M,54.662673,56.751646,Commerce,60.198562,Comm&Mgmt,True,0,50.684929,Mkt&HR,66.078017,,False,2020-03-01 10:52:01.466955776,2020-08-19 20:15:37.505603072,12.0
213,213,M,57.984654,51.599140,Science,69.113537,Comm&Mgmt,False,0,61.562135,Mkt&HR,72.756443,39093.193199,False,2020-02-08 20:32:23.699396864,NaT,


In [11]:
from sdv.evaluation import evaluate

evaluate(new_data, data)

-153.27980312716866

As we can see, in this case these modifications changed the obtained results slightly,
but they did neither introduce dramatic changes in the performance.