# Description
This notebook briefly describes Generative Addative Network and explores the use of the SDV library to generate synthetic data for the enriched telco dataset and the Enriched data set. It evaluates the models used using SDV model metrics which evaluate how similar the evaluated data is. 

These model take quite long to fit, specifically, the CTGAN model took 10-15 min to fit the 10000 entries of the telco dataset. All the models originally took longer, however, tuning hyper parameters has increased speed and accuracy with the Gaussian Copula model having the best results throughout experimentation after field transformers were applied.

# Takeaways Summary
These models work very well when it comes to data generation. Though some can be slow, using field transformers can speed them up and once the model is trained, generation is almost instant. 

Field transformers seem to be the most important feature of these models, however, as will be observed in the synthetic generation notebook, other constraints also exist which make customisation for specific datasets simple and easy to use. 

**_Steps to implement on any data set:_** 
* **Import** the models and **load** the dataset
* Understand and establish the fields of the dataset and **set field transformer and constraints** if neccessary
* **Initialize** model with constraints
* **Train/fit** model on real data
* **Sample/Generate** new data

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Description" data-toc-modified-id="Description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Description</a></span></li><li><span><a href="#Takeaways-Summary" data-toc-modified-id="Takeaways-Summary-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Takeaways Summary</a></span></li><li><span><a href="#GAN-Defenitions--" data-toc-modified-id="GAN-Defenitions---3"><span class="toc-item-num">3&nbsp;&nbsp;</span>GAN Defenitions  <a class="anchor" id="GANexp"></a></a></span><ul class="toc-item"><li><span><a href="#Generative-Adversarial-Networks-" data-toc-modified-id="Generative-Adversarial-Networks--3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Generative Adversarial Networks <a class="anchor" id="GANdef"></a></a></span></li><li><span><a href="#Models-in-Synthetic-Data-Vault-(SDV)-library-" data-toc-modified-id="Models-in-Synthetic-Data-Vault-(SDV)-library--3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Models in Synthetic Data Vault (SDV) library <a class="anchor" id="sdv"></a></a></span><ul class="toc-item"><li><span><a href="#Gaussian-Copula-Model:-" data-toc-modified-id="Gaussian-Copula-Model:--3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Gaussian Copula Model: <a class="anchor" id="Gaussian_def"></a></a></span></li><li><span><a href="#Condition-Tabular-GAN-(CTGAN)-Model:-" data-toc-modified-id="Condition-Tabular-GAN-(CTGAN)-Model:--3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Condition Tabular GAN (CTGAN) Model: <a class="anchor" id="CTGAN_def"></a></a></span></li><li><span><a href="#CopulaGAN-Model:-" data-toc-modified-id="CopulaGAN-Model:--3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>CopulaGAN Model: <a class="anchor" id="Copula_Def"></a></a></span></li><li><span><a href="#Tabular-Variational-AutoEncoder-(TVAE)-Model-:" data-toc-modified-id="Tabular-Variational-AutoEncoder-(TVAE)-Model-:-3.2.4"><span class="toc-item-num">3.2.4&nbsp;&nbsp;</span>Tabular Variational AutoEncoder (TVAE) Model <a class="anchor" id="TVAE_def"></a>:</a></span></li></ul></li><li><span><a href="#Prerequisites-" data-toc-modified-id="Prerequisites--3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Prerequisites <a class="anchor" id="Prerequisites"></a></a></span></li></ul></li><li><span><a href="#Experimentation-on-enriched-Telco-" data-toc-modified-id="Experimentation-on-enriched-Telco--4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Experimentation on enriched Telco <a class="anchor" id="Experimentation_telco"></a></a></span><ul class="toc-item"><li><span><a href="#Loading-and-enriching-data-set-" data-toc-modified-id="Loading-and-enriching-data-set--4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Loading and enriching data set <a class="anchor" id="loading"></a></a></span></li><li><span><a href="#Fitting-the-models-" data-toc-modified-id="Fitting-the-models--4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Fitting the models <a class="anchor" id="fitting"></a></a></span><ul class="toc-item"><li><span><a href="#GaussianCopula-" data-toc-modified-id="GaussianCopula--4.2.1"><span class="toc-item-num">4.2.1&nbsp;&nbsp;</span>GaussianCopula <a class="anchor" id="gaussian_fit"></a></a></span></li><li><span><a href="#CTGAN-" data-toc-modified-id="CTGAN--4.2.2"><span class="toc-item-num">4.2.2&nbsp;&nbsp;</span>CTGAN <a class="anchor" id="CTGAN_fit"></a></a></span></li><li><span><a href="#CopulaGAN-" data-toc-modified-id="CopulaGAN--4.2.3"><span class="toc-item-num">4.2.3&nbsp;&nbsp;</span>CopulaGAN <a class="anchor" id="copula_fit"></a></a></span></li><li><span><a href="#TVAE--" data-toc-modified-id="TVAE---4.2.4"><span class="toc-item-num">4.2.4&nbsp;&nbsp;</span>TVAE  <a class="anchor" id="TVAE_fit"></a></a></span></li></ul></li><li><span><a href="#Evaluation-of-synthetic-data-" data-toc-modified-id="Evaluation-of-synthetic-data--4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Evaluation of synthetic data <a class="anchor" id="eval"></a></a></span><ul class="toc-item"><li><span><a href="#Metric-types-" data-toc-modified-id="Metric-types--4.3.1"><span class="toc-item-num">4.3.1&nbsp;&nbsp;</span>Metric types <a class="anchor" id="metrics"></a></a></span><ul class="toc-item"><li><span><a href="#Statistical-metrics-" data-toc-modified-id="Statistical-metrics--4.3.1.1"><span class="toc-item-num">4.3.1.1&nbsp;&nbsp;</span>Statistical metrics <a class="anchor" id="stat"></a></a></span></li><li><span><a href="#Likelihood-Metrics-" data-toc-modified-id="Likelihood-Metrics--4.3.1.2"><span class="toc-item-num">4.3.1.2&nbsp;&nbsp;</span>Likelihood Metrics <a class="anchor" id="likelihood"></a></a></span></li><li><span><a href="#Detection-Metrics-" data-toc-modified-id="Detection-Metrics--4.3.1.3"><span class="toc-item-num">4.3.1.3&nbsp;&nbsp;</span>Detection Metrics <a class="anchor" id="detection"></a></a></span></li></ul></li><li><span><a href="#Gaussian-Copula--" data-toc-modified-id="Gaussian-Copula---4.3.2"><span class="toc-item-num">4.3.2&nbsp;&nbsp;</span>Gaussian Copula  <a class="anchor" id="gaussian_eval"></a></a></span></li><li><span><a href="#CTGAN-" data-toc-modified-id="CTGAN--4.3.3"><span class="toc-item-num">4.3.3&nbsp;&nbsp;</span>CTGAN <a class="anchor" id="CTGAN_eval"></a></a></span></li><li><span><a href="#CopulaGAN" data-toc-modified-id="CopulaGAN-4.3.4"><span class="toc-item-num">4.3.4&nbsp;&nbsp;</span>CopulaGAN<a class="anchor" id="copula_eval"></a></a></span></li><li><span><a href="#TVAE-" data-toc-modified-id="TVAE--4.3.5"><span class="toc-item-num">4.3.5&nbsp;&nbsp;</span>TVAE <a class="anchor" id="TVAE_eval"></a></a></span></li></ul></li><li><span><a href="#Saving-data" data-toc-modified-id="Saving-data-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Saving data</a></span></li></ul></li><li><span><a href="#Experimentation-on-enriched-dataset-" data-toc-modified-id="Experimentation-on-enriched-dataset--5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Experimentation on enriched dataset <a class="anchor" id="Experimentation"></a></a></span><ul class="toc-item"><li><span><a href="#Loading-enriched-dataset-" data-toc-modified-id="Loading-enriched-dataset--5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Loading enriched dataset <a class="anchor" id="loading2"></a></a></span></li><li><span><a href="#Fitting-the-models-" data-toc-modified-id="Fitting-the-models--5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Fitting the models <a class="anchor" id="fitting2"></a></a></span><ul class="toc-item"><li><span><a href="#GaussianCopula-" data-toc-modified-id="GaussianCopula--5.2.1"><span class="toc-item-num">5.2.1&nbsp;&nbsp;</span>GaussianCopula <a class="anchor" id="gaussian_fit2"></a></a></span></li><li><span><a href="#CTGAN-" data-toc-modified-id="CTGAN--5.2.2"><span class="toc-item-num">5.2.2&nbsp;&nbsp;</span>CTGAN <a class="anchor" id="CTGAN_fit2"></a></a></span></li><li><span><a href="#CopulaGAN-" data-toc-modified-id="CopulaGAN--5.2.3"><span class="toc-item-num">5.2.3&nbsp;&nbsp;</span>CopulaGAN <a class="anchor" id="copula_fit2"></a></a></span></li><li><span><a href="#TVAE--" data-toc-modified-id="TVAE---5.2.4"><span class="toc-item-num">5.2.4&nbsp;&nbsp;</span>TVAE  <a class="anchor" id="TVAE_fit2"></a></a></span></li></ul></li><li><span><a href="#Evaluation-of-synthetic-data-" data-toc-modified-id="Evaluation-of-synthetic-data--5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Evaluation of synthetic data <a class="anchor" id="eval2"></a></a></span><ul class="toc-item"><li><span><a href="#Gaussian-Copula--" data-toc-modified-id="Gaussian-Copula---5.3.1"><span class="toc-item-num">5.3.1&nbsp;&nbsp;</span>Gaussian Copula  <a class="anchor" id="gaussian_eval2"></a></a></span></li><li><span><a href="#CTGAN-" data-toc-modified-id="CTGAN--5.3.2"><span class="toc-item-num">5.3.2&nbsp;&nbsp;</span>CTGAN <a class="anchor" id="CTGAN_eval2"></a></a></span></li><li><span><a href="#CopulaGAN-" data-toc-modified-id="CopulaGAN--5.3.3"><span class="toc-item-num">5.3.3&nbsp;&nbsp;</span>CopulaGAN <a class="anchor" id="copula_eval2"></a></a></span></li><li><span><a href="#TVAE-" data-toc-modified-id="TVAE--5.3.4"><span class="toc-item-num">5.3.4&nbsp;&nbsp;</span>TVAE <a class="anchor" id="TVAE_eval2"></a></a></span></li></ul></li><li><span><a href="#Saving-data" data-toc-modified-id="Saving-data-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Saving data</a></span></li></ul></li></ul></div>

# GAN Definitions  <a class="anchor" id="GANexp"></a>

## Generative Adversarial Networks <a class="anchor" id="GANdef"></a>

A generative adversarial network (GAN) is a class of machine learning frameworks where two neural networks contest with each other in a game (in the form of a zero-sum game, where one agent's gain is another agent's loss). This can be applied and extended to generate data in many different fields from fashion to video games. The two neural networks, the generator and discriminator, both get better with each epoch. 

The way it works is the Generator learns to map latent space to a distribution, ie. the real data. Then the Discriminator takes this new mapped data and real data and attemps to distinguish between the two. Both models' goals are increasing the other model's error rate while decreasing their own creating a synergetic development of both networks until the Generator's synthetic data is relatively indistinguishable from the real data.

## Models in Synthetic Data Vault (SDV) library <a class="anchor" id="sdv"></a>





### Gaussian Copula Model: <a class="anchor" id="Gaussian_def"></a>



Intuitively, the Guassian copula takes d marginal distributions and turns them into a single multivariate distribution $[0,1]^{d}$ using correlation coefficients. It uses this multivariate distribution to generate all values at once.

Mathematically, It is constructed from a multivariate normal distribution over $\mathbb {R} ^{d}$ by using the probability integral transform. For a given correlation matrix ${\displaystyle R\in [-1,1]^{d\times d}}$, the Gaussian copula with parameter matrix ${\displaystyle R}$ can be written as
> $C_{R}^{\text{Gauss}}(u)=\Phi _{R}\left(\Phi ^{-1}(u_{1}),\dots ,\Phi ^{-1}(u_{d})\right)$ 

where $ \Phi ^{-1}$ is the inverse cumulative distribution function of a standard normal and $\Phi _{R}$ is the joint cumulative distribution function of a multivariate normal distribution with mean vector zero and covariance matrix equal to the correlation matrix $R$. 

${\displaystyle C_{R}^{\text{Gauss}}(u)}$ is approximated using numerical integration and the density can be written as:

>${\displaystyle C_{R}^{\text{Gauss}}(u)={\frac {1}{\sqrt {\det {R}}}}\exp \left(-{\frac {1}{2}}{\begin{pmatrix}\Phi ^{-1}(u_{1})\\\vdots \\\Phi ^{-1}(u_{d})\end{pmatrix}}^{T}\cdot \left(R^{-1}-I\right)\cdot {\begin{pmatrix}\Phi ^{-1}(u_{1})\\\vdots \\\Phi ^{-1}(u_{d})\end{pmatrix}}\right),}$
where $\mathbf {I}$  is the identity matrix.


### Condition Tabular GAN (CTGAN) Model: <a class="anchor" id="CTGAN_def"></a>


{Lei Xu et.al,2019}{http://arxiv.org/abs/1907.00503}

CTGAN is a GAN-based method to model tabular data distribution and sample rows from the distribution. It addresses data imbalance in other GAN models caused by the need to simultaneously model discrete and continuous columns, the multi-modal non-Gaussian values within each continuous column, and the severe imbalance of categorical columns by employing a conditional generator and training-by-sampling.

Let $k$  be the value from the $i^{th}$ discrete column $D_{i}$ that has to be matched by the generated samples $r$, then the generator can be interpreted as the conditional distribution of rows given that particular value at that particular column, i.e. $r ∼ PG(row|D_i = k)$.

> $P(row) = \sum_{k \in D_{i}}PG(row|D_{i} = k)P(D_{i} = k)$

The output produced by the conditional generator must be assessed by the
critic, which estimates the distance between the learned conditional distribution $PG(row|cond)$ and
the conditional distribution on real data $P(row|cond)$. The sampling of real training data and the
construction of cond vector comply to help critic estimate the distance. Properly sampled
the cond vector and training data helps the model evenly explore all possible values in discrete
columns. 
 


### CopulaGAN Model: <a class="anchor" id="Copula_Def"></a>



Uses the cummulative distribution function based approach of the Gaussian Capula model with the conditional GAN model to make the underlying CTGAN model task of learning the data easier.



### Tabular Variational AutoEncoder (TVAE) Model <a class="anchor" id="TVAE_def"></a>:


{Lei Xu et.al,2019} {http://arxiv.org/abs/1907.00503 }

It is another neural network generative model and uses two neural networks to model $p_θ(r_j |z_j )$ and $q_φ(z_j |r_j )$, and train them using evidence lower-bound (ELBO) loss. Sdv uses the design present in the paper above.

## Prerequisites <a class="anchor" id="Prerequisites"></a>

In [21]:
# General imports
# Make sure to be in version 3.6-3.8 of python to import sdv, does not work on python 3.9.
import numpy as np
import pandas as pd
import copy
from sdv.tabular import GaussianCopula, CTGAN, CopulaGAN, TVAE
from sdv.metrics.tabular import (CSTest, KSTest, BNLikelihood, BNLogLikelihood,
                                 GMLogLikelihood, LogisticDetection,
                                 SVCDetection,
                                 MulticlassDecisionTreeClassifier, NumericalLR)
from sdv.evaluation import evaluate
from sdv.constraints import UniqueCombinations

from cadai.Binning import *
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.exceptions import ConvergenceWarning

#Included to increase evaluation accuracy
logistic = LogisticRegression(max_iter=1000)
pipe = Pipeline([('reduce_dim', 'passthrough'),
                 ('classify', LinearSVC(dual=False, max_iter=1000), logistic)])

import warnings

warnings.filterwarnings('ignore')
import yapf

# Experimentation on enriched Telco <a class="anchor" id="Experimentation_telco"></a>

## Loading and enriching data set <a class="anchor" id="loading"></a>

Use of raw code from _Scoring - 1 - LogReg and GAM experimenting_ file to ensure understanding of data being generated.

In [22]:
dfRawFile = pd.read_csv('./data/telco_appfrauddetect.csv',
                        sep=',',
                        error_bad_lines=False)
# A column will be removed if it contains more than [threshold] (%) records
threshold = 0.7

# List columns that have more than 80% null values
na_values = dfRawFile.isnull().mean()
print(na_values[na_values > threshold])

# Drop columns with more than 80% null values
dfRawFile.dropna(thresh=dfRawFile.shape[0] * (1 - threshold),
                 how='all',
                 axis=1,
                 inplace=True)

del threshold, na_values
# Wrangle dataset
dfRawFile.loc[dfRawFile['timeSubmitSec'] < 0, 'timeSubmitSec'] = 0

dfRawFile['label'] = dfRawFile['app_status'].map({'valid': 0, 'invalid': 1})

dfRawFile.loc[dfRawFile['browser'] > 5, 'browser'] = 6
dfRawFile['browserType'] = dfRawFile['browser'].map({
    0: 'edge',
    1: 'chrome',
    2: 'safari',
    3: 'firefox',
    4: 'opera',
    5: 'vivaldi',
    6: 'others'
})
dfRawFile['inCustType'] = dfRawFile['inCust'].map({0: 'N', 1: 'Y'})
dfRawFile['packageType'] = dfRawFile['package'].map({
    0: 'A',
    1: 'B',
    2: 'C',
    3: 'D',
    4: 'E',
    5: 'F',
    6: 'G',
    7: 'H',
    8: 'I',
    9: 'J',
    10: 'K',
    11: 'L',
    12: 'M',
    13: 'N',
    14: 'O',
    15: 'P',
    18: 'Q',
    21: 'R',
    24: 'S',
    27: 'T',
    30: 'U'
})
dfRawFile['weekDayType'] = dfRawFile['weekDay'].map({
    0: 'Mon',
    1: 'Tue',
    2: 'Wed',
    3: 'Thu',
    4: 'Fri',
    5: 'Sat',
    6: 'Sun'
})
dfRawFile.drop(['app_id', 'app_status', 'score'], axis=1, inplace=True)

dfRawFile.drop(columns=['browser', 'package', 'inCust', 'weekDay'],
               inplace=True)
dfEnriched = dfRawFile.copy()

# Drop initial dataset
del dfRawFile
dfEnriched.head()

Series([], dtype: float64)


Unnamed: 0,ipRange,ipHop,timeSubmitSec,inList,appHour,label,browserType,inCustType,packageType,weekDayType
0,12307.0,0.0,20.0,0.0,11.0,0,firefox,Y,A,Sun
1,15384.0,0.0,18.0,0.0,11.0,0,safari,Y,G,Sun
2,9230.0,0.0,18.0,0.0,10.0,0,edge,N,D,Sun
3,6153.0,1.0,20.0,0.0,10.0,0,edge,Y,A,Sun
4,21538.0,1.0,21.0,0.0,11.0,0,chrome,N,J,Thu


## Fitting the models <a class="anchor" id="fitting"></a>

 All features in the data set are object types, field transformers 
     help the models distinguish between the types during fitting, 
     returning data to initial type when sampling

In [23]:
field_transformer = {
    'ipRange': 'integer',
    'ipHop': 'integer',
    'timeSubmitSec': 'integer',
    'inList': 'integer',
    'appHour': 'integer',
    'label': 'label_encoding',
    'browserType': 'categorical',
    'inCustType': 'categorical',
    'packageType': 'categorical',
    'weekDayType': 'categorical'
}

Fitting all the models follows the same steps. Initialising model with constraints and transformers, fitting the model, then sampling (generating) the amount of data needed.

### GaussianCopula <a class="anchor" id="gaussian_fit"></a>

In [24]:
gaussian = GaussianCopula(field_transformers=field_transformer)
gaussian.fit(dfEnriched)

In [25]:
gaussianData = gaussian.sample(10000)
gaussianData.head()

Unnamed: 0,ipRange,ipHop,timeSubmitSec,inList,appHour,label,browserType,inCustType,packageType,weekDayType
0,21898.0,0.0,18.0,0.0,9.0,0,edge,Y,G,Fri
1,16742.0,2.0,19.0,0.0,13.0,0,firefox,Y,J,Sun
2,20400.0,0.0,29.0,0.0,12.0,0,chrome,N,G,Sat
3,13103.0,0.0,17.0,0.0,4.0,0,edge,Y,D,Fri
4,12909.0,1.0,20.0,0.0,11.0,0,safari,Y,G,Sat


### CTGAN <a class="anchor" id="CTGAN_fit"></a>

In [26]:
ctg = CTGAN(field_transformers=field_transformer)
ctg.fit(dfEnriched)

In [27]:
ctgData = ctg.sample(10000)
ctgData.describe()

Unnamed: 0,ipRange,ipHop,timeSubmitSec,inList,appHour,label
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,13494.2042,0.7782,14.8031,0.9229,13.2342,0.1009
std,5613.478282,0.73093,7.411479,1.290629,5.080151,0.301211
min,-178.0,0.0,-1.0,0.0,0.0,0.0
25%,9294.75,0.0,5.0,0.0,10.0,0.0
50%,12852.5,1.0,18.0,0.0,11.0,0.0
75%,17802.5,1.0,21.0,2.0,20.0,0.0
max,35223.0,3.0,25.0,3.0,24.0,1.0


### CopulaGAN <a class="anchor" id="copula_fit"></a>

In [28]:
copula = CopulaGAN(field_transformers=field_transformer)
copula.fit(dfEnriched)

In [29]:
copulaData = copula.sample(10000)
copulaData.describe()

Unnamed: 0,ipRange,ipHop,timeSubmitSec,inList,appHour,label
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,13293.7248,0.8233,15.3848,0.9184,12.5512,0.0
std,5377.27019,0.810643,6.820336,1.228207,4.97647,0.0
min,0.0,0.0,1.0,0.0,0.0,0.0
25%,9291.0,0.0,6.0,0.0,10.0,0.0
50%,14585.0,1.0,19.0,0.0,12.0,0.0
75%,15507.0,1.0,20.0,2.0,14.0,0.0
max,34384.0,3.0,26.0,9.0,23.0,0.0


### TVAE  <a class="anchor" id="TVAE_fit"></a>

In [30]:
tvae = TVAE(field_transformers=field_transformer)
tvae.fit(dfEnriched)

In [31]:
tvaeData = tvae.sample(10000)
tvaeData.head()

Unnamed: 0,ipRange,ipHop,timeSubmitSec,inList,appHour,label,browserType,inCustType,packageType,weekDayType
0,15449.0,1.0,1.0,3.0,22.0,0,vivaldi,N,M,Sun
1,15376.0,0.0,19.0,0.0,10.0,0,chrome,Y,A,Sun
2,6171.0,1.0,18.0,0.0,12.0,0,chrome,Y,G,Sun
3,6139.0,1.0,20.0,0.0,10.0,0,chrome,Y,D,Sun
4,12302.0,1.0,19.0,0.0,12.0,0,chrome,Y,D,Sun


## Evaluation of synthetic data <a class="anchor" id="eval"></a>

### Metric types <a class="anchor" id="metrics"></a>
#### Statistical metrics <a class="anchor" id="stat"></a>
These metrics compare individual columns from the real table with the corresponding column from the synthetic table, and at the end report the average outcome from the test.

_CS test_: This metric uses the Chi-Squared test to compare the distributions of two discrete columns. The output for each column is the CSTest p-value, which indicates the probability of the two columns having been sampled from the same distribution.

_KS test_: This metric uses the two-sample Kolmogorov–Smirnov test to compare the distributions of continuous columns using the empirical CDF. The output for each column is 1 minus the KS Test D statistic, which indicates the maximum distance between the expected CDF and the observed CDF values.

#### Likelihood Metrics <a class="anchor" id="likelihood"></a>

The metrics of this family compare the tables by fitting the real data to a probabilistic model and afterwards compute the likelihood of the synthetic data belonging to the learned distribution.


_Bayesian Network likelihood_: This metric fits a BayesianNetwork to the real data and then evaluates the average likelihood of the rows from the synthetic data on it.

_Bayesian Network log likelihood_: This metric fits a BayesianNetwork to the real data and then evaluates the average log likelihood of the rows from the synthetic data on it.

_Gaussian Mixture log likelihood_: This metric fits multiple GaussianMixture models to the real data and then evaluates the average log likelihood of the synthetic data on them.

#### Detection Metrics <a class="anchor" id="detection"></a>
The metrics of this family evaluate how hard it is to distinguish the synthetic data from the real data by using a Machine Learning model. To do this, the metrics will shuffle the real data and synthetic data together with flags indicating whether the data is real or synthetic, and then cross validate a Machine Learning model that tries to predict this flag. The output of the metrics will be the 1 minus the average ROC AUC score across all the cross validation splits. Meaning the closer the value gets to 1, the harder it is for the classifier to seperate the real from the generate data.

_Logistic Detection_: Detection metric based on a LogisticRegression classifier. 

_SVC Detection_: Detection metric based on a SVC classifier.

1 - ROC AUC score so the closer to 1, the less the regressor can classify between the true and generated data.

### Gaussian Copula  <a class="anchor" id="gaussian_eval"></a>

In [32]:
evaluate(gaussianData, dfEnriched)

0.6520233466816736

In [33]:
evaluate(gaussianData, dfEnriched, aggregate=False)

Unnamed: 0,metric,name,raw_score,normalized_score,min_value,max_value,goal
0,BNLogLikelihood,BayesianNetwork Log Likelihood,-5.775408,0.003103,-inf,0.0,MAXIMIZE
1,LogisticDetection,LogisticRegression Detection,0.6901693,0.690169,0.0,1.0,MAXIMIZE
2,SVCDetection,SVC Detection,1.0,1.0,0.0,1.0,MAXIMIZE
11,GMLogLikelihood,GaussianMixture Log Likelihood,-631449400000000.0,0.0,-inf,inf,MAXIMIZE
12,CSTest,Chi-Squared,0.9647584,0.964758,0.0,1.0,MAXIMIZE
13,KSTest,Inverted Kolmogorov-Smirnov D statistic,0.8376167,0.837617,0.0,1.0,MAXIMIZE
14,KSTestExtended,Inverted Kolmogorov-Smirnov D statistic,0.87294,0.87294,0.0,1.0,MAXIMIZE
27,ContinuousKLDivergence,Continuous Kullback–Leibler Divergence,0.6069475,0.606947,0.0,1.0,MAXIMIZE
28,DiscreteKLDivergence,Discrete Kullback–Leibler Divergence,0.8970135,0.897014,0.0,1.0,MAXIMIZE


### CTGAN <a class="anchor" id="CTGAN_eval"></a>

In [34]:
evaluate(ctgData, dfEnriched)

0.5722109776041407

In [35]:
evaluate(ctgData, dfEnriched, aggregate=False)

Unnamed: 0,metric,name,raw_score,normalized_score,min_value,max_value,goal
0,BNLogLikelihood,BayesianNetwork Log Likelihood,-6.184887,0.00206,-inf,0.0,MAXIMIZE
1,LogisticDetection,LogisticRegression Detection,0.509825,0.509825,0.0,1.0,MAXIMIZE
2,SVCDetection,SVC Detection,0.3736439,0.373644,0.0,1.0,MAXIMIZE
11,GMLogLikelihood,GaussianMixture Log Likelihood,-55199530000.0,0.0,-inf,inf,MAXIMIZE
12,CSTest,Chi-Squared,0.9736329,0.973633,0.0,1.0,MAXIMIZE
13,KSTest,Inverted Kolmogorov-Smirnov D statistic,0.8630167,0.863017,0.0,1.0,MAXIMIZE
14,KSTestExtended,Inverted Kolmogorov-Smirnov D statistic,0.87256,0.87256,0.0,1.0,MAXIMIZE
27,ContinuousKLDivergence,Continuous Kullback–Leibler Divergence,0.712338,0.712338,0.0,1.0,MAXIMIZE
28,DiscreteKLDivergence,Discrete Kullback–Leibler Divergence,0.8428682,0.842868,0.0,1.0,MAXIMIZE


### CopulaGAN<a class="anchor" id="copula_eval"></a>

In [36]:
evaluate(copulaData, dfEnriched)

0.5242974541878012

In [37]:
evaluate(copulaData, dfEnriched, aggregate=False)

Unnamed: 0,metric,name,raw_score,normalized_score,min_value,max_value,goal
0,BNLogLikelihood,BayesianNetwork Log Likelihood,-6.138766,0.002158,-inf,0.0,MAXIMIZE
1,LogisticDetection,LogisticRegression Detection,0.3300543,0.330054,0.0,1.0,MAXIMIZE
2,SVCDetection,SVC Detection,0.3207198,0.32072,0.0,1.0,MAXIMIZE
11,GMLogLikelihood,GaussianMixture Log Likelihood,-62441370000.0,0.0,-inf,inf,MAXIMIZE
12,CSTest,Chi-Squared,0.9011471,0.901147,0.0,1.0,MAXIMIZE
13,KSTest,Inverted Kolmogorov-Smirnov D statistic,0.8545833,0.854583,0.0,1.0,MAXIMIZE
14,KSTestExtended,Inverted Kolmogorov-Smirnov D statistic,0.88093,0.88093,0.0,1.0,MAXIMIZE
27,ContinuousKLDivergence,Continuous Kullback–Leibler Divergence,0.6400494,0.640049,0.0,1.0,MAXIMIZE
28,DiscreteKLDivergence,Discrete Kullback–Leibler Divergence,0.7914871,0.791487,0.0,1.0,MAXIMIZE


### TVAE <a class="anchor" id="TVAE_eval"></a>

In [38]:
evaluate(tvaeData, dfEnriched)

0.4323478762511248

In [39]:
evaluate(tvaeData, dfEnriched, aggregate=False)

Unnamed: 0,metric,name,raw_score,normalized_score,min_value,max_value,goal
0,BNLogLikelihood,BayesianNetwork Log Likelihood,-4.059201,0.017263,-inf,0.0,MAXIMIZE
1,LogisticDetection,LogisticRegression Detection,0.08772479,0.087725,0.0,1.0,MAXIMIZE
2,SVCDetection,SVC Detection,0.02649471,0.026495,0.0,1.0,MAXIMIZE
11,GMLogLikelihood,GaussianMixture Log Likelihood,-24412670000.0,0.0,-inf,inf,MAXIMIZE
12,CSTest,Chi-Squared,0.9019683,0.901968,0.0,1.0,MAXIMIZE
13,KSTest,Inverted Kolmogorov-Smirnov D statistic,0.8578333,0.857833,0.0,1.0,MAXIMIZE
14,KSTestExtended,Inverted Kolmogorov-Smirnov D statistic,0.76861,0.76861,0.0,1.0,MAXIMIZE
27,ContinuousKLDivergence,Continuous Kullback–Leibler Divergence,0.7157558,0.715756,0.0,1.0,MAXIMIZE
28,DiscreteKLDivergence,Discrete Kullback–Leibler Divergence,0.5171183,0.517118,0.0,1.0,MAXIMIZE


## Saving data

In [40]:
Gauss_json = gaussianData.to_json(orient="values")
file = open('data/generated/gaussian_telco.json', 'w')
file.write(Gauss_json)
file.close()

In [41]:
ctgan_json = ctgData.to_json(orient="values")
file = open('data/generated/ctgan_telco.json', 'w')
file.write(ctgan_json)
file.close()

In [42]:
copula_json = copulaData.to_json(orient="values")
file = open('data/generated/copula_telco.json', 'w')
file.write(copula_json)
file.close()

In [43]:
tvae_json = tvaeData.to_json(orient="values")
file = open('data/generated/tvae_telco.json', 'w')
file.write(tvae_json)
file.close()

# Experimentation on enriched dataset <a class="anchor" id="Experimentation"></a>

This data set involves more categorical data. Some of the numeric data should not be generated as normal numeric data is with distribution, for some primary keys can be assigned, however, for simplicity in this exploration they the customer, account, subscription and phone numbers are generated as if they were just categorical data. There is no transformer to generate acceptable phone numbers this would need to be created (or anonymize the column).

## Loading enriched dataset <a class="anchor" id="loading2"></a>

In [44]:
dfen = pd.read_csv('./data/Enriched_dataset.csv',
                   sep=',',
                   error_bad_lines=False)
dfen.drop(columns=['Std_Data', 'Std_Message', 'Std_VoiceIn', 'Std_VoiceOut'],
          inplace=True)
dfen.head()

Unnamed: 0.1,Unnamed: 0,Company,Customer Number,Account Number,Subscription Number,Phone Number,Product Line,Product Family,Produit,Client,VOICE IN (Min),VOICE OUT (Min),Message,Data
0,0,.,21024170,21024170,199761,352691277566,MOBILE,smart,LUX_SMART_V7_S_SIM_ONLY,SL,571.533333,2001.116667,22,221.89308
1,1,.,22073297,22197893,1083622,352691277009,MOBILE,smart,LUX_SMART_V09_XL_PACKAGE,SL,1193.066667,5318.65,1429,245019.044338
2,2,AUTOCARS SALES - LENTZ S.A.,21024519,21024519,192883,352691211012,MOBILE,smart,LUX_SMART_V6_L_PACKAGE,SL,2415.383333,5557.983333,101,19492.130311
3,3,AUTOCARS SALES - LENTZ S.A.,21024519,21024519,205747,352691323527,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,SL,0.0,4.233333,0,0.0
4,4,AUTOCARS SALES - LENTZ S.A.,21024519,21024519,1134547,352691277032,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,SL,443.883333,1519.433333,3,0.0


In [45]:
dfen.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2041 entries, 0 to 2040
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           2041 non-null   int64  
 1   Company              2041 non-null   object 
 2   Customer Number      2041 non-null   int64  
 3   Account Number       2041 non-null   int64  
 4   Subscription Number  2041 non-null   int64  
 5   Phone Number         2041 non-null   int64  
 6   Product Line         2041 non-null   object 
 7   Product Family       2041 non-null   object 
 8   Produit              2041 non-null   object 
 9   Client               2041 non-null   object 
 10  VOICE IN (Min)       2041 non-null   float64
 11  VOICE OUT (Min)      2041 non-null   float64
 12  Message              2041 non-null   int64  
 13  Data                 2041 non-null   float64
dtypes: float64(3), int64(6), object(5)
memory usage: 223.4+ KB


## Fitting the models <a class="anchor" id="fitting2"></a>

In [46]:
field_transformer_en = {
    'Company': 'categorical',
    'Customer Number': 'categorical',
    'Account Number': 'categorical',
    'Subscription Number': 'categorical',
    'Phone Number': 'categorical',
    'Product Line': 'categorical',
    'Product Family': 'categorical',
    'Produit': 'categorical',
    'Client': 'categorical',
    'VOICE IN (Min)': 'float',
    'VOICE OUT (Min)': 'float',
    'Message': 'integer',
    'Data': 'float'
}

### GaussianCopula <a class="anchor" id="gaussian_fit2"></a>

In [47]:
gaussian2 = GaussianCopula(primary_key='Unnamed: 0',
                           field_transformers=field_transformer_en)
gaussian2.fit(dfen)

In [48]:
gData2 = gaussian2.sample(10000)
gData2.head()

Unnamed: 0.1,Unnamed: 0,Company,Customer Number,Account Number,Subscription Number,Phone Number,Product Line,Product Family,Produit,Client,VOICE IN (Min),VOICE OUT (Min),Message,Data
0,0,SLA S.A. C/O CAMIONNETTE,27987470,27381866,4278919,352691960885,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,EW,157.361176,1209.688115,0,0.01609182
1,1,VOYAGES EMILE WEBER SÀRL,22443496,27381868,5895841,352691360208,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,SL,693.560887,1726.239898,0,1452511000000000.0
2,2,VOYAGES EMILE WEBER SÀRL,22443496,27381868,6044254,352691992570,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,SL,514.935335,1087.071265,0,26302.0
3,3,VOYAGES EMILE WEBER SÀRL,22443496,27381866,6121324,352691977593,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,SL,-205.437498,188.078911,0,6169.014
4,4,SLA S.A. C/O CAMIONNETTE,27987470,27357080,5993832,352691971631,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,SL,690.90015,414.037953,0,0.125418


### CTGAN <a class="anchor" id="CTGAN_fit2"></a>

In [49]:
ctgan2 = CTGAN(primary_key='Unnamed: 0',
               field_transformers=field_transformer_en)
ctgan2.fit(dfen)

In [50]:
ctData2 = ctgan2.sample(10000)
ctData2.head()

Unnamed: 0.1,Unnamed: 0,Company,Customer Number,Account Number,Subscription Number,Phone Number,Product Line,Product Family,Produit,Client,VOICE IN (Min),VOICE OUT (Min),Message,Data
0,0,SALES LENTZ,27683372,21023952,1139582,352691277050,MOBILE,Corporate,LUX_SMART_V6_XL_PACKAGE,SL,394.064512,494.778488,100,33404.691349
1,1,SLA S.A C/O BUS,27683372,21024519,5839261,352691276426,MOBILE,smart,LUX_SMART_V7_S_PACKAGE,SL,26.122108,28.625366,3,-979.281535
2,2,VOYAGES EMILE WEBER SÀRL,22443496,27381868,6041246,352691111249,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,EW,-14.748141,76.936122,-3,3839.905149
3,3,NORD-TAXI SARL,27987468,27183988,3645838,352691330050,FIXED,Mobile_Internet,LUX_POST_T2-VOIP,SL,39.174653,3052.395835,11,36079.348931
4,4,SLA S.A. C/O CAMIONNETTE,22443496,27381868,6016034,352691345914,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,SL,-5.748331,19.124029,-3,1597.471201


### CopulaGAN <a class="anchor" id="copula_fit2"></a>

In [51]:
copulagan2 = CopulaGAN(primary_key='Unnamed: 0',
                       field_transformers=field_transformer_en)
copulagan2.fit(dfen)

In [52]:
cData2 = copulagan2.sample(10000)
cData2.head()

Unnamed: 0.1,Unnamed: 0,Company,Customer Number,Account Number,Subscription Number,Phone Number,Product Line,Product Family,Produit,Client,VOICE IN (Min),VOICE OUT (Min),Message,Data
0,0,SLA S.A C/O BUS,25769326,27381866,6030386,352691993532,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,SL,-5.059053,33.630729,0,1605.069713
1,1,SALES LENTZ AUTOCARS SA,22443496,21024519,5543653,352691967724,MOBILE,smart,LUX_POST_TAN_ENTERPRISE,SL,-24.346474,12.072966,0,12515.412238
2,2,SLA S.A C/O BUS,27987468,26495802,1713234,352691119892,MOBILE,Mobile_Internet,LUX_MOBILE_ADSL_V6_EU_EXTRA_LARGE,SL,-5.89859,8.847995,0,8378.122048
3,3,SLA S.A C/O BUS,27987468,27381868,6055119,352691345983,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,SL,-13.342319,-11.739683,0,2657.315147
4,4,SALES-LENTZ PROBUS,27987468,26495802,6083235,352691975710,MOBILE,Mobile_Internet,LUX_POSTPAID_M2MCARD,SL,200.211187,-7.33185,17,3314.038299


### TVAE  <a class="anchor" id="TVAE_fit2"></a>

In [53]:
tvae2 = TVAE(primary_key='Unnamed: 0', field_transformers=field_transformer_en)
tvae2.fit(dfen)

In [54]:
tData2 = tvae2.sample(10000)
tData2.head()

Unnamed: 0.1,Unnamed: 0,Company,Customer Number,Account Number,Subscription Number,Phone Number,Product Line,Product Family,Produit,Client,VOICE IN (Min),VOICE OUT (Min),Message,Data
0,0,VOYAGES EMILE WEBER SÀRL,22443496,21022650,1139954,352691384629,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,EW,1177.308308,771.503224,78,150.242561
1,1,SLA S.A. C/O CAMIONNETTE,27987470,27381868,6072415,352691994593,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,SL,6.604915,-0.028761,-1,2952.430055
2,2,SLA S.A C/O BUS,27987468,27381866,5877543,352691969116,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,SL,1.559783,-7.039284,0,7049.991365
3,3,SALES-LENTZ AUTOCARS S.A – EXECUTIVE LANE,21023952,22911009,6059424,352691330338,MOBILE,smart,LUX_SMART_V6_L_PACKAGE,SL,116.23433,3.340964,0,3947.47098
4,4,SLA S.A C/O BUS,27987468,27381866,5839024,352691927558,MOBILE,Enterprise,LUX_POST_TAN_ENTERPRISE,SL,-4.599274,-19.864924,0,6492.621145


## Evaluation of synthetic data <a class="anchor" id="eval2"></a>

### Gaussian Copula  <a class="anchor" id="gaussian_eval2"></a>

In [55]:
evaluate(gData2, dfen)

0.6688430024609224

In [56]:
evaluate(gData2, dfen, aggregate=False)

Unnamed: 0,metric,name,raw_score,normalized_score,min_value,max_value,goal
0,BNLogLikelihood,BayesianNetwork Log Likelihood,-10.21925,3.6e-05,-inf,0.0,MAXIMIZE
1,LogisticDetection,LogisticRegression Detection,1.0,1.0,0.0,1.0,MAXIMIZE
2,SVCDetection,SVC Detection,0.9989999,0.999,0.0,1.0,MAXIMIZE
11,GMLogLikelihood,GaussianMixture Log Likelihood,-1.2024649999999999e+64,0.0,-inf,inf,MAXIMIZE
12,CSTest,Chi-Squared,0.9547906,0.954791,0.0,1.0,MAXIMIZE
13,KSTest,Inverted Kolmogorov-Smirnov D statistic,0.7468943,0.746894,0.0,1.0,MAXIMIZE
14,KSTestExtended,Inverted Kolmogorov-Smirnov D statistic,0.8025602,0.80256,0.0,1.0,MAXIMIZE
27,ContinuousKLDivergence,Continuous Kullback–Leibler Divergence,0.8116512,0.811651,0.0,1.0,MAXIMIZE
28,DiscreteKLDivergence,Discrete Kullback–Leibler Divergence,0.7037542,0.703754,0.0,1.0,MAXIMIZE


### CTGAN <a class="anchor" id="CTGAN_eval2"></a>

In [57]:
evaluate(ctData2, dfen)

0.5167720651834209

In [58]:
evaluate(ctData2, dfen, aggregate=False)

Unnamed: 0,metric,name,raw_score,normalized_score,min_value,max_value,goal
0,BNLogLikelihood,BayesianNetwork Log Likelihood,-11.38601,1.1e-05,-inf,0.0,MAXIMIZE
1,LogisticDetection,LogisticRegression Detection,0.3082401,0.30824,0.0,1.0,MAXIMIZE
2,SVCDetection,SVC Detection,0.7016049,0.701605,0.0,1.0,MAXIMIZE
11,GMLogLikelihood,GaussianMixture Log Likelihood,-29126100.0,0.0,-inf,inf,MAXIMIZE
12,CSTest,Chi-Squared,0.7849435,0.784944,0.0,1.0,MAXIMIZE
13,KSTest,Inverted Kolmogorov-Smirnov D statistic,0.7085351,0.708535,0.0,1.0,MAXIMIZE
14,KSTestExtended,Inverted Kolmogorov-Smirnov D statistic,0.7398639,0.739864,0.0,1.0,MAXIMIZE
27,ContinuousKLDivergence,Continuous Kullback–Leibler Divergence,0.7378287,0.737829,0.0,1.0,MAXIMIZE
28,DiscreteKLDivergence,Discrete Kullback–Leibler Divergence,0.4451199,0.44512,0.0,1.0,MAXIMIZE


### CopulaGAN <a class="anchor" id="copula_eval2"></a>

In [59]:
evaluate(cData2, dfen)

0.5110820942595674

In [60]:
evaluate(cData2, dfen, aggregate=False)

Unnamed: 0,metric,name,raw_score,normalized_score,min_value,max_value,goal
0,BNLogLikelihood,BayesianNetwork Log Likelihood,-10.08782,4.2e-05,-inf,0.0,MAXIMIZE
1,LogisticDetection,LogisticRegression Detection,0.4700525,0.470052,0.0,1.0,MAXIMIZE
2,SVCDetection,SVC Detection,1.0,1.0,0.0,1.0,MAXIMIZE
11,GMLogLikelihood,GaussianMixture Log Likelihood,-313688000.0,0.0,-inf,inf,MAXIMIZE
12,CSTest,Chi-Squared,0.8039081,0.803908,0.0,1.0,MAXIMIZE
13,KSTest,Inverted Kolmogorov-Smirnov D statistic,0.697915,0.697915,0.0,1.0,MAXIMIZE
14,KSTestExtended,Inverted Kolmogorov-Smirnov D statistic,0.7400294,0.740029,0.0,1.0,MAXIMIZE
27,ContinuousKLDivergence,Continuous Kullback–Leibler Divergence,0.7177073,0.717707,0.0,1.0,MAXIMIZE
28,DiscreteKLDivergence,Discrete Kullback–Leibler Divergence,0.4499719,0.449972,0.0,1.0,MAXIMIZE


### TVAE <a class="anchor" id="TVAE_eval2"></a>

In [61]:
evaluate(tData2, dfen)

0.6191024064542601

In [62]:
evaluate(tData2, dfen, aggregate=False)

Unnamed: 0,metric,name,raw_score,normalized_score,min_value,max_value,goal
0,BNLogLikelihood,BayesianNetwork Log Likelihood,-3.56608,0.028266,-inf,0.0,MAXIMIZE
1,LogisticDetection,LogisticRegression Detection,0.358121,0.358121,0.0,1.0,MAXIMIZE
2,SVCDetection,SVC Detection,0.8895365,0.889537,0.0,1.0,MAXIMIZE
11,GMLogLikelihood,GaussianMixture Log Likelihood,-18763910.0,0.0,-inf,inf,MAXIMIZE
12,CSTest,Chi-Squared,0.9824874,0.982487,0.0,1.0,MAXIMIZE
13,KSTest,Inverted Kolmogorov-Smirnov D statistic,0.736671,0.736671,0.0,1.0,MAXIMIZE
14,KSTestExtended,Inverted Kolmogorov-Smirnov D statistic,0.817103,0.817103,0.0,1.0,MAXIMIZE
27,ContinuousKLDivergence,Continuous Kullback–Leibler Divergence,0.8429204,0.84292,0.0,1.0,MAXIMIZE
28,DiscreteKLDivergence,Discrete Kullback–Leibler Divergence,0.8532216,0.853222,0.0,1.0,MAXIMIZE


## Saving data

In [63]:
G_json = gData2.to_json(orient="values")
file = open('data/generated/gaussian_enriched.json', 'w')
file.write(G_json)
file.close()

In [64]:
ct_json = ctData2.to_json(orient="values")
file = open('data/generated/ctgan_enriched.json', 'w')
file.write(ct_json)
file.close()

In [65]:
c_json = cData2.to_json(orient="values")
file = open('data/generated/copula_enriched.json', 'w')
file.write(c_json)
file.close()

In [66]:
t_json = tData2.to_json(orient="values")
file = open('data/generated/tvae_enriched.json', 'w')
file.write(t_json)
file.close()