# Tabular data generation performance demo

## Overview

In this notebbok we compare the performance (throughput) of tabular data generators presented in the SynGen tool. 

Available generators:

1. [KDE (Kernel Density Estimation)](#1)
1. [KDE (Kernel Density Estimation) from sklearn](#2)
1. [Uniform](#3)
1. [Gaussian](#4)
1. [CTGAN](#5)

### Imports

In [1]:
# preprocessing
from syngen.preprocessing.datasets.ieee import IEEEPreprocessing

# generators
from syngen.generator.tabular import (
    KDEGenerator,
    KDEGeneratorSK,
    UniformGenerator, 
    GaussianGenerator,  
    CTGANGenerator,
)

# Others
import time
import pandas as pd
from syngen.utils.types import MetaData

### Helper function

This function measures throughput in samples per second.

In [2]:
def measure_throughput(generator, n=10, samples = 10000):
    times = []
    for _ in range(n):
        start = time.perf_counter()
        generator.sample(samples)
        elapsed = time.perf_counter() - start
        times.append(elapsed)
    return int((samples * n) / sum(times))

### Load tabular features

We utilize `IEEEPreprocessing` class, which loads and prepares the entire dataset. Then we extract tabular data.

In [3]:
preprocessing = IEEEPreprocessing(cached=False)

In [4]:
data = preprocessing.transform('/workspace/data/ieee-fraud/data.csv')

In [5]:
cols_to_drop = set(['user_id', 'product_id'])
cat_cols = set(preprocessing.graph_info[MetaData.EDGE_DATA][MetaData.CATEGORICAL_COLUMNS]) - cols_to_drop
real = data[MetaData.EDGE_DATA][list(cat_cols)].reset_index(drop=True)

Util dict to store the results

In [6]:
results_dict = {}

<a id="1"></a>
## KDE (Kernel Density Estimation) Generator

PyTorch implementation of the [KDE](https://en.wikipedia.org/wiki/Kernel_density_estimation)

In [7]:
kde_generator = KDEGenerator()
kde_generator.fit(real, categorical_columns=cat_cols)

kde_generator_throughput =  measure_throughput(kde_generator)
results_dict['kde'] = kde_generator_throughput
print(f'avg throughput: {kde_generator_throughput}')

avg throughput: 1420600


<a id="2"></a>
## KDE (Kernel Density Estimation) Generator from sklearn

We make a wrapper over [KDE sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html)

In [8]:
kde_sk_generator = KDEGeneratorSK()
kde_sk_generator.fit(real, categorical_columns=cat_cols)

kde_sk_generator_throughput =  measure_throughput(kde_sk_generator)
results_dict['kde_sk'] = kde_sk_generator_throughput
print(f'avg throughput: {kde_sk_generator_throughput}')

avg throughput: 2640973


<a id="3"></a>
## Uniform Generator

Takes the data distribution from the real data and then uniformly samples from it

In [9]:
uniform_generator = UniformGenerator()
uniform_generator.fit(real, categorical_columns=cat_cols)

uniform_generator_throughput =  measure_throughput(uniform_generator)
results_dict['uniform'] = uniform_generator_throughput
print(f'avg throughput: {uniform_generator_throughput}')

avg throughput: 1768136


<a id="4"></a>
## Gaussian Generator

Interprets the real data distribution as a Normal one and samples from it.

In [10]:
gaussian_generator = GaussianGenerator()
gaussian_generator.fit(real, categorical_columns=cat_cols)

gaussian_generator_throughput =  measure_throughput(gaussian_generator)
results_dict['gaussian'] = gaussian_generator_throughput
print(f'avg throughput: {gaussian_generator_throughput}')

avg throughput: 1509729


<a id="5"></a>
## CTGAN Generator

Implements [Modeling Tabular data using Conditional GAN](https://arxiv.org/abs/1907.00503) paper.

In [11]:
ctgan_generator = CTGANGenerator(epochs=1, batch_size=2000, verbose=True)
ctgan_generator.fit(real, categorical_columns=cat_cols)

ctgan_generator_throughput =  measure_throughput(ctgan_generator)
results_dict['ctgan'] = ctgan_generator_throughput
print(f'avg throughput: {ctgan_generator_throughput}')

INFO:syngen.generator.tabular.ctgan:Epoch 1, Loss G:  0.4207, Loss D: -0.0177
avg throughput: 33202


## Results

In [12]:
pd.DataFrame(results_dict, index=['ieee'])

Unnamed: 0,kde,kde_sk,uniform,gaussian,ctgan
ieee,1420600,2640973,1768136,1509729,33202
