# Tabular data generation performance demo

## Overview

In this notebbok we compare the performance (throughput) of tabular data generators presented in the SynGen tool. 

Available generators:

1. [KDE (Kernel Density Estimation)](#1)
1. [Uniform](#2)
1. [Gaussian](#3)
1. [Random](#4)

### Imports

In [1]:
# preprocessing
from syngen.preprocessing.datasets import IEEEPreprocessing

# generators
from syngen.generator.tabular import (
    KDEGenerator,
    UniformGenerator, 
    GaussianGenerator,  
    RandomMVGenerator,
)

# Others
import time
import pandas as pd
from collections import defaultdict
from syngen.utils.types import MetaData



### Helper function

In [2]:
def measure_throughput(generator, n=10, samples = 100000, gpu=False):
    times = []
    for _ in range(n):
        start = time.perf_counter()
        generator.sample(samples, gpu=gpu)
        elapsed = time.perf_counter() - start
        times.append(elapsed)
    return int((samples * n) / sum(times))

### Load tabular features

In [3]:
data_path = '/workspace/data/ieee-fraud'
preprocessed_path = '/workspace/data/ieee_preprocessed'

In [4]:
preprocessing = IEEEPreprocessing(source_path=data_path, destination_path=preprocessed_path)

In [5]:
feature_spec_original = preprocessing.transform(use_cache=True)

In [6]:
original_tabular_data, categorical_features = feature_spec_original.get_tabular_data(MetaData.EDGES, 'user-product', return_cat_feats=True)

In [7]:
results_dict = defaultdict(dict)

<a id="1"></a>
## KDE (Kernel Density Estimation) Generator


In [8]:
kde_generator = KDEGenerator()
kde_generator.fit(original_tabular_data, categorical_columns=categorical_features)

results_dict['kde-cpu'] =  measure_throughput(kde_generator, gpu=False)
results_dict['kde-gpu'] =  measure_throughput(kde_generator, gpu=True)
print(f"avg throughput: {results_dict['kde-cpu']}, {results_dict['kde-gpu']}")

avg throughput: 371296, 592132


<a id="2"></a>
## Uniform Generator

In [9]:
uniform_generator = UniformGenerator()
uniform_generator.fit(original_tabular_data, categorical_columns=categorical_features)
 
results_dict['uniform-cpu'] =  measure_throughput(uniform_generator, gpu=False)
results_dict['uniform-gpu'] =  measure_throughput(uniform_generator, gpu=True)
print(f"avg throughput: {results_dict['uniform-cpu']}, {results_dict['uniform-gpu']}")

avg throughput: 897421, 3621726


<a id="3"></a>
## Gaussian Generator

In [10]:
gaussian_generator = GaussianGenerator()
gaussian_generator.fit(original_tabular_data, categorical_columns=categorical_features)
 
results_dict['gaussian-cpu'] =  measure_throughput(gaussian_generator, gpu=False)
results_dict['gaussian-gpu'] =  measure_throughput(gaussian_generator, gpu=True)
print(f"avg throughput: {results_dict['gaussian-cpu']}, {results_dict['gaussian-gpu']}")

avg throughput: 530683, 983408


<a id="4"></a>
## Random Generator

In [11]:
random_generator = RandomMVGenerator()
random_generator.fit(original_tabular_data, categorical_columns=categorical_features)
 
results_dict['random-cpu'] =  measure_throughput(random_generator, gpu=False)
results_dict['random-gpu'] =  measure_throughput(random_generator, gpu=True)
print(f"avg throughput: {results_dict['random-cpu']}, {results_dict['random-gpu']}")

avg throughput: 440086, 6438646


## Results

In [12]:
pd.DataFrame(results_dict, index=['ieee'])

Unnamed: 0,kde-cpu,kde-gpu,uniform-cpu,uniform-gpu,gaussian-cpu,gaussian-gpu,random-cpu,random-gpu
ieee,371296,592132,897421,3621726,530683,983408,440086,6438646
