# GANs for Tabular Synthetic Data Generation

Typically GANs are used to generate images. However, we can also generate tabular data from a GAN. In this part, we will use the Python tabgan utility to create fake data from tabular data. Specifically, we will use the Auto MPG dataset to train a GAN to generate fake cars.  [Cite:ashrapov2020tabular](https://arxiv.org/pdf/2010.00638.pdf)

## Installing Tabgan

Pytorch is the foundation of the tabgan neural network utility. The following code installs the needed software to run tabgan in Google Colab.

In [2]:
# HIDE OUTPUT
CMD = "wget https://raw.githubusercontent.com/Diyago/"\
  "GAN-for-tabular-data/master/requirements.txt"

!{CMD}
!pip install -r requirements.txt
!pip install tabgan

--2024-10-09 21:07:36--  https://raw.githubusercontent.com/Diyago/GAN-for-tabular-data/master/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2024-10-09 21:07:36 ERROR 404: Not Found.

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'[0m[31m


In [17]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
import pandas as pd
import io
import os
import requests
import numpy as np
from sklearn import metrics


# Load your dataset
rat_df = pd.read_csv('ratings_small.csv')

# Split into training and test sets
df_x_train, df_x_test, df_y_train, df_y_test = train_test_split(
    rat_df.drop("rating", axis=1),
    rat_df["rating"],
    test_size=0.20,
    #shuffle=False,
    random_state=42,
)
# Create dataframe versions for tabular GAN
df_x_test, df_y_test = df_x_test.reset_index(drop=True), \
  df_y_test.reset_index(drop=True)
df_y_train = pd.DataFrame(df_y_train)
df_y_test = pd.DataFrame(df_y_test)



## Training a GAN for Auto Ratings Dataset


In [18]:
from tabgan.sampler import GANGenerator
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

gen_x, gen_y = GANGenerator(gen_x_times=1.1, cat_cols=None,
           bot_filter_quantile=0.001, top_filter_quantile=0.999, \
              is_post_process=True,
           adversarial_model_params={
               "metrics": "rmse", "max_depth": 2, "max_bin": 100,
               "learning_rate": 0.02, "random_state": \
                42, "n_estimators": 500,
           }, pregeneration_frac=2, only_generated_data=False,\
           gen_params = {"batch_size": 500, "patience": 25, \
          "epochs" : 500,}).generate_data_pipe(df_x_train, df_y_train,\
          df_x_test, deep_copy=True, only_adversarial=False, \
          use_adversarial=True)

Fitting CTGAN transformers for each column:   0%|          | 0/4 [00:00<?, ?it/s]

Training CTGAN, epochs::   0%|          | 0/500 [00:00<?, ?it/s]

[LightGBM] [Info] Number of positive: 16001, number of negative: 16000
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001429 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 300
[LightGBM] [Info] Number of data points in the train set: 32001, number of used features: 3
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500016 -> initscore=0.000062
[LightGBM] [Info] Start training from score 0.000062
[LightGBM] [Info] Number of positive: 16000, number of negative: 16001
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001404 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 300
[LightGBM] [Info] Number of data points in the train set: 32001, number of used features: 3
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499984 -> initscore=-0.000062
[LightGBM] [Info] Start training from score -0.000062
[LightGBM] [Info] 

Note: if you receive an error running the above code, you likely need to restart the runtime. You should have a "restart runtime" button in the output from the second cell. Once you restart the runtime, rerun all of the cells.This step is necessary as tabgan requires specific versions of some packages.

## Evaluating the GAN Results

If we display the results, we can see that the GAN-generated data looks similar to the original. Some values, typically whole numbers in the original data, have fractional values in the synthetic data.

In [19]:
gen_x


Unnamed: 0,userId,movieId,timestamp
0,510,133730,849588178
1,510,148672,846457615
2,513,146590,851162656
3,510,132772,846242746
4,514,146968,833662076
...,...,...,...
182987,624,142536,1473962172
182988,624,140725,1462187665
182989,624,144714,1474972061
182990,624,143257,1461356507


In [20]:
gen_y

Unnamed: 0,rating
0,2.097580
1,3.033221
2,2.529508
3,2.091977
4,0.333745
...,...
182987,2.000000
182988,3.000000
182989,1.500000
182990,2.500000


In [21]:
# Concatenate gen_x and gen_y along the columns
combined_genrated_data = pd.concat([gen_x, gen_y], axis=1)

# Save to CSV
combined_genrated_data.to_csv('ratings_combined.csv', index=False)

In [22]:
combined_genrated_data

Unnamed: 0,userId,movieId,timestamp,rating
0,510,133730,849588178,2.097580
1,510,148672,846457615,3.033221
2,513,146590,851162656,2.529508
3,510,132772,846242746,2.091977
4,514,146968,833662076,0.333745
...,...,...,...,...
182987,624,142536,1473962172,2.000000
182988,624,140725,1462187665,3.000000
182989,624,144714,1474972061,1.500000
182990,624,143257,1461356507,2.500000


# It always seems impossible until it's done