<h3> Generating Synthetic Data using GANs</h3>


Typically GANs are used to generate images. However, we can also generate tabular data from a GAN [Cite:ashrapov2020tabular](https://arxiv.org/pdf/2010.00638.pdf)

In [2]:
# HIDE OUTPUT
CMD = "wget https://raw.githubusercontent.com/Diyago/"\
  "GAN-for-tabular-data/master/requirements.txt"

!{CMD}
!pip install -r requirements.txt
!pip install tabgan

--2023-07-15 22:35:26--  https://raw.githubusercontent.com/Diyago/GAN-for-tabular-data/master/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 183 [text/plain]
Saving to: ‘requirements.txt.1’


2023-07-15 22:35:26 (8.46 MB/s) - ‘requirements.txt.1’ saved [183/183]



In [None]:
def convert_arrays_to_csv(file_path):
    # Read the array of arrays from the .txt file
    with open(file_path, 'r') as file:
        array_of_arrays = np.array(eval(file.read()))

    # Convert the array of arrays to a DataFrame
    file_path = 'path/to/your/file.txt'
    df = convert_arrays_to_csv(file_path)

    # Save the DataFrame to a CSV file
    csv_file_path = file_path.replace('.txt', '.csv')
    df.to_csv(csv_file_path, index=False)

    # Assign df to the newly created CSV file
    df = pd.read_csv(csv_file_path)

    return df

In [10]:
# HIDE OUTPUT
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
import pandas as pd
import io
import os
import requests
import numpy as np
from sklearn import metrics

df = pd.read_csv("sample_data/stl.csv")

COLS_USED = ["NUM_FINGERS",
    "PSI_MIN",
    "PSI_MAX",
    "PRINT_TIME_HOURS",
    "BEND_RANGE_ANGLE",
    "PERFORMANCE",
    "OBJ_CONTRACT",
    "OBJ_EXPAND",
    "OBJ_SPHERICAL",
    "OBJ_LINEAR",
    "UTZ_CONTROL",
    "UTZ_MANIPULATION",
    "OUTPUT"]

COLS_TRAIN = ["NUM_FINGERS",
    "PSI_MIN",
    "PSI_MAX",
    "PRINT_TIME_HOURS",
    "BEND_RANGE_ANGLE",
    "PERFORMANCE",
    "OBJ_CONTRACT",
    "OBJ_EXPAND",
    "OBJ_SPHERICAL",
    "OBJ_LINEAR",
    "UTZ_CONTROL",
    "UTZ_MANIPULATION",
              ]
df = df[COLS_USED]


# Split into training and test sets
df_x_train, df_x_test, df_y_train, df_y_test = train_test_split(
    df.drop("OUTPUT", axis=1),
    df["OUTPUT"],
    test_size=10,
    #shuffle=False,
    random_state=42,
)

# Create dataframe versions for tabular GAN
df_x_test, df_y_test = df_x_test.reset_index(drop=True), \
  df_y_test.reset_index(drop=True)
df_y_train = pd.DataFrame(df_y_train)
df_y_test = pd.DataFrame(df_y_test)

# Pandas to Numpy
x_train = df_x_train.values
x_test = df_x_test.values
y_train = df_y_train.values
y_test = df_y_test.values

# Build the neural network
model = Sequential()
# Hidden 1
model.add(Dense(50, input_dim=x_train.shape[1], activation='relu'))
model.add(Dense(25, activation='relu')) # Hidden 2
model.add(Dense(12, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')

# monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3,
#         patience=5, verbose=1, mode='auto',
#         restore_best_weights=True)
# model.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor], verbose=2,epochs=1000)

We now evaluate the trained neural network to see the RMSE. We will use this trained neural network to compare the accuracy between the original data and the GAN-generated data

In [None]:
pred = model.predict(x_test)
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Final score (RMSE): {}".format(score))

In [None]:
df_x_train

## Training a GAN

Next, we will train the GAN to generate fake data from the original  data

In [None]:
from tabgan.sampler import GANGenerator
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

gen_x, gen_y = GANGenerator(gen_x_times=1.1, cat_cols=None,
           bot_filter_quantile=0.001, top_filter_quantile=0.999, \
              is_post_process=True,
           adversarial_model_params={
               "metrics": "rmse", "max_depth": 2, "max_bin": 100,
               "learning_rate": 0.02, "random_state": \
                42, "n_estimators": 500,
           }, pregeneration_frac=2, only_generated_data=False,\
           gan_params = {"batch_size": 500, "patience": 25, \
          "epochs" : 500,}).generate_data_pipe(df_x_train, df_y_train,\
          df_x_test, deep_copy=True, only_adversarial=False, \
          use_adversarial=True)



In [None]:
# Predict
pred = model.predict(gen_x.values)
score = np.sqrt(metrics.mean_squared_error(pred,gen_y.values))
print("Final score (RMSE): {}".format(score))

Final score (RMSE): 9.083745225633098
