Generate synthetic data with GAN and plot distribution

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sdv.tabular import CopulaGAN
from sdv.evaluation import evaluate
from sdv.constraints import UniqueCombinations, GreaterThan


import os, glob

In [None]:
def plot_corr(data, figsize=(15,15)):
    '''
    Plot correlation 
    Args:
    - data: pd dataframe
    '''
    corr = data.corr()
    sns.set(font_scale=1.2)
    mask = np.triu(np.ones_like(corr, dtype=bool))
    with sns.axes_style("white"):
        f, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(corr, mask=mask, square=True, cmap='RdBu_r', center=0, 
                         vmin= -1, vmax=1,
                         annot=True,
                        annot_kws={'fontsize':8})

In [None]:
data = pd.read_csv(os.path.join(os.path.dirname(os.getcwd()), '../Data/Merged_data/MERGE_FT_TEP_UT_on_ID.csv'),
                    index_col=0)

In [None]:
data.index = data.index.str.rstrip('-12345')

In [None]:
mean_df = data.groupby('ID').mean()
mean_df.dropna(how='any', inplace=True)
mean_df['type_cw'] = mean_df.index.astype('str')
mean_df.drop(['%C_IF_2.25MHz', '%C_IF_3.5MHz', '%C_BS'], axis=1, inplace=True)

In [None]:
mean_df.info()

# CopulaGAN

The sdv.tabular.CopulaGAN model is a variation of the CTGAN Model which takes advantage of the CDF based transformation that the GaussianCopulas apply to make the underlying CTGAN model task of learning the data easier.

# Model the data

## adding constraint - just ignored for now

Do not work either give an error at sampling for reject_sampling in GreaterThan or error at fit for UniqueCombinations

Maybe to test remove zero data to avoid errors:
`mean_df=mean_df[(mean_df != 0).all(1)]`

In [None]:
TEP_constraint = GreaterThan(low='TEP_error', high='TEP_average', handling_strategy='all')

PC225_constraint = GreaterThan(low='SE_%_IF_2.25MHz', high='IF_amp_2.25MHz', handling_strategy='all')

PC35_constraint = GreaterThan(low='SE_%_IF_3.5MHz', high='IF_amp_3.5MHz', handling_strategy='all')


constraints = [TEP_constraint, PC225_constraint,  PC35_constraint]

## tuning distribution and fitting model

In [None]:
model = CopulaGAN(
    epochs=5000,
    #constraints=constraints,
    field_distributions={
        'KJIC':'gaussian_kde',
    }
)

In [None]:
model.fit(mean_df)

In [None]:
model.get_distributions()

# Generate synthetic data

In [None]:
samples = model.sample(1000)

In [None]:
samples.info()

### Evaluate

The output of this function call will be a number between 0 and 1 that will indicate us how similar the two tables are, being 0 the worst and 1 the best possible score.

The evaluate function applies a collection of pre-configured metric functions and returns the average of the scores that the data obtained on each one of them. To explore the metrics in more detail, you can pass and additional argument aggregate=False.


- cstest: This metric compares the distributions of all the categorical columns of the table by using a Chi-squared test and returns the average of the p-values obtained across all the columns. If the tables that you are evaluating do not contain any categorical columns the result will be nan.

- kstest: This metric compares the distributions of all the numerical columns of the table with a two-sample Kolmogorov–Smirnov test using the empirical CDF and returns the average of the p-values obtained across all the columns. If the tables that you are evaluating do not contain any numerical columns the result will be nan.

- logistic_detection: This metric tries to use a Logistic Regression classifier to detect whether each row is real or synthetic and then evaluates its performance using an Area under the ROC curve metric. The returned score is 1 minus the ROC AUC score obtained by the classifier.

- svc_detection: This metric tries to use an Support Vector Classifier to detect whether each row is real or synthetic and then evaluates its performance using an Area under the ROC curve metric. The returned score is 1 minus the ROC AUC score obtained by the classifier.


In [None]:
evaluate(samples, mean_df)

In [None]:
evaluate(samples, mean_df, aggregate = False)

In [None]:
short_mean = samples.loc[:, ['KJIC', 'MS_Avg', 'TEP_average',
        'Beta_avg', 'IF_amp_2.25MHz', 'IF_amp_3.5MHz',
       'BS_amp', 'type_cw']]
short_mean.info()

In [None]:
short_mean['Type'] = short_mean.type_cw.str.split('-').str[0].astype('str')

In [None]:
short_mean.sort_values('Type', inplace=True)

In [None]:
samples.to_csv(os.path.join(os.path.dirname(os.getcwd()), '../Data/Merged_data/CopulaGAN_simulated_data_up.csv'), index=False)

In [None]:
plot_corr(short_mean, figsize=(5,5))

In [None]:
sns.pairplot(short_mean, hue='Type')

# 4. Save and load the synthesizer

To save a trained ctgan synthesizer, use

`model.save('my_model.pkl')`

To restore a saved synthesizer, use

`loaded = CopulaGAN.load('my_model.pkl')`

`new_data = loaded.sample(200)`

In [None]:
model.save('CopulaGAN_up.pkl')