# Package Usage Guide

This notebook contains all the steps you need to follow to get the most out of this package. This notebook also contains and briefly explains the available modules, classes and methods in the package.

## Objective

As health data is private information and cannot be shared freely, the limitation on how much can be learnt from the limited freely available data is quite evident. The HealthGAN neural network in this generates synthetic dataset from the original dataset which can be shared without impairing privacy. 

The package supplements the GAN with preprocessing and evaluation metrics so the package can be used as needed.

## Using the package

Lets dive and see how the package can be used.

### Processing

The first step is to have a training file and a testing file. We will consider the case that we have the training file *train.csv* and testing file as *test.csv* inside the folder *data_files*.

We will use the **processing** module to create the **Encoder()** class which encodes the training ang testing files into SDV files which the GAN accepts using **encode_train()** and **encode_test()** functions respectively.

In [None]:
from synthetic_data.generators.processing import Encoder

en = Encoder()

The **encode_train()** method expects the training file and returns the SDV file along with **limits**, **min_max** and **cols** files which are used for encoding and decoding.

In [None]:
en.encode_train("data_files/train.csv")

The **encode_test()** method expects the testing file as first argument and the original training file as the second argument. One must note that the training file must be encoded before the testing file is encoded.

In [None]:
en.encode_test("data_files/test.csv", "data_files/train.csv")

These will generate the SDV files inside the *data_files* folder which can now be used for training our model.

### Using HealthGAN

Now, the files are ready to be used by the HealthGAN, so we import it and simply call the **train()** method on the **HealthGAN** class. The GAN expects SDV converted files, thus we should pass the appropriate files generated by the encoder above (same names with suffix *_sdv*).

In [None]:
from synthetic_data.generators.gan import HealthGAN

gan = HealthGAN(train_file = "data_files/train_sdv.csv", 
                test_file = "data_files/test_sdv.csv", 
                base_nodes=64,
                critic_iters=5,
                num_epochs=100)
gan.train()

The GAN produces the model values and 10 synthetic data files which are all saved in the folder *gen_data*.

## Evaluation

The package provides several different types of evaluation metrics: **Adversarial accuracy**, **Divergence score**, **Discrepancy score**, **PCA plot**, **6 subplot PCA plot** and **6 subplot TSNE plot**.

In [None]:
from synthetic_data.metrics.scores import Scores
from synthetic_data.metrics.plots import LossPlot, ComponentPlots

Here, we'll consider the name of various generated synthetic files as *synth_* followed by a unique number, and the log file will be *log.pkl*

#### Adversarial accuracy, divergence and discrepancy scores

In [None]:
scores = Scores(train_file = "data_files/train_sdv.csv", 
                test_file = "data_files/test_sdv.csv",
                synthetic_files = ["gen_data/synth_0.csv", 
                                   "gen_data/synth_1.csv"])

In [None]:
scores.calculate_accuracy()
scores.compute_divergence()
scores.compute_discrepancy()

#### Plots

In [None]:
lossPlot = LossPlot(log_file = "gen_data/log.pkl")
lossPlot.plot()

In [None]:
componentPlots = ComponentPlots()
componentPlots.pca_plot(real_data = "data_files/train_sdv.csv", 
                        synthetic_data="gen_data/synth_0.csv")
componentPlots.combined_pca(real_data = "data_files/train_sdv.csv", 
                            synthetic_datas=["gen_data/synth_0.csv", 
                                             "gen_data/synth_1.csv", 
                                             "gen_data/synth_2.csv", 
                                             "gen_data/synth_3.csv",
                                             "gen_data/synth_4.csv",
                                             "gen_data/synth_5.csv"],
                           names = ["Data1", "Data2", "Data3", "Data4", "Data5", "Data6"])
componentPlots.combined_tsne(real_data = "data_files/train_sdv.csv", 
                             synthetic_datas=["gen_data/synth_0.csv", 
                                             "gen_data/synth_1.csv", 
                                             "gen_data/synth_2.csv", 
                                             "gen_data/synth_3.csv",
                                             "gen_data/synth_4.csv",
                                             "gen_data/synth_5.csv"], 
                             names = ["Data1", "Data2", "Data3", "Data4", "Data5", "Data6"])

For each of these plots, the images are saved inside *gen_data/plots* folder.