<a href="https://colab.research.google.com/github/RDGopal/IB9LQ0-GenAI/blob/main/Synthetic_Data_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Synthetic Data Generation

Synthetic data generation is the process of creating artificial data that mimics the characteristics of real-world data.Synthetic data generation is used when real data is scarce, sensitive, or difficult to obtain. It can also be used to protect privacy by creating datasets that resemble real data without containing any personally identifiable information. There are two broad classes of methods for generating synthetic data.

1. **Statistical Modeling Based Methods:**
This approach involves fitting statistical distributions to the observed data and then sampling from these fitted distributions to create new synthetic data points. In *independent feature modeling*, fit a suitable distribution fo each feature and draw samples from these fitted distributions. *Copula based methods* can capture complex dependencies between variables, even non-linear ones.

2. **Machine Learning Based Methods:**
These methods leverage machine learning models to learn the underlying patterns in the data and generate new samples that resemble the original data. *Generative Adversarial Networks (GANs)* and *Variational Autoencoders (VAEs)* are useful in this context.GANs consist of two neural networks: a Generator that tries to create synthetic data and a Discriminator that tries to distinguish between real and synthetic data. They are trained in an adversarial manner.VAEs are generative models that learn a latent representation of the data. They consist of an Encoder that maps the input data to a lower-dimensional latent space (typically a Gaussian distribution) and a Decoder that reconstructs the original data from the latent space. New data points are generated by sampling from the latent space and passing them through the Decoder.





#Tabular GANs
Tabular GANs are a type of Generative Adversarial Network (GAN) specifically designed to generate synthetic tabular data (data organized in rows and columns, like a spreadsheet or a Pandas DataFrame) that closely resembles a real-world dataset. Traditional GANs were initially more successful in generating continuous data like images. Tabular data presents unique challenges due to the presence of:
* Mixed Data Types: Tables often contain both numerical (continuous or discrete) and categorical features.
* Complex Correlations: Features in a table can have intricate linear and non-linear relationships.
* Unbalanced Categories: Categorical features can have classes with highly varying frequencies.
* Discrete Values: Even numerical columns might represent discrete quantities.


CTGAN (Conditional Tabular Generative Adversarial Network) addresses these challenges through several key innovations built upon the standard GAN architecture:
* Generator (G):
Takes random noise as input.
Its goal is to generate synthetic data samples that the discriminator cannot distinguish from real data.
It uses neural networks (typically Multi-Layer Perceptrons or MLPs) to transform the noise into synthetic tabular data.
* Discriminator (D):
Takes a batch of data as input, which can be a mix of real data samples from the original dataset and synthetic data samples generated by the generator.
Its goal is to correctly classify each input sample as either "real" or "synthetic."
It also uses neural networks (MLPs) for this classification task.
* Adversarial Training:
The generator and discriminator are trained in an adversarial manner.
The generator tries to fool the discriminator by producing increasingly realistic synthetic data.
The discriminator tries to become better at distinguishing real from synthetic data.
This competition drives both networks to improve, ideally leading the generator to produce synthetic data that is statistically very similar to the real data.


In essence, CTGAN aims to learn the underlying data generation process of  tabular dataset by training a generator to produce synthetic data that fools a discriminator trained to distinguish it from the real data.

In [None]:
!pip install sdv

In [None]:
import sdv
!pip show sdv


#Read Data

In [None]:
import pandas as pd
import numpy as np

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/RDGopal/IB9LQ0-GenAI/main/Data/titanic.csv')

In [None]:
data

#Create and Validate Meta Data

In [None]:
from sdv.metadata import Metadata

metadata = Metadata.detect_from_dataframe(
    data=data)

In [None]:
metadata.validate_table(data=data)

In [None]:
metadata.visualize(
    show_table_details='full',
    show_relationship_labels=True,
    output_filepath='my_metadata.png'
)

#Correct Column Attributes

See https://docs.sdv.dev/sdv/reference/metadata-spec/sdtypes

In [None]:
metadata.update_column(
    column_name='Name',
    sdtype='last_name',
    pii=True)

metadata.update_column(
    column_name='Ticket',
    sdtype='id')

metadata.validate()

In [None]:
metadata.visualize(
    show_table_details='full',
    show_relationship_labels=True,
    output_filepath='my_metadata.png'
)

##Save the Meta Data

In [None]:
metadata.save_to_json(filepath='my_metadata.json')

#Create Synthetic Data

##GaussianCopulaSynthesizer
This is a statistical learning based method.

In [None]:
from sdv.single_table import GaussianCopulaSynthesizer

# Step 1: Create the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)

# Step 2: Train the synthesizer
synthesizer.fit(data)

# Step 3: Generate synthetic data
synthetic_data_GCS = synthesizer.sample(num_rows=10)

In [None]:
synthetic_data_GCS

#CTGANSynthesizer
The CTGAN Synthesizer uses GAN-based, deep learning methods to train a model and generate synthetic data.

In [None]:
from sdv.single_table import CTGANSynthesizer

synthesizer = CTGANSynthesizer(metadata)
synthesizer.fit(data)

synthetic_data_CTGAN = synthesizer.sample(num_rows=10)

In [None]:
synthetic_data_CTGAN

## Analyze the synthesizer

In [None]:
synthesizer.get_parameters()

In [None]:
synthesizer.get_loss_values()

In [None]:
fig = synthesizer.get_loss_values_plot()
fig.show()

##TVAESynthesizer
The TVAE Synthesizer uses a variational autoencoder (VAE)-based, neural network techniques to train a model and generate synthetic data.

In [None]:
from sdv.single_table import TVAESynthesizer

synthesizer = TVAESynthesizer(metadata)
synthesizer.fit(data)

synthetic_data_TVAE = synthesizer.sample(num_rows=10)

In [None]:
synthetic_data_TVAE

In [None]:
synthesizer.get_parameters()

In [None]:
synthesizer.get_loss_values()

In [None]:
import matplotlib.pyplot as plt

# Get the loss values from the synthesizer
loss_values_df = synthesizer.get_loss_values()

# loss values in your DataFrame
loss_values = loss_values_df['Loss'].tolist()

# Extract the epochs and loss values
epochs = list(range(1, len(loss_values) + 1))

# Create the plot
plt.plot(epochs, loss_values)
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Synthesizer Loss over Epochs")
plt.show()

#Evaluation
As a final step to your synthetic data project, you can evaluate and visualize the synthetic data against the real data.

In [None]:
from sdv.evaluation.single_table import run_diagnostic, evaluate_quality
from sdv.evaluation.single_table import get_column_plot

# perform basic validity checks
diagnostic = run_diagnostic(data, synthetic_data_TVAE, metadata)

In [None]:
# measure the statistical similarity
quality_report = evaluate_quality(data, synthetic_data_TVAE, metadata)


In [None]:
# plot the data
fig = get_column_plot(
    real_data=data,
    synthetic_data=synthetic_data_TVAE,
    metadata=metadata,
    column_name='Sex'
)

fig.show()

#Your Turn
Create synthetic data from `Wine.csv`. Build a predictive model to predict the outcome `Type`. Assess whether the prediction from the original data is similar to the prediction from the synthetic data.