<a href="https://colab.research.google.com/github/Emenike-Amara/Projects/blob/main/Generating_Synthetic_Data_with_SDV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ℹ Promblem Statement


> There is a plethora of use cases on why synthetic data is required for certain analytics projects. These use cases includes Data Modeling and Algorithm Development, validate data pipelines, data integration processes, and data analytics workflows. Today, we have various methods to generate synthentic data like random sampling or data perturbation, generative adversarial networks (GANs). However, some of this methods may require more time and computational resoures. 
This project shows how to leverage SDV model to generate synthetic data for in shorter time and with less computational resources.

Below is the python sript to help achieve this:







# ▶ Step 1:  Install the sdv library

> Use your corresponding pip version

In [None]:
pip install sdv==1.0.0b0          

In [None]:
import pandas as pd
import numpy as np

# ▶ Step 2: Load sample data to provide context to the model 
> An xlsx in my locale so I would have to upload the xlsx on my drive and connect (entirely different on jupyter notebook)



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
path='/content/drive/My Drive/test data.xlsx'
real_data = pd.read_excel(path)

In [None]:
real_data.head()

Unnamed: 0,txn_date,account_number,email,debitcreditindicator,transactionamount,currency,current_balance
0,2023-02-01 06:05:18.223,1898340,michaelsanders@shaw.net,C,976.0,NGN,70840.0
1,2023-02-01 13:42:17.277,6708396,randy49@brown.biz,D,14493.23,ZAR,1610.36
2,2023-02-01 13:42:17.54,2218718,webermelissa@neal.com,C,96.71,USD,96.75
3,2023-02-01 13:42:28.61,1146525,gsims@terry.com,D,11296.04,USD,0.0
4,2023-02-01 05:55:45.07,6708196,misty33@smith.biz,C,14928.32,ZAR,16103.59


# ▶ Step 3 : Creating a metadata
This is necessary to enable the synthesizer on the best way to replicate the data taking into consideration the primary key

In [None]:
from sdv.metadata import SingleTableMetadata        

metadata = SingleTableMetadata()

In [None]:
metadata.detect_from_dataframe(real_data)

In [None]:
python_dict = metadata.to_dict()

In [None]:
metadata

{
    "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
    "columns": {
        "txn_date": {
            "sdtype": "categorical"
        },
        "account_number": {
            "sdtype": "numerical"
        },
        "email": {
            "sdtype": "categorical"
        },
        "debitcreditindicator": {
            "sdtype": "categorical"
        },
        "transactionamount": {
            "sdtype": "numerical"
        },
        "currency": {
            "sdtype": "categorical"
        },
        "current_balance": {
            "sdtype": "numerical"
        }
    }
}

Creating a synthesizer

In [None]:
from sdv.lite import SingleTablePreset

synthesizer = SingleTablePreset(
    metadata,
    name='FAST_ML'
)

In [None]:
synthetic_data = synthesizer.sample(
    num_rows=500
)

synthetic_data.head()

Unnamed: 0,txn_date,account_number,email,debitcreditindicator,transactionamount,currency,current_balance
0,2023-02-28 13:28:08.72,913882,jonesernest@example.net,D,1.9,ZAR,0.0
1,2023-02-13 05:38:30.42,71366,sims@terry.com,D,1.9,ZAR,1631767.0
2,2023-02-16 13:54:08.067,336629,jonesernest@example.net,D,307396.272991,ZAR,0.0
3,2023-02-24 16:45:21.53,639642,misty33@smith.biz,D,1.9,ZAR,0.0
4,2023-02-22 16:24:13.317,438167,randy49@brown.biz,D,602582.288893,USD,1543423.0


Generating Sythentic Data

In [None]:
synthetic_data = synthesizer.sample(
    num_rows=10000
)

synthetic_data.head(3)

Unnamed: 0,txn_date,account_number,email,debitcreditindicator,transactionamount,currency,current_balance
0,2023-02-13 06:03:34.813,641994,michaelsanders@shaw.net,C,533436.55037,ZAR,0.0
1,2023-02-22 07:10:57.113,125522,webermelissa@neal.com,C,952137.939639,ZAR,2505373.0
2,2023-02-22 14:58:23.75,384351,jonesernest@example.net,D,268545.933199,ZAR,0.0


In [None]:
len(synthetic_data)

10000

Data Quality Check: This is an awesome add

In [None]:
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data,
    synthetic_data,
    metadata
)

Creating report: 100%|██████████| 4/4 [00:01<00:00,  2.36it/s]



Overall Quality Score: 73.52%

Properties:
Column Shapes: 76.02%
Column Pair Trends: 71.03%


Evaluating Real_Data VS Synthetic Data

In [None]:
synthesizer.save('data_synthesized.pkl')

synthesizer = SingleTablePreset.load('data_synthesized.pkl')