# Welcome to the Synthetic Data Generation - SDV!

The Synthetic Data Vault (SDV) is a Python library designed to be your one-stop shop for creating tabular synthetic data.

In this notebook, we'll demo the basic features of SDV to get you started with creating synthetic data.

<img src= "https://frenzy86.s3.eu-west-2.amazonaws.com/python/fake.jpg" width=1000>


In [3]:
!pip install sdv -q
#RESTART RUNTIME

<font color="maroon"><b>IMPORTANT!</b> When this is finished, <b>please restart the runtime</b> by clicking on Runtime, and then Restart runtime in the top menu bar.</font>

# 1. Loading the data
For this demo, we'll use a fake dataset that describes some fictional guests staying at a hotel.

In [4]:
from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(modality='single_table',
                                    dataset_name='fake_hotel_guests'
                                    )

**Details**: The data is available as a single table.
- `guest_email` is a _primary key_ that uniquely identifies every row
- Other columns have a variety of data types and some the data may be missing.

In [None]:
real_data

Unnamed: 0,guest_email,has_rewards,room_type,amenities_fee,checkin_date,checkout_date,room_rate,billing_address,credit_card_number
0,michaelsanders@shaw.net,False,BASIC,37.89,27 Dec 2020,29 Dec 2020,131.23,"49380 Rivers Street\nSpencerville, AK 68265",4075084747483975747
1,randy49@brown.biz,False,BASIC,24.37,30 Dec 2020,02 Jan 2021,114.43,"88394 Boyle Meadows\nConleyberg, TN 22063",180072822063468
2,webermelissa@neal.com,True,DELUXE,0.00,17 Sep 2020,18 Sep 2020,368.33,"0323 Lisa Station Apt. 208\nPort Thomas, LA 82585",38983476971380
3,gsims@terry.com,False,BASIC,,28 Dec 2020,31 Dec 2020,115.61,"77 Massachusetts Ave\nCambridge, MA 02139",4969551998845740
4,misty33@smith.biz,False,BASIC,16.45,05 Apr 2020,,122.41,"1234 Corporate Drive\nBoston, MA 02116",3558512986488983
...,...,...,...,...,...,...,...,...,...
495,laurabennett@jones-duncan.net,False,BASIC,8.71,04 Jan 2021,06 Jan 2021,103.25,"5678 Office Road\nSan Francisco, CA 94103",3505516387300030
496,johnny71@cook.info,False,BASIC,16.31,24 Aug 2020,26 Aug 2020,115.81,"953 White Island\nChristopherside, TN 91366",2224524502892552
497,ygarcia@ballard-lopez.net,False,BASIC,30.59,11 Nov 2020,13 Nov 2020,141.61,"5678 Office Road\nSan Francisco, CA 94103",180096250673548
498,thomasdale@hall.com,False,BASIC,1.93,16 Jul 2020,18 Jul 2020,136.92,"5678 Office Road\nSan Francisco, CA 94103",4488223821722


The demo also includes **metadata**, a description of the dataset. It includes the primary keys as well as the data types for each column (called "sdtypes").

In [5]:
metadata

{
    "columns": {
        "guest_email": {
            "sdtype": "email",
            "pii": true
        },
        "has_rewards": {
            "sdtype": "boolean"
        },
        "room_type": {
            "sdtype": "categorical"
        },
        "amenities_fee": {
            "sdtype": "numerical",
            "computer_representation": "Float"
        },
        "checkin_date": {
            "sdtype": "datetime",
            "datetime_format": "%d %b %Y"
        },
        "checkout_date": {
            "sdtype": "datetime",
            "datetime_format": "%d %b %Y"
        },
        "room_rate": {
            "sdtype": "numerical",
            "computer_representation": "Float"
        },
        "billing_address": {
            "sdtype": "address",
            "pii": true
        },
        "credit_card_number": {
            "sdtype": "credit_card_number",
            "pii": true
        }
    },
    "primary_key": "guest_email",
    "METADATA_SPEC_VERSION": "SINGLE_TABL

# 2. Creating a synthesizer

An SDV **synthesizer** is an object that you can use to create synthetic data. It learns patterns from the real data and replicates them to generate synthetic data.

Let's use the `FAST_ML` preset synthesizer, which is optimized for performance.

In [7]:
from sdv.lite import SingleTablePreset

synthesizer = SingleTablePreset(
                                metadata,
                                name='FAST_ML'
                                )

Next, we can **train** the synthesizer. We pass in the real data so it can learn patterns using machine learning.

In [8]:
synthesizer.fit(data=real_data)

Now the synthesizer is ready to use!

# 3. Generating synthetic data
Use the `sample` function and pass in any number of rows to synthesize.
The synthesizer is generating synthetic guests in the **same format as the original data**.

In [9]:
synthetic_data = synthesizer.sample(num_rows=500)
synthetic_data

Unnamed: 0,guest_email,has_rewards,room_type,amenities_fee,checkin_date,checkout_date,room_rate,billing_address,credit_card_number
0,dsullivan@example.net,False,DELUXE,10.385184,03 Apr 2020,23 Apr 2020,149.354932,"90469 Karla Knolls Apt. 781\nSusanberg, CA 70033",5161033759518983
1,steven59@example.org,False,BASIC,,04 Jul 2020,24 Aug 2020,179.634314,"6108 Carla Ports Apt. 116\nPort Evan, MI 71694",4133047413145475690
2,brandon15@example.net,False,BASIC,22.700956,20 Apr 2020,14 Apr 2020,145.658788,86709 Jeremy Manors Apt. 786\nPort Garychester...,4977328103788
3,humphreyjennifer@example.net,False,BASIC,23.497404,20 May 2020,05 Jun 2020,187.945019,"8906 Bobby Trail\nEast Sandra, NY 43986",3524946844839485
4,joshuabrown@example.net,False,DELUXE,20.162318,05 Jan 2020,07 Jan 2020,190.691273,"732 Dennis Lane\nPort Nicholasstad, DE 49786",4446905799576890978
...,...,...,...,...,...,...,...,...,...
495,donnaodom@example.net,False,BASIC,32.615319,05 Jan 2020,07 Jan 2020,251.051663,"328 Danny Haven Apt. 575\nBrownbury, FM 31800",180069262819334
496,tammycarey@example.net,False,SUITE,21.749573,26 Apr 2020,05 Jun 2020,203.361819,"48578 Parker Underpass\nBradybury, VA 45723",4716059563401606
497,gwebb@example.org,False,BASIC,9.793593,18 Mar 2020,11 Apr 2020,194.660405,"590 Peterson Roads Apt. 892\nLake Phillip, NE ...",3564840847844629
498,carlnovak@example.com,False,BASIC,48.120000,15 Aug 2020,02 Jul 2020,210.275597,"2854 Steven Haven\nWest Michael, NE 22990",4707209056757331019


# 4. Evaluating real vs. synthetic data
The synthetic data replicates the **mathematical properties** of the real data.

## 4.1 Anonymization

In the original dataset, we had some sensitive columns such as the guest's email, billing address and phone number. In the synthetic data, these columns are **fully anonymized** -- they contain entirely fake values that follow the format of the original.

In [10]:
sensitive_column_names = ['guest_email', 'billing_address', 'credit_card_number']

real_data[sensitive_column_names].head(3)

Unnamed: 0,guest_email,billing_address,credit_card_number
0,michaelsanders@shaw.net,"49380 Rivers Street\nSpencerville, AK 68265",4075084747483975747
1,randy49@brown.biz,"88394 Boyle Meadows\nConleyberg, TN 22063",180072822063468
2,webermelissa@neal.com,"0323 Lisa Station Apt. 208\nPort Thomas, LA 82585",38983476971380


In [11]:
synthetic_data[sensitive_column_names].head(3)

Unnamed: 0,guest_email,billing_address,credit_card_number
0,dsullivan@example.net,"90469 Karla Knolls Apt. 781\nSusanberg, CA 70033",5161033759518983
1,steven59@example.org,"6108 Carla Ports Apt. 116\nPort Evan, MI 71694",4133047413145475690
2,brandon15@example.net,86709 Jeremy Manors Apt. 786\nPort Garychester...,4977328103788


_Note that any repeated values between the real and synthetic data occur by random chance. This ensures that an attacker won't be able to guess the real, sensitive values based on these columns alone._

## 4.2 Data Quality

Other columns in our data are not sensitive. The synthetic data replicates the **mathematical properties** of these columns. To get more insight, we can use the `evaluation` module.

In [12]:
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
                                    real_data,
                                    synthetic_data,
                                    metadata
                                )

Generating report ...
(1/2) Evaluating Column Shapes: : 100%|██████████| 9/9 [00:00<00:00, 580.98it/s]
(2/2) Evaluating Column Pair Trends: : 100%|██████████| 36/36 [00:00<00:00, 111.19it/s]

Overall Score: 89.53%

Properties:
- Column Shapes: 91.76%
- Column Pair Trends: 87.29%


The report allows us to visualize the different properties that were captured. For example, the visualization below shows us _which_ individual column shapes were well-captured and which weren't.

In [13]:
quality_report.get_visualization('Column Shapes')

## 4.3 Visualizing the data

For even more insight, we can visualize the real vs. synthetic data.

Let's perform a 1D visualization comparing a column of the real data to the synthetic data.

In [14]:
from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
                    real_data=real_data,
                    synthetic_data=synthetic_data,
                    column_name='amenities_fee',
                    metadata=metadata
                    )
fig.show()

We can also visualize in 2D, comparing the correlations of a pair of columns.

In [15]:
from sdv.evaluation.single_table import get_column_pair_plot

fig = get_column_pair_plot(
                            real_data=real_data,
                            synthetic_data=synthetic_data,
                            column_names=['checkin_date', 'checkout_date'],
                            metadata=metadata
                            )
fig.show()

# 5. Saving and Loading
We can save the synthesizer to share with others and sample more synthetic data in the future.


In [17]:
synthesizer.save('my_synthesizer.pkl')

In [19]:
##LOAD
synthesizer = SingleTablePreset.load('my_synthesizer.pkl')
synthesizer

SingleTablePreset(name=FAST_ML)

In [21]:
synthesizer.sample(num_rows=1000)

Unnamed: 0,guest_email,has_rewards,room_type,amenities_fee,checkin_date,checkout_date,room_rate,billing_address,credit_card_number
0,tsanchez@example.com,False,BASIC,18.357446,25 Oct 2020,27 Oct 2020,129.066196,"52424 Ashley Ridges\nLake Daniel, MP 27650",3582077138450885
1,bellshawn@example.com,False,BASIC,,23 Aug 2020,01 Sep 2020,83.800000,"18561 Thomas Canyon\nJoshuamouth, SD 22073",4142271383722418
2,iwhite@example.org,False,BASIC,,14 Aug 2020,10 Sep 2020,131.186479,"78944 Marie Harbor\nCynthiaton, IA 56020",6573028438398211
3,christophermiller@example.com,False,DELUXE,30.339632,26 Jul 2020,19 Aug 2020,102.233472,"47551 Hall Flats Apt. 315\nSouth James, WA 16993",30343480880655
4,dgarcia@example.org,False,DELUXE,,19 Jan 2020,16 Jan 2020,207.639230,"38378 Nicholas Mount\nWest Michael, CO 91475",4930915359735
...,...,...,...,...,...,...,...,...,...
995,valerie83@example.org,False,BASIC,6.835664,22 Sep 2020,24 Sep 2020,98.126477,"56780 Thompson Cliff\nPerezfurt, DE 18754",2551419503001787
996,hartdaniel@example.org,False,BASIC,16.895397,04 Sep 2020,23 Sep 2020,125.417431,"741 Donaldson Brook Suite 596\nWest Patrick, T...",30545507259735
997,fbaker@example.org,False,BASIC,17.133790,07 Jan 2021,08 Jan 2021,194.481600,"637 Susan Islands\nJessicaview, MA 91626",2258564950501684
998,jenniferbradley@example.org,False,BASIC,7.771249,24 Jul 2020,27 Jul 2020,99.052630,Unit 9208 Box 5837\nDPO AP 68605,4626487361085121933
