# Prepare synthetic data 
It is going to be much easier to develop if you have realistic synthetic data.
Here we take a sql query that generates a single tabular output.
We run that query against the live identifiable data once.
We then use the [Synthetic Data Vault](https://sdv.dev/SDV/index.html) to prepare a synthetic model of those data.
The code below serves as a vignette for that process but will need adjusting to match the exact contents of the original query.

More complex examples that include multiple tables with joins and dependencies are also possible.

This notebook should be run interactively just once

## Set-up, query and return the data as datafame
The query lives in `./src/api` where `.` represents the project root.
If you run this JupyterNotebook using the local **Makefile** and `make run` then that query will be automatically copied here.

So first steps should be (from _this_ directory)
```sh
make build
make run
```
then navigate to http://uclvlddpragae07:8091/lab/tree/steve/work/synth_test_data.ipynb

In [1]:
import os
import pandas as pd

from pathlib import Path
from sqlalchemy import create_engine

In [2]:
# Construct the PostgreSQL connection
uds_host = os.getenv('UDS_HOST')
uds_user = os.getenv('UDS_USER')
uds_passwd = os.getenv('UDS_PWD')

emapdb_engine = create_engine(f'postgresql://{uds_user}:{uds_passwd}@{uds_host}:5432/uds')

In [3]:
# Read the sql file into a query 'q' and the query into a dataframe
q = Path('query.sql').read_text()
df = pd.read_sql_query(q, emapdb_engine)

## Generate a synthetic version of the real data

Use the table above to generate the metadata you need for the synthetic data

https://sdv.dev/SDV/user_guides/relational/relational_metadata.html#relational-metadata

In [4]:
# Minimal imports
from sdv import Metadata, SDV

In [5]:
# PostgreSQL returns datetimes with tz info which sdv does not seem to be able to handle
def remove_timezone(df, col_name: str) -> pd.DataFrame:
    """sdv does not like timezones"""
    df[col_name] = df[col_name].dt.tz_localize(None)
    return df

### Prepare data and metadata

In [6]:
tz_cols = ['valid_from', 'scheduled_datetime', 'status_change_time', 'admission_time', 'discharge_time']
for col in tz_cols:
    df = remove_timezone(df, col)

sdv doesn't always recognise the columns correctly. Here we specify data_of_birth explicitly as a date.

In [7]:
fields = {
    'date_of_birth': {
        'type': 'datetime',
        'format': '%Y-%m-%d',
    }
}

Prepare the metadata

In [8]:
metadata = Metadata()
metadata.add_table(
    name='tabpid',
    data=df,
    fields_metadata=fields,
)

In [9]:
# Inspect the conversion that metadata.add_table did to the dataframe that you loaded
metadata.get_table_meta('tabpid')

{'fields': {'date_of_birth': {'type': 'datetime', 'format': '%Y-%m-%d'},
  'firstname': {'type': 'categorical'},
  'lastname': {'type': 'categorical'},
  'mrn': {'type': 'categorical'},
  'nhs_number': {'type': 'categorical'},
  'valid_from': {'type': 'datetime'},
  'cancelled': {'type': 'boolean'},
  'closed_due_to_discharge': {'type': 'boolean'},
  'comments': {'type': 'categorical'},
  'scheduled_datetime': {'type': 'datetime'},
  'status_change_time': {'type': 'datetime'},
  'name': {'type': 'categorical'},
  'admission_time': {'type': 'datetime'},
  'discharge_time': {'type': 'datetime'},
  'dept_name': {'type': 'categorical'}}}

Prepare the table(s)

In [10]:
tables = dict(tabpid=df)

Fit the model

In [11]:
sdv = SDV()
sdv.fit(metadata, tables)

Inspect the original data

Inspect the synthetic data

In [12]:
sdv.sample_all()['tabpid'].head()

Unnamed: 0,date_of_birth,firstname,lastname,mrn,nhs_number,valid_from,cancelled,closed_due_to_discharge,comments,scheduled_datetime,status_change_time,name,admission_time,discharge_time,dept_name
0,2018-08-09,TIANYUN,WILLIAMS,21010530,4667319151,2022-05-19 01:46:33,False,False,,2022-05-18 22:53:00,2022-05-19 01:46:20,Inpatient consult to Haematology,2022-05-19 05:16:51,NaT,UCH SDEC
1,1954-07-04,DANIELLE,MONTOYA,21506951,4085170767,2022-05-19 03:17:45,False,False,,2022-05-19 03:35:00,2022-05-19 03:18:21,Inpatient consult to Urology,2022-05-19 06:13:41,NaT,UCH SDEC
2,1961-11-27,CHI,SUMANG,21507592,4964860787,2022-05-18 22:36:18,False,False,,2022-05-18 22:08:00,2022-05-18 22:36:26,Inpatient consult to Oncology,2022-05-19 02:05:34,2022-05-19 03:48:01,UCH EMERGENCY DEPT
3,1949-01-10,BEAU,BRERETON,21508917,7158435865,2022-05-19 14:37:01,False,False,,2022-05-19 14:40:00,2022-05-19 14:37:00,Inpatient consult to Acute Medicine,2022-05-19 06:09:03,NaT,UCH SDEC
4,1970-12-20,KISHOR,MONTOYA,21507592,4324403708,2022-05-19 16:11:25,False,False,,2022-05-19 15:08:00,2022-05-19 16:11:16,Inpatient consult to Mental Health Liaison Team,2022-05-19 11:33:00,NaT,UCH SDEC


### Save the synthetic data

Options
- save the model and not the synthetic data (but then you need *sdv* to run the model)
- save the data (need some care with type conversions if you use csv etc.)

In [13]:
sdv.save('sdv_model.pkl')

In [14]:
sample = sdv.sample_all()
sample_df = sample['tabpid']
sample_df.to_pickle('sample_df')