# Prepare synthetic data 
It is going to be much easier to develop if you have realistic synthetic data.
Here we take a sql query that generates a single tabular output.
We run that query against the live identifiable data once.
We then use the [Synthetic Data Vault](https://sdv.dev/SDV/index.html) to prepare a synthetic model of those data.
The code below serves as a vignette for that process but will need adjusting to match the exact contents of the original query.

More complex examples that include multiple tables with joins and dependencies are also possible.

This notebook should be run interactively just once

## Set-up, query and return the data as datafame
The query lives in `./src/api` where `.` represents the project root.
If you run this JupyterNotebook using the local **Makefile** and `make run` then that query will be automatically copied here.

So first steps should be (from _this_ directory)
```sh
make build
make run
```
then navigate to http://uclvlddpragae07:8091/lab/tree/steve/work/synth_test_data.ipynb

In [40]:
import os
import pandas as pd

from pathlib import Path
from sqlalchemy import create_engine

In [41]:
# Construct the PostgreSQL connection
uds_host = os.getenv('UDS_HOST')
uds_user = os.getenv('UDS_USER')
uds_passwd = os.getenv('UDS_PWD')

emapdb_engine = create_engine(f'postgresql://{uds_user}:{uds_passwd}@{uds_host}:5432/uds')

In [42]:
# Read the sql file into a query 'q' and the query into a dataframe
q = Path('query.sql').read_text()
df = pd.read_sql_query(q, emapdb_engine)

In [None]:
df.head()

## Generate a synthetic version of the real data

Use the table above to generate the metadata you need for the synthetic data

https://sdv.dev/SDV/user_guides/relational/relational_metadata.html#relational-metadata

In [44]:
# Minimal imports
from sdv import Metadata, SDV

In [93]:
# PostgreSQL returns datetimes with tz info which sdv does not seem to be able to handle
def remove_timezone(df, col_name: str) -> pd.DataFrame:
    """sdv does not like timezones"""
    df[col_name] = pd.to_datetime(df[col_name], utc=True).dt.tz_localize(None)    
    return df

### Prepare data and metadata

In [96]:
tz_cols = ['valid_from', 'scheduled_datetime', 'status_change_time', 'admission_time', 'discharge_time']
for col in tz_cols:
    print(col)
    df = remove_timezone(df, col)

valid_from
scheduled_datetime
status_change_time
admission_time
discharge_time


sdv doesn't always recognise the columns correctly. Here we specify data_of_birth explicitly as a date.

In [97]:
fields = {
    'date_of_birth': {
        'type': 'datetime',
        'format': '%Y-%m-%d',
    }
}

Prepare the metadata

In [98]:
metadata = Metadata()
metadata.add_table(
    name='tabpid',
    data=df,
    fields_metadata=fields,
)

In [99]:
# Inspect the conversion that metadata.add_table did to the dataframe that you loaded
metadata.get_table_meta('tabpid')

{'fields': {'date_of_birth': {'type': 'datetime', 'format': '%Y-%m-%d'},
  'firstname': {'type': 'categorical'},
  'lastname': {'type': 'categorical'},
  'mrn': {'type': 'categorical'},
  'nhs_number': {'type': 'categorical'},
  'valid_from': {'type': 'datetime'},
  'comments': {'type': 'categorical'},
  'scheduled_datetime': {'type': 'datetime'},
  'status_change_time': {'type': 'datetime'},
  'name': {'type': 'categorical'},
  'admission_time': {'type': 'datetime'},
  'discharge_time': {'type': 'datetime'},
  'dept_name': {'type': 'categorical'},
  'location_string': {'type': 'categorical'}}}

Prepare the table(s)

In [100]:
tables = dict(tabpid=df)

Fit the model

In [101]:
sdv = SDV()
sdv.fit(metadata, tables)

Inspect the original data

In [None]:
df.head()

Inspect the synthetic data

In [105]:
sdv.sample_all()['tabpid'].head()

Unnamed: 0,date_of_birth,firstname,lastname,mrn,nhs_number,valid_from,comments,scheduled_datetime,status_change_time,name,admission_time,discharge_time,dept_name,location_string
0,1956-04-30,JEFFERY,HAYFORD,21368822,4283098787,2021-11-05 07:57:34,,2020-06-07 14:55:00,2021-11-05 09:39:28,Inpatient consult to Obstetrics,2020-06-19 00:22:14,NaT,GWB L02 NORTH (L02N),ED^UCHED PAEDS TRIAGE^NONE
1,1974-11-25,MARION,BUTLER,21056762,6288467669,2022-03-06 16:30:20,,2022-07-22 14:25:00,2022-03-06 15:58:01,Inpatient consult to Neurology,2022-08-28 04:32:15,NaT,UCH EMERGENCY DEPT,WSU3^W03W BY04^BY04-315
2,1945-02-06,CHRISTINA,GUNNELL,41644720,6324283089,2022-01-30 23:35:06,,2021-11-30 06:27:00,2022-01-31 00:53:52,Inpatient consult to Dietetics (N&D) - Not TPN,2021-12-11 22:13:27,NaT,UCH T14 NORTH TRAUMA,LYNQ^Q1LAA SR02^SR02-02
3,1949-01-31,KUNVERJI,CHINN,03080886,4660936740,2022-04-02 07:23:52,,2022-12-20 15:22:00,2022-04-02 07:16:06,Inpatient consult to Acute Medicine,2022-12-17 17:32:05,NaT,UCH T13 NORTH ONCOLOGY,null^SDEC 18^18 SDEC
4,1934-02-27,AHMED,HALL,U/EZ1177,4683480654,2022-06-24 13:32:58,,2022-10-27 19:05:00,2022-06-24 15:16:10,Inpatient consult to Mental Health Liaison Team,2022-10-30 19:06:16,NaT,UCH SDEC,1020100166^SDEC 19^19 SDEC


### Save the synthetic data

Options
- save the model and not the synthetic data (but then you need *sdv* to run the model)
- save the data (need some care with type conversions if you use csv etc.)

In [106]:
sdv.save('sdv_model.pkl')

In [107]:
sample = sdv.sample_all()
sample_df = sample['tabpid']
sample_df.to_hdf('sample_df.hdf', 'data')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['firstname', 'lastname', 'mrn', 'nhs_number', 'comments', 'name',
       'dept_name', 'location_string'],
      dtype='object')]

  sample_df.to_hdf('sample_df.hdf', 'data')


In [108]:
pd.read_hdf('sample_df.hdf', 'data')

Unnamed: 0,date_of_birth,firstname,lastname,mrn,nhs_number,valid_from,comments,scheduled_datetime,status_change_time,name,admission_time,discharge_time,dept_name,location_string
0,2003-05-06,RAZAN,AL-KUWARI,21495267,4808270676,2021-08-12 11:52:44,,2020-06-17 15:55:00,2021-08-12 09:55:08,Inpatient consult to Obstetrics,2020-07-01 18:37:38,NaT,UCH EMERGENCY DEPT,WSU3^W03W BY03^BY03-310
1,1944-09-20,SHIRLEY,MEKOLLARI,91079225,4821915790,2021-11-08 08:12:05,,2021-08-22 10:00:00,2021-11-08 08:16:58,Inpatient consult to Acute Medicine,2021-08-17 10:10:49,NaT,UCH T12 NORTH (T12N),UCLHHOME^UCLHHOME POOL01^HOME
2,1963-04-05,MARIA,SINNOTT,21491112,4430676615,2021-12-16 17:18:20,,2021-11-10 03:29:00,2021-12-16 17:17:05,Inpatient consult to Dietetics (N&D) - Not TPN,2021-11-15 06:22:13,NaT,MCC G04 ONC SUPPORT,ED^UCH SDEC 23 CH^23-SDEC
3,1961-02-19,LORENZO,CABRAL,94045562,4828721452,2022-02-02 18:32:39,She would prefer a massage but she could have ...,2021-09-30 21:08:00,2022-02-02 17:45:22,Inpatient consult to ENT,2021-09-12 00:46:06,NaT,UCH T14 NORTH TRAUMA,ED^UCHED RAT CHAIR^RAT-CHAIR
4,1955-07-27,CHRISTINE,PONSFORD,40316701,4467251227,2022-08-03 07:40:47,,2023-06-14 08:29:00,2022-08-03 08:45:36,Inpatient consult to PERRT,2023-06-21 17:57:02,NaT,NHNN Q5 H JACKSON WARD,P02G^UCH P02 ENDO POOL^ENDO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1554,1977-03-01,EMMA,CORDARA,M/288957,6249600760,2021-06-27 01:21:29,She would prefer a massage but she could have ...,2021-06-20 05:06:00,2021-06-27 00:20:14,Inpatient consult to PERRT,2021-07-03 19:01:34,NaT,UCH T10 NORTH (T10N),1020100166^SDEC OTF^SDEC OTF
1555,1954-04-22,JAMES,OLAFSDOTTIR,21457472,7268522452,2021-08-19 05:11:25,,2021-05-22 16:46:00,2021-08-19 05:23:44,Inpatient consult to ENT,2021-05-10 23:44:37,NaT,UCH SDEC,T13S^T13S BY07^BY07-45
1556,1970-03-03,AINHOA,SIMAO,93018354,6301863631,2021-11-04 05:43:00,,2020-11-16 16:16:00,2021-11-04 06:39:20,Inpatient consult to Acute Medicine,2020-11-30 08:45:10,NaT,UCH EMERGENCY DEPT,1020100166^SDEC 24^24 SDEC
1557,2011-07-20,SHAKILA,ODUBANJO,41408276,6006904500,2022-01-03 12:40:35,She would prefer a massage but she could have ...,2021-02-26 04:07:00,2022-01-03 14:39:59,Inpatient Consult to Symptom Control and Palli...,2021-02-21 11:13:10,NaT,NHNN Q5 H JACKSON WARD,ONCSUP^MCCSUP IN TREAT^HOLDING
