# Prepare synthetic data 
It is going to be much easier to develop if you have realistic synthetic data.
Here we take a sql query that generates a single tabular output.
We run that query against the live identifiable data once.
We then use the [Synthetic Data Vault](https://sdv.dev/SDV/index.html) to prepare a synthetic model of those data.
The code below serves as a vignette for that process but will need adjusting to match the exact contents of the original query.

More complex examples that include multiple tables with joins and dependencies are also possible.

This notebook should be run interactively just once

## Set-up, query and return the data as datafame
The query lives in `./src/api` where `.` represents the project root.
If you run this JupyterNotebook using the local **Makefile** and `make run` then that query will be automatically copied here.

So first steps should be (from _this_ directory)
```sh
make build
make run
```
then navigate to http://uclvlddpragae07:8091/lab/tree/steve/work/synth_test_data.ipynb

In [1]:
import os
import pandas as pd

from pathlib import Path
from sqlalchemy import create_engine

In [2]:
# Construct the PostgreSQL connection
uds_host = os.getenv('UDS_HOST')
uds_user = os.getenv('UDS_USER')
uds_passwd = os.getenv('UDS_PWD')

emapdb_engine = create_engine(f'postgresql://{uds_user}:{uds_passwd}@{uds_host}:5432/uds')

In [3]:
# Read the sql file into a query 'q' and the query into a dataframe
q = Path('query.sql').read_text()
df = pd.read_sql_query(q, emapdb_engine)

In [None]:
df.head()

## Generate a synthetic version of the real data

Use the table above to generate the metadata you need for the synthetic data

https://sdv.dev/SDV/user_guides/relational/relational_metadata.html#relational-metadata

In [5]:
# Minimal imports
from sdv import Metadata, SDV

In [6]:
# PostgreSQL returns datetimes with tz info which sdv does not seem to be able to handle
def remove_timezone(df, col_name: str) -> pd.DataFrame:
    """sdv does not like timezones"""
    df[col_name] = pd.to_datetime(df[col_name], utc=True).dt.tz_localize(None)    
    return df

### Prepare data and metadata

In [7]:
tz_cols = ['valid_from', 'scheduled_datetime', 'status_change_time', 'admission_time', 'discharge_time']
for col in tz_cols:
    print(col)
    df = remove_timezone(df, col)

valid_from
scheduled_datetime
status_change_time
admission_time
discharge_time


sdv doesn't always recognise the columns correctly. Here we specify data_of_birth explicitly as a date.

In [8]:
fields = {
    'date_of_birth': {
        'type': 'datetime',
        'format': '%Y-%m-%d',
    }
}

Prepare the metadata

In [9]:
metadata = Metadata()
metadata.add_table(
    name='tabpid',
    data=df,
    fields_metadata=fields,
)

In [10]:
# Inspect the conversion that metadata.add_table did to the dataframe that you loaded
metadata.get_table_meta('tabpid')

{'fields': {'date_of_birth': {'type': 'datetime', 'format': '%Y-%m-%d'},
  'firstname': {'type': 'categorical'},
  'lastname': {'type': 'categorical'},
  'mrn': {'type': 'categorical'},
  'nhs_number': {'type': 'categorical'},
  'valid_from': {'type': 'datetime'},
  'comments': {'type': 'categorical'},
  'scheduled_datetime': {'type': 'datetime'},
  'status_change_time': {'type': 'datetime'},
  'name': {'type': 'categorical'},
  'admission_time': {'type': 'datetime'},
  'discharge_time': {'type': 'datetime'},
  'dept_name': {'type': 'categorical'},
  'location_string': {'type': 'categorical'}}}

Prepare the table(s)

In [11]:
tables = dict(tabpid=df)

Fit the model

In [12]:
sdv = SDV()
sdv.fit(metadata, tables)

Inspect the original data

In [None]:
df.head()

Inspect the synthetic data

In [13]:
sdv.sample_all()['tabpid'].head()

Unnamed: 0,date_of_birth,firstname,lastname,mrn,nhs_number,valid_from,comments,scheduled_datetime,status_change_time,name,admission_time,discharge_time,dept_name,location_string
0,1976-05-09,RAOUF,COMLEY,2047359,6249290400,2022-05-30 09:00:30,,2022-05-30 08:55:00,2022-05-30 08:57:45,Inpatient consult to PERRT,2022-05-29 07:58:00,NaT,UCH T15 NORTH DECANT,T10O^T10N BY05^BY05-23
1,1943-09-08,HEATHER,HERMANN,21478560,7120441620,2022-05-29 02:59:53,,2022-05-29 02:59:00,2022-05-29 03:01:43,Inpatient Consult to Symptom Control and Palli...,2022-05-26 07:16:00,NaT,UCH T16 NORTH (T16N),1021800027^GWB L03N SR08^SR08-08
2,1991-01-16,PAUL,CAMPBELL,21319735,6249023119,2022-05-28 20:42:45,72 yr old man presents with personality change...,2022-05-28 20:35:00,2022-05-28 20:41:03,Inpatient Consult to Acute Frailty Team,2022-05-31 17:07:00,NaT,UCH T13 NORTH ONCOLOGY,WST4^W04W BY02^BY02-03
3,1946-03-05,PETER,HAMMOND,21012138,6166708459,2022-05-25 01:03:24,,2022-05-25 01:06:00,2022-05-25 01:02:34,Inpatient consult to Dietetics (N&D) - Not TPN,2022-05-19 13:34:00,NaT,UCH T09 CENTRAL (T09C),T03^T03 BY02^BY02-19
4,1994-05-09,IORWERTH,SHASTRI,40714618,7227830284,2022-05-25 17:02:59,,2022-05-25 17:07:00,2022-05-25 17:04:10,Inpatient consult to General Surgery,2022-05-19 14:24:00,NaT,GWB L01 ELECTIVE SURG,T06H^T06H SR12^SR12-12


### Save the synthetic data

Options
- save the model and not the synthetic data (but then you need *sdv* to run the model)
- save the data (need some care with type conversions if you use csv etc.)

In [15]:
sdv.save('mock_model.pkl')

In [16]:
sample = sdv.sample_all()
sample_df = sample['tabpid']
sample_df.to_hdf('mock.hdf', 'data')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['firstname', 'lastname', 'mrn', 'nhs_number', 'comments', 'name',
       'dept_name', 'location_string'],
      dtype='object')]

  sample_df.to_hdf('mock.hdf', 'data')


In [17]:
pd.read_hdf('mock.hdf', 'data')

Unnamed: 0,date_of_birth,firstname,lastname,mrn,nhs_number,valid_from,comments,scheduled_datetime,status_change_time,name,admission_time,discharge_time,dept_name,location_string
0,1948-03-14,CHANDRAKANT,HALICIOGLU,21501353,6311196522,2022-05-27 03:16:48,Type 1 DM\nOn Insulin glargine 14 U in the nig...,2022-05-27 03:14:00,2022-05-27 03:17:13,Inpatient consult to PERRT,2022-05-19 20:33:00,NaT,NHNN C1 SURGICAL ITU,T01ECU^T01ECU BY02^BY02-11
1,1987-08-30,REBECCA,GRAHAM,40013050,6403728504,2022-05-27 13:17:03,He is known primary CNS lymphoma.\nHe is on PR...,2022-05-27 13:15:00,2022-05-27 13:16:16,Inpatient Referral to Tissue Viability,2022-05-24 15:11:00,NaT,NHNN C2 VICTOR HORSLEY,T09C^T09C SR41^SR41-41
2,1947-05-27,ANDREA,POLIANIDOU,QSC86233,4808996383,2022-05-28 14:15:56,,2022-05-28 14:18:00,2022-05-28 14:16:33,Inpatient consult to Acute Medicine,2022-05-22 01:34:00,NaT,UCH T12 NORTH (T12N),T08N^T08N SR22^SR22-22
3,1944-02-18,PETRA,KIPPS,21492510,7145941066,2022-05-27 12:33:48,,2022-05-27 12:36:00,2022-05-27 12:34:42,Inpatient Consult to Integrated Discharge Service,2022-05-20 05:25:00,NaT,UCH T08 SOUTH (T08S),T08N^T08N BY02^BY02-07
4,1964-01-09,ATHARI,KING,21506866,6283864767,2022-05-26 19:52:43,,2022-05-26 19:52:00,2022-05-26 19:51:18,Inpatient Consult to Symptom Control and Palli...,2022-06-02 12:16:00,NaT,NHNN C3 DAVID FERRIER,1021800028^GWB L03E SR28^SR28-28
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
400,1986-11-19,BRIDGET,BERDIBEK,21278501,4524730737,2022-05-25 21:01:22,,2022-05-25 21:03:00,2022-05-25 21:02:34,Inpatient consult to Neuro Ophthalmology,2022-05-31 04:15:00,NaT,GWB L03 EAST (L03E),T01ECU^T01ECU SR04^SR04-04
401,1969-12-27,SILVIA,VICKERS,21506521,4524730737,2022-05-27 17:50:11,,2022-05-27 17:52:00,2022-05-27 17:49:20,Inpatient consult to PERRT,2022-05-21 11:52:00,NaT,UCH T13 NORTH ONCOLOGY,LYNQ^Q1LAA BY04^BY04-11
402,1973-04-03,ALFRED,WILSON,21485314,4460354314,2022-05-23 22:17:07,,2022-05-23 22:16:00,2022-05-23 22:16:32,Inpatient consult to Dietetics (N&D) - Not TPN,2022-05-07 23:47:00,NaT,UCH T09 NORTH (T09N),T16N^T16N SR06^SR06-06
403,1986-07-22,EUGENE,WOJTECKA,41448436,6263522275,2022-05-27 22:41:02,,2022-05-27 22:38:00,2022-05-27 22:38:25,Inpatient consult to PERRT,2022-06-01 04:41:00,NaT,UCH T16 NORTH (T16N),1020100163^T13NO SR07^SR07-07
