# Prepare synthetic data 
It is going to be much easier to develop if you have realistic synthetic data.
Here we take a sql query that generates a single tabular output.
We run that query against the live identifiable data once.
We then use the [Synthetic Data Vault](https://sdv.dev/SDV/index.html) to prepare a synthetic model of those data.
The code below serves as a vignette for that process but will need adjusting to match the exact contents of the original query.

More complex examples that include multiple tables with joins and dependencies are also possible.

This notebook should be run interactively just once

## Set-up, query and return the data as datafame
The query lives in `./src/api` where `.` represents the project root.
If you run this JupyterNotebook using the local **Makefile** and `make run` then that query will be automatically copied here.

So first steps should be (from _this_ directory)
```sh
make build
make run
```
then navigate to http://uclvlddpragae07:8091/lab/tree/steve/work/synth_test_data.ipynb

In [5]:
import os
import pandas as pd

from pathlib import Path
from sqlalchemy import create_engine

In [6]:
# Construct the PostgreSQL connection
uds_host = os.getenv('UDS_HOST')
uds_user = os.getenv('UDS_USER')
uds_passwd = os.getenv('UDS_PWD')

emapdb_engine = create_engine(f'postgresql://{uds_user}:{uds_passwd}@{uds_host}:5432/uds')

In [7]:
# Read the sql file into a query 'q' and the query into a dataframe
q = Path('query.sql').read_text()
df = pd.read_sql_query(q, emapdb_engine)

In [None]:
df.head()

## Generate a synthetic version of the real data

Use the table above to generate the metadata you need for the synthetic data

https://sdv.dev/SDV/user_guides/relational/relational_metadata.html#relational-metadata

In [9]:
# Minimal imports
from sdv import Metadata, SDV

In [10]:
# PostgreSQL returns datetimes with tz info which sdv does not seem to be able to handle
def remove_timezone(df, col_name: str) -> pd.DataFrame:
    """sdv does not like timezones"""
    df[col_name] = pd.to_datetime(df[col_name], utc=True).dt.tz_localize(None)    
    return df

### Prepare data and metadata

In [11]:
tz_cols = ['valid_from', 'scheduled_datetime', 'status_change_time', 'admission_time', 'discharge_time']
for col in tz_cols:
    print(col)
    df = remove_timezone(df, col)

valid_from
scheduled_datetime
status_change_time
admission_time
discharge_time


sdv doesn't always recognise the columns correctly. Here we specify data_of_birth explicitly as a date.

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402 entries, 0 to 401
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   firstname           402 non-null    object        
 1   lastname            402 non-null    object        
 2   date_of_birth       402 non-null    object        
 3   mrn                 402 non-null    object        
 4   nhs_number          389 non-null    object        
 5   valid_from          402 non-null    datetime64[ns]
 6   comments            40 non-null     object        
 7   scheduled_datetime  402 non-null    datetime64[ns]
 8   status_change_time  402 non-null    datetime64[ns]
 9   name                402 non-null    object        
 10  admission_time      402 non-null    datetime64[ns]
 11  discharge_time      0 non-null      datetime64[ns]
 12  dept_name           402 non-null    object        
 13  location_string     402 non-null    object        

In [18]:
fields = {
    'date_of_birth': {
        'type': 'datetime',
        'format': '%Y-%m-%d',
        'pii': True,
        # the 'pii_category' key defines the Faker function name (method)
        'pii_category': "date_of_birth", 
    },
    'firstname': {
        'type': 'categorical',
        'pii': True,
        'pii_category': 'first_name'
    },
    'lastname': {
        'type': 'categorical',
        'pii': True,
        'pii_category': 'last_name'
    },
    'mrn': {
        'type': 'categorical',
        'pii': True,
        'pii_category': ['random_number', 8 ]
    },
    'nhs_number': {
        'type': 'categorical',
        'pii': True,
        'pii_category': ['numerify', '4## ### ####' ]
    },
}

Prepare the metadata

In [19]:
metadata = Metadata()
metadata.add_table(
    name='tabpid',
    data=df,
    fields_metadata=fields,
)

In [20]:
# Inspect the conversion that metadata.add_table did to the dataframe that you loaded
metadata.get_table_meta('tabpid')

{'fields': {'date_of_birth': {'type': 'datetime',
   'format': '%Y-%m-%d',
   'pii': True,
   'pii_category': 'date_of_birth'},
  'firstname': {'type': 'categorical',
   'pii': True,
   'pii_category': 'first_name'},
  'lastname': {'type': 'categorical',
   'pii': True,
   'pii_category': 'last_name'},
  'mrn': {'type': 'categorical',
   'pii': True,
   'pii_category': ['random_number', 8]},
  'nhs_number': {'type': 'categorical',
   'pii': True,
   'pii_category': ['numerify', '4## ### ####']},
  'valid_from': {'type': 'datetime'},
  'comments': {'type': 'categorical'},
  'scheduled_datetime': {'type': 'datetime'},
  'status_change_time': {'type': 'datetime'},
  'name': {'type': 'categorical'},
  'admission_time': {'type': 'datetime'},
  'discharge_time': {'type': 'datetime'},
  'dept_name': {'type': 'categorical'},
  'location_string': {'type': 'categorical'}}}

Prepare the table(s)

In [21]:
tables = dict(tabpid=df)

Fit the model

In [22]:
sdv = SDV()
sdv.fit(metadata, tables)

Inspect the original data

In [None]:
df.head()

Inspect the synthetic data

In [24]:
sdv.sample_all()['tabpid'].head()

Unnamed: 0,date_of_birth,firstname,lastname,mrn,nhs_number,valid_from,comments,scheduled_datetime,status_change_time,name,admission_time,discharge_time,dept_name,location_string
0,1972-12-06,Laurie,Sullivan,83863567,422 819 1754,2022-06-03 18:53:54,,2022-06-03 18:53:00,2022-06-03 18:53:05,Inpatient Consult to Integrated Discharge Service,2022-06-11 10:45:00,NaT,UCH T09 NORTH (T09N),T12S^T12S BY06^BY06-34
1,1907-09-29,Randall,Palmer,35090382,486 674 7745,2022-06-02 00:34:37,,2022-06-02 00:34:00,2022-06-02 00:33:19,Inpatient consult to PERRT,2022-05-23 11:24:00,NaT,UCH T13 SOUTH (T13S),1020100178^UCHT15ND WAITING^WAIT
2,1913-02-05,Brandon,Bolton,14220815,492 655 2766,2022-06-05 02:23:28,,2022-06-05 02:23:00,2022-06-05 02:22:19,Inpatient consult to General Surgery,2022-06-01 01:50:00,NaT,UCH SDEC,T03^T03 SR07^SR07-07
3,1989-12-26,Aaron,Ramos,7196209,461 352 3032,2022-06-03 14:41:35,,2022-06-03 14:40:00,2022-06-03 14:43:05,Inpatient consult to PERRT,2022-05-25 20:41:00,NaT,UCH T08 SOUTH (T08S),T11S^T11S CB37^CB37-37
4,1958-05-24,Eric,Beasley,32362162,465 262 6649,2022-05-30 11:08:29,Nausea & vomting ++ Pain +\nInquiring if we ca...,2022-05-30 11:08:00,2022-05-30 11:11:08,Inpatient consult to PERRT,2022-06-01 01:18:00,NaT,UCH T16 NORTH (T16N),1021800031^GWB L04E SR31^SR31-31


### Save the synthetic data

Options
- save the model and not the synthetic data (but then you need *sdv* to run the model)
- save the data (need some care with type conversions if you use csv etc.)

In [25]:
sdv.save('mock_model_consults.pkl')

In [26]:
sample = sdv.sample_all()
sample_df = sample['tabpid']
sample_df.to_hdf('mock_consults.h5', 'data')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block1_values] [items->Index(['firstname', 'lastname', 'mrn', 'nhs_number', 'comments', 'name',
       'dept_name', 'location_string'],
      dtype='object')]

  sample_df.to_hdf('mock_consults.h5', 'data')


In [27]:
pd.read_hdf('mock_consults.h5', 'data')

Unnamed: 0,date_of_birth,firstname,lastname,mrn,nhs_number,valid_from,comments,scheduled_datetime,status_change_time,name,admission_time,discharge_time,dept_name,location_string
0,1996-04-03,Louis,Torres,82099664,498 603 8066,2022-06-07 01:05:02,,2022-06-07 00:58:00,2022-06-07 01:01:11,Inpatient consult to PERRT,2022-06-15 22:50:00,NaT,NHNN Q5 H JACKSON WARD,MVNQ^Q6MV BY02^BY02-04
1,1992-06-17,Bruce,Wilson,80591699,474 215 6503,2022-06-02 12:45:22,Nausea & vomting ++ Pain +\nInquiring if we ca...,2022-06-02 12:46:00,2022-06-02 12:45:18,Inpatient consult to Dietetics (N&D) - Not TPN,2022-05-23 21:16:00,NaT,NHNN C3 DAVID FERRIER,1020100164^T14NT BY03^BY03-17
2,1933-08-01,Kyle,Phillips,50936010,439 172 3543,2022-06-03 07:48:56,,2022-06-03 07:49:00,2022-06-03 07:48:35,Inpatient consult to Adult Diabetes CNS,2022-05-30 02:33:00,NaT,UCH T09 SOUTH (T09S),T08N^T08N BY03^BY03-09
3,1958-09-21,Edgar,Williams,15805307,445 337 6636,2022-06-07 16:07:03,,2022-06-07 16:03:00,2022-06-07 16:07:22,Inpatient consult to Adult Endocrine & Diabetes,2022-06-04 00:49:00,NaT,UCH T13 NORTH ONCOLOGY,T09N^T09N BY04^BY04-22
4,1991-06-11,Michelle,Alexander,4097531,460 714 0920,2022-06-02 10:27:11,,2022-06-02 10:27:00,2022-06-02 10:26:55,Inpatient Consult to Integrated Discharge Service,2022-05-10 18:38:00,NaT,UCH T09 NORTH (T09N),1021800027^GWB L03N SR16^SR16-16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397,1970-03-02,Cynthia,Hall,57552139,418 876 4753,2022-06-05 04:25:33,,2022-06-05 04:23:00,2022-06-05 04:25:57,Inpatient consult to Haematology,2022-06-05 01:26:00,NaT,NHNN C3 BERNARD SUNLEY,VYNQ^C2VH BY03^BY03-04
398,1950-12-08,Patricia,Brown,31987852,490 771 1840,2022-05-31 23:15:07,,2022-05-31 23:14:00,2022-05-31 23:15:04,Inpatient consult to Urology,2022-06-03 20:19:00,NaT,NHNN C3 BERNARD SUNLEY,ED^UCHED PAEDS05^05-PAEDS
399,1982-06-19,Daniel,Young,48078177,454 834 7924,2022-06-05 08:52:21,Nausea & vomting ++ Pain +\nInquiring if we ca...,2022-06-05 08:50:00,2022-06-05 08:52:26,Inpatient consult to Dietetics (N&D) - Not TPN,2022-06-08 17:05:00,NaT,UCH T01 ENHANCED CARE,1020100163^T13NO BY02^BY02-24
400,1947-01-31,William,Johnson,89675078,445 619 4912,2022-06-08 16:43:29,Review please for pain management,2022-06-08 16:46:00,2022-06-08 16:45:22,Inpatient consult to Obstetrics,2022-05-28 23:01:00,NaT,UCH T14 NORTH TRAUMA,T11S^T11S CB37^CB37-37
