# Prepare synthetic data (Vitals)
It is going to be much easier to develop if you have realistic synthetic data.
Here we take a sql query that generates a single tabular output.
We run that query against the live identifiable data once.
We then use the [Synthetic Data Vault](https://sdv.dev/SDV/index.html) to prepare a synthetic model of those data.
The code below serves as a vignette for that process but will need adjusting to match the exact contents of the original query.

More complex examples that include multiple tables with joins and dependencies are also possible.

This notebook should be run interactively just once

## Set-up, query and return the data as datafame
The query lives in `./src/api/sitrep/` where `.` represents the project root.
If you run this JupyterNotebook using the local **Makefile** and `make run` then that query will be automatically copied here.

So first steps should be (from _this_ directory)
```sh
make mock1build
make mock2copyin
make mock3run
```
then navigate to http://uclvlddpragae07:8091/lab/tree/steve/work/synth_test_data_vitals.ipynb

In [121]:
import os
import pandas as pd
import numpy as np
import requests

from pathlib import Path
import sqlalchemy as sa

In [122]:
# Construct the PostgreSQL connection
emap_host = os.getenv('EMAP_DB_HOST')
emap_user = os.getenv('EMAP_DB_USER')
emap_passwd = os.getenv('EMAP_DB_PASSWORD')

emapdb_engine = sa.create_engine(f'postgresql://{emap_user}:{emap_passwd}@{emap_host}:5432/uds')

In [123]:
# Read the sql file into a query 'q' and the query into a dataframe
q = Path('vitals.sql').read_text()

In [124]:
df = pd.read_sql_query(sa.text(q), emapdb_engine)

Now rather than using a SQL query, this time we are going to use an existing API to populate a dataframe

In [125]:
df.shape

(4166, 17)

In [126]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4166 entries, 0 to 4165
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype              
---  ------                  --------------  -----              
 0   visit_observation_id    4166 non-null   int64              
 1   ob_tail_i               4166 non-null   int64              
 2   observation_datetime    4166 non-null   datetime64[ns, UTC]
 3   id_in_application       4166 non-null   object             
 4   value_as_real           2174 non-null   float64            
 5   value_as_text           1981 non-null   object             
 6   unit                    546 non-null    object             
 7   mrn                     4166 non-null   object             
 8   lastname                4166 non-null   object             
 9   firstname               4166 non-null   object             
 10  sex                     4166 non-null   object             
 11  date_of_birth           4166 non-null   obj

## Generate a synthetic version of the real data

Use the table above to generate the metadata you need for the synthetic data

https://sdv.dev/SDV/user_guides/relational/relational_metadata.html#relational-metadata

In [127]:
# Minimal imports
from sdv import Metadata, SDV
from sdv.constraints import FixedCombinations

SDV does not handle timezones nicely so remove

In [80]:
# PostgreSQL returns datetimes with tz info which sdv does not seem to be able to handle
def remove_timezone(df, col_name: str) -> pd.DataFrame:
    """sdv does not like timezones"""
    df[col_name] = pd.to_datetime(df[col_name], utc=True).dt.tz_localize(None)    
    return df

In [128]:
tz_cols = ['observation_datetime', 'bed_admit_dt', 'perrt_consult_datetime']
for col in tz_cols:
    print(col)
    df = remove_timezone(df, col)

observation_datetime
bed_admit_dt
perrt_consult_datetime


### Define PII that must be faked and not modelled

Define fields that contain PII and need faking (see the sketchy documentation [here](https://sdv.dev/SDV/developer_guides/sdv/metadata.html?highlight=pii#categorical-fields-data-anonymization) and the [Faker Documentation](https://faker.readthedocs.io/en/master/providers.html) for a full list of providers. Here is a brief example that specifies Fakers for [name](https://faker.readthedocs.io/en/master/providers/faker.providers.person.html#faker.providers.person.Provider.name) and [date of birth](https://faker.readthedocs.io/en/master/providers/faker.providers.date_time.html#faker.providers.date_time.Provider.date_of_birth). Note that you must pass arguments to a faker as a list.

NB: sdv also doesn't always recognise the columns correctly. Here we specify data_of_birth explicitly as a date whilst working on the larger task of defining columns that contain PII. See [field details](https://sdv.dev/SDV/developer_guides/sdv/metadata.html#field-details)

Now a full specification for the Vitals data

In [129]:
fields = {
    'visit_observation_id' : {
        'type': 'id',
        'subtype': 'integer',
    },
    'date_of_birth': {
        'type': 'datetime',
        'format': '%Y-%m-%d',
        'pii': True,
        # the 'pii_category' key defines the Faker function name (method)
        'pii_category': "date_of_birth", 
    },
    'lastname': {
        'type': 'categorical',
        'pii': True,
        'pii_category': 'last_name'
    },
    'firstname': {
        'type': 'categorical',
        'pii': True,
        'pii_category': 'first_name'
    },
    'mrn': {
        'type': 'categorical',
        'pii': True,
        'pii_category': ['random_number', 8 ]
    },
}

Prepare the constraints

In [130]:
fixed_obs_constraint = FixedCombinations(
    column_names=['id_in_application', 'value_as_text', 'value_as_real', 'unit'],
    handling_strategy='reject_sampling',  # transform strategy fails ?b/c data types vary
)

Prepare the metadata

In [131]:
metadata = Metadata()
metadata.add_table(
    name='tabpid',
    data=df,
    fields_metadata=fields,
    primary_key = 'visit_observation_id',
    constraints=[fixed_obs_constraint], # be patient; reject_sampling is slow
)

In [132]:
# Inspect the conversion that metadata.add_table did to the dataframe that you loaded
metadata.get_table_meta('tabpid')

{'fields': {'visit_observation_id': {'type': 'id', 'subtype': 'integer'},
  'date_of_birth': {'type': 'datetime',
   'format': '%Y-%m-%d',
   'pii': True,
   'pii_category': 'date_of_birth'},
  'lastname': {'type': 'categorical',
   'pii': True,
   'pii_category': 'last_name'},
  'firstname': {'type': 'categorical',
   'pii': True,
   'pii_category': 'first_name'},
  'mrn': {'type': 'categorical',
   'pii': True,
   'pii_category': ['random_number', 8]},
  'ob_tail_i': {'type': 'numerical', 'subtype': 'integer'},
  'observation_datetime': {'type': 'datetime'},
  'id_in_application': {'type': 'categorical'},
  'value_as_real': {'type': 'numerical', 'subtype': 'float'},
  'value_as_text': {'type': 'categorical'},
  'unit': {'type': 'categorical'},
  'sex': {'type': 'categorical'},
  'bed_admit_dt': {'type': 'datetime'},
  'dept_name': {'type': 'categorical'},
  'room_name': {'type': 'categorical'},
  'bed_hl7': {'type': 'categorical'},
  'perrt_consult_datetime': {'type': 'datetime'}},
 

Prepare the table(s)

In [133]:
tables = dict(tabpid=df)

Fit the model

In [134]:
sdv = SDV()
sdv.fit(metadata, tables)

In [135]:
dfs = sdv.sample_all()['tabpid']

In [136]:
dfs.loc[:5]

Unnamed: 0,visit_observation_id,date_of_birth,lastname,firstname,mrn,ob_tail_i,observation_datetime,id_in_application,value_as_real,value_as_text,unit,sex,bed_admit_dt,dept_name,room_name,bed_hl7,perrt_consult_datetime
0,0,1976-03-20,Adams,Suzanne,8887833,1,2022-06-26 10:54:02,3040109304,,Room air,,M,2022-06-30 00:41:47,UCH EMERGENCY DEPT,BY06,UTC TZ,NaT
1,1,1963-01-31,Scott,Stephen,55582361,1,2022-06-26 09:38:49,5,,,,M,2022-06-25 17:12:23,UCH T02 VASCULAR ANGIO,OTF,SR04-04,NaT
2,2,1925-05-19,Love,Willie,23882699,1,2022-06-26 09:48:55,5,,130/72,,M,2022-07-11 06:05:08,UCH T08 SOUTH (T08S),SR08,BY05-36,NaT
3,3,2009-03-21,Rice,Adam,29488187,1,2022-06-26 08:52:51,9,,,,F,2022-06-20 21:40:10,UCH T16 NORTH (T16N),SR34,SR32-32,2022-06-18 02:58:00
4,4,2018-05-12,Kline,Donald,65051604,1,2022-06-26 07:35:18,6,,,,M,2022-07-04 00:13:53,UCH T02 DAY SURG THR,BY03,BY05-24,NaT
5,5,1959-01-13,Roberts,William,77402084,1,2022-06-26 09:11:01,9,,,,M,2022-07-07 04:32:32,UCH SDEC,BY04,BY07-35,NaT


In [137]:
dfs['id_in_application'].value_counts().sort_index()

10              33
28315          411
28316           63
3040109304     858
5              390
6              310
6466          1508
8              370
9              223
Name: id_in_application, dtype: int64

Inspect the original data

In [138]:
df['id_in_application'].value_counts().sort_index()

10            546
28315         431
28316          28
3040109304    510
5             493
6             550
6466          524
8             544
9             540
Name: id_in_application, dtype: int64

In [139]:
# df.head()

Inspect the synthetic data

Finally transform admission_age_years to match the fake DoB

### Save the synthetic data

Options
- save the model and not the synthetic data (but then you need *sdv* to run the model)
- save the data (need some care with type conversions if you use csv etc.)

In [140]:
model.save('mock_vitals.pkl')

In [141]:
dfs.to_hdf('mock_vitals.h5', 'data')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block2_values] [items->Index(['lastname', 'firstname', 'mrn', 'id_in_application', 'value_as_text',
       'unit', 'sex', 'dept_name', 'room_name', 'bed_hl7'],
      dtype='object')]

  dfs.to_hdf('mock_vitals.h5', 'data')


In [142]:
pd.read_hdf('mock_vitals.h5', 'data')

Unnamed: 0,visit_observation_id,date_of_birth,lastname,firstname,mrn,ob_tail_i,observation_datetime,id_in_application,value_as_real,value_as_text,unit,sex,bed_admit_dt,dept_name,room_name,bed_hl7,perrt_consult_datetime
0,0,1976-03-20,Adams,Suzanne,8887833,1,2022-06-26 10:54:02,3040109304,,Room air,,M,2022-06-30 00:41:47,UCH EMERGENCY DEPT,BY06,UTC TZ,NaT
1,1,1963-01-31,Scott,Stephen,55582361,1,2022-06-26 09:38:49,5,,,,M,2022-06-25 17:12:23,UCH T02 VASCULAR ANGIO,OTF,SR04-04,NaT
2,2,1925-05-19,Love,Willie,23882699,1,2022-06-26 09:48:55,5,,130/72,,M,2022-07-11 06:05:08,UCH T08 SOUTH (T08S),SR08,BY05-36,NaT
3,3,2009-03-21,Rice,Adam,29488187,1,2022-06-26 08:52:51,9,,,,F,2022-06-20 21:40:10,UCH T16 NORTH (T16N),SR34,SR32-32,2022-06-18 02:58:00
4,4,2018-05-12,Kline,Donald,65051604,1,2022-06-26 07:35:18,6,,,,M,2022-07-04 00:13:53,UCH T02 DAY SURG THR,BY03,BY05-24,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4161,4161,2021-08-17,Romero,Christine,96344082,1,2022-06-26 10:05:28,9,,,,M,2022-06-24 15:49:13,UCH T06 HEAD (T06H),SR04,BY12-56,NaT
4162,4162,1933-12-08,Estrada,Timothy,90176637,1,2022-06-26 07:53:35,6466,,A,,M,2022-07-18 04:26:24,UCH T09 NORTH (T09N),RAT 06,BY01-02,NaT
4163,4163,1971-03-23,Harper,Nicholas,95254572,1,2022-06-26 09:10:59,6466,,A,,F,2022-06-03 11:24:15,UCH T09 NORTH (T09N),BY04,03 SDEC,2022-06-17 12:23:00
4164,4164,2001-05-13,Montgomery,Natasha,85394685,1,2022-06-26 08:19:52,3040109304,,Room air,,F,2022-06-10 15:09:12,UCH EMERGENCY DEPT,BY13,UTC TZ,2022-06-17 07:28:00
