# Prepare synthetic data 
It is going to be much easier to develop if you have realistic synthetic data.
Here we take a sql query that generates a single tabular output.
We run that query against the live identifiable data once.
We then use the [Synthetic Data Vault](https://sdv.dev/SDV/index.html) to prepare a synthetic model of those data.
The code below serves as a vignette for that process but will need adjusting to match the exact contents of the original query.

More complex examples that include multiple tables with joins and dependencies are also possible.

This notebook should be run interactively just once

## Set-up, query and return the data as datafame
The query lives in `./src/api/sitrep/` where `.` represents the project root.
If you run this JupyterNotebook using the local **Makefile** and `make run` then that query will be automatically copied here.

So first steps should be (from _this_ directory)
```sh
make mock1build
make mock2copyin
make mock3run
```
then navigate to http://uclvlddpragae07:8091/lab/tree/steve/work/synth_test_data_sitrep.ipynb

In [193]:
import os
import pandas as pd
import numpy as np
import requests

from pathlib import Path
from sqlalchemy import create_engine

Now rather than using a SQL query, this time we are going to use an existing API to populate a dataframe

In [45]:
response = requests.get("http://uclvlddpragae07:5006/icu/live/T03/ui")
df = pd.DataFrame.from_dict(response.json()['data'])

In [50]:
df.dtypes

episode_slice_id                 int64
csn                             object
admission_dt            datetime64[ns]
elapsed_los_td                 float64
bed_code                        object
bay_code                        object
bay_type                        object
ward_code                       object
mrn                             object
name                            object
dob                             object
admission_age_years              int64
sex                             object
is_proned_1_4h                    bool
discharge_ready_1_4h            object
is_agitated_1_8h                  bool
n_inotropes_1_4h                 int64
had_nitric_1_8h                   bool
had_rrt_1_4h                      bool
had_trache_1_12h                  bool
vent_type_1_4h                  object
avg_heart_rate_1_24h           float64
max_temp_1_12h                 float64
avg_resp_rate_1_24h            float64
wim_1                            int64
dtype: object

In [None]:
df.head()

## Generate a synthetic version of the real data

Use the table above to generate the metadata you need for the synthetic data

https://sdv.dev/SDV/user_guides/relational/relational_metadata.html#relational-metadata

In [47]:
# Minimal imports
from sdv import Metadata, SDV

SDV does not handle timezones nicely so remove

In [48]:
# PostgreSQL returns datetimes with tz info which sdv does not seem to be able to handle
def remove_timezone(df, col_name: str) -> pd.DataFrame:
    """sdv does not like timezones"""
    df[col_name] = pd.to_datetime(df[col_name], utc=True).dt.tz_localize(None)    
    return df

In [49]:
tz_cols = ['admission_dt',]
for col in tz_cols:
    print(col)
    df = remove_timezone(df, col)

admission_dt


### Define PII that must be faked and not modelled

Define fields that contain PII and need faking (see the sketchy documentation [here](https://sdv.dev/SDV/developer_guides/sdv/metadata.html?highlight=pii#categorical-fields-data-anonymization) and the [Faker Documentation](https://faker.readthedocs.io/en/master/providers.html) for a full list of providers. Here is a brief example that specifies Fakers for [name](https://faker.readthedocs.io/en/master/providers/faker.providers.person.html#faker.providers.person.Provider.name) and [date of birth](https://faker.readthedocs.io/en/master/providers/faker.providers.date_time.html#faker.providers.date_time.Provider.date_of_birth). Note that you must pass arguments to a faker as a list.

In [139]:
fields = {
    'dob': {
        'type': 'datetime',
        'format': '%Y-%m-%d',
        'pii': True,
        'pii_category': "date_of_birth", 
    },
    'name': {
        'type': 'categorical',
        'pii': True,
        'pii_category': 'name'
    },
    
}

NB: sdv also doesn't always recognise the columns correctly. Here we specify data_of_birth explicitly as a date whilst working on the larger task of defining columns that contain PII. See [field details](https://sdv.dev/SDV/developer_guides/sdv/metadata.html#field-details)

Now a full specification for the Sitrep data

In [178]:
fields = {
    'dob': {
        'type': 'datetime',
        'format': '%Y-%m-%d',
        'pii': True,
        # the 'pii_category' key defines the Faker function name (method)
        'pii_category': "date_of_birth", 
    },
    'admission_age_years': {
        'type': 'numerical',
        'pii': True,
        'pii_category': ['random_number', 2 ]
    },
    'name': {
        'type': 'categorical',
        'pii': True,
        'pii_category': 'name'
    },
    'mrn': {
        'type': 'categorical',
        'pii': True,
        'pii_category': ['random_number', 8 ]
    },
    'csn': {
        'type': 'categorical',
        'pii': True,
        'pii_category': ['numerify', '10########' ]
    },
}

Prepare the metadata

In [179]:
metadata = Metadata()
metadata.add_table(
    name='tabpid',
    data=df,
    fields_metadata=fields,
)

In [180]:
# Inspect the conversion that metadata.add_table did to the dataframe that you loaded
metadata.get_table_meta('tabpid')

{'fields': {'dob': {'type': 'datetime',
   'format': '%Y-%m-%d',
   'pii': True,
   'pii_category': 'date_of_birth'},
  'admission_age_years': {'type': 'numerical',
   'pii': True,
   'pii_category': ['random_number', 2]},
  'name': {'type': 'categorical', 'pii': True, 'pii_category': 'name'},
  'mrn': {'type': 'categorical',
   'pii': True,
   'pii_category': ['random_number', 8]},
  'csn': {'type': 'categorical',
   'pii': True,
   'pii_category': ['numerify', '10########']},
  'episode_slice_id': {'type': 'numerical', 'subtype': 'integer'},
  'admission_dt': {'type': 'datetime'},
  'elapsed_los_td': {'type': 'numerical', 'subtype': 'float'},
  'bed_code': {'type': 'categorical'},
  'bay_code': {'type': 'categorical'},
  'bay_type': {'type': 'categorical'},
  'ward_code': {'type': 'categorical'},
  'sex': {'type': 'categorical'},
  'is_proned_1_4h': {'type': 'boolean'},
  'discharge_ready_1_4h': {'type': 'categorical'},
  'is_agitated_1_8h': {'type': 'boolean'},
  'n_inotropes_1_4h':

Prepare the table(s)

In [181]:
tables = dict(tabpid=df)

Fit the model

In [182]:
sdv = SDV()
sdv.fit(metadata, tables)

Inspect the original data

In [183]:
df.to_csv('sitrep_orig.csv')

Inspect the synthetic data

In [184]:
sdv.sample_all()['tabpid'].head()

Unnamed: 0,dob,admission_age_years,name,mrn,csn,episode_slice_id,admission_dt,elapsed_los_td,bed_code,bay_code,...,is_agitated_1_8h,n_inotropes_1_4h,had_nitric_1_8h,had_rrt_1_4h,had_trache_1_12h,vent_type_1_4h,avg_heart_rate_1_24h,max_temp_1_12h,avg_resp_rate_1_24h,wim_1
0,1925-10-21,51.0,Phillip Chen,50039936,1048914148,346233,2022-06-15 22:08:00,25500.0,SR07-07,BY04,...,False,0,False,False,False,Room air,90.421007,37.135396,19.031167,0
1,1994-10-27,19.0,William Oconnor,87708341,1010672303,347208,2022-05-22 23:44:00,1097780.0,SR08-08,SR09,...,False,0,False,False,True,Oxygen,71.629085,36.155943,17.612935,0
2,1911-06-04,61.0,Deborah Moran,74533030,1042127015,345456,2022-06-10 23:51:00,25500.0,BY03-25,BY03,...,False,0,False,False,False,Ventilated,76.960221,35.680047,18.985467,0
3,2019-02-03,56.0,Brandon Scott,87262209,1082230082,344197,2022-06-01 21:59:00,25500.0,BY02-17,BY03,...,False,0,False,False,False,Unknown,80.314818,37.289888,23.198947,2
4,1930-02-18,76.0,Christopher Casey,3012876,1094052118,348749,2022-06-04 16:12:00,594300.0,BY04-27,BY01,...,False,0,False,False,True,Oxygen,76.807276,36.792177,19.73903,1


Finally transform admission_age_years to match the fake DoB

In [186]:
df = sdv.sample_all()['tabpid']

In [205]:
df['admission_age_years'] = (np.floor(((df['admission_dt'] - df['dob'])/np.timedelta64(1, 'Y')))).astype(int)

In [207]:
df.head()

Unnamed: 0,dob,admission_age_years,name,mrn,csn,episode_slice_id,admission_dt,elapsed_los_td,bed_code,bay_code,...,is_agitated_1_8h,n_inotropes_1_4h,had_nitric_1_8h,had_rrt_1_4h,had_trache_1_12h,vent_type_1_4h,avg_heart_rate_1_24h,max_temp_1_12h,avg_resp_rate_1_24h,wim_1
0,1960-04-06,62,Kayla Humphrey,72583775,1096467807,343812,2022-05-25 13:49:00,465270.0,SR04-04,SR04,...,False,0,False,False,False,Ventilated,73.298531,37.166279,22.798999,1
1,2018-07-28,3,Victoria Jones,72643979,1093922604,348749,2022-05-23 23:32:00,1231910.0,SR04-04,SR09,...,False,0,False,False,False,Oxygen,87.351178,35.493027,20.76373,0
2,1997-11-14,24,Mrs. Karen Miller,48679233,1083049076,348749,2022-05-09 07:59:00,2493210.0,BY02-21,BY04,...,False,0,False,False,True,Ventilated,79.301645,35.752834,21.158517,2
3,2008-12-31,13,April Patel,15704253,1020485525,347783,2022-06-02 12:52:00,254910.0,BY01-14,BY04,...,False,0,False,True,False,Ventilated,100.434768,35.79681,21.395963,2
4,1962-08-26,59,Abigail Curry,11607520,1013378594,348749,2022-05-29 02:22:00,1361650.0,SR08-08,SR07,...,False,0,False,False,False,Oxygen,89.540401,36.099054,22.14066,1


In [208]:
df.to_csv('sitrep_synth.csv')

### Save the synthetic data

Options
- save the model and not the synthetic data (but then you need *sdv* to run the model)
- save the data (need some care with type conversions if you use csv etc.)

In [15]:
sdv.save('mock_model.pkl')

In [209]:
df.to_hdf('mock_sitrep.h5', 'data')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block4_values] [items->Index(['name', 'mrn', 'csn', 'bed_code', 'bay_code', 'bay_type', 'ward_code',
       'sex', 'discharge_ready_1_4h', 'vent_type_1_4h'],
      dtype='object')]

  df.to_hdf('mock_sitrep.h5', 'data')


In [210]:
pd.read_hdf('mock_sitrep.h5', 'data')

Unnamed: 0,dob,admission_age_years,name,mrn,csn,episode_slice_id,admission_dt,elapsed_los_td,bed_code,bay_code,...,is_agitated_1_8h,n_inotropes_1_4h,had_nitric_1_8h,had_rrt_1_4h,had_trache_1_12h,vent_type_1_4h,avg_heart_rate_1_24h,max_temp_1_12h,avg_resp_rate_1_24h,wim_1
0,1960-04-06,62,Kayla Humphrey,72583775,1096467807,343812,2022-05-25 13:49:00,465270.0,SR04-04,SR04,...,False,0,False,False,False,Ventilated,73.298531,37.166279,22.798999,1
1,2018-07-28,3,Victoria Jones,72643979,1093922604,348749,2022-05-23 23:32:00,1231910.0,SR04-04,SR09,...,False,0,False,False,False,Oxygen,87.351178,35.493027,20.76373,0
2,1997-11-14,24,Mrs. Karen Miller,48679233,1083049076,348749,2022-05-09 07:59:00,2493210.0,BY02-21,BY04,...,False,0,False,False,True,Ventilated,79.301645,35.752834,21.158517,2
3,2008-12-31,13,April Patel,15704253,1020485525,347783,2022-06-02 12:52:00,254910.0,BY01-14,BY04,...,False,0,False,True,False,Ventilated,100.434768,35.79681,21.395963,2
4,1962-08-26,59,Abigail Curry,11607520,1013378594,348749,2022-05-29 02:22:00,1361650.0,SR08-08,SR07,...,False,0,False,False,False,Oxygen,89.540401,36.099054,22.14066,1
5,1974-09-15,47,Amy Munoz,72314414,1051785327,345712,2022-05-17 22:40:00,1353890.0,BY02-18,SR09,...,False,0,False,False,False,Oxygen,94.171025,37.323651,22.422932,1
6,1969-04-28,53,Becky Richards,12250178,1043412595,348749,2022-05-25 05:07:00,1267160.0,BY04-28,BY01,...,False,0,False,False,True,Oxygen,79.404847,35.531953,20.806849,1
7,1994-10-05,27,Luis Solis,13165367,1097278851,346386,2022-06-13 05:17:00,25500.0,BY01-14,BY02,...,False,0,False,False,False,Ventilated,97.00568,36.422595,19.894819,0
8,1935-11-16,86,Tammy Reese,71939650,1081490529,348749,2022-05-28 00:43:00,1134870.0,BY04-27,BY04,...,False,0,False,False,True,Oxygen,63.834053,35.770413,16.480063,0
9,2016-05-16,6,Danielle Armstrong,19119666,1052885723,346776,2022-06-12 20:19:00,25500.0,SR08-08,BY02,...,False,0,False,False,False,Ventilated,92.483037,37.108701,17.509124,0
