# Prepare synthetic data (Elective surgery)
It is going to be much easier to develop if you have realistic synthetic data.
Here we take a sql query that generates a single tabular output.
We run that query against the live identifiable data once.
We then use the [Synthetic Data Vault](https://sdv.dev/SDV/index.html) to prepare a synthetic model of those data.
The code below serves as a vignette for that process but will need adjusting to match the exact contents of the original query.

More complex examples that include multiple tables with joins and dependencies are also possible.

This notebook should be run interactively just once

## Set-up, query and return the data as datafame
The query lives in `./src/api/surgery/` where `.` represents the project root.
If you run this JupyterNotebook using the local **Makefile** and `make run` then that query will be automatically copied here.

So first steps should be (from _this_ directory)
```sh
make mock1build
make mock2copyin
make mock3run
```
then navigate to http://uclvlddpragae07:8091/lab/tree/steve/work/synth_test_data_surgery.ipynb

In [1]:
import os
import pandas as pd
import numpy as np
import requests

from pathlib import Path
from sqlalchemy import create_engine

Now rather than using a SQL query, this time we are going to use an existing API to populate a dataframe

In [2]:
q = Path('surgery.sql').read_text()

In [3]:
# Construct the MSSQL connection
db_host = os.getenv('CABOODLE_DB_HOST')
db_user = os.getenv('CABOODLE_DB_USER')
db_password = os.getenv('CABOODLE_DB_PASSWORD')
db_port = os.getenv('CABOODLE_DB_PORT')
db_name = os.getenv('CABOODLE_DB_NAME')

In [4]:
connection_string = f"mssql+pyodbc://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}?driver=ODBC+Driver+17+for+SQL+Server"

In [5]:
db_engine = create_engine(connection_string)
df = pd.read_sql_query(q, db_engine)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183 entries, 0 to 182
Data columns (total 37 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   PrimaryMrn                    183 non-null    object        
 1   AgeInYears                    183 non-null    int64         
 2   PlacedOnWaitingListDate       179 non-null    object        
 3   DecidedToAdmitDate            179 non-null    object        
 4   AdmissionService              179 non-null    object        
 5   ElectiveAdmissionType         179 non-null    object        
 6   IntendedManagement            179 non-null    object        
 7   Priority                      179 non-null    object        
 8   RemovalReason                 179 non-null    object        
 9   Status                        179 non-null    object        
 10  Subgroup                      179 non-null    object        
 11  SurgicalService               17

In [7]:
# df[['PrimaryMrn', 'PatientKey','PatientDurableKey']]

## Generate a synthetic version of the real data

Use the table above to generate the metadata you need for the synthetic data

https://sdv.dev/SDV/user_guides/relational/relational_metadata.html#relational-metadata

In [8]:
# Minimal imports
from sdv import Metadata, SDV

  from .autonotebook import tqdm as notebook_tqdm


SDV does not handle timezones nicely so remove

In [9]:
# PostgreSQL returns datetimes with tz info which sdv does not seem to be able to handle
def remove_timezone(df, col_name: str) -> pd.DataFrame:
    """sdv does not like timezones"""
    df[col_name] = pd.to_datetime(df[col_name], utc=True).dt.tz_localize(None)    
    return df

In [10]:
tz_cols = ['PlannedOperationStartInstant', 'PlannedOperationEndInstant', '_LastUpdatedInstant']
for col in tz_cols:
    print(col)
    df = remove_timezone(df, col)

PlannedOperationStartInstant
PlannedOperationEndInstant
_LastUpdatedInstant


### Define PII that must be faked and not modelled

Define fields that contain PII and need faking (see the sketchy documentation [here](https://sdv.dev/SDV/developer_guides/sdv/metadata.html?highlight=pii#categorical-fields-data-anonymization) and the [Faker Documentation](https://faker.readthedocs.io/en/master/providers.html) for a full list of providers. Here is a brief example that specifies Fakers for [name](https://faker.readthedocs.io/en/master/providers/faker.providers.person.html#faker.providers.person.Provider.name) and [date of birth](https://faker.readthedocs.io/en/master/providers/faker.providers.date_time.html#faker.providers.date_time.Provider.date_of_birth). Note that you must pass arguments to a faker as a list.

In [11]:
fields = {
    'PrimaryMrn': {
        'type': 'datetime',
        'format': '%Y-%m-%d',
        'pii': True,
        'pii_category': "date_of_birth", 
    },
    'name': {
        'type': 'categorical',
        'pii': True,
        'pii_category': 'name'
    },
    
}

NB: sdv also doesn't always recognise the columns correctly. Here we specify data_of_birth explicitly as a date whilst working on the larger task of defining columns that contain PII. See [field details](https://sdv.dev/SDV/developer_guides/sdv/metadata.html#field-details)

Now a full specification for the Sitrep data

In [12]:
fields = {
    'PrimaryMrn': {
        'type': 'categorical',
        'pii': True,
        'pii_category': ['random_number', 8 ]
    },
    'PatientKey': {
        'type': 'categorical',
        'pii': True,
        'pii_category': ['random_number', 7 ]
    },
    'PatientDurableKey': {
        'type': 'categorical',
        'pii': True,
        'pii_category': ['random_number', 7 ]
    },
    'AgeInYears': {
        'type': 'categorical',
        'pii': True,
        'pii_category': ['random_int', 18, 99 ]
    },

}

Prepare the metadata

In [13]:
metadata = Metadata()
metadata.add_table(
    name='tabpid',
    data=df,
    fields_metadata=fields,
)

In [14]:
# Inspect the conversion that metadata.add_table did to the dataframe that you loaded
metadata.get_table_meta('tabpid')

{'fields': {'PrimaryMrn': {'type': 'categorical',
   'pii': True,
   'pii_category': ['random_number', 8]},
  'PatientKey': {'type': 'categorical',
   'pii': True,
   'pii_category': ['random_number', 7]},
  'PatientDurableKey': {'type': 'categorical',
   'pii': True,
   'pii_category': ['random_number', 7]},
  'AgeInYears': {'type': 'categorical',
   'pii': True,
   'pii_category': ['random_int', 18, 99]},
  'PlacedOnWaitingListDate': {'type': 'categorical'},
  'DecidedToAdmitDate': {'type': 'categorical'},
  'AdmissionService': {'type': 'categorical'},
  'ElectiveAdmissionType': {'type': 'categorical'},
  'IntendedManagement': {'type': 'categorical'},
  'Priority': {'type': 'categorical'},
  'RemovalReason': {'type': 'categorical'},
  'Status': {'type': 'categorical'},
  'Subgroup': {'type': 'categorical'},
  'SurgicalService': {'type': 'categorical'},
  'Type': {'type': 'categorical'},
  '_LastUpdatedInstant': {'type': 'datetime'},
  'SurgeryDate': {'type': 'categorical'},
  'Primar

Prepare the table(s)

In [15]:
tables = dict(tabpid=df)

Fit the model

In [16]:
sdv = SDV()
sdv.fit(metadata, tables)

Inspect the original data

In [17]:
# df.head()

Inspect the synthetic data

In [18]:
sdv.sample_all()['tabpid'].head()

Unnamed: 0,PrimaryMrn,PatientKey,PatientDurableKey,AgeInYears,PlacedOnWaitingListDate,DecidedToAdmitDate,AdmissionService,ElectiveAdmissionType,IntendedManagement,Priority,...,CaseCancelReason,CaseCancelReasonCode,CancelDate,PlannedOperationStartInstant,PlannedOperationEndInstant,PostOperativeDestination,Name,PatientFriendlyName,RoomName,DepartmentName
0,535227,2465059,3899882,76,2022-04-08,2022-05-30,Thoracic Surgery,Elective Wait,Inpatient,Routine,...,*Unspecified,*Unspecified,2022-05-30,2022-06-22 16:56:00,2022-06-22 22:05:00,*Not Applicable,OPEN MYOMECTOMY,EXCISION OF LESION OF LUNG,UCH P3 TH03,UCH P03 THEATRE SUITE
1,95452563,2284290,6100074,65,2022-05-27,2022-05-25,Urology - Oncology Robotic,Elective Wait,Inpatient,Routine,...,*Unspecified,*Unspecified,2022-06-16,2022-06-24 04:08:00,2022-06-24 09:16:00,*Not Applicable,TOTAL REPLACEMENT OF HIP,TOTAL REPLACEMENT OF HIP,WMS TH06,UCH P03 THEATRE SUITE
2,91808832,2643028,8505398,31,2022-06-01,2022-05-16,Urology - Andrology,Elective Wait,Inpatient,Cancer Pathway,...,*Unspecified,581,,2022-06-23 20:59:00,2022-06-23 23:17:00,*Not Applicable,ROBOT ASSISTED LAPAROSCOPIC RADICAL PROSTATECTOMY,ROBOT ASSISTED LAPAROSCOPIC RADICAL PROSTATECTOMY,GWB TH 01,WMS W01 THEATRE SUITE
3,377915,9996072,9301563,70,2022-06-08,2022-06-16,Urology - Stones,Elective Wait,Inpatient,Cancer Pathway,...,*Unspecified,*Unspecified,,2022-06-26 04:12:00,2022-06-26 08:59:00,*Not Applicable,LUNG EXCISION,BULBAR URETHROPLASTY WITH BUCCAL MUCOSAL GRAFT...,WMS TH02,UCH P03 THEATRE SUITE
4,10590956,2256421,2147676,25,2022-06-16,2022-03-28,Orthopaedics - Lower Limb,Elective Wait,Inpatient,Urgent,...,*Unspecified,*Unspecified,,2022-06-21 14:16:00,2022-06-21 19:32:00,*Not Applicable,LUNG EXCISION,LUNG EXCISION,UCH P3 TH08,GWB B-1 THEATRE SUITE


### Save the synthetic data

Options
- save the model and not the synthetic data (but then you need *sdv* to run the model)
- save the data (need some care with type conversions if you use csv etc.)

In [19]:
sdv.save('mock_surgery.pkl')

In [20]:
sdv.sample_all()['tabpid'].to_hdf('mock_surgery.h5', 'data')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block0_values] [items->Index(['PrimaryMrn', 'PatientKey', 'PatientDurableKey', 'AgeInYears',
       'PlacedOnWaitingListDate', 'DecidedToAdmitDate', 'AdmissionService',
       'ElectiveAdmissionType', 'IntendedManagement', 'Priority',
       'RemovalReason', 'Status', 'Subgroup', 'SurgicalService', 'Type',
       'SurgeryDate', 'PrimaryService', 'Classification',
       'SurgeryPatientClass', 'AdmissionPatientClass', 'PrimaryAnesthesiaType',
       'ReasonNotPerformed', 'CaseScheduleStatus', 'CaseCancelReason',
       'CaseCancelReasonCode', 'CancelDate', 'PostOperativeDestination',
       'Name', 'PatientFriendlyName', 'RoomName', 'DepartmentName'],
      dtype='object')]

  sdv.sample_all()['tabpid'].to_hdf('mock_surgery.h5', 'data')


In [22]:
pd.read_hdf('mock_surgery.h5', 'data')

Unnamed: 0,PrimaryMrn,PatientKey,PatientDurableKey,AgeInYears,PlacedOnWaitingListDate,DecidedToAdmitDate,AdmissionService,ElectiveAdmissionType,IntendedManagement,Priority,...,CaseCancelReason,CaseCancelReasonCode,CancelDate,PlannedOperationStartInstant,PlannedOperationEndInstant,PostOperativeDestination,Name,PatientFriendlyName,RoomName,DepartmentName
0,7109189,2021015,5735465,66,2021-11-29,2021-11-29,Thoracic Surgery,Elective Wait,Inpatient,Routine,...,*Unspecified,*Unspecified,2022-06-20,2022-06-24 00:53:00,2022-06-24 03:27:00,*Not Applicable,UNIPORTAL VATS - VIDEO ASSISTED THORACOSCOPIC ...,UNIPORTAL VATS - VIDEO ASSISTED THORACOSCOPIC ...,GWB TH 06,GWB B-1 THEATRE SUITE
1,34877906,791145,2278186,31,2022-06-14,2022-06-14,Urology - Stones,Elective Planned,Inpatient,Routine,...,*Unspecified,589,2022-06-14,2022-06-23 22:32:00,2022-06-24 01:12:00,*Not Applicable,REVISION OF APPENDICOVESICOSTOMY,REVISION OF APPENDICOVESICOSTOMY,WMS TH02,GWB B-1 THEATRE SUITE
2,13790731,7163744,5117664,44,2022-06-15,2022-06-15,Orthopaedics - Lower Limb,Elective Wait,Inpatient,Routine,...,Hospital Cancel - Admin Error,581,2022-06-20,2022-06-25 11:39:00,2022-06-25 15:57:00,*Not Applicable,SELECTIVE NECK DISSECTION OF CERVICAL LYMPH NODES,SELECTIVE NECK DISSECTION OF CERVICAL LYMPH NODES,UCH P3 TH10,WMS W01 THEATRE SUITE
3,50780899,3247768,3015475,70,2022-05-27,2022-05-25,ENT - Ears,Elective Wait,Inpatient,Routine,...,Hospital Cancel - Admin Error,581,,2022-06-22 10:10:00,2022-06-22 16:41:00,*Not Applicable,OESOPHAGECTOMY,URETEROSCOPY,WMS TH02,UCH P03 THEATRE SUITE
4,36047041,4649803,9801571,24,2021-10-28,2021-10-28,General Surgery - Lower GI,Elective Planned,Inpatient,Routine,...,*Unspecified,*Unspecified,,2022-06-25 09:29:00,2022-06-25 12:04:00,*Not Applicable,SELECTIVE NECK DISSECTION OF CERVICAL LYMPH NODES,SELECTIVE NECK DISSECTION OF CERVICAL LYMPH NODES,WMS TH05,UCH P03 THEATRE SUITE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
178,91944610,6843087,8508487,41,2022-06-10,2022-06-10,General Surgery - Lower GI,Elective Wait,Inpatient,Routine,...,*Unspecified,*Unspecified,,2022-06-22 18:20:00,2022-06-22 23:38:00,*Not Applicable,TOTAL KNEE REPLACEMENT,TOTAL REPLACEMENT OF HIP,WMS TH03,GWB B-1 THEATRE SUITE
179,779831,369257,1377705,71,2022-06-15,2022-06-15,Urology - Andrology,Elective Wait,Inpatient,Cancer Pathway,...,*Unspecified,*Unspecified,2022-06-14,2022-06-24 21:37:00,2022-06-25 02:05:00,*Not Applicable,AMPUTATION OF GLANS PENIS,AMPUTATION OF GLANS PENIS,UCH P3 TH11,UCH P03 THEATRE SUITE
180,14684298,1968182,1379133,72,2021-05-06,2022-06-09,Thoracic Surgery,Elective Wait,Inpatient,Routine,...,*Unspecified,*Unspecified,,2022-06-25 16:47:00,2022-06-25 18:26:00,*Not Applicable,COMBINED APPROACH TYMPANOPLASTY,ROBOTIC ASSISTED SURGERY,WMS TH07,WMS W01 THEATRE SUITE
181,16171338,9482112,4439753,94,2022-05-27,2022-05-27,Urology - Andrology,Elective Planned,Inpatient,Cancer Pathway,...,Hospital Cancel - Admin Error,581,2022-06-20,2022-06-25 13:38:00,2022-06-25 17:36:00,*Not Applicable,TYMPANOPLASTY WITH MASTOIDECTOMY,TYMPANOPLASTY WITH MASTOIDECTOMY,UCH P3 TH04,WMS W01 THEATRE SUITE
