## Protecting Life Data Preparation Script

- Version 1.2.1 - NHS Pycom Version Built - 22/05/2022
- Version 1.1.3 - Current Version Demo delivered to Divisional Management - 14/04/2022
- Version 1.1.2 - Abstract Submitted to BSG Conference 2022 - 25/02/2022
- Version 1.1.1 - Basic MVP Built - 23/02/2022

#### Authors:

1. Matt Stammers - Consultant Gastroenterolgist and Data Scientist @ AXIS, UHS
2. Michael George - Data Engineering Lead @ AXIS, UHS

What this Script Does:
- Finds patients who have suffered an acute gastrointestinal bleed and identifies their index admission.
- Obtains key risk factors for death and key information documenting the admission.
- Risk stratifies patient mortality by validated risk scoring systems (CCI and HFRS) using comorbidipy (see documentation).
- Deposit the results into a flat file on the network for use by clinical teams or for further subsequent analysis.

In [None]:
# Datetime Packages
import datetime as datetime

# Database Connectors
import cx_Oracle as cxo
import sqlalchemy as sqla
from sqlalchemy import create_engine, MetaData, Table, and_
from sqlalchemy.sql import select

# Pandas
import numpy as np
import pandas as pd

# Encryption
import keyring

# Risk Scoring
from comorbidipy import comorbidity
from comorbidipy import hfrs

# Print version of sqlalchemy
print(sqla.__version__)  

# Print if the cx_Oracle is recognized
print(cxo.version)   

# Setup Connection to Client

cxo.init_oracle_client(lib_dir= "{path to client}/instantclient_11_2/")

# Print client version
print(cxo.clientversion())

#### A note on Flat Files

If you want to run flat files instead of connecting to the database you should do so here. For instance if you have a .csv file or set of .csv files with the relevant data in them you can import them as per the following:

```python
df_comorb = pd.read_csv('{path to file}/comorbidity_data.csv')
```

We would however recommend setting up a database connection to this as it is going to be far more scalable as you will see below. You will need your local IT team's help to set this up but once you have access credentials they can be stored using keyring as below:

```python
keyring.set_password('User', '1', 'jimminycrickets')
```

Once you have set this up as long as you remember how you stored everything it can easily be retrieved by switching set to get_password as below.

#### A note on connecting to a database

In this example we are connecting to an Oracle SQL database using the SQLAlchemy library.

One method of connecting your database via SQLAlchemy is using an Engine. This is done by passing in a 'Database URL' to the create_engine() function. These database URLS follow a particular protocol and generally include the username, password, hostname and database name, and some other optional keyword arguments for additional configuration (in our example we have included a service_name configuration option to connect to the Oracle database)

A typical database url looks like this:

```python
dialect+driver://username:password@host:port/database
```

For more details you can refer to the SQLAlchemy documentation regarding engines and how to construct a database url for your particualr hospitals database: https://docs.sqlalchemy.org/en/14/core/engines.html

In [None]:
# Load in Connection Credentials

ora_user = keyring.get_password("User", "10")
ora_password = keyring.get_password("Password", "10")
ora_host = keyring.get_password("Host", "10")
ora_service = keyring.get_password("Service", "10")
ora_port = keyring.get_password("Port", "10")

# Set Key Connection Variables

DIALECT = 'oracle'
SQL_DRIVER = 'cx_oracle'
USERNAME = ora_user
PASSWORD = ora_password
HOST = ora_host
PORT = ora_port
SERVICE = ora_service

# Create Engine Authorisation String Without Exposing Credentials
ENGINE_PATH_WIN_AUTH = f'{DIALECT}+{SQL_DRIVER}://{USERNAME}:{PASSWORD}@{HOST}:{str(PORT)}/?service_name={SERVICE}'
# ENGINE_PATH_WIN_AUTH = DIALECT + '+' + SQL_DRIVER + '://' + USERNAME + ':' + PASSWORD +'@' + HOST + ':' + str(PORT) + '/?service_name=' + SERVICE

# Create and Connect to Engine

engine = create_engine(ENGINE_PATH_WIN_AUTH)
engine.connect()

#### A note when learning

When you are learning to do the above we recommend breaking this block up into smaller subcomponents until you have mastered each of them. It will be worth the effort to learn how to do this. Once you have done it you can then connect to your SQL Queries

#### SQL Queries and Extract

These SQL queries are pseudocode to help you get the idea behind the approach and why it is being achieved this way. If you however insert the SQL in this format between the comments and then parse it through the engine it will work. For the sake of simplicity I have generated individual queries for the different components however some of the connections could be made database-side but we are deliberately assuming little SQL knowledge in this script to make the process easier to understand.

In [None]:
# Finds all the patients who have had a gastrointestinal bleed within the hospital

GI_Bleed_Query = """
SELECT {insert columns of interest}
FROM {endoscopy table}
{LEFT/INNER/OUTER} JOIN {insert tables of interest}
WHERE {filters of interest - in this case patients with meleana, haematemesis or 'coffee ground vomiting'}
"""

#### Now you can create your index cohort

From this you can filter the subsequent queries. This is often better done using SQL itself but if you do want to do it using python this will work. 

In [None]:
# This gets you the GI Bleed cohort

df_bleeding = pd.read_sql_query(GI_Bleed_Query, engine)

# Patients who bled

patients_who_bled = tuple(df_bleeding['patient_id'].to_list())

#### Now run the other queries in sequence 

This enables you to gather your cohort with the first query and then cross-filter the results with subsequent queries. See below:

In [None]:
# Other Queries

# Finds key patient demographics

Demographics_Query = """
SELECT {insert columns of interest - probably patient table or like}
FROM {main presumably patient table}
{LEFT/INNER/OUTER} JOIN {insert tables of interest}
WHERE {filters of interest - in this case filtered on the patient number from the Bleeding Query}
PATIENT_IDENTIFIER IN {} /* inserts the patient identifiers */
""".format(patients_who_bled)

# Comorbidity Query

Comorbidity_Query = """
SELECT {insert columns of interest - likely ICD10/SNOMED codes}
FROM {main ICD10/SNOMED table}
{LEFT/INNER/OUTER} JOIN {other important ICD10/SNOMED tables}
WHERE {filters of interest - in this case filtered on the patient number from the Bleeding Query}
PATIENT_IDENTIFIER IN {} /* inserts the patient identifiers */
ORDER BY {perhaps by date or code}
""".format(patients_who_bled)

# Admissions Query

Admissions_Query = """
SELECT {insert columns of interest - likely admissions data}
FROM {main discharge summary table}
{LEFT/INNER/OUTER} JOIN {other important tables if needed}
WHERE {filters of interest - in this case filtered on the patient number from the Bleeding Query}
PATIENT_IDENTIFIER IN {} /* inserts the patient identifiers */
""".format(patients_who_bled)

# Physiology Query

Physiology_Query = """
SELECT {insert columns of interest - likely weight and height rather than BMI}
FROM {main physiology data table}
{LEFT/INNER/OUTER} JOIN {probably not required}
WHERE {filters of interest - in this case filtered on the patient number from the Bleeding Query and perhaps time}
PATIENT_IDENTIFIER IN {} /* inserts the patient identifiers */
""".format(patients_who_bled)

#### You can now connect your engine to these queries to create the dataframes

Now by connecting the engines to the queries you can pull all relevant data through

In [None]:
# This makes it very easy to pull discrete up to date datasets from your EPR/PAS into the kernel

df_demographics = pd.read_sql_query(Demographics_Query, engine)
df_comorbidity = pd.read_sql_query(Comorbidity_Query, engine)
df_admissions = pd.read_sql_query(Admissions_Query, engine)
df_physiology = pd.read_sql_query(Physiology_Query, engine)

#### Building the Index Database

In this case we need to build the index dataframe first as we are only interested in index or first admissions. To get these we combine the bleed and admission databases and filter them to extract only the first relevant admission

In [None]:
# Datetime Function - to turn all column strings containing 'date' or 'datetime' into datetimes

def Datetime(series):
    if ('date' in series.name.lower() or 'datetime' in series.name.lower()) and 'age' not in series.name.lower():
            series = pd.to_datetime(series, dayfirst = True)
    return series

# First convert all the relevant rows in the key dataframes

df_admissions = df_admissions.apply(Datetime)
df_bleeding = df_bleeding.apply(Datetime)

# Then join the tables so we can select only the relevant admissions

df_bleeds_and_admissions = pd.merge(df_bleeding, df_admissions, on='patient_id', how='left')

# Now filter to select only the bleed endoscopies which occurred during an admission

df_admitted_bleeds = df_bleeds_and_admissions[df_bleeds_and_admissions['date_of_endoscopy'].between(df_bleeds_and_admissions['admission_date'], df_bleeds_and_admissions['discharge_date'])]

# Finally select out only the first or index admission as this is what we are interested in. Subsequent admissions are handled seperately

df_first_adm = df_key.loc[df_key.groupby('Procedure Date')['ipe_admissiondate'].idxmin()]
df_first_adm.shape

#### Now we want to assess only the relevant BMI of these patients

We do not want to have all the BMI's - only those that pertain to the time of the gastrointestinal bleeds. We can collect this using a similar strategy to the above one of filtering the values to make sure they occur only after the index admission.

In [None]:
# First extract the weights and heights from the data to create two seperate pandas series

df_heights = df_physiology[df_physiology['test_code'] == 'HEIG']
df_weights = df_physiology[df_physiology['test_code'] == 'WEIG']

# Then aggregate them - we settled for mean in the end after checking for skewness

df_heights2 = df_heights.groupby(['patient_id']).agg('mean').reset_index()
df_weights2 = df_weights.groupby(['patient_id']).agg('mean').reset_index()

# Rename the columns

df_weights2.columns = ['PATIENT_ID', 'MEAN_WEIGHT']
df_heights2.columns = ['PATIENT_ID', 'MEAN_HEIGHT']

# Join Together
df_bmi = pd.concat([df_heights2, df_weights2])

# Then convert height to meters if currently in centimeters and calc BMI

df_bmi['MEAN_HEIGHT_M'] = df_bmi['MEAN_HEIGHT']/100
df_bmi['BMI'] = round(df_bmi['MEAN_WEIGHT']/(df_bmi['MEAN_HEIGHT_M'])**2,2)

In [None]:
# First merge the core cohort to the demographics and then add the BMI data

df_merge1 = pd.merge(df_index_scope, df_demographics, on='patient_id', how = 'left')
df_merge2 = pd.merge(df_merge1, df_bmi, on='patient_id', how = 'left')

In [None]:
# First create a subframe with the key columns

comorbidities = df_merge2[['code', 'patient_id', 'age']]
comorbidities.columns = ['code', 'id', 'age']

# Then calculate the scores. Comorbidipy needs them in this format to work properly

cci = comorbidity(comorbidities)
frail = hfrs(comorbidities)

# Then tidy up the outputs a bit

cci['survival_10yr'] = round(cci['survival_10yr']*100,1).astype(str)
cci['survival_10yr'] = cci['survival_10yr'].apply(lambda x: ''.join(x + "%"))

# Then join these tables into one and add back to the main dataframe to create merge3

comorb_merge = pd.merge(cci, frail, on='id', how='left')
df_merge3_added = pd.merge(df_merge2, comorb_merge, left_on='patient_id', right_on='id', how='left')

In [None]:
# Datetime Function - to turn all column strings containing 'date' or 'datetime' into datetimes

def Datetime(series):
    if ('date' in series.name.lower() or 'datetime' in series.name.lower()) and 'age' not in series.name.lower():
            series = pd.to_datetime(series, dayfirst = True)
    return series

df_admissions = df_admissions.apply(Datetime)
df_bleeding = df_bleeding.apply(Datetime)