## Learning From Death Script

- Version 1.4 - Clinical Narrative Added For NHS Pycom - 22/05/2022
- Version 1.3.1 - Bugs fixed and versioned such that source code can be safely made open source - 23/04/2022
- Version 1.2.3 - Changes made to improve cohort capture mechanism - 23/03/2022
- Version 1.2.2 - Process Automated - 02/03/2022
- Version 1.2.1 - Improved upon initial version with better filtering, output formatting and direct database pulls - 13/02/2022
- Version 1.1.2 - Soft Launch. Used alongside conventional M&M process to compare - 14/01/2022
- Version 1.1.1 - Most Data Pulls in Place - 10/12/2021

#### Authors:

1. Matt Stammers - Consultant Gastroenterolgist and Data Scientist, UHS
2. Michael George - Data Engineering Lead, UHS

What this Script Does:
- Finds patients who have died during a particular admission.
- Obtains key risk factors for death and key information documenting the admission.
- Risk stratifies patient mortality by validated risk scoring systems (CCI and HFRS) using comorbidipy (see documentation).
- Deposit the results into a flat file on the network for use by clinical teams for the purposes of morbidity and mortality estimation to help improve the quality of discussion and visibility for those involved in the direct clinical care of the patients.

#### Key Packages

Below are the key packages needed to run this script

In [None]:
# Datetime Packages
import datetime as datetime

# Database Connectors
import cx_Oracle as cxo
import sqlalchemy as sqla
from sqlalchemy import create_engine, MetaData, Table, and_
from sqlalchemy.sql import select

# Pandas
import numpy as np
import pandas as pd

# Encryption
import keyring

# Risk Scoring
from comorbidipy import comorbidity
from comorbidipy import hfrs

#### A note on Flat Files

If you want to run flat files instead of connecting to the database you should do so here. For instance if you have a .csv file or set of .csv files with the relevant data in them you can import them as per the following:

```python
df_comorb = pd.read_csv('{path to file}/comorbidity_data.csv')
```

We would however recommend setting up a database connection to this as it is going to be far more scalable as you will see below. You will need your local IT team's help to set this up but once you have access credentials they can be stored using keyring as below:

```python
keyring.set_password('User', '1', 'jimminycrickets')
```

Once you have set this up as long as you remember how you stored everything it can easily be retrieved by switching set to get_password as below.

#### A note on connecting to a database

In this example we are connecting to an Oracle SQL database using the SQLAlchemy library.

One method of connecting your database via SQLAlchemy is using an Engine. This is done by passing in a 'Database URL' to the create_engine() function. These database URLS follow a particular protocol and generally include the username, password, hostname and database name, and some other optional keyword arguments for additional configuration (in our example we have included a service_name configuration option to connect to the Oracle database)

A typical database url looks like this:

```python
dialect+driver://username:password@host:port/database
```

For more details you can refer to the SQLAlchemy documentation regarding engines and how to construct a database url for your particualr hospitals database: https://docs.sqlalchemy.org/en/14/core/engines.html

In [None]:
# Print version of sqlalchemy
print(sqla.__version__)  

# Print if the cx_Oracle is recognized
print(cxo.version)   

# Setup Connection to Client

cxo.init_oracle_client(lib_dir= "{path to client}/instantclient_11_2/")

# Print client version
print(cxo.clientversion())

# Load in Connection Credentials
ora_user = keyring.get_password("User", "10")
ora_password = keyring.get_password("Password", "10")
ora_host = keyring.get_password("Host", "10")
ora_service = keyring.get_password("Service", "10")
ora_port = keyring.get_password("Port", "10")

# Set Key Connection Variables

DIALECT = 'oracle'
SQL_DRIVER = 'cx_oracle'
USERNAME = ora_user
PASSWORD = ora_password
HOST = ora_host
PORT = ora_port
SERVICE = ora_service

# Create Engine Authorisation String Without Exposing Credentials
ENGINE_PATH_WIN_AUTH = f'{DIALECT}+{SQL_DRIVER}://{USERNAME}:{PASSWORD}@{HOST}:{str(PORT)}/?service_name={SERVICE}'
# ENGINE_PATH_WIN_AUTH = DIALECT + '+' + SQL_DRIVER + '://' + USERNAME + ':' + PASSWORD +'@' + HOST + ':' + str(PORT) + '/?service_name=' + SERVICE

# Create and Connect to Engine

engine = create_engine(ENGINE_PATH_WIN_AUTH)
engine.connect()

#### A note when learning

When you are learning to do the above we recommend breaking this block up into smaller subcomponents until you have mastered each of them. It will be worth the effort to learn how to do this. Once you have done it you can then connect to your SQL Queries

#### SQL Queries and Extract

These SQL queries are pseudocode to help you get the idea behind the approach and why it is being achieved this way. If you however insert the SQL in this format between the comments and then parse it through the engine it will work. For the sake of simplicity I have generated individual queries for the different components however some of the connections could be made database-side but we are deliberately assuming little SQL knowledge in this script to make the process easier to understand.

In [None]:
# Finds all the patients who have died under x team within x timeframe:

Mortality_Query = """
SELECT {insert columns of interest}
FROM {main table}
{LEFT/INNER/OUTER} JOIN {insert tables of interest}
WHERE {filters of interest - in this case patients admitted under my team who died within x months of now}
"""


#### Now you can create your index cohort

From this you can filter the subsequent queries. This is often better done using SQL itself but if you do want to do it using python this will work. 

In [None]:
# This gets you the mortality cohort

df_died = pd.read_sql_query(Mortality_Query, engine)

# You can then clean up the patient identifiers if needed and turn them into a tuple as below.
# You might need to set them first if there are repeats but if not this will work:

patients_who_died = tuple(df_died['patient_id'].to_list())

# Now we have our index list

#### Now run the other queries in sequence 

This enables you to gather your cohort with the first query and then cross-filter the results with subsequent queries. See below:

In [None]:
# Finds key patient demographics

Demographics_Query = """
SELECT {insert columns of interest - probably patient table or like}
FROM {main presumably patient table}
{LEFT/INNER/OUTER} JOIN {insert tables of interest}
WHERE {filters of interest - in this case filtered on the patient number from the Mortality Query}
PATIENT_IDENTIFIER IN {} /* inserts the patient identifiers */
""".format(patients_who_died)

# Comorbidity Query

Comorbidity_Query = """
SELECT {insert columns of interest - likely ICD10/SNOMED codes}
FROM {main ICD10/SNOMED table}
{LEFT/INNER/OUTER} JOIN {other important ICD10/SNOMED tables}
WHERE {filters of interest - in this case filtered on the patient number from the Mortality Query}
PATIENT_IDENTIFIER IN {} /* inserts the patient identifiers */
ORDER BY {perhaps by date or code}
""".format(patients_who_died)

# Narrative Query

Narrative_Query = """
SELECT {insert columns of interest - likely discharge summary free text}
FROM {main discharge summary table}
{LEFT/INNER/OUTER} JOIN {other important tables if needed}
WHERE {filters of interest - in this case filtered on the patient number from the Mortality Query}
PATIENT_IDENTIFIER IN {} /* inserts the patient identifiers */
""".format(patients_who_died)

# Physiology Query

Physiology_Query = """
SELECT {insert columns of interest - likely weight and height rather than BMI}
FROM {main physiology data table}
{LEFT/INNER/OUTER} JOIN {probably not required}
WHERE {filters of interest - in this case filtered on the patient number from the Mortality Query and perhaps time}
PATIENT_IDENTIFIER IN {} /* inserts the patient identifiers */
""".format(patients_who_died)

#### You can now connect your engine to these queries to create the dataframes

Now by connecting the engines to the queries you can pull all relevant data through

In [None]:
# This makes it very easy to pull discrete up to date datasets from your EPR/PAS into the kernel

df_demographics = pd.read_sql_query(Demographics_Query, engine)
df_comorbidity = pd.read_sql_query(Comorbidity_Query, engine)
df_narrative = pd.read_sql_query(Narrative_Query, engine)
df_physiology = pd.read_sql_query(Physiology_Query, engine)

#### Next I calculated BMI

This might be done for you

In [None]:
# First extract the weights and heights from the data to create two seperate pandas series

df_heights = df_physiology[df_physiology['test_code'] == 'HEIG']
df_weights = df_physiology[df_physiology['test_code'] == 'WEIG']

# Then aggregate them - we settled for mean in the end after checking for skewness

df_heights2 = df_heights.groupby(['patient_id']).agg('mean').reset_index()
df_weights2 = df_weights.groupby(['patient_id']).agg('mean').reset_index()

# Rename the columns

df_weights2.columns = ['PATIENT_ID', 'MEAN_WEIGHT']
df_heights2.columns = ['PATIENT_ID', 'MEAN_HEIGHT']

# Join Together
df_bmi = pd.concat([df_heights2, df_weights2])

# Then convert height to meters if currently in centimeters and calc BMI

df_bmi['MEAN_HEIGHT_M'] = df_bmi['MEAN_HEIGHT']/100
df_bmi['BMI'] = round(df_bmi['MEAN_WEIGHT']/(df_bmi['MEAN_HEIGHT_M'])**2,2)


#### Now join physiology to the core cohort and demographics table to make a first merged dataframe

A process of joining datasets when in a one to one relationship keeps things tidy. In this script we assume that the demographics table contains one id per patient - if this is not the case you will need to clean that first

In [None]:
# First merge the core cohort to the demographics and then add the BMI data

df_merge1 = pd.merge(df_died, df_demographics, on='patient_id', how = 'left')
df_merge2 = pd.merge(df_merge1, df_bmi, on='patient_id', how = 'left')

#### Now calculate the comorbidity scores

We can now apply our risk algorithm as we have age, comorbidities and unique identifiers in one table also with the BMI if needed (it acts as an independent marker of risk but doesn't feature in CCI or HFRS per se)

In [None]:
# First create a subframe with the key columns

comorbidities = df_merge2[['code', 'patient_id', 'age']]
comorbidities.columns = ['code', 'id', 'age']

# Then calculate the scores. Comorbidipy needs them in this format to work properly

cci = comorbidity(comorbidities)
frail = hfrs(comorbidities)

# Then tidy up the outputs a bit

cci['survival_10yr'] = round(cci['survival_10yr']*100,1).astype(str)
cci['survival_10yr'] = cci['survival_10yr'].apply(lambda x: ''.join(x + "%"))

# Then join these tables into one and add back to the main dataframe to create merge3

comorb_merge = pd.merge(cci, frail, on='id', how='left')
df_merge3_added = pd.merge(df_merge2, comorb_merge, left_on='patient_id', right_on='id', how='left')

#### Now add the narrative data

The narrative data has been left till last because it is the most difficult to handle. It is many to one and thus needs to be filtered on the dates to make sure the correct narrative aligns with the correct admission. You can also add the cause of death but remember this information is highly sensitive so has to be stored very carefully.

#### DateTime

To make the process of converting the datetimes easier you can use a simple function like this which converts particular columns to datetime values based on only the name of the column. Be careful though that the dates are all formatted the same or you might get odd results

In [None]:
# Datetime Function - to turn all column strings containing 'date' or 'datetime' into datetimes

def Datetime(series):
    if ('date' in series.name.lower() or 'datetime' in series.name.lower()) and 'age' not in series.name.lower():
            series = pd.to_datetime(series, dayfirst = True)
    return series

df_narrative = df_narrative.apply(Datetime)
df_died = df_died.apply(Datetime)

#### Then Filter the narratives to make sure they match the correct admission narrative

Simply by joining the mortality to the narratives and then using the .between() method we can filter the narratives to make sure they match the correct admission.

In [None]:
# Left join them all together

df_narratives = pd.merge(df_died, df_narrative, on='patient_id', how='left')

In [None]:
# Now filter to select only the correct narratives out

df_filtered_narratives = df_narratives[df_narratives['date_of_death'].between(df_narratives['admission_date'], df_narratives['discharge_date'])]

#### Joining all together

This creates a final dataframe. By this point the data is probably a bit unwieldy and needs truncating. How you do this is up to you but a simple method for truncating / rearranging columns is given.

In [None]:

# Final Merge

df_merge4 = pd.merge(df_merge3, df_filtered_narratives, on='patient_id', how='left')

# Final Output

final = df_merge4[['patient_id', 'age', 'survival_10yr', 'hfrs', 'admission narrative', 'cause of death']]

#### Using the Output

Typically in hospitals this process will be managed by team consultant and perhaps middle grade. We set up a system where by the data could be output to a particular secure location and accessed by only those directly involved in preparing for the M&M meeting.

In [None]:
# Now calculate timedeltas for the output files

year_ago_today = pd.to_datetime('today') - timedelta(days=365.25)
year_ago_today_string = year_ago_today.date()
date_today_string = pd.to_datetime('today').date()

# Output the file

final.to_csv('{path to output location}' + 'Team x Deaths Between {} and {}.csv'.format(year_ago_today_string, date_today_string))

#### What do you think?

We share this as an example of how one can easily use python to dramatically simplify a fairly arduous & complex task into something far more manageable and useful. Internally this process cut the time for mortality meeting preparation by about 50% because it removed a lot of the data gathering steps that were previously required. As you can see it is not a particularly complicated task using simple python scripting.

Feel free to contact me at matt@reallyusefulmodels.com if you have any further queries or questions about this, need help or simply want to point out potential improvements / share how you have built upon this.