## Learning From Death Script

- Version 1.4 - Clinical Narrative Added For NHS Pycom - 22/05/2022
- Version 1.3.1 - Bugs fixed and versioned such that source code can be safely made open source - 23/04/2022
- Version 1.2.3 - Changes made to improve cohort capture mechanism - 23/03/2022
- Version 1.2.2 - Process Automated - 02/03/2022
- Version 1.2.1 - Improved upon initial version with better filtering, output formatting and direct database pulls - 13/02/2022
- Version 1.1.2 - Soft Launch. Used alongside conventional M&M process to compare - 14/01/2022
- Version 1.1.1 - Most Data Pulls in Place - 10/12/2021

#### Authors:

1. Matt Stammers - Consultant Gastroenterolgist and Data Scientist, UHS
2. Michael George - Data Engineering Lead, UHS

What this Script Does:
- Finds patients who have died during a particular admission.
- Obtains key risk factors for death and key information documenting the admission.
- Risk stratifies patient mortality by validated risk scoring systems (CCI and HFRS) using comorbidipy (see documentation).
- Deposit the results into a flat file on the network for use by clinical teams for the purposes of morbidity and mortality estimation to help improve the quality of discussion and visibility for those involved in the direct clinical care of the patients.

#### Key Packages

Below are the key packages needed to run this script

In [700]:
# Pandas
import numpy as np
import pandas as pd

# Risk Scoring
from comorbidipy import comorbidity
from comorbidipy import hfrs

# Database
import sqlite3

## Data Extraction

#### A note on Flat Files

If you want to run flat files instead of connecting to the database you should do so here. For instance if you have a .csv file or set of .csv files with the relevant data in them you can import them as per the following:

```python
df_comorb = pd.read_csv('{path to file}/comorbidity_data.csv')
```

We would however recommend setting up a database connection to this as it is going to be far more scalable as you will see below. You will need your local IT team's help to set this up but once you have access credentials they can be stored using keyring as below:

```python
keyring.set_password('User', '1', 'jimminycrickets')
```

Once you have set this up as long as you remember how you stored everything it can easily be retrieved by switching set to get_password as below.

#### A note on connecting to a database

In this example we are connecting to a SQLite database and as the name suggests it is a very light weight database engine. It is almost entirely self-contained with all the data housed within a single file. This makes it very easy to connect to, but the process is very diferent to connecting to the larger, more fully featured  database engines such as Postgres and Oracle. Therefore we have also provided, as an example, how to connect to an Oracle Database in a separate notebook file - OracleDB_Example.ipynb

In [701]:
DATABASE_PATH = "qipy.db"
con = sqlite3.connect(DATABASE_PATH)

#### SQL Queries and Extract

These SQL queries are pseudocode to help you get the idea behind the approach and why it is being achieved this way. If you however insert the SQL in this format between the comments and then parse it through the engine it will work. For the sake of simplicity I have generated individual queries for the different components however some of the connections could be made database-side but we are deliberately assuming little SQL knowledge in this script to make the process easier to understand.

In [702]:
# Finds all the patients who have died under x team within x timeframe:

# In this particular query we are combining data from both the Demographics and Admission tables
Mortality_Query = """
  SELECT demo.patient_id, adm.speciality, adm.admission_date, adm.discharge_date
  FROM Demographics AS demo, Admissions AS adm
  WHERE demo.patient_id = adm.patient_id
  AND demo.deceased is 1
  AND adm.speciality is "Gastroenterology"
  AND demo.date_of_death BETWEEN DATE('now', '-2 months') AND DATE('now');
"""

#### Now you can create your index cohort

From this you can filter the subsequent queries. This is often better done using SQL itself but if you do want to do it using python this will work. 

In [703]:
# This gets you a dataframe containing the mortality cohort
df_died = pd.read_sql_query(Mortality_Query, con)

In [704]:
df_died

Unnamed: 0,patient_id,speciality,admission_date,discharge_date
0,3,Gastroenterology,2012-08-10 00:00:00,2012-08-30 00:00:00
1,3,Gastroenterology,2022-05-01 00:00:00,2022-05-07 00:00:00
2,4,Gastroenterology,2022-04-12 00:00:00,2022-04-20 00:00:00
3,5,Gastroenterology,2016-08-09 00:00:00,2016-08-18 00:00:00
4,5,Gastroenterology,2018-02-10 00:00:00,2018-02-20 00:00:00
5,5,Gastroenterology,2022-05-09 00:00:00,2022-05-16 00:00:00


In [705]:
# This will create a de-duplicated list of patient identifiers from the dataframe
patients_who_died = tuple(set(df_died['patient_id']))

In [706]:
# Now we have our index list
patients_who_died

(3, 4, 5)

#### Now run the other queries in sequence 

This enables you to gather your cohort with the first query and then cross-filter the results with subsequent queries. See below:

In [707]:
# Query = """
  # SELECT {insert columns of interest - probably patient table or like}
  # FROM {table}
  # {LEFT/INNER/OUTER} JOIN {insert tables of interest}
  # WHERE {filters of interest - in this case filtered on the patient number from the Mortality Query}
  # PATIENT_IDENTIFIER IN {} /* inserts the patient identifiers */
# """.format(patients_who_died)

Demographics_Query = """
  SELECT patient_id, first_name, last_name, age, date_of_death, cause_of_death
  FROM Demographics
  WHERE patient_id IN {}
""".format(patients_who_died)

Comorbidity_Query = """
  SELECT patient_id, code
  FROM Comorbidity
  WHERE patient_id IN {}
""".format(patients_who_died)

Admission_Query = """
  SELECT patient_id, admission_date, discharge_date, speciality, discharge_summary
  FROM Admissions
  WHERE patient_id IN {}
  AND speciality IS 'Gastroenterology'
""".format(patients_who_died)

Physiology_Query = """
  SELECT patient_id, height, weight
  FROM Physiology
  WHERE patient_id IN {}
""".format(patients_who_died)

#### You can now connect your engine to these queries to create the dataframes

Now by connecting the engines to the queries you can pull all relevant data through

In [708]:
# This makes it very easy to pull discrete up to date datasets from your EPR/PAS

df_demographics = pd.read_sql_query(Demographics_Query, con)
df_comorbidity = pd.read_sql_query(Comorbidity_Query, con)
df_admission = pd.read_sql_query(Admission_Query, con)
df_physiology = pd.read_sql_query(Physiology_Query, con)

In [709]:
df_admission

Unnamed: 0,patient_id,admission_date,discharge_date,speciality,discharge_summary
0,3.0,2012-08-10 00:00:00,2012-08-30 00:00:00,Gastroenterology,Admitted to Gastroenterology
1,3.0,2022-05-01 00:00:00,2022-05-07 00:00:00,Gastroenterology,Admitted to Gastroenterology
2,4.0,2022-04-12 00:00:00,2022-04-20 00:00:00,Gastroenterology,Admitted to Gastroenterology
3,5.0,2016-08-09 00:00:00,2016-08-18 00:00:00,Gastroenterology,Admitted to Gastroenterology
4,5.0,2018-02-10 00:00:00,2018-02-20 00:00:00,Gastroenterology,Admitted to Gastroenterology
5,5.0,2022-05-09 00:00:00,2022-05-16 00:00:00,Gastroenterology,Admitted to Gastroenterology


#### Next I calculated BMI

This might be done for you

In [710]:
# First extract the weights and heights from the data to create two seperate pandas series
df_weights = df_physiology[['patient_id','weight']]
df_heights = df_physiology[['patient_id','height']]

# Then aggregate them - we settled for mean in the end after checking for skewness
df_mean_weight = df_weights.groupby(['patient_id']).agg('mean').reset_index()
df_mean_height = df_heights.groupby(['patient_id']).agg('mean').reset_index()

# Join the two tables together
df_bmi = pd.merge(df_mean_height, df_mean_weight, on='patient_id')

# Then convert height to meters if currently in centimeters and calc BMI
df_bmi['height_m'] = df_bmi['height']/100
df_bmi['BMI'] = round(df_bmi['weight']/(df_bmi['height_m'])**2,2)

In [711]:
df_bmi

Unnamed: 0,patient_id,height,weight,height_m,BMI
0,3,165,65.94,1.65,24.22
1,4,192,67.91718,1.92,18.42
2,5,170,80.7438,1.7,27.94


#### Now join physiology to the core cohort and demographics table

A process of joining datasets when in a one to one relationship keeps things tidy. In this script we assume that the demographics table contains one id per patient - if this is not the case you will need to clean that first

In [712]:
# First merge the core cohort to the demographics and then add the BMI data

df_adm_demo = pd.merge(df_admission, df_demographics, on='patient_id', how = 'left')
df_adm_demo_bmi = pd.merge(df_adm_demo, df_bmi, on='patient_id', how = 'left')
df_all_comorb = pd.merge(df_adm_demo_bmi, df_comorbidity, on='patient_id', how='left')

In [713]:
df_all_comorb

Unnamed: 0,patient_id,admission_date,discharge_date,speciality,discharge_summary,first_name,last_name,age,date_of_death,cause_of_death,height,weight,height_m,BMI,code
0,3.0,2012-08-10 00:00:00,2012-08-30 00:00:00,Gastroenterology,Admitted to Gastroenterology,Leopoldo,Vishal,89,2022-05-02 00:00:00,GI Bleed,165,65.9400,1.65,24.22,C20X
1,3.0,2012-08-10 00:00:00,2012-08-30 00:00:00,Gastroenterology,Admitted to Gastroenterology,Leopoldo,Vishal,89,2022-05-02 00:00:00,GI Bleed,165,65.9400,1.65,24.22,C775
2,3.0,2012-08-10 00:00:00,2012-08-30 00:00:00,Gastroenterology,Admitted to Gastroenterology,Leopoldo,Vishal,89,2022-05-02 00:00:00,GI Bleed,165,65.9400,1.65,24.22,Z510
3,3.0,2012-08-10 00:00:00,2012-08-30 00:00:00,Gastroenterology,Admitted to Gastroenterology,Leopoldo,Vishal,89,2022-05-02 00:00:00,GI Bleed,165,65.9400,1.65,24.22,Z801
4,3.0,2012-08-10 00:00:00,2012-08-30 00:00:00,Gastroenterology,Admitted to Gastroenterology,Leopoldo,Vishal,89,2022-05-02 00:00:00,GI Bleed,165,65.9400,1.65,24.22,Z511
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
361,5.0,2022-05-09 00:00:00,2022-05-16 00:00:00,Gastroenterology,Admitted to Gastroenterology,Lucifer,Grigore,98,2022-05-12 00:00:00,GI Bleed,170,80.7438,1.70,27.94,R073
362,5.0,2022-05-09 00:00:00,2022-05-16 00:00:00,Gastroenterology,Admitted to Gastroenterology,Lucifer,Grigore,98,2022-05-12 00:00:00,GI Bleed,170,80.7438,1.70,27.94,M546
363,5.0,2022-05-09 00:00:00,2022-05-16 00:00:00,Gastroenterology,Admitted to Gastroenterology,Lucifer,Grigore,98,2022-05-12 00:00:00,GI Bleed,170,80.7438,1.70,27.94,R296
364,5.0,2022-05-09 00:00:00,2022-05-16 00:00:00,Gastroenterology,Admitted to Gastroenterology,Lucifer,Grigore,98,2022-05-12 00:00:00,GI Bleed,170,80.7438,1.70,27.94,R263


#### Now calculate the comorbidity scores

We can now apply our risk algorithm as we have age, comorbidities and unique identifiers in one table also with the BMI if needed (it acts as an independent marker of risk but doesn't feature in CCI or HFRS per se)

In [714]:
# First create a subframe with the key columns
comorbidities = df_all_comorb[['code', 'patient_id', 'age']]

#The Comorbidipy library specifies that we should have the columns 'id', 'code' and 'age'
comorbidities.rename(columns={'patient_id': 'id'}, inplace=True)
cci = comorbidity(comorbidities, weighting="charlson")
frail = hfrs(comorbidities)

# Then tidy up the outputs a bit
cci['survival_10yr'] = round(cci['survival_10yr']*100,1).astype(str) + '%'

# Then join these tables generated from comorbidipy into one
comorb_merge = pd.merge(cci, frail, on='id', how='left')

# Rename id back to patient_id to better manage merges
comorb_merge.rename(columns={'id':'patient_id'}, inplace=True)

# At this point the age column is duplicated across multiple dataframes - cci and df_merged and therefore 
# we should drop this columns from one of the dataframes before merging again
comorb_merge.drop(['age'], axis=1, inplace=True)

# Combine the patients admission, demographisc and comorbidity data into a single dataframe
df_all = pd.merge(df_adm_demo, comorb_merge, on='patient_id', how='left')

In [715]:
df_all

Unnamed: 0,patient_id,admission_date,discharge_date,speciality,discharge_summary,first_name,last_name,age,date_of_death,cause_of_death,...,mld,msld,pud,pvd,rend,rheumd,charlson_wt_charlson_icd10_quan,age_adj_charlson_wt_charlson_icd10_quan,survival_10yr,hfrs
0,3.0,2012-08-10 00:00:00,2012-08-30 00:00:00,Gastroenterology,Admitted to Gastroenterology,Leopoldo,Vishal,89,2022-05-02 00:00:00,GI Bleed,...,0,0,0,0,0,0,6.0,10.0,0.0%,10.2
1,3.0,2022-05-01 00:00:00,2022-05-07 00:00:00,Gastroenterology,Admitted to Gastroenterology,Leopoldo,Vishal,89,2022-05-02 00:00:00,GI Bleed,...,0,0,0,0,0,0,6.0,10.0,0.0%,10.2
2,4.0,2022-04-12 00:00:00,2022-04-20 00:00:00,Gastroenterology,Admitted to Gastroenterology,Tamaz,Asa,78,2022-04-16 00:00:00,GI Bleed,...,0,0,0,0,0,0,2.0,5.0,21.4%,26.5
3,5.0,2016-08-09 00:00:00,2016-08-18 00:00:00,Gastroenterology,Admitted to Gastroenterology,Lucifer,Grigore,98,2022-05-12 00:00:00,GI Bleed,...,0,0,0,0,0,0,2.0,6.0,2.2%,21.4
4,5.0,2018-02-10 00:00:00,2018-02-20 00:00:00,Gastroenterology,Admitted to Gastroenterology,Lucifer,Grigore,98,2022-05-12 00:00:00,GI Bleed,...,0,0,0,0,0,0,2.0,6.0,2.2%,21.4
5,5.0,2022-05-09 00:00:00,2022-05-16 00:00:00,Gastroenterology,Admitted to Gastroenterology,Lucifer,Grigore,98,2022-05-12 00:00:00,GI Bleed,...,0,0,0,0,0,0,2.0,6.0,2.2%,21.4


#### DateTime

To make the process of converting the datetimes easier you can use a simple function like this which converts particular columns to datetime values based on only the name of the column. Be careful though that the dates are all formatted the same or you might get odd results

In [716]:
# to_datetime Function - to turn all column strings containing 'date' or 'datetime' into datetimes

def to_datetime(series):
    if ('date' in series.name.lower() or 'datetime' in series.name.lower()) and 'age' not in series.name.lower():
            series = pd.to_datetime(series, dayfirst = True)
    return series

df_all = df_all.apply(to_datetime)

#### Filtering Admissions to obtain the correct narrative

Because the patient to admissions relationship is one-to-many (i.e. one patient can have many admissions) we need to filter this data to ensure we obtain the correct admission and therefore correct narrative (discharge summary). You can also add the cause of death but remember this information is highly sensitive so has to be stored very carefully.

By using the .between() method we can filter the data to make sure we obtain the correct admission

In [717]:
df_filtered = df_all[df_all['date_of_death'].between(df_all['admission_date'], df_all['discharge_date'])]

In [718]:
df_filtered

Unnamed: 0,patient_id,admission_date,discharge_date,speciality,discharge_summary,first_name,last_name,age,date_of_death,cause_of_death,...,mld,msld,pud,pvd,rend,rheumd,charlson_wt_charlson_icd10_quan,age_adj_charlson_wt_charlson_icd10_quan,survival_10yr,hfrs
1,3.0,2022-05-01,2022-05-07,Gastroenterology,Admitted to Gastroenterology,Leopoldo,Vishal,89,2022-05-02,GI Bleed,...,0,0,0,0,0,0,6.0,10.0,0.0%,10.2
2,4.0,2022-04-12,2022-04-20,Gastroenterology,Admitted to Gastroenterology,Tamaz,Asa,78,2022-04-16,GI Bleed,...,0,0,0,0,0,0,2.0,5.0,21.4%,26.5
5,5.0,2022-05-09,2022-05-16,Gastroenterology,Admitted to Gastroenterology,Lucifer,Grigore,98,2022-05-12,GI Bleed,...,0,0,0,0,0,0,2.0,6.0,2.2%,21.4


#### Joining all together

This creates a final dataframe. By this point the data is probably a bit unwieldy and needs truncating. How you do this is up to you but a simple method for truncating / rearranging columns is given.

In [719]:
# Final Output containing the columns of interest only
final = df_filtered[['patient_id', 'age', 'survival_10yr', 'hfrs', 'discharge_summary', 'cause_of_death']]

In [720]:
final

Unnamed: 0,patient_id,age,survival_10yr,hfrs,discharge_summary,cause_of_death
1,3.0,89,0.0%,10.2,Admitted to Gastroenterology,GI Bleed
2,4.0,78,21.4%,26.5,Admitted to Gastroenterology,GI Bleed
5,5.0,98,2.2%,21.4,Admitted to Gastroenterology,GI Bleed


#### Using the Output

Typically in hospitals this process will be managed by team consultant and perhaps middle grade. We set up a system where by the data could be output to a particular secure location and accessed by only those directly involved in preparing for the M&M meeting.

In [721]:
# Now calculate timedeltas for the output files

from datetime import timedelta, datetime

TEAM = 'Gastroenterology'

date_today= datetime.now().strftime("%d-%m-%Y")
year_ago_today = (datetime.now() - timedelta(days=365.25)).strftime("%d-%m-%Y")

# Output the file
final.to_csv(f'M+M {TEAM} - Deaths Between {year_ago_today} and {date_today}.csv')

#### What do you think?

We share this as an example of how one can easily use python to dramatically simplify a fairly arduous & complex task into something far more manageable and useful. Internally this process cut the time for mortality meeting preparation by about 50% because it removed a lot of the data gathering steps that were previously required. As you can see it is not a particularly complicated task using simple python scripting.

Feel free to contact me at matt@reallyusefulmodels.com if you have any further queries or questions about this, need help or simply want to point out potential improvements / share how you have built upon this.