# 20240805 - Liver transplant analysis

This notebook has the code necessary to reproduce the couple of data findings in the South Dakota News Watch article, ["35, dry and dying: The impact of booze on 1 man,"](TK) published Aug. 5, 2024.

This article includes findings from analyzing portions of the [UNOS STAR file](https://optn.transplant.hrsa.gov/data/view-data-reports/request-data/). See [README.md](README.md) for more details on the data.

- [Setup and summary stats](#Setup-and-summary-stats)
    - [Read in liver file code lookup](#Read-in-liver-file-code-lookup)
    - [Read in STAR file data dictionary](#Read-in-STAR-file-data-dictionary)
    - [Read in main liver data file](#Read-in-main-liver-data-file)
    - [Flag transplants](#Flag-transplants)
    - [Flag alcohol diagnoses](#Flag-alcohol-diagnoses)
- [Findings](#Findings)

## Setup and summary stats

In [74]:
import math
from datetime import date

from utils import get_df

import pandas as pd

In [75]:
pd.set_option('display.max_columns', None)

### Read in liver file code lookup

We can use this to look up pieces of encoded data.

In [76]:
df_code_dict = get_df(
    'docs/CODE DICTIONARY - FORMATS 202406/Liver/LIVER_FORMATS_FLATFILE.DAT',
    encoding='cp1252'
)

In [77]:
df_code_dict.head()

Unnamed: 0,LABEL,FMTNAME,TYPE,CODE
0,O,ABO,C,1
1,A,ABO,C,2
2,B,ABO,C,3
3,AB,ABO,C,4
4,A1,ABO,C,5


### Read in STAR file data dictionary

Mainly to check the variable start/end dates.

In [78]:
df_star_data_dict = pd.read_excel(
    'docs/STAR User Guide/STAR File Data Dictionary.xlsx',
    sheet_name='LIVER_DATA',
    skiprows=1
)

In [79]:
df_star_data_dict.head()

Unnamed: 0,VARIABLE NAME,DESCRIPTION,FORM,VAR START DATE,VAR END DATE,FORM SECTION,DATA TYPE,SAS ANALYSIS FORMAT,COMMENT
0,ABO,RECIPIENT BLOOD GROUP @ REGISTRATION,TCR,1987-10-01,NaT,CLINICAL INFORMATION,CHAR(3),ABO,
1,ABO_DON,DONOR BLOOD TYPE,DDR/LDR,1987-10-01,NaT,DONOR INFORMATION,CHAR(3),ABO,
2,ABO_MAT,DONOR-RECIPIENT ABO MATCH LEVEL,CALCULATED,NaT,NaT,,CHAR(1),ABOMAT,
3,ACADEMIC_LEVEL_TCR,ACADEMIC ACTIVITY LEVEL AT LISTING,TCR,2004-06-30,NaT,CANDIDATE INFORMATION,NUM,ACADLVLKI,
4,ACADEMIC_LEVEL_TRR,ACADEMIC ACTIVITY LEVEL AT TRANSPLANT,TRR,2004-06-30,NaT,PATIENT STATUS,NUM,ACADLVLKI,


### Read in main liver data file

In [80]:
# load the LIVER_DATA.DAT file
df = get_df(
    'data/liver/LIVER_DATA.DAT',
    parse_dates=[
        'PX_STAT_DATE',
        'INIT_DATE'
    ]
)

In [81]:
# how many records are missing a list year?
len(df[df['LISTYR'].isnull()])

1167

In [82]:
# We're only going to use data on or after the first full year of listing data: 1988
df = df[df['LISTYR'] > 1987]

In [83]:
# how many records total?
total_records = len(df)
print(f'{total_records:,} events')

367,012 events


In [84]:
# any nulls in patient ID column?
print(len(df[df['PT_CODE'].isnull()]), 'nulls')

0 nulls


In [85]:
# what are max and min dates (formatted)?
date_first = df['INIT_DATE'].min().strftime("%B %Y")
date_last = df['INIT_DATE'].max().strftime("%B %Y")

# there are a few records past June 30, 2024, but we're going to
# stick with the documentation and cap the data at the end of 6/24
date_last = 'June 2024'

print(f'Earliest event: {date_first}')
print(f'Most recent event: {date_last}')

Earliest event: January 1988
Most recent event: June 2024


In [86]:
# how many unique candidates?
# get the count of unique patient codes
total_candidates_unique_count = df['PT_CODE'].nunique()
print(f'{total_candidates_unique_count:,} candidates')

318,437 candidates


In [87]:
# how many unique donors?
# get the count of unique donor IDs
total_donors_unique_count = df['DONOR_ID'].nunique()
print(f'{total_donors_unique_count:,} donors')

213,641 donors


### Flag transplants

In [88]:
# how many candidates got a transplant? for this,
# we want to make sure that every record associated with a candidate
# shows whether they have (at some point) gotten a transplant

# step one: get a unique list of all PT_CODE values that are
# associated with a transplant record -- i.e., the TRR_ID_CODE
# value is not null
pt_codes_w_tx = df[df['TRR_ID_CODE'].notnull()]['PT_CODE'].unique()

# set a convenience flag on _every_ event record showing whether this
# candidate has ever been associated with a tx record
df['has_tx'] = df['PT_CODE'].isin(pt_codes_w_tx)

In [89]:
# checks out
df[df['has_tx']]['PT_CODE'].nunique() == df[df['TRR_ID_CODE'].notnull()]['PT_CODE'].nunique()

True

In [None]:
# break out transplant rate by LISTYR


### Flag alcohol diagnoses

Candidates who were listed with a primary or secondary diagnosis associated with an alcohol-related disease will have one of these codes present in the `DGN_TCR`, `DGN2_TCR` or `DIAG` fields, according to UNOS:
- 4215 – Alcoholic Cirrhosis
- 4216 – Alcoholic Cirrhosis with Hepatitis C
- 4217 – Acute Alcoholic Hepatitis
- 4218 – Acute Alcohol-Associated Hepatitis with or without Cirrhosis
- 4219 – Alcohol-Associated Cirrhosis without Acute Alcohol-Associated Hepatitis

Here, we're going to check for the presence of these codes and set a flag on each record.

In [90]:
# check the variable start dates for those columns
df_star_data_dict[df_star_data_dict['VARIABLE NAME'].isin(['DGN_TCR', 'DGN2_TCR', 'DIAG'])]

Unnamed: 0,VARIABLE NAME,DESCRIPTION,FORM,VAR START DATE,VAR END DATE,FORM SECTION,DATA TYPE,SAS ANALYSIS FORMAT,COMMENT
87,DGN_TCR,PRIMARY DIAGNOSIS AT TIME OF LISTING,TCR,1994-04-01,NaT,CLINICAL INFORMATION,NUM,ALL_DGN,
88,DGN2_TCR,SECONDARY DIAGNOSIS AT TIME OF LISTING,TCR,1994-04-01,NaT,CLINICAL INFORMATION,NUM,ALL_DGN,
92,DIAG,RECIPIENT PRIMARY DIAGNOSIS,TRR>TCR,1987-10-01,NaT,PATIENT STATUS/CLINICAL INFORMATION,NUM,ALL_DGN,Primary diagnosis was not collected on TCR unt...


In [91]:
# check - see which code labels include the word 'alcohol'
df_code_dict[df_code_dict['LABEL'].fillna('').str.contains('alcohol', case=False)]

Unnamed: 0,LABEL,FMTNAME,TYPE,CODE
2582,ALCOHOLIC CIRRHOSIS,LI_DGN,N,4215
2583,ALCOHOLIC CIRRHOSIS WITH HEPATITIS C,LI_DGN,N,4216
2584,ACUTE ALCOHOLIC HEPATITIS,LI_DGN,N,4217
2585,ACUTE ALCOHOL-ASSOCIATED HEPATITIS WITH OR WIT...,LI_DGN,N,4218
2586,ALCOHOL-ASSOCIATED CIRRHOSIS WITHOUT ACUTE ALC...,LI_DGN,N,4219


In [92]:
alcohol_codes = set([int(x) for x in df_code_dict[df_code_dict['LABEL'].fillna('').str.contains('alcohol', case=False)]['CODE']])

In [93]:
alcohol_codes

{4215, 4216, 4217, 4218, 4219}

In [94]:
def has_alcohol_diagnosis(row):
    ''' function to check a row of liver data to see if any of the alcohol codes are present '''
    diag_codes = [
        row['DGN_TCR'],
        row['DGN2_TCR'],
        row['DIAG']
    ]

    if set(diag_codes).intersection(alcohol_codes):
        return True

    return False

In [95]:
df['alcohol_diagnosis'] = df.apply(
    has_alcohol_diagnosis,
    axis=1
)

In [96]:
# to guard against the possibility of a candidate record not reflecting the alcohol diagnostic
# code for that candidate, set the 'alcohol_diagnosis' flag on ALL records for candidates who had an
# alcohol diagnosis in ANY of their records

# (belt and suspenders -- our numbers didn't change after we tested against this approach)

pt_codes_w_alc = df[df['alcohol_diagnosis']]['PT_CODE'].unique()

# now set a convenience flag on each event record -- has this patient
# ever been associated with an alcohol diagnosis record?
df['alcohol_diagnosis'] = df['PT_CODE'].isin(pt_codes_w_alc)

In [97]:
# grab a unique list of candidates for use later - get latest record based on
# `PX_START_DATE`
df_uniq_cands = df.sort_values('PX_STAT_DATE', ascending=False).drop_duplicates(
    subset='PT_CODE',
    keep='first'
)

# .. and one for unique donors
df_uniq_donors = df.sort_values('PX_STAT_DATE', ascending=False).drop_duplicates(
    subset='DONOR_ID',
    keep='first'
)

## Findings

### "Most liver transplant patients aren't this young. The average age of a person waiting for a liver is 54. Almost two-thirds ultimately receive an organ."

In the U.S., from January 1988 to June 2024, 201,973 of 318,437 liver transplant candidates received a transplant (63.43%).

Taylor is 35. Summary stats for "age at listing" for candidates and "donor age" values across the U.S. in this time period:
  - Candidate age mean: 49.79
  - Candidate age median: 54.0
  - Donor age mean: 37.75
  - Donor age median: 38.0

Last year (2023), the median age for a liver transplant candidate listed was 56.

In [101]:
df_cand_w_tx = df[df['has_tx']]
total_candidates_w_tx = df_cand_w_tx['PT_CODE'].nunique()

# and get the % of total
tx_pct = total_candidates_w_tx / total_candidates_unique_count

print(f'In the U.S., from {date_first} to {date_last}, {total_candidates_w_tx:,} of {total_candidates_unique_count:,} liver transplant candidates received a transplant ({tx_pct:.2%}).')
print('Of course, this is just a snapshot in time: Many people listed recently will ultimately get a transplant.')

In the U.S., from January 1988 to June 2024, 201,973 of 318,437 liver transplant candidates received a transplant (63.43%).
Of course, this is just a snapshot in time: Many people listed recently will ultimately get a transplant.


In [117]:
# break out tx rate by LISTYR

df_listyr_x_tx = pd.pivot_table(
    df_uniq_cands[['LISTYR', 'has_tx']],
    index='LISTYR',
    columns='has_tx',
    aggfunc=len
).reset_index()

In [119]:
df_listyr_x_tx.columns = ['LISTYR', 'tx_no', 'tx_yes']

In [124]:
df_listyr_x_tx['tx_pct'] = (df_listyr_x_tx['tx_yes'] / (df_listyr_x_tx['tx_yes'] + df_listyr_x_tx['tx_no'])) * 100

In [126]:
df_listyr_x_tx

Unnamed: 0,LISTYR,tx_no,tx_yes,tx_pct
0,1988.0,398,1276,76.224612
1,1989.0,552,1734,75.853018
2,1990.0,635,2335,78.619529
3,1991.0,850,2593,75.312228
4,1992.0,964,2896,75.025907
5,1993.0,1279,3095,70.759031
6,1994.0,1522,3459,69.443887
7,1995.0,2033,3935,65.934987
8,1996.0,2474,4153,62.667874
9,1997.0,2857,4202,59.526845


In [98]:
# how many records are missing `INIT_AGE`?
len(df[df['INIT_AGE'].isnull()])

0

In [40]:
# max/min for `INIT_AGE` (candidate age at listing)
# and `AGE_DON` (donor age)

print('Age range:')
print('- Candidates:', df['INIT_AGE'].min(), '-', df['INIT_AGE'].max())
print('- Donors:', df['AGE_DON'].min(), '-', df['AGE_DON'].max())

Age range:
- Candidates: 0.0 - 86.0
- Donors: 0.0 - 98.0


In [105]:
# grab mean/median age for candidates and donors
age_mean_candidates_us = df_uniq_cands['INIT_AGE'].mean()
age_median_candidates_us = df_uniq_cands['INIT_AGE'].median()

age_mean_donor_us = df_uniq_donors['AGE_DON'].mean()
age_median_donor_us = df_uniq_donors['AGE_DON'].median()

age_str = '\n  - '.join([
    f'Candidate age mean: {age_mean_candidates_us:.2f}',
    f'Candidate age median: {age_median_candidates_us:.2f}',
    f'Donor age mean: {age_mean_donor_us:.2f}',
    f'Donor age median: {age_median_donor_us:.2f}'
])

# FINDING
print(f'Nationally:\n  - {age_str}')

Nationally:
  - Candidate age mean: 49.79
  - Candidate age median: 54.00
  - Donor age mean: 37.75
  - Donor age median: 38.00


In [112]:
# look at mean/median candidate INIT_AGE broken out by listing year
df_uniq_cands[['LISTYR', 'INIT_AGE']].groupby('LISTYR').agg(['median', 'mean'])

Unnamed: 0_level_0,INIT_AGE,INIT_AGE
Unnamed: 0_level_1,median,mean
LISTYR,Unnamed: 1_level_2,Unnamed: 2_level_2
1988.0,40.0,34.934289
1989.0,43.0,38.57699
1990.0,45.0,40.600337
1991.0,46.0,41.632297
1992.0,46.0,42.134197
1993.0,47.0,42.602881
1994.0,46.0,42.624573
1995.0,48.0,44.815851
1996.0,48.0,45.226799
1997.0,48.0,45.259244


### "Doctors do approve those who struggle with alcohol for transplant, and the practice is growing. Thirty years ago, around 20% of candidates added to the transplant list were diagnosed with an alcohol-related liver disease. Last year, that number rose to 45%. Generally, the majority of these patients ended up with a liver, similar to patients without alcohol-related liver disease."

In [131]:
# how many candidates were listed with an alcohol diagnosis?
df_alcohol_diag = df[df['alcohol_diagnosis']]
alcohol_cand_total = df_alcohol_diag['PT_CODE'].nunique()

In [132]:
alcohol_cand_pct = alcohol_cand_total / total_candidates_unique_count

cands_alcohol_got_tx_count = df_alcohol_diag[df_alcohol_diag['has_tx']]['PT_CODE'].nunique()
cands_alcohol_got_tx_pct = cands_alcohol_got_tx_count / alcohol_cand_total

print(f'Of {total_candidates_unique_count:,} liver donation candidates across the U.S., {alcohol_cand_total:,} were listed with a primary or secondary diagnosis of an alcohol-related disease ({alcohol_cand_pct:.2%}).')
print(f'Of those {alcohol_cand_total:,} candidates who were listed with an alcohol diagnosis, {df_cands_alcohol_got_tx_count:,} received a transplant ({df_cands_alcohol_got_tx_pct:.2%}).')

Of 318,437 liver donation candidates across the U.S., 94,434 were listed with a primary or secondary diagnosis of an alcohol-related disease (29.66%).
Of those 94,434 candidates who were listed with an alcohol diagnosis, 58,448 received a transplant (61.89%).


In [133]:
# note: vetted against manually filtered data as well

# show unique candidates, LISTYR x alcohol diagnosis
df_alc_by_listyr_uniq = pd.pivot_table(
    df_uniq_cands[['LISTYR', 'alcohol_diagnosis']],
    index='alcohol_diagnosis',
    columns='LISTYR',
    aggfunc=len
).T.reset_index()

In [134]:
df_alc_by_listyr_uniq.head()

alcohol_diagnosis,LISTYR,False,True
0,1988.0,1548,126
1,1989.0,2016,270
2,1990.0,2506,464
3,1991.0,2924,519
4,1992.0,3240,620


In [135]:
df_alc_by_listyr_uniq.columns = ['LISTYR', 'alcohol_diag_no', 'alcohol_diag_yes']

In [136]:
df_alc_by_listyr_uniq.head()

Unnamed: 0,LISTYR,alcohol_diag_no,alcohol_diag_yes
0,1988.0,1548,126
1,1989.0,2016,270
2,1990.0,2506,464
3,1991.0,2924,519
4,1992.0,3240,620


In [137]:
df_alc_by_listyr_uniq['alcohol_diag_pct'] = (df_alc_by_listyr_uniq['alcohol_diag_yes'] / (df_alc_by_listyr_uniq['alcohol_diag_yes'] + df_alc_by_listyr_uniq['alcohol_diag_no'])) * 100

In [138]:
df_alc_by_listyr_uniq

Unnamed: 0,LISTYR,alcohol_diag_no,alcohol_diag_yes,alcohol_diag_pct
0,1988.0,1548,126,7.526882
1,1989.0,2016,270,11.811024
2,1990.0,2506,464,15.622896
3,1991.0,2924,519,15.074063
4,1992.0,3240,620,16.062176
5,1993.0,3623,751,17.169639
6,1994.0,3766,1215,24.392692
7,1995.0,4330,1638,27.446381
8,1996.0,4750,1877,28.323525
9,1997.0,5163,1896,26.859329


In [139]:
df_alc_by_listyr_uniq.to_clipboard(index=False)

In [140]:
# spot-checking 2023 numbers from source
test_23 = df_uniq_cands[df_uniq_cands['LISTYR'] == 2023]
test_23_has_alc = test_23[test_23['alcohol_diagnosis']]
test_23_no_alc = test_23[~test_23['alcohol_diagnosis']]
print(len(test_23_has_alc))
print(len(test_23_no_alc))

6141
7580
