In [1]:
import numpy as np
import pandas as pd

In [2]:
lab_dict = pd.read_csv('D_LABITEMS.csv', sep = ',')
labevents = pd.read_csv('LABEVENTS.csv', sep = ',')

## Explore the LABEVENTS table
Note that we could do the same thing from MySQL.

In [3]:
labevents.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ITEMID,CHARTTIME,VALUE,VALUENUM,VALUEUOM,FLAG
0,281,3,,50820,2101-10-12 16:07:00,7.39,7.39,units,
1,282,3,,50800,2101-10-12 18:17:00,ART,,,
2,283,3,,50802,2101-10-12 18:17:00,-1,-1.0,mEq/L,
3,284,3,,50804,2101-10-12 18:17:00,22,22.0,mEq/L,
4,285,3,,50808,2101-10-12 18:17:00,0.93,0.93,mmol/L,abnormal


We aggregated each lab test by its units, and counted how many lab test record use each unit, and their means and medians

In [4]:
group = labevents.fillna({'VALUEUOM':'?'}).groupby(['ITEMID', 'VALUEUOM'])
lab_sum = group.size().to_frame(name = 'counts')
labevents.set_index('ROW_ID')
lab_sum = (lab_sum
    .join(group.agg({'VALUENUM':'mean'}).rename(columns={'VALUENUM':'mean'}))
    .join(group.agg({'VALUENUM':'median'}).rename(columns={'VALUENUM':'median'}))
    .reset_index()
)

In [5]:
lab_sum.head(n = 10)

Unnamed: 0,ITEMID,VALUEUOM,counts,mean,median
0,50800,?,404785,,
1,50801,?,5943,457.646524,475.0
2,50801,mm Hg,16073,478.099981,501.0
3,50802,?,20,5.1,5.0
4,50802,mEq/L,490631,-0.090816,0.0
5,50803,mEq/L,9246,23.847858,24.0
6,50804,?,20,33.5,35.0
7,50804,MEQ/L,20266,25.918928,26.0
8,50804,mEq/L,470355,26.046331,26.0
9,50805,%,2056,1.757571,1.0


The above table represented lab tests with their local codes (ITEMID column). We can join the lab code dictionary table to get the LOINC coding.

In [6]:
# show the lab dictionary table
lab_dict.head()

Unnamed: 0,ROW_ID,ITEMID,LABEL,FLUID,CATEGORY,LOINC_CODE
0,546,51346,Blasts,Cerebrospinal Fluid (CSF),Hematology,26447-3
1,547,51347,Eosinophils,Cerebrospinal Fluid (CSF),Hematology,26451-5
2,548,51348,"Hematocrit, CSF",Cerebrospinal Fluid (CSF),Hematology,30398-2
3,549,51349,Hypersegmented Neutrophils,Cerebrospinal Fluid (CSF),Hematology,26506-6
4,550,51350,Immunophenotyping,Cerebrospinal Fluid (CSF),Hematology,


In [7]:
# show the lab summary data with LOINC code
lab_sum.merge(lab_dict, on = 'ITEMID', how = 'outer').head()

Unnamed: 0,ITEMID,VALUEUOM,counts,mean,median,ROW_ID,LABEL,FLUID,CATEGORY,LOINC_CODE
0,50800,?,404785.0,,,1,SPECIMEN TYPE,BLOOD,BLOOD GAS,
1,50801,?,5943.0,457.646524,475.0,2,Alveolar-arterial Gradient,Blood,Blood Gas,19991-9
2,50801,mm Hg,16073.0,478.099981,501.0,2,Alveolar-arterial Gradient,Blood,Blood Gas,19991-9
3,50802,?,20.0,5.1,5.0,3,Base Excess,Blood,Blood Gas,11555-0
4,50802,mEq/L,490631.0,-0.090816,0.0,3,Base Excess,Blood,Blood Gas,11555-0


## Explore lab tests after loinc2hpo transformation.
Note the transformed data is not uploaded to the repo or the MySQL database yet...

In [9]:
mimic2hpo = pd.read_csv('lab2hpo.csv', sep = ',')

In [11]:
mimic2hpo.head()

Unnamed: 0,ROW_ID,NEGATED,MAP_TO
0,281,T,HP:0004360
1,282,U,ERROR 1: local id not mapped to loinc
2,283,T,HP:0032281
3,284,T,HP:0500164
4,285,F,HP:0002901


Note that the last two columns are the LOINC representation of each lab record. 'Negated' indicates whether the HPO term should be negated to represent the medical implication. 

For some lab records, we were not able to transform them into HPO terms. We noted what kind of error caused the failure. Below we calculated success conversion rate and the percentages of each type of failures.

In [14]:
pd.DataFrame({'percentage': 
              mimic2hpo.assign(cat = ['HPO' if x.startswith('HP') else x for x in mimic2hpo.MAP_TO])
              .groupby('cat').size() / len(mimic2hpo) })


Unnamed: 0_level_0,percentage
cat,Unnamed: 1_level_1
ERROR 1: local id not mapped to loinc,0.035987
ERROR 3: loinc code not annotated,0.068701
ERROR 4: interpretation code not mapped to hpo,0.000112
ERROR 5: unable to interpret,0.02471
HPO,0.87049


The result shows we were able to transform 87.0% lab records into HPO terms. The rate is pretty similar to the asthma dataset (88.6%). Note that the annotations were prioritized based on the LOINC frequencies from the asthma dataset. Therefore there might be some bias when comparing loinc2hpo success rate across datasets. But 87% is not far from 88.6%, and some tailed annotations for the ICU patients may bring up the score further. 

The top reason for failing to transform lab tests into HPO is missing annotation ( `ERROR 3` ), which happened for 6.9%  lab tests. For these lab tests, we will be able to map to HPO if we do more annotations. 

The next reason for failing to transform into HPO is that some local lab codes are not mapped to LOINC ( `ERROR 1` : 3.6%). For thse lab tests, we will simply skip them. We do not try to map them into LOINC because the original data provider might have a good reason to not do so.

The third reason for failing to transform into HPO is `ERROR 5: unable to interpret` (2.5%). This error occurs when a lab test result was a nominal type, which we did not consider for now, or was not reported in the expected format, such as using a free text that we were not able to parse. For these lab tests, we may consider to address the nominal types, but that is low priority.

The last reason for not being able to transform into HPO is `ERROR4: interpretation code not mapped to HPO`. This happens when a lab test was interpreted in a code that we do not have annotations for. Because of the low frequency (0.01%), we will omit such lab tests. 

Next, we look at how many HPO terms were assigned to each patient.

In [15]:
mimic2hpo = mimic2hpo.set_index(mimic2hpo.ROW_ID)
labevents = labevents.set_index(labevents.ROW_ID)

In [17]:
combined = labevents.join(mimic2hpo[['NEGATED', 'MAP_TO']], how = 'left')
combined.head()

Unnamed: 0_level_0,ROW_ID,SUBJECT_ID,HADM_ID,ITEMID,CHARTTIME,VALUE,VALUENUM,VALUEUOM,FLAG,NEGATED,MAP_TO
ROW_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
281,281,3,,50820,2101-10-12 16:07:00,7.39,7.39,units,,T,HP:0004360
282,282,3,,50800,2101-10-12 18:17:00,ART,,,,U,ERROR 1: local id not mapped to loinc
283,283,3,,50802,2101-10-12 18:17:00,-1,-1.0,mEq/L,,T,HP:0032281
284,284,3,,50804,2101-10-12 18:17:00,22,22.0,mEq/L,,T,HP:0500164
285,285,3,,50808,2101-10-12 18:17:00,0.93,0.93,mmol/L,abnormal,F,HP:0002901


In [19]:
total_hp_count_per_patient = combined[combined.MAP_TO.str.startswith('HP')].groupby('SUBJECT_ID').size().reset_index()
total_hp_count_per_patient.rename(columns = {'SUBJECT_ID': 'patient', 0: 'hpo_n'}, inplace = True)

In [20]:
total_hp_count_per_patient.head()

Unnamed: 0,patient,hpo_n
0,2,40
1,3,1303
2,4,1433
3,5,17
4,6,1065


In [21]:
# cut the count of hpo into bins
bins = pd.cut(total_hp_count_per_patient.hpo_n, bins = [-1, 0, 100, 500, 1500, 4000, 30000])

In [22]:
total_hp_count_per_patient.groupby(bins).size()

hpo_n
(-1, 0]              0
(0, 100]         10247
(100, 500]       22261
(500, 1500]      10404
(1500, 4000]      2876
(4000, 30000]      463
dtype: int64

We can see that the majority of patients have between 100 - 500 HPO terms. Note that this counts repeated HPO terms multiple times, and counts both normal finds and abnormal findings.

Next, we look at abnormal findings only.

In [23]:
abnormal_hp_count_per_patient = combined[combined.NEGATED == 'F'].groupby('SUBJECT_ID').size().reset_index()


#.groupby(pd.cut('0', bins = [0, 100, 500, 1000, 30000])).size()

In [24]:
abnormal_hp_count_per_patient.groupby(pd.cut(abnormal_hp_count_per_patient.loc[:,0], bins = [0, 100, 500, 1000, 5000, 30000])).size()

0
(0, 100]         24032
(100, 500]       17490
(500, 1000]       2912
(1000, 5000]      1466
(5000, 30000]       17
dtype: int64

Notice that most patients have between 0 - 100 abnormal findings. Note we still count repeated abnormal findings multiple times.

Next, we look at abnormal findings but only look at unique findings for each patient.

In [25]:
patient_group = combined[combined.NEGATED == 'F'].groupby('SUBJECT_ID')

In [27]:
unique_abnormal_hp_count = pd.DataFrame({'patient_id': [group_id for group_id, _ in patient_group],
                                        'abnormal_hpo_n': [len(group.MAP_TO.unique()) for _, group in patient_group]})

In [28]:
unique_abnormal_hp_count = unique_abnormal_hp_count.set_index('patient_id')


In [29]:
unique_abnormal_hp_count.head()

Unnamed: 0_level_0,abnormal_hpo_n
patient_id,Unnamed: 1_level_1
2,14
3,42
4,66
5,5
6,54


In [31]:
unique_abnormal_hp_count.describe()

Unnamed: 0,abnormal_hpo_n
count,45917.0
mean,27.806586
std,15.708607
min,1.0
25%,16.0
50%,26.0
75%,38.0
max,100.0


The result shows that each patient on average has 27.8 unique abnormal findings in HPO terms. the interquantile range is 16 - 38. 

Next, we are going to infer HPO terms based on HPO hierarchy. Before we do that, we need to define a target and a window, such as 10 days before death?

to continue...