Generate text based on the provided prompt, which is related to identifying causality between medical conditions. The function replaces a placeholder in the prompt with the actual test case, and the generated text is printed and returned. Finally, the function is called once with a specific test case ("A: heart rate, B: sepsis").

# TODO
* get label and fluid
* merge these two together into DF
* add all admission data too?

Run `rph.sh` script to SSH into server.

In [1]:
import pandas as pd
import numpy as np

#datetime calculations
from datetime import datetime

In [2]:
import requests
import re

# For local streaming, the websockets are hosted without ssl - http://
HOST = 'localhost:5000'
URI = f'http://{HOST}/api/v1/generate'

# For reverse-proxied streaming, the remote will likely host with ssl - https://
# URI = 'https://your-uri-here.trycloudflare.com/api/v1/generate'


def run(prompt,test):
    request = {
        'prompt': prompt.replace("{test}",test),
        'max_new_tokens': 250,
        'temperature': 0.0,
        'top_p': 0.1,
        'length_penalty': 5,
        'early_stopping': True,
        'seed': 0,
    }

    response = requests.post(URI, json=request)
    print(response.json()['results'][0]['text'])

    return response.json()['results'][0]['text']


prompt = '''
[INST]
You are a medical expert. You are tasked to identify causality between two conditions. 
Return yes if A causes B. If B causes A or if you are not sure, return no. 
Keep your responses short and succinct. 
Input
{test}
[/INST]
'''
run(prompt,"A: heart rate, B: sepsis")

Yes, A (heart rate) can cause B (sepsis). Elevated heart rate can lead to increased cardiac output, which can increase the delivery of bacteria-carrying blood to the site of infection, leading to sepsis.


'Yes, A (heart rate) can cause B (sepsis). Elevated heart rate can lead to increased cardiac output, which can increase the delivery of bacteria-carrying blood to the site of infection, leading to sepsis.'

Use the local to focus on cause and effect. Like what symptoms are causing sepsis.

In [12]:
# Function to get the most recent value for a given item ID list from a DataFrame

def get_most_recent_value(df, itemids):

    filtered_df = df[df['itemid'].isin(itemids)].sort_values('charttime')

    if not filtered_df.empty:

        return filtered_df.iloc[-1]['value']

    else:

        return None

In [5]:
symptom = ['uti', 'low oxygen', 'low neutrophils', 'low BP', 'high BP', 'age>50', 'pregnancy', 'age>70']
df = pd.DataFrame(symptom, columns=['symptom'])
df

Unnamed: 0,symptom
0,uti
1,low oxygen
2,low neutrophils
3,low BP
4,high BP
5,age>50
6,pregnancy
7,age>70


In [6]:
# Apply 'run' function to each row of DataFrame
df['relation'] = df.apply(lambda x: run(prompt, f"A: {x['symptom']}, B: sepsis"), axis=1)

Yes, A (uti) causes B (sepsis).
Yes, A (low oxygen) causes B (sepsis). Low oxygen levels can lead to sepsis, a life-threatening condition that can cause organ failure and death.
Yes, A (low neutrophils) causes B (sepsis). Low neutrophil counts can impair the body's ability to fight off infections, which can lead to the development of sepsis.
No. Sepsis can cause low blood pressure (BP), but low BP does not cause sepsis.
Yes, A (high BP) can cause B (sepsis). High blood pressure can lead to damage to blood vessels, which can increase the risk of bacterial infection and sepsis.
Yes, A (age > 50) causes B (sepsis). As people age, their immune systems weaken, making them more susceptible to infections like sepsis.
No. Sepsis can be caused by various factors, including infection, but pregnancy is not a known cause of sepsis.
Yes, A (age > 70) causes B (sepsis). As people age, their immune systems weaken, making them more susceptible to infections like sepsis.


In [7]:
df['symptom_short'] = df['relation'].apply(lambda x: 'yes' if 'yes' in x.lower() else 'no')

In [8]:
df

Unnamed: 0,symptom,relation,symptom_short
0,uti,"Yes, A (uti) causes B (sepsis).",yes
1,low oxygen,"Yes, A (low oxygen) causes B (sepsis). Low oxy...",yes
2,low neutrophils,"Yes, A (low neutrophils) causes B (sepsis). Lo...",yes
3,low BP,"No. Sepsis can cause low blood pressure (BP), ...",no
4,high BP,"Yes, A (high BP) can cause B (sepsis). High bl...",yes
5,age>50,"Yes, A (age > 50) causes B (sepsis). As people...",yes
6,pregnancy,"No. Sepsis can be caused by various factors, i...",no
7,age>70,"Yes, A (age > 70) causes B (sepsis). As people...",yes


# Identifying causation for input disease

In [10]:
prompt_top_indicators = '''
[INST]
You are a medical researcher with expertise in diagnosing diseases. 
Your task is to identify what biological markers, demographic data, or symptoms are most crucial for diagnosing a specific disease.

Input Disease: [Disease_Name]

Please list the top indicators that are essential for diagnosing [Disease_Name]. Please keep only the most crucial indicators. 

Return only a list of the indicators as your response to my query. The list should be in the following format:

['indicator_1', 'indicator_2', 'indicator_3', ... , 'indicator_n']

Input
{test}
[/INST]
'''

In [None]:
# prompt for Labevents. Including LABEL, FLUID
# is LABEL from FLUID 
prompt_labevents = '''
[INST]
You are a medical expert. You are tasked to identify causality between two conditions. 
Return yes if A causes B. If B causes A or if you are not sure, return no. 
Keep your responses short and succinct. 
Input
{test}
[/INST]
'''

In [11]:
run(prompt_top_indicators,"Input Disease: uti")

Output:

['frequency of urination', 'cloudy or strong-smelling urine', 'fever', 'flank pain', 'blood in the urine']


"Output:\n\n['frequency of urination', 'cloudy or strong-smelling urine', 'fever', 'flank pain', 'blood in the urine']"

In [66]:
df_lab

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ITEMID,CHARTTIME,VALUE,VALUENUM,VALUEUOM,FLAG
0,281,3,,50820,2101-10-12 16:07:00,7.39,7.39,units,
1,282,3,,50800,2101-10-12 18:17:00,ART,,,
2,283,3,,50802,2101-10-12 18:17:00,-1,-1.00,mEq/L,
3,284,3,,50804,2101-10-12 18:17:00,22,22.00,mEq/L,
4,285,3,,50808,2101-10-12 18:17:00,0.93,0.93,mmol/L,abnormal
...,...,...,...,...,...,...,...,...,...
27854050,27428435,96443,103219.0,50882,2109-12-30 01:40:00,26,26.00,mEq/L,
27854051,27428436,96443,103219.0,50885,2109-12-30 01:40:00,2.1,2.10,mg/dL,abnormal
27854052,27428437,96443,103219.0,50902,2109-12-30 01:40:00,97,97.00,mEq/L,
27854053,27428438,96443,103219.0,50911,2109-12-30 01:40:00,2,2.00,ng/mL,


# Exploring Labevents

In [20]:
lab_path = '../data/LABEVENTS.csv'
df_lab = pd.read_csv(lab_path)

lab_desc_path = '../data/D_LABITEMS.csv'
df_lab_desc = pd.read_csv(lab_desc_path)

In [21]:
df_lab_desc[df_lab_desc['ITEMID']==50802]

Unnamed: 0,ROW_ID,ITEMID,LABEL,FLUID,CATEGORY,LOINC_CODE
129,3,50802,Base Excess,Blood,Blood Gas,11555-0


In [22]:
df_items_desc = pd.read_csv('../data/D_ITEMS.csv')

In [24]:
# filter for only microbiology events
df_mb_desc = df_items_desc[df_items_desc['LINKSTO']=='microbiologyevents']

We have to store both the label value and the fluid or the cateogory value. 

In [25]:
print("MB values:", len(df_mb_desc['LABEL'].values))
print("MB unique values:", len(df_mb_desc['LABEL'].unique()))

MB values: 436
MB unique values: 421


In [48]:
# show values
df_mb_desc[df_mb_desc.duplicated(subset='LABEL', keep=False)]

Unnamed: 0,ROW_ID,ITEMID,LABEL,ABBREVIATION,DBSOURCE,LINKSTO,CATEGORY,UNITNAME,PARAM_TYPE,CONCEPTID
10341,12340,80001,,,hospital,microbiologyevents,ORGANISM,,,
10607,12247,70001,,,hospital,microbiologyevents,SPECIMEN,,,
10638,12278,70032,Blood (EBV),,hospital,microbiologyevents,SPECIMEN,,,
10641,12281,70035,"FLUID,OTHER",,hospital,microbiologyevents,SPECIMEN,,,
10646,12286,70040,SWAB,,hospital,microbiologyevents,SPECIMEN,,,
10657,12297,70051,"FLUID,OTHER",,hospital,microbiologyevents,SPECIMEN,,,
10673,12313,70067,SWAB,,hospital,microbiologyevents,SPECIMEN,,,
10674,12314,70068,SWAB,,hospital,microbiologyevents,SPECIMEN,,,
10675,12315,70069,SWAB,,hospital,microbiologyevents,SPECIMEN,,,
10676,12316,70070,SWAB,,hospital,microbiologyevents,SPECIMEN,,,


In [49]:
# store values ?
df_mb_desc[df_mb_desc.duplicated(subset='LABEL', keep=False)]['LABEL'].unique()

array([nan, 'Blood (EBV)', 'FLUID,OTHER', 'SWAB', 'TISSUE', 'URINE',
       'PORPHYROMONAS SPECIES', '2ND ISOLATE', 'FUSOBACTERIUM SPECIES'],
      dtype=object)

# Lab values

In [42]:
df_lab_full = pd.merge(df_lab, df_lab_desc, on='ITEMID')

In [60]:
# getting label values from Labs. 
#todo: get all Fluid too...
df_lab_filtered = df_lab_desc[['ITEMID', 'LABEL', 'FLUID']]

In [61]:
# Convert all string types to uppercase
df_lab_filtered = df_lab_filtered.applymap(lambda x: x.upper() if type(x) == str else x)

In [65]:
# medical tests, markers, and parameters commonly encountered in clinical settings
df_lab_filtered['LABEL']



0                          BLASTS
1                     EOSINOPHILS
2                 HEMATOCRIT, CSF
3      HYPERSEGMENTED NEUTROPHILS
4               IMMUNOPHENOTYPING
                  ...            
748               VOIDED SPECIMEN
749               VOIDED SPECIMEN
750               VOIDED SPECIMEN
751               VOIDED SPECIMEN
752      SURFACTANT ALBUMIN RATIO
Name: LABEL, Length: 753, dtype: object

In [50]:
df_item_labels['lactate']= df_item_labels['item_list'].apply(lambda x: 'lactate' if 'lactate' in x.lower() else 'no')

AttributeError: 'numpy.ndarray' object has no attribute 'apply'

In [None]:
df_item_labels['symptom_short'] = df['relation'].apply(lambda x: 'yes' if 'yes' in x.lower() else 'no')

In [205]:
data_ref000 = {
'crucial_sepsis_predictors': [
    "High Lactate Levels",
    "Low Blood Pressure (Hypotension)",
    "Altered Mental Status",
    "High Heart Rate (Tachycardia)",
    "Rapid Breathing (Tachypnea)",
    "High WBC Count",
    "Low WBC Count",
    "Elevated Procalcitonin",
    "Elevated CRP",
    "Positive Blood Culture",
    "Fever or Hypothermia",
    "Low Platelet Count",
    "Elevated Bilirubin",
    "Elevated Creatinine",
    "Low Bicarbonate Levels",
    "High Blood Glucose",
    "Positive Urine Culture",
    "Elevated Liver Enzymes (AST, ALT)",
    "Elevated Troponin",
    "Low Hemoglobin",
    "High INR or PTT",
    "Low Albumin",
    "Elevated D-Dimer",
    "Elevated Ferritin",
    "Elevated LDH",
    "High BUN",
    "Low PaO2/FiO2 Ratio",
    "Age > 65 years",
    "Comorbidities (e.g., Diabetes, Heart Disease)",
    "Recent Surgery",
    "Immunosuppression",
    "Chronic Alcohol Abuse",
    "Intravenous Drug Use",
    "Recent Infection or Antibiotic Use",
    "Mechanical Ventilation",
    "Central Venous Catheter Use",
    "Urinary Catheter Use",
    "Recent Hospitalization",
    "Chronic Renal Failure",
    "Chronic Liver Disease",
    "Pregnancy",
    "Burns or Trauma",
    "Neutropenia",
    "Malnutrition",
    "Obesity",
    "Hyperglycemia on Admission",
    "Community-Acquired vs Hospital-Acquired Infection",
    "Skin Mottling",
    "Cyanosis (Bluish Skin)",
    "Oliguria (Low Urine Output)"
],
'Table': [
    "labevents",  # High Lactate Levels
    "chartevents",  # Low Blood Pressure (Hypotension)
    "chartevents",  # Altered Mental Status
    "chartevents",  # High Heart Rate (Tachycardia)
    "chartevents",  # Rapid Breathing (Tachypnea)
    "labevents",  # High WBC Count
    "labevents",  # Low WBC Count
    "labevents",  # Elevated Procalcitonin
    "labevents",  # Elevated CRP
    "microbiologyevents",  # Positive Blood Culture
    "chartevents",  # Fever or Hypothermia
    "labevents",  # Low Platelet Count
    "labevents",  # Elevated Bilirubin
    "labevents",  # Elevated Creatinine
    "labevents",  # Low Bicarbonate Levels
    "labevents",  # High Blood Glucose
    "microbiologyevents",  # Positive Urine Culture
    "labevents",  # Elevated Liver Enzymes (AST, ALT)
    "labevents",  # Elevated Troponin
    "labevents",  # Low Hemoglobin
    "labevents",  # High INR or PTT
    "labevents",  # Low Albumin
    "labevents",  # Elevated D-Dimer
    "labevents",  # Elevated Ferritin
    "labevents",  # Elevated LDH
    "labevents",  # High BUN
    ["chartevents", "labevents"],  # Low PaO2/FiO2 Ratio
    "patients",  # Age > 65 years
    "admissions",  # Comorbidities (e.g., Diabetes, Heart Disease)
    "admissions",  # Recent Surgery
    "admissions",  # Immunosuppression
    "admissions",  # Chronic Alcohol Abuse
    "admissions",  # Intravenous Drug Use
    "admissions",  # Recent Infection or Antibiotic Use
    "chartevents",  # Mechanical Ventilation
    "procedures_icd",  # Central Venous Catheter Use
    "procedures_icd",  # Urinary Catheter Use
    "admissions",  # Recent Hospitalization
    "admissions",  # Chronic Renal Failure
    "admissions",  # Chronic Liver Disease
    "admissions",  # Pregnancy
    "admissions",  # Burns or Trauma
    "labevents",  # Neutropenia
    "admissions",  # Malnutrition
    "admissions",  # Obesity
    "labevents",  # Hyperglycemia on Admission
    "admissions",  # Community-Acquired vs Hospital-Acquired Infection
    "chartevents",  # Skin Mottling
    "chartevents",  # Cyanosis (Bluish Skin)
    "outputevents"  # Oliguria (Low Urine Output)
],
    'Query': [
    ['Lactate'],
    ['Systolic BP', 'Diastolic BP'],
    ['Glasgow Coma Scale'],
    ['Heart Rate'],
    ['Respiratory Rate'],
    ['WBC'],
    ['WBC'],
    ['Procalcitonin'],
    ['C-Reactive Protein'],
    ['Blood Culture'],
    ['Temperature'],
    ['Platelet Count'],
    ['Bilirubin'],
    ['Creatinine'],
    ['Bicarbonate'],
    ['Blood Glucose'],
    ['Urine Culture'],
    ['AST', 'ALT'],
    ['Troponin'],
    ['Hemoglobin'],
    ['INR', 'PTT'],
    ['Albumin'],
    ['D-Dimer'],
    ['Ferritin'],
    ['LDH'],
    ['BUN'],
    ['PaO2', 'FiO2'],
    ['Age'],
    ['Comorbidities'],
    ['Surgery History'],
    ['Immunosuppression Status'],
    ['Alcohol Abuse History'],
    ['Intravenous Drug Use History'],
    ['Recent Infection or Antibiotic Use'],
    ['Mechanical Ventilation Status'],
    ['Central Venous Catheter Use'],
    ['Urinary Catheter Use'],
    ['Recent Hospitalization'],
    ['Chronic Renal Failure'],
    ['Chronic Liver Disease'],
    ['Pregnancy Status'],
    ['Burns or Trauma'],
    ['Neutropenia'],
    ['Nutritional Status'],
    ['BMI'],
    ['Blood Glucose on Admission'],
    ['Infection Source'],
    ['Skin Mottling'],
    ['Skin Color'],
    ['Urine Output']
]
}

df_ref000 = pd.DataFrame(data_ref000)
df_ref000

Unnamed: 0,crucial_sepsis_predictors,Table,Query
0,High Lactate Levels,labevents,[Lactate]
1,Low Blood Pressure (Hypotension),chartevents,"[Systolic BP, Diastolic BP]"
2,Altered Mental Status,chartevents,[Glasgow Coma Scale]
3,High Heart Rate (Tachycardia),chartevents,[Heart Rate]
4,Rapid Breathing (Tachypnea),chartevents,[Respiratory Rate]
5,High WBC Count,labevents,[WBC]
6,Low WBC Count,labevents,[WBC]
7,Elevated Procalcitonin,labevents,[Procalcitonin]
8,Elevated CRP,labevents,[C-Reactive Protein]
9,Positive Blood Culture,microbiologyevents,[Blood Culture]


In [301]:
# https://www.sciencedirect.com/science/article/pii/S2352914823000783#tbl1
# https://doi.org/10.1016/j.imu.2023.101236

data_ref001 = {
    'Category': [
        'Demographic data', 'Demographic data', 'Demographic data', 'Demographic data',
        'Clinical time series data', 'Clinical time series data', 'Clinical time series data', 'Clinical time series data',
        'Clinical time series data', 'Clinical time series data', 'Clinical time series data', 
        'Laboratory values', 'Laboratory values', 'Laboratory values', 'Laboratory values', 'Laboratory values',
        'Laboratory values', 'Laboratory values', 'Laboratory values', 'Laboratory values', 'Laboratory values',
        'Laboratory values', 'Laboratory values', 'Laboratory values', 'Laboratory values', 'Laboratory values',
        'Laboratory values', 'Laboratory values', 'Laboratory values', 'Laboratory values'
    ],
    'Variable': [
        'ICULOS', 'HospAdmTime', 'Age', 'Gender',
        'O2Sat', 'Resp', 'HR', 'Temp', 'DBP', 'MAP', 'SBP',
        'FiO2', 'pH', 'SIRS_Score', 'Base Excess', 'Glucose', 'PCO2', 'WBC', 'OXYGEN SATURATION',
        'Potassium', 'Calcium', 'Hematocrit', 'Hemoglobin', 'Magnesium', 'BUN', 'Phosphate', 'Bicarbonate',
        'Creatinine', 'Platelet', 'Chloride'
    ],
    'Description': [
        'ICU length of stay (hours since ICU admission)', 'The time between hospital and ICU admission', 'Age (years)', 'Female (0) or male (1)',
        'Pulse oximetry (%)', 'Respiration rate (breaths per minute)', 'Heart rate (beats per minute)', 'Temperature (deg C)',
        'Diastolic BP (mm Hg)', 'Mean arterial pressure (mm Hg)', 'Systolic BP (mm Hg)',
        'Fraction of inspired oxygen (%)', 'A blood pH test is a normal part of a blood gas test or arterial blood gas (ABG) test. It measures how much oxygen and carbon dioxide are in your blood',
        'Four SIRS criteria were defined, namely tachycardia (heart rate >90 beats/min), tachypnea (respiratory rate >20 breaths/min), fever or hypothermia (temperature >38 or <36 °C), and leukocytosis, leukopenia, or bandemia (white blood cells >1200/mm3, <4000/mm3 or bandemia ≥10%',
        'Excess bicarbonate (mmol/L)', 'Serum glucose (mg/dL)', 'The partial pressure of carbon dioxide from arterial blood (mm Hg). Ranges 35 to 45 mmHg, or 4.7 to 6.0 kPa.', 'Leukocyte count (count/L)',
        'Oxygen saturation from arterial blood (%) SaO2', 'Potassiam (mmol/L)', 'Calcium (mg/dL)', 'Hematocrit (%)', 'Hemoglobin (g/dL)',
        'Magnesium (mmol/dL)', 'Blood urea nitrogen (mg/dL)', 'Phosphate (mg/dL)', 'Bicarbonate (mmol/L)', 'Creatinine (mg/dL)',
        'Platelet count (count/mL)', 'Chloride (mmol/L)'
    ],
    'Type': [
        'Numeric', 'Date', 'Numeric', 'Categorical',
        'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric',
        'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric',
        'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric',
        'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric'
    ],
    'Table': [
        'ICUSTAYS', 'ADMISSIONS', 'PATIENTS', 'PATIENTS',
        'CHARTEVENTS', 'CHARTEVENTS', 'CHARTEVENTS', 'CHARTEVENTS', 'CHARTEVENTS', 'CHARTEVENTS', 'CHARTEVENTS',
        'CHARTEVENTS', 'LABEVENTS', 'Derived', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS',
        'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS',
        'LABEVENTS', 'LABEVENTS', 'LABEVENTS'
    ]
    
}

df_ref001 = pd.DataFrame(data_ref001)
df_ref001

Unnamed: 0,Category,Variable,Description,Type,Table
0,Demographic data,ICULOS,ICU length of stay (hours since ICU admission),Numeric,ICUSTAYS
1,Demographic data,HospAdmTime,The time between hospital and ICU admission,Date,ADMISSIONS
2,Demographic data,Age,Age (years),Numeric,PATIENTS
3,Demographic data,Gender,Female (0) or male (1),Categorical,PATIENTS
4,Clinical time series data,O2Sat,Pulse oximetry (%),Numeric,CHARTEVENTS
5,Clinical time series data,Resp,Respiration rate (breaths per minute),Numeric,CHARTEVENTS
6,Clinical time series data,HR,Heart rate (beats per minute),Numeric,CHARTEVENTS
7,Clinical time series data,Temp,Temperature (deg C),Numeric,CHARTEVENTS
8,Clinical time series data,DBP,Diastolic BP (mm Hg),Numeric,CHARTEVENTS
9,Clinical time series data,MAP,Mean arterial pressure (mm Hg),Numeric,CHARTEVENTS


In [302]:
df_ref001_lab = df_ref001[df_ref001['Table']=='LABEVENTS']
df_ref001_lab['Exists'] = df_ref001_lab['Variable'].apply(lambda x: 'yes' if x.upper() in df_lab_desc['LABEL'].values else 'no')
df_ref001_lab

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ref001_lab['Exists'] = df_ref001_lab['Variable'].apply(lambda x: 'yes' if x.upper() in df_lab_desc['LABEL'].values else 'no')


Unnamed: 0,Category,Variable,Description,Type,Table,Exists
12,Laboratory values,pH,A blood pH test is a normal part of a blood ga...,Numeric,LABEVENTS,yes
14,Laboratory values,Base Excess,Excess bicarbonate (mmol/L),Numeric,LABEVENTS,yes
15,Laboratory values,Glucose,Serum glucose (mg/dL),Numeric,LABEVENTS,yes
16,Laboratory values,PCO2,The partial pressure of carbon dioxide from ar...,Numeric,LABEVENTS,yes
17,Laboratory values,WBC,Leukocyte count (count/L),Numeric,LABEVENTS,yes
18,Laboratory values,OXYGEN SATURATION,Oxygen saturation from arterial blood (%) SaO2,Numeric,LABEVENTS,yes
19,Laboratory values,Potassium,Potassiam (mmol/L),Numeric,LABEVENTS,yes
20,Laboratory values,Calcium,Calcium (mg/dL),Numeric,LABEVENTS,no
21,Laboratory values,Hematocrit,Hematocrit (%),Numeric,LABEVENTS,yes
22,Laboratory values,Hemoglobin,Hemoglobin (g/dL),Numeric,LABEVENTS,yes


In [280]:
ph = df_lab_desc[df_lab_desc['LABEL'] == ('PH')]
# 4 values of pH

array([50820, 50831, 51094, 51491])

In [299]:
df_lab_desc[df_lab_desc['LABEL'].str.contains('Platelet', case=False)]

Unnamed: 0,ROW_ID,ITEMID,LABEL,FLUID,CATEGORY,LOINC_CODE,Exists,ref002
566,440,51240,LARGE PLATELETS,Blood,Hematology,34167-7,yes,yes
590,464,51264,PLATELET CLUMPS,Blood,Hematology,40741-1,yes,yes
591,465,51265,PLATELET COUNT,Blood,Hematology,777-3,yes,yes
592,466,51266,PLATELET SMEAR,Blood,Hematology,778-1,yes,yes


## Ref 002

Sup_data_Development and Evaluation of a Machine Learning Model for the Early Identification of Patients at Risk for Sepsis

Table 1

In [231]:
data_ref002 = {
    'Variable': [
        'AGE', 'DOBUTAMINE  (Y/N)', 'DOPAMINE  (Y/N)', 'EPINEPHRINE  (Y/N)', 'NOREPINEPHRINE  (Y/N)',
        'SYSTOLIC BP x AGE', 'SHOCK INDEX (HR/SYSTOLIC BP)', 'SHOCK INDEX x AGE',
        'ANION GAP', 'ALBUMIN', 'ALKALINE PHOSPHATASE', 'ASPARATE AMINOTRANSFERASE (AST)', 'BANDS', 'BILIRUBIN',
        'UREA NITROGEN', 'CALCIUM', 'CREATININE', 'ESTIMATED GFR (MDRD EQUATION)', 'GLUCOSE', 'LACTIC ACID',
        'LYMPHOCYTE', 'MONOCYTE', 'NEUTROPHIL', 'PLATELET', 'WHITE BLOOD CELLS', 'WBC',
        'GLASGOW COMA SCALE', 'SUPPLEMENTAL OXYGEN (Y/N)', 'VENTILATOR', 'ALTERED MENTAL STATUS',
        'ABSCESS', 'ACUTE', 'ALTERED', 'BACTEREMIA', 'CELLULITIS', 'CYSTITIS', 'DIABETES',
        'FAILURE', 'LACTIC', 'LEUKOCYTOSIS', 'PNA', 'PNEUMONIA', 'PYELONEPHRITIS', 'RESPIRATORY',
        'SEPSIS', 'SEPTIC', 'UROSEPSIS', 'UTI',
        'HEART RATE', 'MEAN ARTERIAL PRESSURE', 'RESPIRATORY RATE', 'SYSTTOLIC BLOOD PRESSURE',
        'OXYGEN SATURATION', 'TEMPERATURE', 'WEIGHT', 'HEIGHT', 'BMI'
    ],
    'TYPE_OF_VARIABLE': [
        'Demographic', 'Medication', 'Medication', 'Medication', 'Medication',
        'Engineered Feature', 'Engineered', 'Engineered',
        'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result',
        'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result',
        'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result',
        'Neurological Evaluation', 'Nursing Documentation', 'Nursing Documentation', 'Nursing Documentation',
        'Text- Keyword (ED chief complaint)', 'Text- Keyword (ED chief complaint)', 'Text- Keyword (ED chief complaint)', 'Text- Keyword (ED chief complaint)', 'Text- Keyword (ED chief complaint)', 'Text- Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)',
        'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)',
        'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)',
        'Vital Sign', 'Vital Sign', 'Vital Sign', 'Vital Sign',
        'Vital Sign', 'Vital Sign', 'Physiological', 'Physiological', 'Physiological'
    ],
    'Table': [
        'PATIENTS', 'PRESCRIPTIONS', 'PRESCRIPTIONS', 'PRESCRIPTIONS', 'PRESCRIPTIONS',
        'Derived', 'Derived', 'Derived',
        'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS',
        'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS',
        'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS',
        'CHARTEVENTS', 'CHARTEVENTS', 'CHARTEVENTS', 'CHARTEVENTS',
        'ED_EVENTS', 'ED_EVENTS', 'ED_EVENTS', 'ED_EVENTS', 'ED_EVENTS', 'ED_EVENTS', 'ED_EVENTS',
        'ED_EVENTS', 'ED_EVENTS', 'ED_EVENTS', 'ED_EVENTS', 'ED_EVENTS', 'ED_EVENTS', 'ED_EVENTS',
        'ED_EVENTS', 'ED_EVENTS', 'ED_EVENTS', 'ED_EVENTS',
        'CHARTEVENTS', 'CHARTEVENTS', 'CHARTEVENTS', 'CHARTEVENTS',
        'CHARTEVENTS', 'CHARTEVENTS', 'ADMISSIONS', 'ADMISSIONS', 'Derived'
    ]
}

df_ref002 = pd.DataFrame(data_ref002)

#     "Derived" means that the variable is often calculated from other variables
#    "ED_EVENTS" is a placeholder for the table containing Emergency Department events

In [232]:
df_ref002_lab = df_ref002[df_ref002['Table']=="LABEVENTS"]
df_lab_desc['LABEL'] = df_lab_desc['LABEL'].str.upper()
df_ref002_lab['Exists'] = df_ref002_lab['Variable'].apply(lambda x: 'yes' if x in df_lab_desc['LABEL'].values else 'no')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ref002_lab['Exists'] = df_ref002_lab['FEATURE_NAME'].apply(lambda x: 'yes' if x in df_lab_desc['LABEL'].values else 'no')


In [264]:
df_ref002_lab[df_ref002_lab['Exists']=='yes']


df_ref002_lab['Variable'].values
df_ref002

Unnamed: 0,FEATURE_NAME,TYPE_OF_VARIABLE,Table
0,AGE,Demographic,PATIENTS
1,DOBUTAMINE (Y/N),Medication,PRESCRIPTIONS
2,DOPAMINE (Y/N),Medication,PRESCRIPTIONS
3,EPINEPHRINE (Y/N),Medication,PRESCRIPTIONS
4,NOREPINEPHRINE (Y/N),Medication,PRESCRIPTIONS
5,SYSTOLIC BP x AGE,Engineered Feature,Derived
6,SHOCK INDEX (HR/SYSTOLIC BP),Engineered,Derived
7,SHOCK INDEX x AGE,Engineered,Derived
8,ANION GAP,Laboratory Result,LABEVENTS
9,ALBUMIN,Laboratory Result,LABEVENTS


In [263]:
words_to_check = ['LYMPHOCYTE', 'MONOCYTE', 'PLATE', 'NEUTROPHIL', 'CALCIUM', 'UREA NITROGEN', 'LACTATE', 'WBC']
other = ['ANION GAP', 'ALBUMIN', 'ALKALINE PHOSPHATASE',
       'AMINOTRANSFERASE', 'BANDS', 'BILIRUBIN',
       'UREA NITROGEN', 'CALCIUM', 'CREATININE',
       'ESTIMATED GFR (MDRD EQUATION)', 'GLUCOSE', 'LACTATE',
       'LYMPHOCYTE', 'MONOCYTE', 'NEUTROPHIL', 'PLATELET',
       'WHITE BLOOD CELLS', 'WBC']



def checkwords(df, words_to_check):
    filtered_dfs = []
    for word in words_to_check:
        filtered_df = df_lab_desc[df_lab_desc['LABEL'].apply(lambda x: word in x)]
        filtered_dfs.append(filtered_df)
    df_query_labels = pd.concat(filtered_dfs).drop_duplicates()
    return df_query_labels

checkwords(df_lab_desc, words_to_check).shape
checkwords(df_lab_desc, other).shape
#df_query_labels

#GLOMULAR FILTRATION RATE


(106, 7)

In [268]:
df_lab_desc['ref002'] = df_lab_desc['LABEL'].apply(lambda x: 'yes' if any(feature in x for feature in df_ref002_lab['Variable']) else 'no')

df_lab_desc[df_lab_desc['ref002']=='yes']['LABEL'].values




array(['HYPERSEGMENTED NEUTROPHILS', 'MONOCYTES', 'ATYPICAL LYMPHOCYTES',
       'BANDS', 'LYMPHOCYTES', 'MONOCYTES', 'ATYPICAL LYMPHOCYTES',
       'BANDS', 'LYMPHOCYTES', 'ATYPICAL LYMPHOCYTES', 'BANDS',
       'LYMPHOCYTES', 'BILIRUBIN', 'BILIRUBIN CRYSTALS',
       'CALCIUM CARBONATE CRYSTALS', 'CALCIUM OXALATE CRYSTALS',
       'CALCIUM PHOSPHATE CRYSTALS', 'FREE CALCIUM', 'GLUCOSE',
       'ALBUMIN, ASCITES', 'BILIRUBIN, TOTAL, ASCITES',
       'CREATININE, ASCITES', 'GLUCOSE, ASCITES',
       'UREA NITROGEN, ASCITES', 'ALBUMIN', 'ALKALINE PHOSPHATASE',
       'ANION GAP', 'ANTI-NEUTROPHIL CYTOPLASMIC ANTIBODY',
       'ASPARATE AMINOTRANSFERASE (AST)', 'BILIRUBIN, DIRECT',
       'BILIRUBIN, INDIRECT', 'BILIRUBIN, TOTAL', 'CALCIUM, TOTAL',
       'CREATININE', 'ESTIMATED GFR (MDRD EQUATION)', 'GLUCOSE',
       'UREA NITROGEN', '<ALBUMIN>', 'BILIRUBIN, TOTAL, CSF',
       'GLUCOSE, CSF', 'ALBUMIN, JOINT FLUID', 'CREATININE, JOINT FLUID',
       'GLUCOSE, JOINT FLUID', 'ALBUMIN, B

In [236]:
df_lab_desc[df_lab_desc['LABEL'].str.contains('PLATE', case=False)]
# There are 4 items relating to platelets, neutrophils

df_lab_desc[df_lab_desc['LABEL'].str.contains('MONOCYTE', case=False)]
# There are 5 monocytes. 
# We would likely be interested in those returned from Blood fluids
# 51253, 51254

df_lab_desc[df_lab_desc['LABEL'].str.contains('LYMPHOCYTE', case=False)]
# thirteen lymphocytes


df_lab_desc[df_lab_desc['LABEL'].str.contains('LACTATE', case=False)]
# cannot find any abbreviations of LACTIC ACID
# turns out it's just Lactate


df_lab_desc[df_lab_desc['LABEL'].str.contains('AMINOTRANSFERASE', case=False)]
# need to clean?
# should be ASPARTATE, not ASPARATE
#df_lab_desc['LABEL'] = df_lab_desc['LABEL'].replace('ASPARATE AMINOTRANSFERASE (AST)', 'ASPARTATE AMINOTRANSFERASE (AST)')


df_lab_desc[df_lab_desc['LABEL'].str.contains('GLUCOSE', case=False)]

# ESTIMATED GFR (MDRD EQUATION) 
# 50920


 





Unnamed: 0,ROW_ID,ITEMID,LABEL,FLUID,CATEGORY,LOINC_CODE
136,10,50809,GLUCOSE,Blood,Blood Gas,2339-0
169,43,50842,"GLUCOSE, ASCITES",Ascites,Chemistry,2347-3
258,132,50931,GLUCOSE,Blood,Chemistry,2345-7
340,214,51014,"GLUCOSE, CSF",Cerebrospinal Fluid (CSF),Chemistry,2342-4
348,222,51022,"GLUCOSE, JOINT FLUID",Joint Fluid,Chemistry,2348-1
360,234,51034,"GLUCOSE, BODY FLUID",Other Body Fluid,Chemistry,2344-0
379,253,51053,"GLUCOSE, PLEURAL",Pleural,Chemistry,2346-5
410,284,51084,"GLUCOSE, URINE",Urine,Chemistry,2350-7
677,678,51478,GLUCOSE,Urine,Hematology,5792-7
728,729,51529,ESTIMATED ACTUAL GLUCOSE,BLOOD,CHEMISTRY,


In [None]:
# BMI
# Assuming height is in cm
df_admissions['BMI'] = df_admissions['WEIGHT'] / (df_admissions['HEIGHT'] / 100)**2

# Shock Index 
# assuming chartevents has 'HEART RATE' and 'SYSTOLIC BP'
chartevents['SHOCK INDEX'] = chartevents['HEART RATE'] / chartevents['SYSTOLIC BP']



def calc_gfr(scr, age, gender, ethnicity):
#    https://github.com/chapmanbe/shilpy_6018_2017_term_project
    """Calculates the estimated Glomerular Filteration Rate(eGFR)
    Based on MDRD equation.
    GFR=175×(Scr)^-1.154×(Age)^-0.203×(0.742 if female)×(1.212 if AA)

    Arguments: 
        scr: Patient's serum creatinine level in mg/dL as float
        age: Patient's age in years as integer
        gender: Patient's gender "male" or "female" as string
        ethnicity: Patient's ethicity   

    Returns: 
        gfr: Patient's eGFR in mL/min rounded to decimal places.

    """
    if scr == 0.0 or age == 0.0:
        return 0

    if gender == 'F' and ethnicity == 'African American':
        egfr = round(175*(scr)**(-1.154) * age**(-0.203) * (0.742) * 1.212, 2)
    elif gender == 'F' and ethnicity != 'African American':
        egfr = round(175*(scr)**(-1.154) * age**(-0.203) * (0.742), 2)
    elif gender == 'M' and ethnicity == 'African American':
        egfr = round(175*(scr)**(-1.154) * age**(-0.203) * 1.212, 2)
    else:
        egfr = round(175*(scr)**(-1.154) * age**(-0.203), 2)    
    return egfr

# Test time calculations

Calculating the average time of each test and the final test.

To calculate how many tests are taken over all, the time between tests, and the average time between each test. 

Differs from length of stay. It is the time between the first and last test done.

In [15]:
!ls ../data

[31mADMISSIONS.csv[m[m                          [31mMICROBIOLOGYEVENTS.csv[m[m
[31mDIAGNOSES_ICD.csv[m[m                       [31mPATIENTS.csv[m[m
[31mD_CPT.csv[m[m                               [34mdatapro[m[m
[31mD_ICD_DIAGNOSES.csv[m[m                     demographic.csv
[31mD_ICD_PROCEDURES.csv[m[m                    [34mlabevents_transform_parquet[m[m
[31mD_ITEMS.csv[m[m                             [34mlabevents_transform_sepsis_parquet[m[m
[31mD_LABITEMS.csv[m[m                          labevents_transform_v3_202308260031.csv
[31mLABEVENTS.csv[m[m                           [34moutput_csv[m[m


In [52]:
def computeTdiff(x):
    x = x['CHARTTIME']
    x = x.dropna()
    x = sorted(x)
    
    if len(x)>1:
        #return datetime.strptime(x[-1]['CHARTTIME'], '%Y-%m-%d %H:%M%:%S').timestamp() - datetime.strptime(x[0]['CHARTTIME'], '%Y-%m-%d %H:%M%:%S').timestamp()
        return datetime.strptime(x[-1], '%Y-%m-%d %H:%M:%S').timestamp() - datetime.strptime(x[0], '%Y-%m-%d %H:%M:%S').timestamp()
    else:
        return float("nan")

In [14]:
datetime.strptime('2101-10-12 16:06:00', '%Y-%m-%d %H:%M:%S').timestamp()


4158547560.0

In [58]:
#np.nanmean(df_lab.iloc[:10000].groupby('HADM_ID').apply(lambda x:computeTdiff(x)).values())/(24*3600)
df_lab.iloc[:10000].groupby('HADM_ID').apply(lambda x:computeTdiff(x.dropna()))/(24*3600)

HADM_ID
103251.0     0.000000
107064.0    15.914583
109235.0     2.923611
109451.0    12.885417
112213.0    12.672917
118037.0     0.000000
143045.0     6.463194
145834.0    10.510417
150750.0     4.939583
157681.0     4.754861
159514.0     3.875000
161087.0     3.961806
163353.0     0.015278
178980.0     0.000000
184167.0     6.697917
185777.0     6.616667
188822.0     1.839583
194023.0     3.597222
194540.0    25.347222
dtype: float64

In [5]:
# checking the HADM_ID shows that only one 
df_lab[df_lab.HADM_ID == 178980]

NameError: name 'df_lab' is not defined

In [None]:
def drop_nans()