# Positive Cases

## Explicit Coding of Sepsis using ICD9
* This code extracts explicit sepsis using ICD-9 diagnosis codes
* That is, the two codes 995.92 (severe sepsis) or 785.52 (septic shock)
* These codes are extremely specific to sepsis, but have very low sensitivity
* From Iwashyna et al. (vs. chart reviews): 100% PPV, 9.3% sens, 100% specificity

In [1]:
from pyspark.sql.functions import array, lit
from pyspark.sql.types import *

# load diagnoses_icd table
diagnoses = spark.read.csv("s3://mimic-raw/mimic3/diagnoses_icd/DIAGNOSES_ICD.csv", header = True)

# intialize dataframe using ICD9 99592
df1 = diagnoses.filter("ICD9_CODE like '"+str(99592)+"%'")
df2 = diagnoses.filter("ICD9_CODE like '"+str(78552)+"%'") 
df = df1.union(df2) # concatenate to create our required df

Waiting for a Spark session to start...

In [6]:
sepsis_admissions = list(set(df.select('HADM_ID').collect()))
print(len(sepsis_admissions)) # number of admissions with sepsis explicitly coded

4085


In [12]:
import pandas
pandas.DataFrame({'HADM_ID':sepsis_admissions}).to_csv('sepsis_admissions.csv')

# Negative Cases

### ICD-9 codes for Angus criteria of sepsis

*Angus et al, 2001. Epidemiology of severe sepsis in the United States*

http://www.ncbi.nlm.nih.gov/pubmed/11445675

Select all acute care hospitalizations with ICD-9-CM codes for both:
* bacterial or fungal infectious process AND
* diagnosis of acute organ dysfunction (Appendix 2).

## Get ICD codes for infections

In [11]:
# load diagnoses_icd table
diagnoses = spark.read.csv("s3://mimic-raw/mimic3/diagnoses_icd/DIAGNOSES_ICD.csv", header = True)

In [6]:
## generate list of ICD9 codes for infection
firstInfection = '001' 
infectionICD3 = ['002','003','004','005','008', # first 3 characters of ICD9_CODE
   '009','010','011','012','013','014','015','016','017','018',
   '020','021','022','023','024','025','026','027','030','031',
   '032','033','034','035','036','037','038','039','040','041',
   '090','091','092','093','094','095','096','097','098','100',
   '101','102','103','104','110','111','112','114','115','116',
  '117','118','320','322','324','325','420','421','451','461',
  '462','463','464','465','481','482','485','486','494','510',
   '513','540','541','542','566','567','590','597','601','614',
   '615','616','681','682','683','686','730']
infectionICD4 = ['5695','5720','5721','5750','5990','7110','7907','9966','9985','9993'] # first 4 characters of ICD9_CODE
infectionICD5 = ['49121','56201','56203','56211','56213','56983'] # first 5 characters of ICD9_CODE

In [7]:
from pyspark.sql.functions import array, lit
from pyspark.sql.types import *

In [8]:
# list of possible infection_codes excluding 001
list_infections = infectionICD3 + infectionICD4 + infectionICD5

In [9]:
# intialize dataframe using ICD9 001
infection_df = diagnoses.filter("ICD9_CODE like '"+str(firstInfection)+"%'")

# concatenate dataframe for all other relevant ICD codes
for x in list_infections: # iterate through the list of relevant ICD codes
    temp_df = diagnoses.filter("ICD9_CODE like '"+str(x)+"%'") # filter on given code
    infection_df = infection_df.union(temp_df) # concatenate to create our required df

In [10]:
print(infection_df.count()) # number of admissions with evidence of infection 

43110


## Get ICD codes for Organ Dysfunction

In [13]:
## generate list of ICD9 codes for infection
firstDysfunction = '458' 
dysfunctionICD3 = ['293','570','584']
dysfunctionICD4 = ['7855','3483','3481','2874','2875','2869','2866','5734'] # first 4 characters of ICD9_CODE
dysfunctionICD5 = ['99592','78552'] # first 5 characters of ICD9_CODE

list_dysfunction = dysfunctionICD3 + dysfunctionICD4 + dysfunctionICD5

# intialize dataframe using ICD9 001
dysfunction_df = diagnoses.filter("ICD9_CODE like '"+str(firstDysfunction)+"%'")

# concatenate dataframe for all other relevant ICD codes
for x in list_dysfunction: # iterate through the list of relevant ICD codes
    temp_df = diagnoses.filter("ICD9_CODE like '"+str(x)+"%'") # filter on given code
    dysfunction_df = dysfunction_df.union(temp_df) # concatenate to create our required df

In [15]:
print(dysfunction_df.count()) # number of admissions with evidence of organ dysfunction

37148


## Identify Hospitalizations with both Infection & Organ Dysfunction

In [20]:
infection_admissions = infection_df.select('HADM_ID').collect()
dysfunction_admissions = dysfunction_df.select('HADM_ID').collect()

In [25]:
sepsis_admissions = []

# evaluate if admission has both infection and organ dysfunction
for admission in infection_admissions:
    if admission in dysfunction_admissions:
        sepsis_admissions.append(admission)

In [1]:
print(len(set(sepsis_admissions))) # number of admissions w/ severe sepsis
print(len(set(diagnoses.select('HADM_ID').collect()))) # number of admissions

not_sepsis_admissions = []
for admission in set(diagnoses.select('HADM_ID').collect()):
    if admission not in set(sepsis_admissions):
        not_sepsis_admissions.append(admission)

12441
58976


In [2]:
import pandas

pandas.DataFrame({'HADM_ID':not_sepsis_admissions}).to_csv('not_sepsis_admissions.csv')

# Combine sepsis and not sepsis examples for set of labeled admissions

### Identify sepsis & non sepsis admissions

In [1]:
import pandas

# merge positive & negative cases and add labels
sepsis = pandas.read_csv("sepsis_admissions.csv")
sepsis['label'] = 1
not_sepsis = pandas.read_csv("not_sepsis_admissions.csv")
not_sepsis['label'] = 0
df = pandas.concat([sepsis, not_sepsis])

#### Retrieve relevant free text fields for NLP
The DIAGNOSIS column provides a preliminary, free text diagnosis for the patient on hospital admission. The diagnosis is usually assigned by the admitting clinician and does not use a systematic ontology. As of MIMIC-III v1.0 there were 15,693 distinct diagnoses for 58,976 admissions. The diagnoses can be very informative (e.g. chronic kidney failure) or quite vague (e.g. weakness).

In [4]:
df.label.value_counts()

0    46535
1     4085
Name: label, dtype: int64

In [7]:
print(df.columns.values)
df.to_csv("sepsis_and_not_sepsis_admissions.csv")

['Unnamed: 0' 'HADM_ID' 'label']
