## 1. This patient may have sepsis
<p>Sepsis is a deadly syndrome where a patient has a severe infection that causes organ failure. The sooner septic patients are treated, the more likely they are to survive, but sepsis can be challenging to recognize. It may be possible to use hospital data to develop machine learning models that could flag patients who are likely to be septic. However, before we develop predictive algorithms, we need a reliable method to determine patients who are septic. One component of sepsis is a severe infection.</p>
<p>In this project, we will use two weeks of hospital electronic health record (EHR) data to find out which patients had a severe infection according to four criteria. We will look into the data to see if a doctor ordered a blood test to look for bacteria (a blood culture) and gave the patient a series of intervenous antibiotics. </p>
<p>Let's get started!</p>

In [1]:
# Load packages
# .... YOUR CODE FOR TASK 1 ....

library(data.table)

# Read in the data
antibioticDT <- fread('datasets/antibioticDT.csv')
# Finding the structure and the class of the antibioticDT

class(antibioticDT)

# Look at the first 30 rows


head(antibioticDT , 30)

patient_id,day_given,antibiotic_type,route
1,2,ciprofloxacin,IV
1,4,ciprofloxacin,IV
1,6,ciprofloxacin,IV
1,7,doxycycline,IV
1,9,doxycycline,IV
1,15,penicillin,IV
1,16,doxycycline,IV
1,18,ciprofloxacin,IV
8,1,doxycycline,PO
8,2,penicillin,IV


## 2. Which antibiotics are "new"?
<p>These data represent all drugs administered in a hospital over two weeks. Each row represents one time a patient was given an antibiotic. The variables include the patient identification number, the day the drug was administered, the name of the antibiotic, and how it was administered. For example, patient "8" received doxycycline by mouth on the first day of their stay.</p>
<p>We will identify patients with a serious infection using the following criteria. </p>
<p><strong>Criteria for Suspected Infection</strong><a href="https://www.ncbi.nlm.nih.gov/pubmed/28903154">*</a></p>
<ol>
<li>The patient receives antibiotics for a sequence of four days, with gaps of one day allowed.</li>
<li>The sequence must start with a new antibiotic, defined as an antibiotic type that was not given in the previous two days.</li>
<li>The sequence must start within two days of a blood culture.</li>
<li>There must be at least one intervenous (I.V.) antibiotic within the +/-2 day window.</li>
</ol>
<p>Let's start with the second item by finding which rows represent "new antibiotics". We will determine if each antibiotic was given to the patient in the prior two days. We'll visualize this task by looking at the data sorted by id, antibiotic type, and day.</p>

In [2]:
# Sort the data and examining the first 40 rows

antibioticDT_copy <- antibioticDT

head(antibioticDT_copy)

# Sorting our data on the copied version of the antibioticDT
setorder(antibioticDT_copy , patient_id , antibiotic_type , day_given)
head(antibioticDT_copy , 40)


patient_id,day_given,antibiotic_type,route
1,2,ciprofloxacin,IV
1,4,ciprofloxacin,IV
1,6,ciprofloxacin,IV
1,7,doxycycline,IV
1,9,doxycycline,IV
1,15,penicillin,IV


patient_id,day_given,antibiotic_type,route
1,2,ciprofloxacin,IV
1,4,ciprofloxacin,IV
1,6,ciprofloxacin,IV
1,18,ciprofloxacin,IV
1,7,doxycycline,IV
1,9,doxycycline,IV
1,16,doxycycline,IV
1,15,penicillin,IV
8,1,doxycycline,PO
8,3,doxycycline,IV


In [3]:
# Use shift to calculate the last day a particular drug was administered
#antibioticDT[ , last_administration_day := ...., 
 #            by = .(patient_id, antibiotic_type)]

antibioticDT_copy[ , last_administration_day := shift(1:.N , type='lead'),
                 by = .(patient_id , antibiotic_type )]

head(antibioticDT_copy , 30)

patient_id,day_given,antibiotic_type,route,last_administration_day
1,2,ciprofloxacin,IV,2.0
1,4,ciprofloxacin,IV,3.0
1,6,ciprofloxacin,IV,4.0
1,18,ciprofloxacin,IV,
1,7,doxycycline,IV,2.0
1,9,doxycycline,IV,3.0
1,16,doxycycline,IV,
1,15,penicillin,IV,
8,1,doxycycline,PO,2.0
8,3,doxycycline,IV,3.0


In [4]:
# Calculate the number of days since the drug was last administered
antibioticDT_copy[ , days_since_last_admin := 1:.N , by = .(last_administration_day , patient_id)]
head(antibioticDT_copy , 200)

patient_id,day_given,antibiotic_type,route,last_administration_day,days_since_last_admin
1,2,ciprofloxacin,IV,2,1
1,4,ciprofloxacin,IV,3,1
1,6,ciprofloxacin,IV,4,1
1,18,ciprofloxacin,IV,,1
1,7,doxycycline,IV,2,2
1,9,doxycycline,IV,3,2
1,16,doxycycline,IV,,2
1,15,penicillin,IV,,3
8,1,doxycycline,PO,2,1
8,3,doxycycline,IV,3,1


In [5]:
# Create antibiotic_new with an initial value of one, then reset it to zero as needed
antibiotic_new = 1
antibioticDT_copy[ , antibiotic_new_col := ifelse(duplicated(antibiotic_type) == "FALSE" , antibiotic_new , 0) , by = patient_id]

antibioticDT <- antibioticDT_copy
head(antibioticDT , 40)


patient_id,day_given,antibiotic_type,route,last_administration_day,days_since_last_admin,antibiotic_new_col
1,2,ciprofloxacin,IV,2.0,1,1
1,4,ciprofloxacin,IV,3.0,1,0
1,6,ciprofloxacin,IV,4.0,1,0
1,18,ciprofloxacin,IV,,1,0
1,7,doxycycline,IV,2.0,2,1
1,9,doxycycline,IV,3.0,2,0
1,16,doxycycline,IV,,2,0
1,15,penicillin,IV,,3,1
8,1,doxycycline,PO,2.0,1,1
8,3,doxycycline,IV,3.0,1,0


## 3. Looking at the blood culture data
<p>Now let's look at blood culture data from the same two-week period in this hospital. These data are in <code>blood_cultureDT.csv</code>. Let's start by reading it into the workspace and having a look at a few rows. </p>
<p>Each row represents one blood culture and gives the patient's id and the day the blood culture test occurred. For example, patient "8" had a blood culture on the second day of their hospitalization and again on the thirteenth day. Notice that some patients from the antibiotic dataset are not in this dataset and vice versa. Some patients are in neither because they received neither antibiotics nor a blood culture.</p>

In [6]:
# Read in blood_cultureDT.csv
blood_cultureDT <- fread('datasets/blood_cultureDT.csv')
head(blood_cultureDT , 30)
# Print the first 30 rows


patient_id,blood_culture_day
1,3
1,13
8,2
8,13
23,3
39,10
45,4
45,9
45,11
51,3


## 4. Combine the antibiotic data and the blood culture data
<p>To find which antibiotics were given close to a blood culture test, we need to combine the drug administration data with the blood culture data. We'll keep only patients that are still candidates for infection&mdash;only those in both data sets.</p>
<p>A challenge with the data is that some patients had blood cultures on several different days. For each of those days, we will see if there is a sequence of antibiotic days close to them. To accomplish this, in the merge we will match each blood culture to each antibiotic day.</p>
<p>After sorting the data following the merge, you will see that each patient's antibiotic sequence repeats for each blood culture day. This repetition allows us to look at each blood culture day and check if it is associated with a qualifying sequence of antibiotics.</p>

In [7]:
# Merge antibioticDT with blood_cultureDT
combinedDT <- merge(antibioticDT , blood_cultureDT , by = "patient_id")

# Sort by patient_id, blood_culture_day, day_given, and antibiotic_type


setorder(combinedDT , patient_id , blood_culture_day , day_given , antibiotic_type)

# Print and examine the first 30 rows

head(combinedDT , 30)

patient_id,day_given,antibiotic_type,route,last_administration_day,days_since_last_admin,antibiotic_new_col,blood_culture_day
1,2,ciprofloxacin,IV,2.0,1,1,3
1,4,ciprofloxacin,IV,3.0,1,0,3
1,6,ciprofloxacin,IV,4.0,1,0,3
1,7,doxycycline,IV,2.0,2,1,3
1,9,doxycycline,IV,3.0,2,0,3
1,15,penicillin,IV,,3,1,3
1,16,doxycycline,IV,,2,0,3
1,18,ciprofloxacin,IV,,1,0,3
1,2,ciprofloxacin,IV,2.0,1,1,13
1,4,ciprofloxacin,IV,3.0,1,0,13


## 5. Determine whether each row is in-window
<p>Now that we have the antibiotic and blood culture data combined, we can test each drug administration against each blood culture to see if it's "in the window."</p>

In [8]:
# Make a new variable called drug_in_bcx_window
combinedDT[ , drug_in_bcx_window := as.numeric(combinedDT[ , days_since_last_admin <= 2])]
head(combinedDT , 40)


patient_id,day_given,antibiotic_type,route,last_administration_day,days_since_last_admin,antibiotic_new_col,blood_culture_day,drug_in_bcx_window
1,2,ciprofloxacin,IV,2.0,1,1,3,1
1,4,ciprofloxacin,IV,3.0,1,0,3,1
1,6,ciprofloxacin,IV,4.0,1,0,3,1
1,7,doxycycline,IV,2.0,2,1,3,1
1,9,doxycycline,IV,3.0,2,0,3,1
1,15,penicillin,IV,,3,1,3,0
1,16,doxycycline,IV,,2,0,3,1
1,18,ciprofloxacin,IV,,1,0,3,1
1,2,ciprofloxacin,IV,2.0,1,1,13,1
1,4,ciprofloxacin,IV,3.0,1,0,13,1


## 6. Check the I.V. requirement
<p>Now let's look at the fourth item in the criteria. </p>
<p><strong>Criteria for Suspected Infection</strong><a href="https://www.ncbi.nlm.nih.gov/pubmed/28903154">*</a></p>
<ol>
<li>The patient receives antibiotics for a sequence of four days, with gaps of one day allowed.</li>
<li>The sequence must start with a new antibiotic, defined as an antibiotic type that was not given in the previous two days.</li>
<li>The sequence must start within two days of a blood culture.</li>
<li><em>There must be at least one intervenous (I.V.) antibiotic within the +/-2 day window.</em></li>
</ol>

In [9]:
# Creating a variable indicating if there was at least one I.V. drug given in the window

combinedDT[ , any_iv_in_bcx_window := as.numeric(drug_in_bcx_window == 1)  , by = .(blood_culture_day , patient_id)]
# Printing the dataset
#head(combinedDT , 40)


In [10]:
# Exclude rows in which the blood_culture_day does not have any I.V. drugs in window 


combinedDT[any(drug_in_bcx_window == 1 & any_iv_in_bcx_window ==1) , combinedDT[1:.N] , by = .(blood_culture_day , patient_id)]
combinedDT <- combinedDT[drug_in_bcx_window == 1 & any_iv_in_bcx_window == 1]


blood_culture_day,patient_id,patient_id.1,day_given,antibiotic_type,route,last_administration_day,days_since_last_admin,antibiotic_new_col,blood_culture_day.1,drug_in_bcx_window,any_iv_in_bcx_window
3,1,1,2,ciprofloxacin,IV,2,1,1,3,1,1
3,1,1,4,ciprofloxacin,IV,3,1,0,3,1,1
3,1,1,6,ciprofloxacin,IV,4,1,0,3,1,1
3,1,1,7,doxycycline,IV,2,2,1,3,1,1
3,1,1,9,doxycycline,IV,3,2,0,3,1,1
3,1,1,15,penicillin,IV,,3,1,3,0,0
3,1,1,16,doxycycline,IV,,2,0,3,1,1
3,1,1,18,ciprofloxacin,IV,,1,0,3,1,1
3,1,1,2,ciprofloxacin,IV,2,1,1,13,1,1
3,1,1,4,ciprofloxacin,IV,3,1,0,13,1,1


In [11]:
head(combinedDT , 30)

patient_id,day_given,antibiotic_type,route,last_administration_day,days_since_last_admin,antibiotic_new_col,blood_culture_day,drug_in_bcx_window,any_iv_in_bcx_window
1,2,ciprofloxacin,IV,2.0,1,1,3,1,1
1,4,ciprofloxacin,IV,3.0,1,0,3,1,1
1,6,ciprofloxacin,IV,4.0,1,0,3,1,1
1,7,doxycycline,IV,2.0,2,1,3,1,1
1,9,doxycycline,IV,3.0,2,0,3,1,1
1,16,doxycycline,IV,,2,0,3,1,1
1,18,ciprofloxacin,IV,,1,0,3,1,1
1,2,ciprofloxacin,IV,2.0,1,1,13,1,1
1,4,ciprofloxacin,IV,3.0,1,0,13,1,1
1,6,ciprofloxacin,IV,4.0,1,0,13,1,1


## 7. Find the first day of possible sequences
<p>We're getting close! Let's review the criteria again.</p>
<p><strong>Criteria for Suspected Infection</strong><a href="https://www.ncbi.nlm.nih.gov/pubmed/28903154">*</a></p>
<ol>
<li>The patient receives antibiotics for a sequence of four days, with gaps of one day allowed.</li>
<li>The sequence must start with a new antibiotic, defined as an antibiotic type that was not given in the previous two days.</li>
<li>The sequence must start within two days of a blood culture.</li>
<li>There must be at least one intervenous (I.V.) antibiotic within the +/-2 day window.</li>
</ol>
<p>Let's assess the first criterion by finding the first day of possible 4-day qualifying sequences.    </p>

In [12]:
# Create a new variable called day_of_first_new_abx_in_window

combinedDT[ , day_of_first_new_abx_in_window := ifelse(any_iv_in_bcx_window == 1 & antibiotic_new_col == 1 , 1 , 0)]


# Remove rows where the day is before this first qualifying day


combinedDT <- combinedDT[day_of_first_new_abx_in_window == 1]
head(combinedDT , 30)

patient_id,day_given,antibiotic_type,route,last_administration_day,days_since_last_admin,antibiotic_new_col,blood_culture_day,drug_in_bcx_window,any_iv_in_bcx_window,day_of_first_new_abx_in_window
1,2,ciprofloxacin,IV,2.0,1,1,3,1,1,1
1,7,doxycycline,IV,2.0,2,1,3,1,1,1
1,2,ciprofloxacin,IV,2.0,1,1,13,1,1,1
1,7,doxycycline,IV,2.0,2,1,13,1,1,1
8,1,doxycycline,PO,2.0,1,1,2,1,1,1
8,2,penicillin,IV,2.0,2,1,2,1,1,1
8,1,doxycycline,PO,2.0,1,1,13,1,1,1
8,2,penicillin,IV,2.0,2,1,13,1,1,1
23,3,amoxicillin,IV,2.0,1,1,3,1,1,1
23,3,ciprofloxacin,IV,2.0,2,1,3,1,1,1


## 8. Simplify the data
<p>The first criterion is: <em>The patient receives antibiotics for a sequence of four days, with gaps of one day allowed.</em></p>
<p>We've pinned down the first day of possible sequences in the previous task. Now we have to check for four-day sequences. We don't need the drug type (name); we need the days the antibiotics were administered.</p>

In [13]:
# Create a new data.table containing only patient_id, blood_culture_day, and day_given
simplified_data <- data.table(combinedDT[, .(patient_id , blood_culture_day , day_given)])
head(simplified_data , 30)

patient_id,blood_culture_day,day_given
1,3,2
1,3,7
1,13,2
1,13,7
8,2,1
8,2,2
8,13,1
8,13,2
23,3,3
23,3,3


In [14]:
# Remove duplicate rows

simplified_data <- unique(simplified_data)
head(simplified_data , 40)

patient_id,blood_culture_day,day_given
1,3,2
1,3,7
1,13,2
1,13,7
8,2,1
8,2,2
8,13,1
8,13,2
23,3,3
45,4,1


## 9. Extract first four rows for each blood culture
<p>To check for four-day sequences, let's pull out the first four days (rows) for each patient/blood culture combination. Some patients will have less than four antibiotic days. We'll remove them first.</p>

In [15]:
# Count the antibiotic days within each patient/blood culture day combination

simplified_data[ , num_antibiotic_days := .N , by = .(patient_id , blood_culture_day)]

# Remove blood culture days with less than four rows 
simplified_data <- simplified_data[num_antibiotic_days >= 4]

# Select the first four days for each blood culture
first_four_days <- simplified_data[ , .SD[1:4] , by = .(patient_id , blood_culture_day)] 

head(first_four_days , 20)

patient_id,blood_culture_day,day_given,num_antibiotic_days
213,1,1,4
213,1,2,4
213,1,10,4
213,1,17,4
213,2,1,4
213,2,2,4
213,2,10,4
213,2,17,4
213,3,1,4
213,3,2,4


## 10. Consecutive sequence
<p>Now we need to check whether each four-day sequence qualifies by having no gaps of more than one day.</p>
<!--"Patient receives antibiotics for a sequence of 4 days, with gaps of 1 day allowed."-->

In [16]:
# Make the indicator for consecutive sequence


# Creating a new col called shift_col to perform the difference between 2 rows because by using the diff() function here 
# we are losing the data for the last column , so to retain the data for the last column we are shifting the rows postion 
# and subtracting the 2 columns

first_four_days[, shift_col := shift(blood_culture_day , type = "lead")]

first_four_days[ , four_in_seq := as.numeric((shift_col - blood_culture_day) <= 2) , by = patient_id]


first_four_days[ , shift_col:=NULL]

head(first_four_days , 20)




patient_id,blood_culture_day,day_given,num_antibiotic_days,four_in_seq
213,1,1,4,1
213,1,2,4,1
213,1,10,4,1
213,1,17,4,1
213,2,1,4,1
213,2,2,4,1
213,2,10,4,1
213,2,17,4,1
213,3,1,4,1
213,3,2,4,1


## 11. Select the patients who meet criteria
<p>A patient would meet the criteria if any of their blood cultures were accompanied by a qualifying sequence of antibiotics. Now that we've determined which each blood culture qualify let's select the patients who meet the criteria.</p>

In [17]:
# Select the rows which have four_in_seq equal to 1
suspected_infection <- first_four_days[four_in_seq == 1]

# Retain only the patient_id column

suspected_infection[ , `:=`(blood_culture_day = NULL , day_given = NULL , num_antibiotic_days = NULL , four_in_seq = NULL)]

# Remove duplicates


suspected_infection <- unique(suspected_infection)

# Make an infection indicator


suspected_infection[ , infection := 1]

head(suspected_infection)

patient_id,infection
213,1
237,1
1718,1
1806,1
2497,1


## 12. Find the prevalence of sepsis
<p>In this project, we used two EHR datasets to flag patients who were suspected of having a serious infection. We also got a <code>data.table</code> workout!</p>
<p>So far, we've been looking at records of all antibiotics administered and blood cultures that occurred over two weeks at a particular hospital. However, not all patients who were hospitalized over this period are represented in <code>combinedDT</code> because not all of them took antibiotics or had blood culture tests. We have to read in and merge the rest of the patient information to see what percentage of patients at the hospital might have had a serious infection.</p>

In [18]:
# Read in "all_patients.csv"
all_patientsDT <- fread('datasets/all_patients.csv')



# Merge this with the infection flag data
all_patientsDT <- merge(suspected_infection , all_patientsDT , all = T )

# Set any missing values of the infection flag to 0


all_patientsDT <- as.data.frame(all_patientsDT)
all_patientsDT[is.na(all_patientsDT)] = 0

all_patientsDT <- as.data.table(all_patientsDT)


# Calculate the percentage of patients who met the criteria for presumed infection

sum_row_with_infection_1 = sum(all_patientsDT[ , infection == 1])


total_rows = nrow(all_patientsDT)

ans  <- sum_row_with_infection_1 / total_rows * 100 
ans