Electronic Health Records
=========================



## Introduction



In this second activity, you will work on all the concepts that we have been **practicing over the last weeks**. You will practice:

1.  SQL (5 points)
2.  Regular expressions (2 points)
3.  Bringing everything together (3 points)

Some of the exercises on each of the blocks have **a common factor**, the study of [**heart failure**](https://www.cdc.gov/heartdisease/heart_failure.htm), a disease that affects the efficiency of the heart pumping blood and oxygen to other organs. This disease has a large prevalence in the US and it&rsquo;s associated with a 13% of the deaths across the country.

Some notes:

-   The **deadline** is set for Friday, **December 22th at 23:59 CET**. No extensions will be allowed. Any **late submission** will be evaluated with a cap on the **maximum grade of 6**.
-   As you know, there are many ways to do a task. Here, **besides correctness, we will also evaluate efficiency**; if your query takes more than a minute, consider that there is something that you need to change.
-   You need to work on **teams of three**. The team may be different to the one for the first assignment, you need to create it again.
-   In the **Software** section in Moodle you will find the details to create a Docker container that includes the drivers to connect to MIMIC-III database. You should run and fill this notebook within that container.
-   You can find the information regarding the database structure on the [**MIMIC-III webpage**](https://mimic.mit.edu/docs/iii/).
-   Just **one member of the team needs to submit the notebook** to the Moodle task. **The format of the delivery is `A2-assignment-groupID`**, using the group id assigned on Moodle.
-   You can use **any function available for [MariaDB](https://mariadb.com/kb/en/built-in-functions/)** to perform any of the requested queries. It is your job, using the available documentation, to find the one that better fits the job to do.
-   The R packages that you will need to use are defined at the beginning of the document. **YOU CANNOT ALTER OR USE ADDITIONAL PACKAGES NOT LISTED THERE**.
-   For complex codes (more than a few lines), you must use **comments to explain your design decisions**.
-   **Variable naming is an important part of programming**. Use meaningful names in order to increase readability.
-   **Do not get stuck** on a given exercise. Most of the exercises are not incremental, just do the easy ones first.
-   To minimize correction time, **you must provide a screenshot with the results of every exercice** adding an additional cell at the end of it with the image.



### Group information



Use this markdown cell to introduce the group data.



### Configuration



In this section, you will define some **parameters needed to properly execute this notebook**. Remember that if you stop either the kernel or the container you will need to run these cells again.



#### Loading libraries



Those are the libraries that you can use for the realization of this activity. **You can use any of the functions available on them** (unless specified otherwise). **You cannot modify** this cell or load any other library during the completion of this assignment.



In [None]:
library(dplyr)
library(tidyr)
library(tibble)
library(lubridate)
library(readr)
library(stringr)
library(ggplot2)
library(data.table)
library(odbc)
library(RMariaDB)

#### Connecting to the database



As you know, to connect to the database, we must specify an object with the specific details of the connection. In this exercise, you will use the following object, `con`, to connect. **Place your username and password on the designed locations**.



In [None]:
con <- dbConnect(
  drv = RMariaDB::MariaDB(),
  username = "USERNAME",
  password = "PASSWORD",
  host = "ehr1.deim.urv.cat",
  dbname = "mimiciiiv14",
  port = 3306
)

Remember that **you must first be connected to the VPN** in order to do so. Also, remember that the VPN connection is rebooted every 24h, that means that you may need to run this cell again.

The following cell can be used to check if the connection works properly. The output should be the list of tables in the database.



In [None]:
dbListTables(con)

## SQL (5 Points)



In this first activity, you will review some of the techniques that you have learned about **the use of SQL**. In all the exercises, you need to input the SQL query in a predefined location, and then execute the next given cell to print the output.

**We strongly recommend you to start working on the queries using DBeaver**. Once you get the desired output, just copy it into the specified place.



### Number of admissions (0.25 points)



In this first activity, you must create a table containing the **number of times that each patient has been admitted on the ICU**. The structure of the table has to be:

-   `SUBJECT_ID`
-   `N_ADM`: number of admissions

The table must be ordered in **descending order according to the number of admissions**.



In [None]:
sql <- "
YOUR SQL QUERY GOES HERE!
"

In [None]:
# DO NOT MODIFY THIS CELL
dbGetQuery(con, sql) %>% head(10)

### Length of stay (0.75 points)

One of the most important things that we need to take into account when we analyze the evolution of patients admitted at the ICU, it is how long patients stayed there. In this exercise, and **only using the table `ADMISIONS`**, you need to provide **how long each patient has been admitted at the ICU and the standard deviation of that average**. The table must have the following structure.

-   `SUBJECT_ID`
-   `N_ADM`: number of admissions
-   `MEAN_LOS`: Average length of stay in number of days
-   `SD_LOS`: Standard deviation of the length of stay in number of days

Results must be ordered by **`MEAN_LOS` in descending order**.



In [None]:
sql <- "
YOUR SQL QUERY GOES HERE!
"

In [None]:
# DO NOT MODIFY THIS CELL
dbGetQuery(con, sql) %>% head(10)

### Microbiology (2 points)



Microbiological studies are very common in the clinical practice to assess the interference of a given microorganism on the health of the patients. These studies are performed using samples cultures that after a period are analyzed to check the composition of the microorganism population.

In this exercise, you will practice some of the concepts learned on epidemiology. The objective is to **study the relation of the exposure to a microorganism to the diagnosis of a disease**. You must provide a table with the following information:

-   `LONG_TITLE`: Name of the primary (`SEQ_NUM = 1`) diagnosis in long format
-   `ORG_NAME`: Name of the organism
-   `N_DIAG`: Number of admissions with the same primary diagnosis
-   `N_TESTED`: Number of admissions with the specified primary diagnosis with a microbiological study (either positive or negative)
-   `PERCENT_TESTED`: Percentage of admissions diagnosed with the same primary diagnosis tested for microorganism
-   `N_POS`: Number of admissions with at least one positive test for the presence of the specified microorganism
-   `N_NEG`: Number of admissions without any positive microbiological test for that given disease
-   `PERCENT_POS`: Percentage of tested admissions with at least a positive test for the given disease and microorganism
-   `ODDS_RATIO`: The odds ratio of the exposure on the disease following the same formulation used previously in class

The results have to be **limited** to the ones that have at least 200 positive tests on different admissions considering all the diagnoses, and the ones that have at least 50 positive tests on different admissions for the given disease. The results need to be sorted in **descending order by `N_DIAG`** and ascending order according to the odds ratio.

There are several assumptions that you need to make:

-   For simplicity, you only need to take into account the **primary diagnosis** (`SEQ_NUM = 1`) ignoring any possible comorbidity
-   You also need to assume that the test applied for detecting a microorganism is **the same for all the existing types**; meaning that if you find a negative test this will indicate the no presence of any microorganism
-   You must assume that there is just one **unique microbiological test per admission**, considering all the different studies as part of the same
-   As a simplification, you may consider **all the admissions independent of the patient**
-   Remember that **the order of the JOINS matter** and their match should be unique if you wanna avoid multiple matches for the same pair of keys



In [None]:
sql <- "
YOUR SQL QUERY GOES HERE!
"

In [None]:
# DO NOT MODIFY THIS CELL
dbGetQuery(con, sql) %>% head(70)

#### Questions:



Which kind of epidemiological this study belongs to? (Explain why)



Comment the results observed on the final table. At what conclusions do you arrive?



### Heart Failure: Comorbidities (1 point)



In the practice of medicine one of the factors that you need to always consider is the presence of comorbidities, **diseases that are co-ocurring at the same time than a primary condition**. In this exercise you will explore this concept studying a specific disease, heart failure. As we told you on the introduction of the assignment, this disease has a large prevalence in the U.S. and it&rsquo;s present in the corner&rsquo;s reports of almost 13% of the deaths. Here, you must provide a table with the following information:

-   `ICD9_CODE`: The ICD of the disease
-   `LONG_TITLE`: Long description of the codified disease
-   `N`: Number of patients with that disease associated. You must consider only those diseases that are present on admissions where the primary condition is heart failure (ICD9 starting with 428)
-   `Prevalence`: Prevalence of that disease on the heart failure population (in percentage)

You must order the results by **`Prevalence` in descending order**



In [None]:
sql <- "
YOUR SQL QUERY GOES HERE!
"

In [None]:
# DO NOT MODIFY THIS CELL
dbGetQuery(con, sql) %>% head(30)

#### Questions:



Compare the results that you are obtaining with the [description of the CDC](https://www.cdc.gov/heartdisease/heart_failure.htm):



What you did is an oversimplification of the process of calculating comorbidities. Which pitfalls do you think this approach has and how can we overcome them?



### Heart Failure: Building a cohort (1 point)

In the next step, you will build a **cohort of patients diagnosed with a HF as primary condition**. From each patient, you must obtain some important clinical and demographic features. The resulting table must contain:

-   `SUBJECT_ID`
-   `GENDER`
-   `AGE_FIRST`: age at the first diagnosis of heart failure
-   `AGE_LAST`: age at the last diagnosis of heart failure
-   `ETHNICITY`
-   `DECEASED`: 1 if the patient has died 0 otherwise
-   `AVG_LOS`: Average length-of-stay
-   `DM2`: 1 if the patient has been diagnosed with diabetes mellitus type II, 0 otherwise
-   `CAD`: 1 if the patient has been diagnosed with coronary artery disease, 0 otherwise
-   `CKD`: 1 if the patient has been diagnosed with cronic kidney disease, 0 otherwise
-   `HYPERTENSION`: 1 if the patient has been diagnosed with hypertension, 0 otherwise
    
    The results must be ordered by age **at the last admision in descending order**.



In [None]:
sql <- "
YOUR SQL QUERY GOES HERE!
"

In [None]:
# DO NOT MODIFY THIS CELL
dbGetQuery(con, sql) %>% head(30)

## Regular expressions (2 points)

### Getting the text (0.25 points)

In this section, you will work with **regular expressions**. The first exercises will be conducted over a single medical report. In this first exercise, **you will get the report from the data base**. First, connect to MIMIC-III and extract the TEXT field from the NOTEEVENTS table corresponding to the only register with `SUBJECT_ID` equal to `13702`, `CATEGORY` equal to &rsquo;Discharge summary&rsquo;, and `CHARTDATE` equal to &rsquo;2118-06-14&rsquo;. Store it in a string variable called `text`.



In [None]:
sql_diagnosis_all <- "
YOUR SQL QUERY GOES HERE!
"

diagnosis_all <- dbGetQuery(con, sql_diagnosis_all)
text <- diagnosis_all$TEXT[1]

In [None]:
# DO NOT MODIFY THIS CELL
writeLines(str_c('<', text ,'>'))

### Medications (0.75 points)

In `text`, there is a paragraph with the **list of medications** on admission. Using `stringr` functions, **extract it to a list of strings called `medications_admission`**. Remove the enumeration in front of each medication. Each string in the list **must be composed by the name of the medication and its dosage** just as it appears in the report.



In [None]:
# YOUR CODE GOES HERE!

In [None]:
# DO NOT MODIFY THIS CELL
medications_admission

### Medications data frame (1 point)

Now that you have extracted the medication information, you need to **put this information in a structured object**. Specifically, you must generate a data.frame `medications` with the following columns, using regular expressions to segment the previously generated strings:

-   `medication`: name of the product
-   `dosage`: numerical value with the prescribed dosage
-   `units`: units on which the dossage is specified

Do it as precise as possible, given the variety of formats in the original text.


In [None]:
# YOUR CODE GOES HERE!

In [None]:
# DO NOT MODIFY THIS CELL
print(medications)

#### Questions



**Explain the limitations of your method**: which formats are going to be recognized correctly and which are going to fail; in case of failure, explain how do they fail; undesired flaws (e.g., the appearance of additional characters, such as commas or blank spaces, in any of the fields); explain how do you handle the absence of quantity and/or units, or the presence of multiple quantities and/or units.



As you know, medications are already stored on an structured table in MIMIC-III. Compare the results obtained in this exercise with the ones stored for this particular admission in MIMIC-III:



## Bringing everything together (3 points)

The **ejection fraction measures the volumetric fraction of fluid ejected from a chamber with each contraction**. As you may imagine, this metric is deeply connected with the heart functionality. Doctors typically measure the ejection fraction **using an echocardiogram**, which is an ultrasound imaging test that uses sound waves to create a picture of the heart. The test allows the doctor to see the size and shape of the heart and how well it is functioning. Doctors can measure both, the volume of blood in the left ventricle and in the right ventricle. In general, the **left ventricle is responsible for pumping oxygenated blood to the rest of the body**, while the right ventricle pumps blood to the lungs to be oxygenated.

It turns out that **left ventricular ejection fraction (LVEF) is closely related with the diagnosis of heart failure** since it can be used to further **classify the disease**. According to the European Society of Cardiology those can be:

-   Heart failure with preserved LVEF (HFpEF) [LVEF > 50%]
-   Heart failure with moderated reduced LVEF (HFmrEF) [41% < LVEF < 50%]
-   Heart failure with reduced LVEF (HFrEF) [LVEF < 41%]

**Treatment for these types of heart failure may differ**, depending on the underlying cause and the specific symptoms a person is experiencing. In general, managing heart failure involves a combination of medications, lifestyle changes, and in some cases, medical procedures or surgeries.



### Heart Failure: Ejection Fraction (1.5 points)

In this exercise, you must **extract using regular expressions the LVEF from the `NOTEEVENTS`** using the echocardiogram reports associated with admissions where the primary diagnosis is heart failure. Be aware that they are **many different ways how this information can be recorded** (LV ejection fraction, LVEF, L.V.E.F&#x2026;). **You must analyze those different scenarios** when you design the extraction strategy. You must provide a data frame with the following content:

-   `HADM_ID`: admission number
-   `LVEF`: Left ventricular ejection fraction (if there are more than one measurement per admission you must provide the average)

In this activity you **may use any of the techniques taught in this course**. You can organize the code in as many cells as you need. You must properly describe any decision made either using markdown cells or code comments.



In [None]:
# YOUR CODE GOES HERE! (AND IN AS MANY CELL YOU NEED)

### Heart Failure: Adding LVEF to our cohort (0.25 points)

In order to understand the relation of ejection fraction with other factors, you need to **merge the table `hf_cohort` with the data frame containing LVEF values** that you just obtained. If a patient has associated more than one LVEF you will input the average. Additionally, you will **add a new column `type`**, specifying the type of HF (HFrEF, HFmrEF or HFpEF).



In [None]:
# YOUR CODE GOES HERE!

### Heart Failure: LVEF distribution (1.25 points)

The distribution of LVEF values offers a nice description of HF. In this exercise, you must provide a **histogram showing the distribution of those values** (you need to take care of outliers, if there are). Moreover, you need also to show this information **based on gender and by the presence of hypertension**.



In [None]:
# YOUR CODE GOES HERE!

#### Questions:



Do you observe any appreciable difference? Does these distributions show a remarkable characteristic?

