<a href="https://colab.research.google.com/github/MIT-LCP/bidmc-datathon/blob/master/01_explore_patients.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# eICU Collaborative Research Database

# Notebook 1: Exploring the patient table

The aim of this notebook is to get set up with access to a demo version of the [eICU Collaborative Research Database](http://eicu-crd.mit.edu/). The demo is a subset of the full database, limited to ~1000 patients.

We begin by exploring the `patient` table, which contains patient demographics and admission and discharge details for hospital and ICU stays. For more detail, see: http://eicu-crd.mit.edu/eicutables/patient/

## Prerequisites

- If you do not have a Gmail account, please create one at http://www.gmail.com. 
- If you have not yet signed the data use agreement (DUA) sent by the organizers, please do so now to get access to the dataset.

## Load libraries and connect to the data

Run the following cells to import some libraries and then connect to the database.

In [None]:
# Import libraries
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.path as path

# Make pandas dataframes prettier
from IPython.display import display, HTML

# Access data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery

Before running any queries, you need to first authenticate yourself by running the following cell. If you are running it for the first time, it will ask you to follow a link to log in using your Gmail account, and accept the data access requests to your profile. Once this is done, it will generate a string of verification code, which you should paste back to the cell below and press enter.

In [None]:
auth.authenticate_user()

We'll also set the project details.

In [None]:
project_id='bidmc-datathon'
os.environ["GOOGLE_CLOUD_PROJECT"]=project_id

# "Querying" our database with SQL

Now we can start exploring the data. We'll begin by running a simple query to load all columns of the `patient` table to a Pandas DataFrame. The query is written in SQL, a common language for extracting data from databases. The structure of an SQL query is:

```sql
SELECT <columns>
FROM <table>
WHERE <criteria, optional>
```

`*` is a wildcard that indicates all columns

# BigQuery

Our dataset is stored on BigQuery, Google's database engine. We can run our query on the database using some special ("magic") [BigQuery syntax](https://googleapis.dev/python/bigquery/latest/magics.html).

In [None]:
%%bigquery patient

SELECT *
FROM `physionet-data.eicu_crd_demo.patient`

We have now assigned the output to our query to a variable called `patient`. Let's use the `head` method to view the first few rows of our data.

In [None]:
# view the top few rows of the patient data
patient.head()

## Questions

- What does `patientunitstayid` represent? (hint, see: http://eicu-crd.mit.edu/eicutables/patient/)
- What does `patienthealthsystemstayid` represent?
- What does `uniquepid` represent?

In [None]:
# select a limited number of columns to view
columns = ['uniquepid', 'patientunitstayid','gender','age','unitdischargestatus']
patient[columns].head()

- Try running the following query, which lists unique values in the age column. What do you notice?

In [None]:
# what are the unique values for age?
age_col = 'age'
patient[age_col].sort_values().unique()

- Try plotting a histogram of ages using the command in the cell below. What happens? Why?

In [None]:
# try plotting a histogram of ages
patient[age_col].plot(kind='hist', bins=15)

Let's create a new column named `age_num`, then try again.

In [None]:
# create a column containing numerical ages
# If ‘coerce’, then invalid parsing will be set as NaN
agenum_col = 'age_num'
patient[agenum_col] = pd.to_numeric(patient[age_col], errors='coerce')
patient[agenum_col].sort_values().unique()

In [None]:
patient[agenum_col].plot(kind='hist', bins=15)

## Questions

- Use the `mean()` method to find the average age. Why do we expect this to be lower than the true mean?
- In the same way that you use `mean()`, you can use `describe()`, `max()`, and `min()`. Look at the admission heights (`admissionheight`) of patients in cm. What issue do you see? How can you deal with this issue?

In [None]:
adheight_col = 'admissionheight'
patient[adheight_col].describe()

In [None]:
# set threshold
adheight_col = 'admissionheight'
patient[patient[adheight_col] < 10] = None