<a href="https://colab.research.google.com/github/MIT-LCP/2019_tokyo_datathon/blob/master/01_explore_patients.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# eICU Collaborative Research Database

# Notebook 2: Exploring the patient table

In this notebook we introduce the patient table, a key table in the [eICU Collaborative Research Database](http://eicu-crd.mit.edu/). The patient table contains patient demographics and admission and discharge details for hospital and ICU stays. For more detail, see: http://eicu-crd.mit.edu/eicutables/patient/

## Load libraries and connect to the data

Run the following cells to import some libraries and then connect to the database.

In [0]:
# Import libraries
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.path as path

# Make pandas dataframes prettier
from IPython.display import display, HTML

# Access data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery

As before, you need to first authenticate yourself by running the following cell. If you are running it for the first time, it will ask you to follow a link to log in using your Gmail account, and accept the data access requests to your profile. Once this is done, it will generate a string of verification code, which you should paste back to the cell below and press enter.

In [0]:
auth.authenticate_user()

We'll also set the project details.

In [0]:
project_id='datathonjapan2019'
os.environ["GOOGLE_CLOUD_PROJECT"]=project_id

# Load data from the `patient` table

Now we can start exploring the data. We'll begin by running a simple query on the database to load all columns of the `patient` table to a Pandas DataFrame. The query is written in SQL, a common language for extracting data from databases. The structure of an SQL query is:

```sql
SELECT <columns>
FROM <table>
WHERE <criteria, optional>
```

`*` is a wildcard that indicates all columns

In [0]:
# Helper function to read data from BigQuery into a DataFrame.
def run_query(query):
    return pd.io.gbq.read_gbq(query, project_id=project_id, verbose=False, 
                            configuration={'query':{'useLegacySql': False}})

In [0]:
query = """
SELECT *
FROM `physionet-data.eicu_crd_demo.patient`
"""

patient = run_query(query)

We have now assigned the output to our query to a variable called `patient`. Let's use the `head` method to view the first few rows of our data.

In [0]:
# view the top few rows of the patient data
patient.head()

## Questions

- What does `patientunitstayid` represent? (hint, see: http://eicu-crd.mit.edu/eicutables/patient/)
- What does `patienthealthsystemstayid` represent?
- What does `uniquepid` represent?

In [0]:
# select a limited number of columns to view
columns = ['uniquepid', 'patientunitstayid','gender','age','unitdischargestatus']
patient[columns].head()

- Try running the following query, which lists unique values in the age column. What do you notice?

In [0]:
# what are the unique values for age?
age_col = 'age'
patient[age_col].sort_values().unique()

- Try plotting a histogram of ages using the command in the cell below. What happens? Why?

In [0]:
# try plotting a histogram of ages
patient[age_col].plot(kind='hist', bins=15)

Let's create a new column named `age_num`, then try again.

In [0]:
# create a column containing numerical ages
# If ‘coerce’, then invalid parsing will be set as NaN
agenum_col = 'age_num'
patient[agenum_col] = pd.to_numeric(patient[age_col], errors='coerce')
patient[agenum_col].sort_values().unique()

In [0]:
patient[agenum_col].plot(kind='hist', bins=15)

## Questions

- Use the `mean()` method to find the average age. Why do we expect this to be lower than the true mean?
- In the same way that you use `mean()`, you can use `describe()`, `max()`, and `min()`. Look at the admission heights (`admissionheight`) of patients in cm. What issue do you see? How can you deal with this issue?

In [0]:
adheight_col = 'admissionheight'
patient[adheight_col].describe()

In [0]:
# set threshold
adheight_col = 'admissionheight'
patient[patient[adheight_col] < 10] = None