# eICU Collaborative Research Database

# Notebook 4: Summary statistics

This notebook shows how summary statistics can be computed for a patient cohort using the `tableone` package. Usage instructions for tableone are at: https://pypi.org/project/tableone/


## Load libraries and connect to the database

In [0]:
# Import libraries
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.path as path

# Make pandas dataframes prettier
from IPython.display import display, HTML

# Access data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery

In [0]:
# authenticate
auth.authenticate_user()

In [0]:
# Set up environment variables
project_id='tdothealthhack-team'
os.environ["GOOGLE_CLOUD_PROJECT"]=project_id

In [0]:
# Helper function to read data from BigQuery into a DataFrame.
def run_query(query):
    return pd.io.gbq.read_gbq(query, project_id=project_id, dialect="standard")

## Install and load the `tableone` package

The tableone package can be used to compute summary statistics for a patient cohort. Unlike the previous packages, it is not installed by default in Colab, so will need to install it first.

In [0]:
!pip install tableone

In [0]:
# Import the tableone class
from tableone import TableOne

## Load the patient cohort

In this example, we will load all data from the patient data, and link it to APACHE data to provide richer summary information.

In [0]:
# Link the patient and apachepatientresult tables on patientunitstayid
# using an inner join.
query = """
SELECT p.unitadmitsource, p.gender, p.age, p.ethnicity, p.admissionweight, 
    p.unittype, p.unitstaytype, a.acutephysiologyscore,
    a.apachescore, a.actualiculos, a.actualhospitalmortality,
    a.unabridgedunitlos, a.unabridgedhosplos
FROM `physionet-data.eicu_crd_demo.patient` p
INNER JOIN `physionet-data.eicu_crd_demo.apachepatientresult` a
ON p.patientunitstayid = a.patientunitstayid
WHERE apacheversion LIKE 'IVa'
"""

cohort = run_query(query)

In [0]:
cohort.head()

## Calculate summary statistics

Before summarizing the data, we will need to convert the ages to numerical values.

In [0]:
cohort['agenum'] = pd.to_numeric(cohort['age'], errors='coerce')

In [0]:
columns = ['unitadmitsource', 'gender', 'agenum', 'ethnicity',
          'admissionweight','unittype','unitstaytype',
          'acutephysiologyscore','apachescore','actualiculos',
          'unabridgedunitlos','unabridgedhosplos']

In [0]:
TableOne(cohort, columns=columns, labels={'agenum': 'age'}, 
         groupby='actualhospitalmortality',
         label_suffix=True, limit=4)

## Questions

- Are the severity of illness measures higher in the survival or non-survival group?
- What issues suggest that some of the summary statistics might be misleading?
- How might you address these issues?

## Visualizing the data

Plotting the distribution of each variable by group level via histograms, kernel density estimates and boxplots is a crucial component to data analysis pipelines. Vizualisation is often is the only way to detect problematic variables in many real-life scenarios. We'll review a couple of the variables.

In [0]:
# Plot distributions to review possible multimodality
cohort[['acutephysiologyscore','agenum']].dropna().plot.kde(figsize=[12,8])
plt.legend(['APS Score', 'Age (years)'])
plt.xlim([-30,250])

## Questions

- Do the plots change your view on how these variable should be reported?