<a href="https://colab.research.google.com/github/MIT-LCP/2019_tokyo_datathon/blob/master/mimic_python/summary_stats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MIMIC-III

# Summary statistics

This notebook shows how summary statistics can be computed for a patient cohort using the `tableone` package. Usage instructions for tableone are at: https://pypi.org/project/tableone/

このノートブックでは、`tableone`というパッケージを用いて、データの分布などの詳細について見ていきます。

## Load libraries and connect to the database

In [0]:
# Import libraries
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.path as path

# Make pandas dataframes prettier
from IPython.display import display, HTML

# Access data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery

In [0]:
# authenticate
auth.authenticate_user()

In [0]:
# Set up environment variables
project_id='datathonjapan2019'
os.environ["GOOGLE_CLOUD_PROJECT"]=project_id

To make our lives easier, we'll also install and import a set of helper functions from the `datathon2` package.

In [0]:
!pip install datathon2

In [0]:
import datathon2 as dtn

## Install and load the `tableone` package

The tableone package can be used to compute summary statistics for a patient cohort. Unlike the previous packages, it is not installed by default in Colab, so will need to install it first.

これまで使ったパッケージに加え、`tableone`もインストールします。

In [0]:
!pip install tableone

In [0]:
# Import the tableone class
from tableone import TableOne

## Load the patient cohort

In this example, we will load data from the `admissions` table, taking the first hospital admission for each patient.

In [0]:
# Link the patient and apachepatientresult tables on patientunitstayid
# using an inner join.
query = """
WITH tmp AS (
SELECT a.subject_id, a.hadm_id, a.admission_type, a.admission_location, a.discharge_location,
      a.insurance, a.ethnicity, a.diagnosis, a.hospital_expire_flag,
      DENSE_RANK() OVER (PARTITION BY a.subject_id ORDER BY a.admittime) AS hospstay_seq,
      DATETIME_DIFF(a.dischtime, a.admittime, DAY) AS los_hospital_days,
      DATETIME_DIFF(a.edouttime, a.edregtime, HOUR) AS los_emergency_hrs
FROM `physionet-data.mimiciii_demo.admissions` a)
SELECT *
FROM tmp
WHERE hospstay_seq = 1;
"""

cohort = dtn.run_query(query,project_id)

In [0]:
cohort.head()

## Summary statistics

In [0]:
columns = ['admission_type', 'admission_location', 'discharge_location', 'insurance',
          'ethnicity','los_hospital_days','los_emergency_hrs']

categorical = ['admission_type', 'admission_location', 'discharge_location', 'insurance',
          'ethnicity']

In [0]:
TableOne(cohort, columns=columns, categorical = categorical, 
         groupby='hospital_expire_flag',
         label_suffix=True, limit=4)

## Visualizing the data

Plotting the distribution of each variable by group level via histograms, kernel density estimates and boxplots is a crucial component to data analysis pipelines. Vizualisation is often is the only way to detect problematic variables in many real-life scenarios. We'll review a couple of the variables.

データの分布を視覚化することは、データの問題点を把握するために非常に重要な方法です。以下にその例をみてみましょう。

In [0]:
# Plot distributions to review possible multimodality
cohort[['los_emergency_hrs','los_hospital_days']].dropna().plot.kde(figsize=[12,8])
plt.legend(['ED time,Hours', 'Hospital LOS'])
plt.xlim([-30,50])