<img src="https://www.esicm.org/wp-content/uploads/2021/02/Plan-de-travail-4-copie-7@3x-150x150.png" alt="Logo" width=128px/>

<img src="https://github.com/AmsterdamUMC/AmsterdamUMCdb/blob/master/img/logo_amds.png?raw=1" alt="Logo" width=128px/>

# 3rd Critical Care Datathon 2021 on AmsterdamUMCdb - Freely Accessible ICU Database

version 1.0.2 March 2020  
Copyright &copy; 2003-2021 Amsterdam UMC - Amsterdam Medical Data Science

## Introduction
To make the most of your time during the datathon, access to AmsterdamUMCdb will be provided using Google BigQuery using Google Colaboratory as the main coding environment. This removes the necessity to download AmsterdamUMCdb, setting up a database system and installing a coding environment.

This tutorial for datathons using AmsterdamUMCdb is based on the original Google BigQuery tutorial on [Colab](https://colab.research.google.com/notebooks/bigquery.ipynb).

## Before you Begin
Ensure you have a working Google account and verify that the e-mail address used when registering for the Datathon has been associated with this account. If you already have a Google account, you can add secondary e-mail adresses [here](https://myaccount.google.com/alternateemail), or alternatively create another Google account using the e-mail adress used during registration of the Datathon.

If you don't have any experience in using Jupyter notebooks and/or Python, it is recommended to familiarize yourself with the [basics](https://colab.research.google.com/notebooks/intro.ipynb).

## Running Colab
Open Colab with the **getting-started** notebook from the official AmsterdamUMCdb GitHub repository: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AmsterdamUMC/AmsterdamUMCdb/blob/master/datathons/2021-amsterdam/getting-started.ipynb)

**Important**: when following this tutorial, make sure to follow *all* steps and to run the **code cells** using the **Play** button or by pressing `Ctrl-Enter`

## Provide your credentials to access the AmsterdamUMCdb dataset on Google BigQuery
Authenticate your credentials with Google Cloud Platform and set the default Google Cloud project id for running query jobs. Run the cell, follow the generated link, and paste the verification code in the provided box:

In [None]:
import os
from google.colab import auth

#sets the project id
PROJECT_ID = 'esicmdatathon2021'
DATASET_PROJECT_ID = 'amsterdamumcdb-data'
DATASET_ID = 'ams102'
LOCATION = 'eu'

#all libraries check this environment variable, so set it:
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID

auth.authenticate_user()
print('Authenticated')

# Enable data table display

Colab includes the ``google.colab.data_table`` package that can be used to display large pandas dataframes as an interactive data table. This is especially useful when working with the `numericitems` table from AmsterdamUMCdb. It can be enabled with:

In [None]:
%load_ext google.colab.data_table

# Running your first query on AmsterdamUMCdb
BigQuery provides different ways to query the dataset:
- **magics**: the `google.cloud.bigquery` library  includes a *magic* command which runs a query and either displays the result or saves it to a Pandas DataFrame`. The main advantage to this technique is that it improves readability of SQL code by syntax highlighting. Its main limitation: it requires a separate cell for the query, so cannot be combined with other Python code in the same code cell.

Let's query the `admissions` table using magics.

### Sets the default query job configuration for magics

In [None]:
from google.cloud.bigquery import magics
from google.cloud import bigquery

#sets the default query job configuration
config = bigquery.job.QueryJobConfig(default_dataset=DATASET_PROJECT_ID + "." + DATASET_ID)
magics.context.default_query_job_config = config

#sets client options job configuration
client_options = {}
client_options['location'] = LOCATION
magics.context.bigquery_client_options = client_options

### Queries the admission table and displays all admissions without copying the data to a Pandas dataframe

In [None]:
%%bigquery
SELECT * FROM admissions
LIMIT 100

### Query the admission table and copy the data to the `admissions` Pandas dataframe:

In [None]:
%%bigquery admissions
SELECT * FROM admissions

### Display the first 100 rows of the admissions dataframe.

In [None]:
admissions.head(100)

In [None]:
%%bigquery numericitems
SELECT * FROM numericitems
WHERE itemid = 6640
LIMIT 10

In [None]:
numericitems.head()

# Query AmsterdamUMCdb through google-cloud-bigquery

Alternatively, we can manually invoke the `biqquery` Python module. The examples used the previously defined `PROJECT_ID` (cell #4).

See [BigQuery documentation](https://cloud.google.com/bigquery/docs) and [library reference documentation](https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html).

### Sets the default query job configuration for google-cloud-bigquery client

In [None]:
from google.cloud import bigquery

#BigQuery requires a separate config to prevent the 'BadRequest: 400 Cannot explicitly modify anonymous table' error message
job_config = bigquery.job.QueryJobConfig()

#sets default client settings by re-using the previously defined config
client = bigquery.Client(project=PROJECT_ID, location=LOCATION, default_query_job_config=def_config)

### Get all patients and group by age group

In [None]:
age_groups = client.query('''
SELECT agegroup
    , COUNT(*) AS Number_of_admissions -- COUNT(*) counts everything including NULL
FROM admissions
GROUP BY agegroup
ORDER BY agegroup ASC
''', job_config=job_config).to_dataframe()

age_groups

### Show a plot
Uses the Pandas built-in functions to plot a bar chart.

In [None]:
age_groups.plot(kind='bar', x='agegroup')

In [None]:
numids = client.query('''
SELECT itemid, item
    , COUNT(*) AS number_of_samples -- COUNT(*) counts everything including NULL
FROM numericitems
GROUP BY itemid, item
ORDER BY number_of_samples DESC
''', job_config=job_config).to_dataframe()

numids.head()

# Query AmsterdamUMCdb through pandas-gbq

The third option is to to query the dataset using the Pandas `pandas-gbq` library. Especially when you already have been using the `pandas.read_sql` function it's relatively straightforward to modify your existing code to be compatible with BigQuery.

[Pandas GBQ Documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_gbq.html)

### Sets the default query job configuration for pandas-gbq

In [None]:
import pandas as pd

config_gbq = {'query': 
          {'defaultDataset': {
              "datasetId": DATASET_ID, 
              "projectId": DATASET_PROJECT_ID
              },
           'Location': LOCATION}
           }

### Get all lactate values

### Creates a dictionary of all numericitems

In [None]:
numericitems_itemids = pd.read_gbq('''
  SELECT DISTINCT itemid, item, unitid, unit 
  FROM numericitems
''', configuration=config_gbq)
numericitems_itemids.head()

### Get all itemids matching lactate

In [None]:
lactate_ids = numericitems_itemids[numericitems_itemids['item'].str.contains('lact', regex=True, case=False)]
lactate_ids

### Get lactate values for all patients

In [None]:
lactate = pd.read_gbq('''
  SELECT *
  FROM numericitems
  WHERE itemid = 10053	--Lactaat (bloed)
''', configuration=config_gbq)
lactate.head()

### Plot lactate values using default pandas histogram function

In [None]:
lactate['value'].hist()

### Plot lactate values using outlier aware histogram from AmsterdamUMCdb library

In [None]:
#gets the amsterdamumcdb package from PiPy repository for use in Colab
!pip install amsterdamumcdb
import amsterdamumcdb as adb

In [None]:
adb.outliers_histogram(data=lactate['value']).show()

In [None]:
adb.outliers_histogram(data=lactate['value'], z_treshold=16).show()

In [None]:
lactate[lactate['value'] > 15].sort_values('value', ascending=False)

This table demonstrates that the top 3 highest values are most likely data entry errors. The are also manually documented, instead of filed by the system (Dutch: 'Systeem')

# Finding relevant parameters

The `amsterdamumcdb` package provides the `get_dictionary()` function that retreives a DataFrame containing all items and itemids in AmsterdamUMCdb. In combination with BiqQueries DataTables it's possible to quickly locate an item of interest. Since AmsterdamUMCdb originated from a real Dutch ICU database, the original item names are in Dutch. For common ICU parameters, translations have been provided. Full mapping to LOINC and SNOMED CT is currently in progress.

In [None]:
dictionary = adb.get_dictionary()
dictionary

## Steroids

In [None]:
steroids = pd.read_gbq('''
  SELECT *
  FROM drugitems
  WHERE itemid IN (
    --intravenous
    7106	--Hydrocortison (Solu Cortef)
    ,6995	--Dexamethason
    ,6922	--Prednisolon (Prednison)
    ,6922	--Prednisolon (Prednison)
    ,8132	--Methylprednisolon (Solu-Medrol)	

    --non intravenous
    ,10628	--Fludrocortison (Florinef)
    ,6995	--Dexamethason
    ,7106	--Hydrocortison (Solu Cortef)
    ,9130	--Prednisonum
  )
''', configuration=config_gbq)
steroids.head()

## Atrial Fibrillation

In [None]:
afib = pd.read_gbq('''
  SELECT *
  FROM listitems
  WHERE itemid = 6671	--Hartritme
  AND valueid = 13	--Atr fib
''', configuration=config_gbq)
afib.head()

## Conclusion
This finalizes our tutorial on accessing AmsterdamUMCdb using BigQuery. 

What next?
- Have a look at the AmsterdamUMCdb [wiki](https://github.com/AmsterdamUMC/AmsterdamUMCdb/wiki).
- Check the [table specific Jupyter Notebooks](https://github.com/AmsterdamUMC/AmsterdamUMCdb/tree/master/tables) for more in depth examples of the specific tables.