跟着 mimic-code 探索 MIMIC 数据之 tutorials （一） #24

JackieMium · 2018-08-01T06:09:27Z

SQL 算是学完了，结果回去看 mimic-code 发现大多数脚本根本看不懂！想起来小学做数学习题：

课本例题: 小明有 3 个苹果，吃了 1 个，请问小明还有几个苹果 ?

课后习题：小华前天买了 5 个橘子，昨天吃了 1 个梨，请问小红今天还剩下几个苹果？

我：卒.....

没有办法，我就把 mimic-code 翻来覆去地看，看看有没有什么我能看懂的。果然，MIMIC 很良心的，mimic-code/tutorials/ 里面就放了针对新人的几个简单的小课程，小课程搭配习题，答案也有，可以说是很好了。

一个个来看吧。

1. sql-intro

这个文档基本就是教我们 SQL 的了。基本上我就是泛泛地看了看。有几个值得记下来的：

How can we use temporary tables to help manage queries?

临时的表格可以用 WITH foo AS bar 这样的语法来存放。比如我们要得到从 patients 表格中得出年龄并做他用：

WITH patient_dates AS (
SELECT p.subject_id, p.dob, a.hadm_id, a.admittime,
    ( (cast(a.admittime as date) - cast(p.dob as date)) / 365.2 ) as age
FROM patients p
INNER JOIN admissions a
ON p.subject_id = a.subject_id
ORDER BY subject_id, hadm_id
)
SELECT *
FROM patient_dates;

另一个办法是使用 materialised views，即物化视图：

-- we begin by dropping any existing views with the same name
DROP MATERIALIZED VIEW IF EXISTS patient_dates_view;
CREATE MATERIALIZED VIEW patient_dates_view AS
SELECT p.subject_id, p.dob, a.hadm_id, a.admittime,
    ( (cast(a.admittime as date) - cast(p.dob as date)) / 365.2 ) as age
FROM patients p
INNER JOIN admissions a
ON p.subject_id = a.subject_id
ORDER BY subject_id, hadm_id;

CASE statement for if/else logic

CASE WHEN 是简单的逻辑判断语句。比如我们想对 icustays 中 ICU 住院时间长短 (los) 分组：

-- Use if/else logic to categorise length of stay
-- into 'short', 'medium', and 'long'
SELECT subject_id, hadm_id, icustay_id, los,
    CASE WHEN los < 2 THEN 'short'
         WHEN los >=2 AND los < 7 THEN 'medium'
         WHEN los >=7 THEN 'long'
         ELSE NULL END AS los_group
FROM icustays;

Window functions

Window functions 中文好像翻译为窗口函数，这个窗口其实是艾滋病感染潜伏期窗口那个意思，在 Bowtie2 之类的 RNA-Seq 数据比对之类的软件计算比对质量的时候也会用到这个这个概念。
不知道为什么 SQLBolt 竟然没有涉及到 Window functions，感觉很实用的功能。

Window functions 和 Aggregate 很像，但是 Aggregate 是聚合，会按照我们要求对相同的行进行合并，而 Window functions 则不同。用例子来看会很清楚，比如我们想要对同一个病人多次住 ICU 进行编号。这种情况下直接用 GROUP BY subject_id 会直接把同一个病人信息合并到一行，而我们想要的是每个病人每次入 ICU 的信息仍然单独是一行，顺序通过 admission_time 进行编号。这里的 窗 就是 subject_id ，每个病人为一个处理单位，RANK() 生成顺序编号。代码：

-- find the order of admissions to the ICU for a patient
SELECT subject_id, icustay_id, intime,
    RANK() OVER (PARTITION BY subject_id ORDER BY intime)
FROM icustays;

有了这样一个编号我们就可以很方便的筛选只住过一次 ICU 的病例了(这个在文献里经常看到)：

-- select patients from icustays who've stayed in ICU for only once
WITH icustayorder AS (
SELECT subject_id, icustay_id, intime,
  RANK() OVER (PARTITION BY subject_id ORDER BY intime)
FROM icustays
)
SELECT *
FROM icustayorder
WHERE rank = 1;

Multiple temporary views

多个临时视图，这个在 mimic-code 里简直不要太常见。

services 表格包含了病人接受治疗的情况（比如是在外科还是内科这种）：

-- find the care service provided to each hospital admission
SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
FROM services;

但是这个表格里没有 icustay_id，我们只能通过 hadm_id 来 JOIN：

WITH serv as (
  SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
  FROM services
)
, icu as
(
  SELECT subject_id, hadm_id, icustay_id, intime, outtime
  FROM icustays
)
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id, icu.intime, icu.outtime
, serv.transfertime, serv.prev_service, serv.curr_service
FROM icu
INNER JOIN serv
ON icu.hadm_id = serv.hadm_id

但是，这个过程其实中间是有一些猫腻的。INNER JOIN 是取交集的：

那么取完后的结果的行数肯定不多于之前的数据。但是我们看看我们的数据：

WITH serv as (
  SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
  FROM services
)
, icu as
(
  SELECT subject_id, hadm_id, icustay_id, intime, outtime
  FROM icustays
)
SELECT COUNT(*)
FROM icu
INNER JOIN serv
ON icu.hadm_id = serv.hadm_id

这个在我电脑上显示 78840 行，那我们再看 icustays 数据：

SELECT count(*)
FROM icustays;

61532行。哈哈，INNER JOIN 之后行数变多了，刺激！

下面很快给出了解释：

事实是，每个 hadm_id 可能对应了好几个 service 和好几个 icustay_id ，即一个病人院内转科和多次住 ICU 的情况。所以当通过 hadm_id 来 JOIN 两个表的时候，在 hadm_id 相同而 icustay_id 和 services 不同时每种组合都会在结果里作为单独的一行。专业的解释：

More technically, the first query joined two tables on non-unique keys: there may be multiple hadm_id with the same value in the services table, and there may be multiple hadm_id with the same value in the admissions table. For example, if the services table has hadm_id = 100001 repeated N times, and the admissions table has hadm_id = 100001 repeated M times, then joining these two on hadm_id will result in a table with NxM rows: one for every pair. With MIMIC, it is generally very bad practice to join two tables on non-unique columns: at least one of the tables should have unique values for the column, otherwise you end up with duplicate rows and the query results can be confusing.

所以最后，我们可以通过在 services 里对相同的hadm_id 利用窗口函数排序，只留下第一个 service 记录，这样 hadm_id 也就变成了 unique key 了。

 WITH serv as (
   SELECT subject_id, hadm_id, transfertime, prev_service, curr_service,
    RANK() OVER (PARTITION BY hadm_id ORDER BY transfertime) as rank
   FROM services
   )
   , icu as
   (
   SELECT subject_id, hadm_id, icustay_id, intime, outtime
   FROM icustays
   )
   SELECT COUNT(*)
   FROM icu
   INNER JOIN serv
   ON icu.hadm_id = serv.hadm_id
   AND serv.rank = 1;

本来打算只写一点点做个笔记，没想到已经这么长了，那干脆分篇好了。

Ref

关于窗口函数官方文档和中文文档
mimic-code 的 tutorails 。

THE END

The text was updated successfully, but these errors were encountered:

跟着 mimic-code 探索 MIMIC 数据之 tutorials （一） #24

JackieMium added a commit that referenced this issue Aug 1, 2018

#24 跟着 mimic-code 探索 MIMIC 数据之 tutorials （一）

8b9c4dc

跟着 mimic-code 探索 MIMIC 数据之 tutorials （一） #24

JackieMium added Linux Linux related 基础很基础的东西实践 Practice makes perfect Code Talk is chip, show me the code SQL MIMIC MIMIC 数据库相关的东西 and removed Linux Linux related labels Aug 3, 2018

JackieMium added this to ICU in 博文分类 Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

跟着 mimic-code 探索 MIMIC 数据之 tutorials （一） #24

跟着 mimic-code 探索 MIMIC 数据之 tutorials （一） #24

JackieMium commented Aug 1, 2018

跟着 mimic-code 探索 MIMIC 数据之 tutorials （一） #24

跟着 mimic-code 探索 MIMIC 数据之 tutorials （一） #24

Comments

JackieMium commented Aug 1, 2018

1. sql-intro

How can we use temporary tables to help manage queries?

CASE statement for if/else logic

Window functions

Multiple temporary views

Ref