You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 12, 2019. It is now read-only.
How can we use temporary tables to help manage queries?
临时的表格可以用 WITH foo AS bar 这样的语法来存放。比如我们要得到从 patients 表格中得出年龄并做他用:
WITH patient_dates AS (
SELECTp.subject_id, p.dob, a.hadm_id, a.admittime,
( (cast(a.admittimeasdate) - cast(p.dobasdate)) /365.2 ) as age
FROM patients p
INNER JOIN admissions a
ONp.subject_id=a.subject_idORDER BY subject_id, hadm_id
)
SELECT*FROM patient_dates;
另一个办法是使用 materialised views,即物化视图:
-- we begin by dropping any existing views with the same name
DROP MATERIALIZED VIEW IF EXISTS patient_dates_view;
CREATE MATERIALIZED VIEW patient_dates_view ASSELECTp.subject_id, p.dob, a.hadm_id, a.admittime,
( (cast(a.admittimeasdate) - cast(p.dobasdate)) /365.2 ) as age
FROM patients p
INNER JOIN admissions a
ONp.subject_id=a.subject_idORDER BY subject_id, hadm_id;
CASE statement for if/else logic
CASE WHEN 是简单的逻辑判断语句。比如我们想对 icustays 中 ICU 住院时间长短 (los) 分组:
-- Use if/else logic to categorise length of stay-- into 'short', 'medium', and 'long'SELECT subject_id, hadm_id, icustay_id, los,
CASE WHEN los <2 THEN 'short'
WHEN los >=2AND los <7 THEN 'medium'
WHEN los >=7 THEN 'long'
ELSE NULL END AS los_group
FROM icustays;
-- find the order of admissions to the ICU for a patientSELECT subject_id, icustay_id, intime,
RANK() OVER (PARTITION BY subject_id ORDER BY intime)
FROM icustays;
有了这样一个编号我们就可以很方便的筛选只住过一次 ICU 的病例了(这个在文献里经常看到):
-- select patients from icustays who've stayed in ICU for only once
WITH icustayorder AS (
SELECT subject_id, icustay_id, intime,
RANK() OVER (PARTITION BY subject_id ORDER BY intime)
FROM icustays
)
SELECT*FROM icustayorder
WHERE rank =1;
Multiple temporary views
多个临时视图,这个在 mimic-code 里简直不要太常见。
services 表格包含了病人接受治疗的情况(比如是在外科还是内科这种):
-- find the care service provided to each hospital admissionSELECT subject_id, hadm_id, transfertime, prev_service, curr_service
FROM services;
但是这个表格里没有 icustay_id,我们只能通过 hadm_id 来 JOIN:
WITH serv as (
SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
FROM services
)
, icu as
(
SELECT subject_id, hadm_id, icustay_id, intime, outtime
FROM icustays
)
SELECTicu.subject_id, icu.hadm_id, icu.icustay_id, icu.intime, icu.outtime
, serv.transfertime, serv.prev_service, serv.curr_serviceFROM icu
INNER JOIN serv
ONicu.hadm_id=serv.hadm_id
但是,这个过程其实中间是有一些猫腻的。INNER JOIN 是取交集的:
那么取完后的结果的行数肯定不多于之前的数据。但是我们看看我们的数据:
WITH serv as (
SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
FROM services
)
, icu as
(
SELECT subject_id, hadm_id, icustay_id, intime, outtime
FROM icustays
)
SELECTCOUNT(*)
FROM icu
INNER JOIN serv
ONicu.hadm_id=serv.hadm_id
More technically, the first query joined two tables on non-unique keys: there may be multiple hadm_id with the same value in the services table, and there may be multiple hadm_id with the same value in the admissions table. For example, if the services table has hadm_id = 100001 repeated N times, and the admissions table has hadm_id = 100001 repeated M times, then joining these two on hadm_id will result in a table with NxM rows: one for every pair. With MIMIC, it is generally very bad practice to join two tables on non-unique columns: at least one of the tables should have unique values for the column, otherwise you end up with duplicate rows and the query results can be confusing.
WITH serv as (
SELECT subject_id, hadm_id, transfertime, prev_service, curr_service,
RANK() OVER (PARTITION BY hadm_id ORDER BY transfertime) as rank
FROM services
)
, icu as
(
SELECT subject_id, hadm_id, icustay_id, intime, outtime
FROM icustays
)
SELECTCOUNT(*)
FROM icu
INNER JOIN serv
ONicu.hadm_id=serv.hadm_idANDserv.rank=1;
SQL 算是学完了,结果回去看 mimic-code 发现大多数脚本根本看不懂!想起来小学做数学习题:
没有办法,我就把 mimic-code 翻来覆去地看,看看有没有什么我能看懂的。果然,MIMIC 很良心的,
mimic-code/tutorials/
里面就放了针对新人的几个简单的小课程,小课程搭配习题,答案也有,可以说是很好了。一个个来看吧。
1. sql-intro
这个文档基本就是教我们 SQL 的了。基本上我就是泛泛地看了看。有几个值得记下来的:
How can we use temporary tables to help manage queries?
临时的表格可以用 WITH foo AS bar 这样的语法来存放。比如我们要得到从
patients
表格中得出年龄并做他用:另一个办法是使用 materialised views,即物化视图:
CASE statement for if/else logic
CASE WHEN 是简单的逻辑判断语句。比如我们想对
icustays
中 ICU 住院时间长短 (los
) 分组:Window functions
Window functions 中文好像翻译为窗口函数,这个窗口其实是艾滋病感染潜伏期窗口那个意思,在
Bowtie2
之类的 RNA-Seq 数据比对之类的软件计算比对质量的时候也会用到这个这个概念。不知道为什么 SQLBolt 竟然没有涉及到 Window functions,感觉很实用的功能。
Window functions 和 Aggregate 很像,但是 Aggregate 是聚合,会按照我们要求对相同的行进行合并,而 Window functions 则不同。用例子来看会很清楚,比如我们想要对同一个病人多次住 ICU 进行编号。这种情况下直接用
GROUP BY subject_id
会直接把同一个病人信息合并到一行,而我们想要的是每个病人每次入 ICU 的信息仍然单独是一行,顺序通过admission_time
进行编号。这里的窗
就是subject_id
,每个病人为一个处理单位,RANK()
生成顺序编号。 代码:有了这样一个编号我们就可以很方便的筛选只住过一次 ICU 的病例了(这个在文献里经常看到):
Multiple temporary views
多个临时视图,这个在 mimic-code 里简直不要太常见。
services
表格包含了病人接受治疗的情况(比如是在外科还是内科这种):但是这个表格里没有
icustay_id
,我们只能通过hadm_id
来 JOIN:但是,这个过程其实中间是有一些猫腻的。INNER JOIN 是取交集的:
那么取完后的结果的行数肯定不多于之前的数据。但是我们看看我们的数据:
这个在我电脑上显示
78840
行,那我们再看icustays
数据:61532
行。哈哈,INNER JOIN 之后行数变多了,刺激!下面很快给出了解释:
事实是,每个
hadm_id
可能对应了好几个 service 和好几个icustay_id
,即一个病人院内转科和多次住 ICU 的情况。所以当通过hadm_id
来 JOIN 两个表的时候,在hadm_id
相同而icustay_id
和services
不同时每种组合都会在结果里作为单独的一行。专业的解释:所以最后,我们可以通过在
services
里对相同的hadm_id
利用窗口函数排序,只留下第一个service
记录,这样hadm_id
也就变成了 unique key 了。本来打算只写一点点做个笔记,没想到已经这么长了,那干脆分篇好了。
Ref
THE END
The text was updated successfully, but these errors were encountered: