You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 12, 2019. It is now read-only.
文档的目的是展示 MIMIC 中病例信息的跟踪追溯。主要讲解 subject_id, hadm_id, 和 icustay_id 代表着什么,怎么提取研究病例队列,以及理解患者接受 service 和患者物理位置之间的差别(老实说我都不知道这个到底是什么)。
我自己用的是 RStudio + PostgreSQL,所以代码相对原文档会有一些改动。
首先是设置和数据库连接和基本选项:
library(RPostgreSQL)
library(tidyverse)
# connect to PostgresSQLdrv<- dbDriver("PostgreSQL")
con<- dbConnect(
drv=drv,
dbname="mimic",
user="postgres",
.rs.askForPassword("Enter password for user postgres:")
)
# set the search path to the mimiciii schema
dbSendQuery(con, "SET search_path TO mimiciii, public;")
# 为了偷懒我写了一个方便查询数据库的函数query<-function(query=query) {
con %>%
dbGetQuery(sql(query)) %>%
as_tibble()
}
query("SELECT subject_id, hadm_id, icustay_id FROM icustays LIMIT 10;")
# 在仅仅是尝试性或者探索性的看看数据的时候一般都用 LIMIT 10#-----# A tibble: 10 x 3subject_idhadm_idicustay_id*<int><int><int>12681104042808362269106296206613327018802822034542711737272491965272164716210407627315868924150772741305462548518275129886219649927613515620632710277171601272866
计算 ICU 的住院时间:
query("SELECT subject_id, hadm_id, icustay_id , outtime - intime as icu_length_of_stay_interval , EXTRACT(EPOCH FROM outtime - intime) as icu_length_of_stay FROM icustays LIMIT 10;")
#----# A tibble: 10 x 5subject_idhadm_idicustay_idicu_length_of_stay_intervalicu_length_of_stay*<int><int><int><chr><dbl>12681104042808363days05:58:3328071322691062962066133days06:41:2828328832701880282203452days21:27:0925002942711737272491962days01:26:2217798252721647162104071day14:53:0913998962731586892415071day11:40:0612840672741305462548518days19:32:3276155282751298862196497days03:09:1461615492761351562063271day08:06:291155891027717160127286617:33:0263182
query("SELECT subject_id, hadm_id, icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay FROM icustays LIMIT 10;")
#---# A tibble: 10 x 4subject_idhadm_idicustay_idicu_length_of_stay*<int><int><int><dbl>12681104042808363.2522691062962066133.2832701880282203452.8942711737272491962.0652721647162104071.6262731586892415071.4972741305462548518.8182751298862196497.1392761351562063271.34102771716012728660.731
如果还想对 ICU 住院时间进行筛选,比如只想看住院超过 24h 的,就得先建个临时表格。比如:
query("WITH co AS ( SELECT subject_id, hadm_id, icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay FROM icustays LIMIT 10 ) SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay FROM co WHERE icu_length_of_stay >= 2;")
#---# A tibble: 6 x 4subject_idhadm_idicustay_idicu_length_of_stay*<int><int><int><dbl>12681104042808363.2522691062962066133.2832701880282203452.8942711737272491962.0652741305462548518.8162751298862196497.13
选取病例人群的时候,好的做法是构建一个队列表格。这个表格应该包含数据库中所有的 icustay_id,然后通过一个添加一个 binary flag 来指明每个病例是否要从研究人群中剔除。比如还是上面的筛选 ICU 住院时间 > 2 天的病例的例子:
query("WITH co AS ( SELECT subject_id, hadm_id, icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay FROM icustays LIMIT 10 ) SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END as exclusion_los FROM co;")
#---# A tibble: 10 x 5subject_idhadm_idicustay_idicu_length_of_stayexclusion_los*<int><int><int><dbl><int>12681104042808363.25022691062962066133.28032701880282203452.89042711737272491962.06052721647162104071.62162731586892415071.49172741305462548518.81082751298862196497.13092761351562063271.341102771716012728660.7311
query("WITH co AS ( SELECT icu.subject_id, icu.hadm_id, icu.icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay ,icu.intime - pat.dob AS age FROM icustays icu INNER JOIN patients pat ON icu.subject_id = pat.subject_id LIMIT 10 ) SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age, CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END as exclusion_los FROM co;")
#---# A tibble: 10 x 6subject_idhadm_idicustay_idicu_length_of_stayageexclusion_los*<int><int><int><dbl><chr><int>121633532436530.091821:20:071231458342115526.0627950days19:10:110341857772946381.6817475days00:29:311451789802147570.084406:04:241561070642282323.6724084days21:30:540671180372784440.26815:35:291771180372367540.7392days03:26:011881595142622991.0812:36:101991507502205975.3215263days13:07:02010101841672884098.0911:39:050
query("WITH co AS ( SELECT icu.subject_id, icu.hadm_id, icu.icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay , icu.intime - pat.dob AS age FROM icustays icu INNER JOIN patients pat ON icu.subject_id = pat.subject_id LIMIT 10 ) SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age , EXTRACT('year' FROM co.age) as age_extract_year , EXTRACT('year' FROM co.age) + EXTRACT('months' FROM co.age) / 12.0 + EXTRACT('days' FROM co.age) / 365.242 + EXTRACT('hours' FROM co.age) / 24.0 / 364.242 as age_extract_precise , EXTRACT('epoch' from co.age) / 60.0 / 60.0 / 24.0 / 365.242 as age_extract_epoch, CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END as exclusion_los FROM co;")
#---# A tibble: 10 x 7subject_idicu_length_of_stayageage_extract_yearage_extract_preciseage_extract_epochexclusion_los*<int><dbl><chr><dbl><dbl><dbl><int>120.091821:20:0700.002400.002431236.0627950days19:10:11076.576.50341.6817475days00:29:31047.847.81450.084406:04:2400.0006860.0006931563.6724084days21:30:54065.965.90670.26815:35:2900.001720.001781770.7392days03:26:0100.005820.005871881.0812:36:1000.001370.001441995.3215263days13:07:02041.841.8010108.0911:39:0500.001260.001330
可以看到后面两种方法计算的年龄其实基本上没什么差别。而第一种办法,由于提取出来的实际上都是以天为单位的 INTERVAL,所以提取年得不到年龄的,只得到 0 了。所以结论就是,其实用不同的办法算得年龄没什么大的区别,按个人喜好自己定一个就 OK。后面我们都会用最简单的 EXTRACT(EPOCH FROM ... ) 这种方法。
query("WITH co AS ( SELECT icu.subject_id, icu.hadm_id, icu.icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age FROM icustays icu INNER JOIN patients pat ON icu.subject_id = pat.subject_id LIMIT 10 ) SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age, CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END as exclusion_los ,CASE WHEN co.age < 16 then 1 ELSE 0 END as exclusion_age FROM co;")
#---# A tibble: 10 x 7subject_idhadm_idicustay_idicu_length_of_stayageexclusion_losexclusion_age*<int><int><int><dbl><dbl><int><int>121633532436530.09180.0024311231458342115526.0676.500341857772946381.6847.810451789802147570.08440.00069311561070642282323.6765.900671180372784440.2680.0017811771180372367540.7390.0058711881595142622991.080.0014411991507502205975.3241.80010101841672884098.090.0013301
query("WITH co AS ( SELECT icu.subject_id, icu.hadm_id, icu.icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order FROM icustays icu INNER JOIN patients pat ON icu.subject_id = pat.subject_id LIMIT 10 ) SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age, co.icustay_id_order, CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END as exclusion_los, CASE WHEN co.age < 16 then 1 ELSE 0 END as exclusion_age FROM co;")
#---# A tibble: 10 x 8subject_idhadm_idicustay_idicu_length_of_stayageicustay_id_orderexclusion_losexclusion_age*<int><int><int><dbl><dbl><dbl><int><int>121633532436530.09180.00243111231458342115526.0676.5100341857772946381.6847.8110451789802147570.08440.000693111561070642282323.6765.9100671180372784440.2680.00178111771180372367540.7390.00587211881595142622991.080.00144111991507502205975.3241.810010101841672884098.090.00133101
可以对看到 subject_id 为 7 的患者就有两次入院信息。所以我们要做的就是再加入一个 CASE WHEN 把这样的病例去掉(虽然其实这个病例也会因为其他标准不符合而被剔除):
query("WITH co AS ( SELECT icu.subject_id, icu.hadm_id, icu.icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order FROM icustays icu INNER JOIN patients pat ON icu.subject_id = pat.subject_id LIMIT 10 ) SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age, co.icustay_id_order, CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END AS exclusion_los, CASE WHEN co.age < 16 then 1 ELSE 0 END AS exclusion_age, CASE WHEN co.icustay_id_order != 1 THEN 1 ELSE 0 END AS exclusion_first_stay FROM co;")
#---# A tibble: 10 x 9subject_idhadm_idicustay_idicu_length_of_stayageicustay_id_orderexclusion_losexclusion_ageexclusion_first_stay*<int><int><int><dbl><dbl><dbl><int><int><int>121633532436530.09180.002431110231458342115526.0676.51000341857772946381.6847.81100451789802147570.08440.0006931110561070642282323.6765.91000671180372784440.2680.001781110771180372367540.7390.005872111881595142622991.080.001441110991507502205975.3241.8100010101841672884098.090.001331010
query("SELECT subject_id, hadm_id, transfertime, prev_service, curr_service FROM services LIMIT 10;")
#---# A tibble: 10 x 5subject_idhadm_idtransfertimeprev_servicecurr_service*<int><int><dttm><chr><chr>14711358792122-07-2214:07:27TSURGMED24711358792122-07-2618:31:49MEDTSURG34721730642172-09-2819:22:15NACMED44731291942201-01-0920:16:45NANB54741942462181-03-2308:24:41NANB64741467462181-04-0417:38:46NANBB74751393512131-09-1618:44:04NANB84761610422100-07-0510:26:45NANB94771910252156-07-2011:53:03NAMED104781373702194-07-1513:55:21NANB
从上面可以看到,curr_service是 current service 的缩写,prev_service在患者有转科的情况下记录转科前的科室,否则为 null。比如 subject_id 为 471 的患者发生过至少两次 service 的变更:一次从 TSURG 到 MED,另一次从 MED 到 TSURG(注:可能还有更多记录因为我们用了 LIMIT 10 而没有显示,可以通过 SELECT * FROM services WHERE subject_id = 471 进一步查看)。
Cardiac Medical - for non-surgical cardiac related admissions
CSURG
Cardiac Surgery - for surgical cardiac admissions
DENT
Dental - for dental/jaw related admissions
ENT
Ear, nose, and throat - conditions primarily affecting these areas
GU
Genitourinary - reproductive organs/urinary system
GYN
Gynecological - female reproductive systems and breasts
MED
Medical - general service for internal medicine
NB
Newborn - infants born at the hospital
NBB
Newborn baby - infants born at the hospital
NMED
Neurologic Medical - non-surgical, relating to the brain
NSURG
Neurologic Surgical - surgical, relating to the brain
OBS
Obstetrics - conerned with childbirth and the care of women giving birth
ORTHO
Orthopaedic - surgical, relating to the musculoskeletal system
OMED
Orthopaedic medicine - non-surgical, relating to musculoskeletal system
PSURG
Plastic - restortation/reconstruction of the human body (including cosmetic or aesthetic)
PSYCH
Psychiatric - mental disorders relating to mood, behaviour, cognition, or perceptions
SURG
Surgical - general surgical service not classified elsewhere
TRAUM
Trauma - injury or damage caused by physical harm from an external source
TSURG
Thoracic Surgical - surgery on the thorax, located between the neck and the abdomen
VSURG
Vascular Surgical - surgery relating to the circulatory system
如果我们想剔除掉接受手术治疗的病人的,那就需要排除这些 service:
CSURG
NSURG
ORTHO
PSURG
SURG
TSURG
VSURG
可以通过 %SURG or ORTHO通配符匹配搞定:
query("SELECT hadm_id, curr_service, CASE WHEN curr_service like '%SURG' then 1 WHEN curr_service = 'ORTHO' then 1 ELSE 0 END AS surgical FROM services se LIMIT 10;")
#---# A tibble: 10 x 3hadm_idcurr_servicesurgical*<int><chr><int>1135879MED02135879TSURG13173064CMED04129194NB05194246NB06146746NBB07139351NB08161042NB09191025MED010137370NB0
query("SELECT icu.hadm_id, icu.icustay_id, curr_service, CASE WHEN curr_service like '%SURG' then 1 WHEN curr_service = 'ORTHO' then 1 ELSE 0 END AS surgical FROM icustays icu LEFT JOIN services se ON icu.hadm_id = se.hadm_id LIMIT 10;")
#----# A tibble: 10 x 4hadm_idicustay_idcurr_servicesurgical*<int><int><chr><int>1100001275225MED02100003209281MED03100006291788MED04100006291788OMED05100007217937SURG16100009253656CSURG17100010271147GU08100011214619TRAUM09100012239289SURG110100016217590MED0
然后现在新的问题又来了:一个 icustay_id 对应多个 service 怎么选择?其实这个是关于研究队列的选择的问题,而不是代码写法的问题。比如我们决定把来 ICU 之前是做手术的病人剔除掉,那么上面的 JOIN 就要改了:
query("SELECT icu.hadm_id, icu.icustay_id, se.curr_service, CASE WHEN curr_service like '%SURG' then 1 WHEN curr_service = 'ORTHO' then 1 ELSE 0 END AS surgical FROM icustays icu LEFT JOIN services se ON icu.hadm_id = se.hadm_id AND se.transfertime < icu.intime + interval '12' hour LIMIT 10;")
#----# A tibble: 10 x 4hadm_idicustay_idcurr_servicesurgical*<int><int><chr><int>1100001275225MED02100003209281MED03100006291788MED04100007217937SURG15100009253656CSURG16100010271147GU07100011214619TRAUM08100012239289SURG19100016217590MED010100017258320MED0
最后,我们合并结果为每次 ICU 只有一个 service 记录。和前面一样,用到 RANK() :
query("WITH serv AS ( SELECT icu.hadm_id, icu.icustay_id, se.curr_service, CASE WHEN curr_service like '%SURG' then 1 WHEN curr_service = 'ORTHO' then 1 ELSE 0 END AS surgical, RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank FROM icustays icu LEFT JOIN services se ON icu.hadm_id = se.hadm_id AND se.transfertime < icu.intime + interval '12' hour LIMIT 10 ) SELECT hadm_id, icustay_id, curr_service, surgical FROM serv WHERE rank = 1;")
#----# A tibble: 10 x 4hadm_idicustay_idcurr_servicesurgical*<int><int><chr><int>1100001275225MED02100003209281MED03100006291788MED04100007217937SURG15100009253656CSURG16100010271147GU07100011214619TRAUM08100012239289SURG19100016217590MED010100017258320MED0
然后最后的最后在和我们之前的筛选队列再 JOIN 一下:
query("WITH co AS ( SELECT icu.subject_id, icu.hadm_id, icu.icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order FROM icustays icu INNER JOIN patients pat ON icu.subject_id = pat.subject_id LIMIT 10), serv AS ( SELECT icu.hadm_id, icu.icustay_id, se.curr_service , CASE WHEN curr_service like '%SURG' then 1 WHEN curr_service = 'ORTHO' then 1 ELSE 0 END as surgical , RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank FROM icustays icu LEFT JOIN services se ON icu.hadm_id = se.hadm_id AND se.transfertime < icu.intime + interval '12' hour ) SELECT co.subject_id, co.hadm_id, co.icustay_id , co.icu_length_of_stay, co.age, co.icustay_id_order , CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END AS exclusion_los , CASE WHEN co.age < 16 then 1 ELSE 0 END AS exclusion_age , CASE WHEN co.icustay_id_order != 1 THEN 1 ELSE 0 END AS exclusion_first_stay , CASE WHEN serv.surgical = 1 THEN 1 ELSE 0 END as exclusion_surgical FROM co LEFT JOIN serv ON co.icustay_id = serv.icustay_id AND serv.rank = 1;")
#----# A tibble: 10 x 10subject_idhadm_idicustay_idicu_length_of_stayageicustay_id_orderexclusion_losexclusion_ageexclusion_first_st… exclusion_surgic…
*<int><int><int><dbl><dbl><dbl><int><int><int><int>161070642282323.6765.910001271180372784440.2680.0017811100371180372367540.7390.0058721110431458342115526.0676.510001591507502205975.3241.810000681595142622991.080.0014411100721633532436530.09180.0024311100851789802147570.08440.000693111009101841672884098.090.00133101001041857772946381.6847.811000
然后我们就顺利得到了需要的病人队列,可以开始提取数据了。
最后来总结一下我们的筛选流程(最后这一步也可以在 R 里写,嫌麻烦算了,直接复制粘贴到 Python 里了)
importpandasaspdimportnumpyasnpimportpsycopg2fromIPython.displayimportdisplay, HTMLsqluser='postgres'dbname='mimic'schema_name='mimiciii'con=psycopg2.connect(dbname=dbname,user=sqluser, password='not_shown_here')
query_schema='set search_path to '+schema_name+';'query=query_schema+"""WITH co AS(SELECT icu.subject_id, icu.hadm_id, icu.icustay_id, first_careunit, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_orderFROM icustays icuINNER JOIN patients pat ON icu.subject_id = pat.subject_idLIMIT 10), serv AS(SELECT icu.hadm_id, icu.icustay_id, se.curr_service, CASE WHEN curr_service like '%SURG' then 1 WHEN curr_service = 'ORTHO' then 1 ELSE 0 END as surgical, RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rankFROM icustays icuLEFT JOIN services se ON icu.hadm_id = se.hadm_idAND se.transfertime < icu.intime + interval '12' hour)SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay , co.age , co.icustay_id_order , serv.curr_service , co.first_careunit , CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END AS exclusion_los , CASE WHEN co.age < 16 then 1 ELSE 0 END AS exclusion_age , CASE WHEN co.icustay_id_order != 1 THEN 1 ELSE 0 END AS exclusion_first_stay , CASE WHEN serv.surgical = 1 THEN 1 ELSE 0 END as exclusion_surgicalFROM coLEFT JOIN serv ON co.icustay_id = serv.icustay_id AND serv.rank = 1"""df=pd.read_sql_query(query, con)
print('{:20s} {:5d}'.format('Observations', df.shape[0]))
idxExcl=np.zeros(df.shape[0],dtype=bool)
forcolindf.columns:
if"exclusion_"incol:
print('{:20s} {:5d} ({:2.2f}%)'.format(col, df[col].sum(), df[col].sum()*100.0/df.shape[0]))
idxExcl= (idxExcl) | (df[col]==1)
print('')
print('{:20s} {:5d} ({:2.2f}%)'.format('Total excluded', np.sum(idxExcl), np.sum(idxExcl)*100.0/df.shape[0]))
# --------Observations10exclusion_los6 (60.00%)
exclusion_age6 (60.00%)
exclusion_first_stay1 (10.00%)
exclusion_surgical2 (20.00%)
Totalexcluded9 (90.00%)
library(RPostgreSQL)
library(tidyverse)
query<-function(query=query) {
con %>%
dbGetQuery(sql(query)) %>%
as_tibble()
}
# connect to DB -----------------------------------------------------------drv<- dbDriver("PostgreSQL")
con<- dbConnect(
drv=drv,
dbname="mimic",
user="postgres",
.rs.askForPassword("Enter password for user postgres:")
)
# set the search path to the mimiciii schema
dbSendQuery(con, "SET search_path TO mimiciii, public;")
# being lazy
query("SELECT subject_id, hadm_id, icustay_id FROM icustays LIMIT 10")
query("SELECT subject_id, hadm_id, icustay_id , outtime - intime as icu_length_of_stay_interval , EXTRACT(EPOCH FROM outtime - intime) as icu_length_of_stay FROM icustays LIMIT 10;")
query("SELECT subject_id, hadm_id, icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay FROM icustays LIMIT 10;")
query("WITH co AS ( SELECT subject_id, hadm_id, icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay FROM icustays LIMIT 10 ) SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay FROM co WHERE icu_length_of_stay >= 2;")
query("WITH co AS ( SELECT subject_id, hadm_id, icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay FROM icustays LIMIT 10 ) SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END as exclusion_los FROM co;")
# age ---------------------------------------------------------------------
query("WITH co AS ( SELECT icu.subject_id, icu.hadm_id, icu.icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay , icu.intime - pat.dob AS age FROM icustays icu INNER JOIN patients pat ON icu.subject_id = pat.subject_id LIMIT 10 ) SELECT co.subject_id, co.icu_length_of_stay, co.age , EXTRACT('year' FROM co.age) as age_extract_year , EXTRACT('year' FROM co.age) + EXTRACT('months' FROM co.age) / 12.0 + EXTRACT('days' FROM co.age) / 365.242 + EXTRACT('hours' FROM co.age) / 24.0 / 364.242 as age_extract_precise , EXTRACT('epoch' from co.age) / 60.0 / 60.0 / 24.0 / 365.242 as age_extract_epoch, CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END as exclusion_los FROM co;")
query("WITH co AS ( SELECT icu.subject_id, icu.hadm_id, icu.icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age FROM icustays icu INNER JOIN patients pat ON icu.subject_id = pat.subject_id LIMIT 10 ) SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age, CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END as exclusion_los ,CASE WHEN co.age < 16 then 1 ELSE 0 END as exclusion_age FROM co;")
# readmission -------------------------------------------------------------
query("WITH co AS ( SELECT icu.subject_id, icu.hadm_id, icu.icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order FROM icustays icu INNER JOIN patients pat ON icu.subject_id = pat.subject_id LIMIT 10 ) SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age, co.icustay_id_order, CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END as exclusion_los, CASE WHEN co.age < 16 then 1 ELSE 0 END as exclusion_age FROM co;")
query("WITH co AS ( SELECT icu.subject_id, icu.hadm_id, icu.icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order FROM icustays icu INNER JOIN patients pat ON icu.subject_id = pat.subject_id LIMIT 10 ) SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age, co.icustay_id_order, CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END AS exclusion_los, CASE WHEN co.age < 16 then 1 ELSE 0 END AS exclusion_age, CASE WHEN co.icustay_id_order != 1 THEN 1 ELSE 0 END AS exclusion_first_stay FROM co;")
# service -----------------------------------------------------------------
query("SELECT subject_id, hadm_id, transfertime, prev_service, curr_service FROM services LIMIT 10;")
query("SELECT * FROM services WHERE subject_id = 471;")
query("SELECT hadm_id, curr_service, CASE WHEN curr_service like '%SURG' then 1 WHEN curr_service = 'ORTHO' then 1 ELSE 0 END AS surgical FROM services se LIMIT 10;")
query("SELECT icu.hadm_id, icu.icustay_id, curr_service, CASE WHEN curr_service like '%SURG' then 1 WHEN curr_service = 'ORTHO' then 1 ELSE 0 END AS surgical FROM icustays icu LEFT JOIN services se ON icu.hadm_id = se.hadm_id LIMIT 10;")
query("SELECT * FROM services WHERE hadm_id=100006;")
query("SELECT * FROM icustays WHERE hadm_id=100006;")
query("SELECT icu.hadm_id, icu.icustay_id, se.curr_service, CASE WHEN curr_service like '%SURG' then 1 WHEN curr_service = 'ORTHO' then 1 ELSE 0 END AS surgical FROM icustays icu LEFT JOIN services se ON icu.hadm_id = se.hadm_id AND se.transfertime < icu.intime + interval '12' hour LIMIT 10;")
query("WITH serv AS ( SELECT icu.hadm_id, icu.icustay_id, se.curr_service, CASE WHEN curr_service like '%SURG' then 1 WHEN curr_service = 'ORTHO' then 1 ELSE 0 END AS surgical, RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank FROM icustays icu LEFT JOIN services se ON icu.hadm_id = se.hadm_id AND se.transfertime < icu.intime + interval '12' hour LIMIT 10 ) SELECT hadm_id, icustay_id, curr_service, surgical FROM serv WHERE rank = 1;")
# together ----------------------------------------------------------------
query("WITH co AS ( SELECT icu.subject_id, icu.hadm_id, icu.icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order FROM icustays icu INNER JOIN patients pat ON icu.subject_id = pat.subject_id LIMIT 10), serv AS ( SELECT icu.hadm_id, icu.icustay_id, se.curr_service , CASE WHEN curr_service like '%SURG' then 1 WHEN curr_service = 'ORTHO' then 1 ELSE 0 END as surgical , RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank FROM icustays icu LEFT JOIN services se ON icu.hadm_id = se.hadm_id AND se.transfertime < icu.intime + interval '12' hour ) SELECT co.subject_id, co.hadm_id, co.icustay_id , co.icu_length_of_stay, co.age, co.icustay_id_order , CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END AS exclusion_los , CASE WHEN co.age < 16 then 1 ELSE 0 END AS exclusion_age , CASE WHEN co.icustay_id_order != 1 THEN 1 ELSE 0 END AS exclusion_first_stay , CASE WHEN serv.surgical = 1 THEN 1 ELSE 0 END as exclusion_surgical FROM co LEFT JOIN serv ON co.icustay_id = serv.icustay_id AND serv.rank = 1;")
df<- query("WITH co AS ( SELECT icu.subject_id, icu.hadm_id, icu.icustay_id , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order FROM icustays icu INNER JOIN patients pat ON icu.subject_id = pat.subject_id LIMIT 10), serv AS ( SELECT icu.hadm_id, icu.icustay_id, se.curr_service , CASE WHEN curr_service like '%SURG' then 1 WHEN curr_service = 'ORTHO' then 1 ELSE 0 END as surgical , RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank FROM icustays icu LEFT JOIN services se ON icu.hadm_id = se.hadm_id AND se.transfertime < icu.intime + interval '12' hour ) SELECT co.subject_id, co.hadm_id, co.icustay_id , co.icu_length_of_stay, co.age, co.icustay_id_order , CASE WHEN co.icu_length_of_stay < 2 then 1 ELSE 0 END AS exclusion_los , CASE WHEN co.age < 16 then 1 ELSE 0 END AS exclusion_age , CASE WHEN co.icustay_id_order != 1 THEN 1 ELSE 0 END AS exclusion_first_stay , CASE WHEN serv.surgical = 1 THEN 1 ELSE 0 END as exclusion_surgical FROM co LEFT JOIN serv ON co.icustay_id = serv.icustay_id AND serv.rank = 1;")
The text was updated successfully, but these errors were encountered:
mimic-code 的
tutorials
还提供了sql-crosstab
,很短,我大概看了感觉不是很有用,先放着了。using_r_with_jupyter.ipynb
就是教你怎么用 Jupyter + R,没什么。explore-items.Rmd
是 MySQL + R,但是没太搞懂这是在干嘛,而且我也没 MySQL,代码转 Postgres 应该不难,我太懒了。直接看最后一个,cohort-selection.ipynb
,打开看了 Postgres + Python,讲怎么选择病例队列的一些小技巧,感觉写得挺好的。就这个了,开始。原文档用的 Python,我不喜欢。当然还是 R 好啦,所以我直接用里面的
sql
语句就行了。Cohort selection
The aim of this tutorial is to describe how patients are tracked in the MIMIC-III database. By the end of this notebook you should:
subject_id
,hadm_id
, andicustay_id
representRequirements:
pip install numpy pandas matplotlib psycopg2 jupyter
文档的目的是展示 MIMIC 中病例信息的跟踪追溯。主要讲解
subject_id
,hadm_id
, 和icustay_id
代表着什么,怎么提取研究病例队列,以及理解患者接受service
和患者物理位置之间的差别(老实说我都不知道这个到底是什么)。我自己用的是 RStudio + PostgreSQL,所以代码相对原文档会有一些改动。
首先是设置和数据库连接和基本选项:
队列选择一般都是从这三个表开始:
patients
,admissions
以及icustays
:patients
: information about a patient that does not change - e.g. date of birth, genotypical sexadmissions
: information recorded on hospital admission - admission type (elective, emergency), time of admissionicustays
: information recorded on intensive care unit admission - primarily admission and discharge timeMIMIC-III 主要是关注 ICU 的数据库,所以我们一般都是想看患者在 ICU 的进科出科情况。也因此,一般在选取患者队列时都不会从病例作为切入(即通过
subject_id
),而是通过 ICU 出入情况,即通过icustays
表格中的icustay_id
切入。计算 ICU 的住院时间:
EXTRACT(EPOCH FROM ... )
从TIMESTAMP
中提出以秒为单位的INTERVAL
,所以真正要计算时间,还要除以 (60 * 60 *24):如果还想对 ICU 住院时间进行筛选,比如只想看住院超过 24h 的,就得先建个临时表格。比如:
这样就只筛选到住院时间 > 2 天的病例。
很多使用 MIMIC 数据库的研究都会聚焦于特定的人群。比如,MIMIC 中的数据包含了 ICU 中成人和新生儿的住院记录,但是一般研究是不会在这两个人群里同时开展的。所以很多研究的第一步就是从
icustays
表格中选择病例人群,即从这张表格中筛选合适的icustay_id
。上面的例子就是选取 ICU 住院时间超过 2 天的。选取病例人群的时候,好的做法是构建一个队列表格。这个表格应该包含数据库中所有的
icustay_id
,然后通过一个添加一个 binary flag 来指明每个病例是否要从研究人群中剔除。比如还是上面的筛选 ICU 住院时间 > 2 天的病例的例子:之前的例子里,最后结果只返回了 6 行,因为有 4 行被我们筛选出去了。而在这里,所有的 10 行数据都在,但是最后一列显示有 4 行数据是不应该包含在我们的研究人群中的。
这种做法的好处在于在研究的最后,我们很容易总结整个研究人群的排除情况,也很容易根据需要作出修改。
再回想一下之前提到的剔除标准:标记非成人病例为剔除对象。所以,首先必须得知道病人在进入 ICU 时的年龄,这个需要用患者出生日期和 ICU 入院时间来计算。
icustays
里的intime
记录病人入 ICU 的时间,所以我们还需要从patients
得到病人的出生日期。结果发现,再一次的,计算的年龄成了
INTERVAL
。所以还得转换。转换有 3 种办法:EXTRACT()
提取INTERVAL
,此时INTERVAL
是天 + 小时 : 分钟 : 秒
这样的形式,然后作除法得到年(前面用到的做法);AGE()
返回为年龄精确值,然后用DATE_PART()
提取年数得到以年为单位的年龄;AGE()
得到年龄精确值,DATE_PART()
分别提取年月日计算精确年龄。我们把三种方法都试试看:
可以看到后面两种方法计算的年龄其实基本上没什么差别。而第一种办法,由于提取出来的实际上都是以天为单位的
INTERVAL
,所以提取年得不到年龄的,只得到 0 了。所以结论就是,其实用不同的办法算得年龄没什么大的区别,按个人喜好自己定一个就 OK。后面我们都会用最简单的EXTRACT(EPOCH FROM ... )
这种方法。然后我们就可以通过设置年龄必须 >= 16来把新生儿剔除掉了(虽然也把青少年剔除了,但是其实 MIMIC 只有新生儿和成人):
可以看到有 6 行因为年龄不足 16 岁而标记为待剔除,而且这 6 例里大部分也和之前的住院日 > 2 天有很多重合。
下面再尝试另一个常见的剔除标准:二次入 ICU 病例,不管是院内还是院外的。这么做的理由是筛选后可以达到很多统计分析所需要的各样本之间独立的要求。如果保留同一患者多次 ICU 住院信息,那么就必须考虑到这多次入院之间的高度相关性(同一患者因同样的情况多次入院),这对统计分析添加了不必要的麻烦。所以,我们通过
RANK()
对多次入院情况做排序编号:可以对看到
subject_id
为 7 的患者就有两次入院信息。所以我们要做的就是再加入一个CASE WHEN
把这样的病例去掉(虽然其实这个病例也会因为其他标准不符合而被剔除):可以看到
subject_id
为 7 的患者第 2 次的入院信息确实已经被标记为待剔除。最后,我们可能还想根据入院接受治疗特定情况剔除掉部分人。因为不同科室接收的病人基本情况差别也很大,而通过剔除特定人群之后可以使研究的人群一致性更好。
services
表格就提供了患者入院接受何种治疗的情况:从上面可以看到,
curr_service
是 current service 的缩写,prev_service
在患者有转科的情况下记录转科前的科室,否则为null
。比如subject_id
为 471 的患者发生过至少两次service
的变更:一次从 TSURG 到 MED,另一次从 MED 到 TSURG(注:可能还有更多记录因为我们用了LIMIT 10
而没有显示,可以通过SELECT * FROM services WHERE subject_id = 471
进一步查看)。表格里所有的
service
可以从 MIMIC 网站查看:http://mimic.physionet.org/mimictables/services/。简单来说就是这些:如果我们想剔除掉接受手术治疗的病人的,那就需要排除这些
service
:可以通过
%SURG or ORTHO
通配符匹配搞定:OK,该剔除的都标记好了。但是我们发现我们只有
hadm_id
,而我们选取队列是以icustay_id
为中心的。所以现在还要通过hadm_id
与icustays
表格来一次 JOIN 得到icustay_id
:然后现在新的问题又来了:一个
icustay_id
对应多个service
怎么选择?其实这个是关于研究队列的选择的问题,而不是代码写法的问题。比如我们决定把来 ICU 之前是做手术的病人剔除掉,那么上面的JOIN
就要改了:与前面的结果比较,发现
hadm_id
= 100006 的患者service
= OMED 的行去掉了:因为这个患者的 OMED 是在 ICU 之后的,我们不纳入研究(虽然其实 OMED 是非手术)。注意上面代码的JOIN
中我们用到了+ interval '12' hour
,这给我们的剔除标准增加了一点点宽容度。原因在于数据中记录的这些时间信息都是院内不同地方不同的人不同时刻进行录入的,所以必然有一些不一致。比如,一个 ICU 病人可能因为需要手术而发生 transfer,但是记录的转科时间上却在进入 ICU 的时间一小时后。这就属于行政上的“噪音”,而我们加入的 12 个小时有助于解决这个问题。再次说明,这个这是关于队列如何选择的问题——可能你觉得 12h 太长,2-4h 比较合适——但是其实对于我们的例子来说区别不大,因为 80% 的病人没有转科的情况。最后,我们合并结果为每次 ICU 只有一个
service
记录。和前面一样,用到RANK()
:然后最后的最后在和我们之前的筛选队列再 JOIN 一下:
然后我们就顺利得到了需要的病人队列,可以开始提取数据了。
最后来总结一下我们的筛选流程(最后这一步也可以在 R 里写,嫌麻烦算了,直接复制粘贴到 Python 里了)
可以发现,由于我们前面建立了筛选的队列表格,所以最后想看整个筛选过程就变得很简单。
这篇文档真的觉得很有用,其一是很展示了每一步应该怎么写查询语句并有详细的解释;其二也是最重要的,给出了选择研究队列的一般理念。
THE END
所有 R 代码也贴在最后当备份了。
The text was updated successfully, but these errors were encountered: