Create MIMIC-III in a local apache-Drill query engine (standalone mode)
Instructions for use
For installation instructions, see: https://drill.apache.org/docs/installing-drill-in-embedded-mode
Once drill is installed, clone the mimic-code repository into a local directory using the following command:
$ git clone https://github.com/MIT-LCP/mimic-code.gitGUI
DBeaver [http://dbeaver.jkiss.org/download/](download dbeaver) works great with Drill.
- File>New>Choose Drill
- Fill JDBC URL = jdbc:drill:drillbit=localhost:31010
- Host = localhost
- Port = 31010
Install Mimic Data
Run drill:
$ cd /path/to/drill
$ bin/drill-embeddedConfigure Drill:
Default install has 2 problems: i) is it has a default storage in temporary folder => when reboot, data are lost ii) csv format does not use header by default. Let's configure it.
- edit /path/to/drill/conf/drill-override.conf
- it should look like :
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "localhost:2181",
sys.store.provider: {
class: "org.apache.drill.exec.store.sys.zk.ZkPStoreProvider",
zk: {
blobroot: "file:///var/log/drill"
},
local: {
path: "/path/to/drilldata/",
write: true
}
}
}
- edit /path/to/drill/conf/drill-env-sh add
-Duser.timezone=UTto DRILL_JAVA_OPTS in order understand mimic dates formats - You can specify the maximum memory there too thanks to DRILL_MAX_DIRECT_MEMORY
- restart drill (ctrl + d) in the console where drill was started; then restart it as described before.
- go to http://localhost:8047/
- go to onglet "storage" and update "dfs"
- add a mimiciii (path you want to save drill table as parquet files) location after the tmp location:
"tmp" : { "location" : "/tmp", "writable" : true, "defaultInputFormat" : null }, "mimiciii" : { "location" : "/path/to/drilldata", "writable" : true, "defaultInputFormat" : null } - make sure "csv" section has extractHeader:
"csv" : { "type" : "text", "extensions" : [ "csv" ], "extractHeader" : true, "delimiter" : "," } - validate
Create/Load the table/data
In DBeaver, connect and copy/paste:
- drill_create_data.sql
- create alias for all csv in temp (dfs.tmp = /tmp on linux) ln -s /path/to/mimic/csv/* /tmp
- create table one by one. Otherwize, it crashes when multi query are made in one run
Notes
- NOTEEVENTS : To be loaded there is two fixes:
- Drill does not accept newlines in text fields (it actually splits csv based on newlines to parallelize reading processes)then replace \n with
<b>by example or remove them before loading the csv - Drill has a bug when a double quote is the last character of a text field. Row_ID 387846, 982481, 1008470, 1036580 has in there DESCRIPTION field such case. Remove it and drill will be able to load NOTEEVENTS
- For now, Drill does not have a regex operator(not a ANSI SQL). However it exists a function that cover this needs at : https://github.com/parisni/drill-simple-contains
- Drill does not accept newlines in text fields (it actually splits csv based on newlines to parallelize reading processes)then replace \n with
- Example of query: SELECT * FROM dfs.mimiciii.
CHARTEVENTSLIMIT 10;
Create MIMIC-III in a local apache-Drill query engine (distributed mode)
to be done