Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
Postgres CHARTEVENTS partition is no longer efficient #136
Comments
|
First let me inform you that postgres 9.6 released today. https://www.postgresql.org/docs/9.6/static/release-9-6.html That's beeing said sorry I am not aware about that MIMIC new feature. Questions:
|
|
Great news! Looking forward to trying some of the parallel processing features. No problem - I will update you as it's also relevant to the issue at hand. Essentially, we added a large amount of data to CHARTEVENTS. This data is from TEXTSIGNALS in Metavision. It is structured drop down box data. For example for Personally, I am finding that the ventildation-durations.sql query is very slow now, but that is a very inefficient query. I think in general queries will be slowed because almost all of the Metavision data is in partition 14 ( 220074 and above ). |
|
First, have you got a set of chartevents queries that are runed ? I am not sure to get the big picture about variety of queries runed on it. The way I describe above is ok for the queries I made, but maybe not for all use cases. The way I use chartevent is allways beginning with getting a subset of it, based on an itemid. If this does not represent the way all user query it, then I would think again about that. My fisrt reaction about storing those drop down box data as text in chartevent was surprise. In general, such predefined values are stored as smallint ids and reference a dimension table around the fact table. But in the meantime, adding a new column (say drop_down_box) in chartevents would be very costly, and mostly containing null values. Then storing those values as text is maybe the best option. In the general case, speeding up chartevent on postgresql = reducing sequential scans. It is not possible to totally avoid them, that is why partitionning is a good help because it limits sequential scan times in most cases. Partitionning data by datatype can be a good option. It allows to index different partition in specialized ways. That is why separating "text rows" from "num rows" would allow to adapt indexes (say on value/valuenum) respectively. By the way, I am not sure it is relevant to index value/valuenum for now, because itemid is allready very disperse. So maybe a better option would be to work on dispersion and not data type. By example, put all itemids that are very disperse, in a single huge partition. Querying on itemid would always result on an index-scan ( small % of the table ). And those itemids that represent huge portion of chartevents on other little partitions. The table below represents the 100 first itemids in terms of row number. They represent 140M rows (~middle of chartevents table ?). Separating them as 14 tables of 10M rows would result to use those tables for sequential scans only. And 10M rows is a descent amount of rows for sequential scans. And one of 200M very disparate. To be fair, on postgresql I would go for a chartevent table implemented with FDW like cstore_fdw in case of a standalone server ( take a look at this well explained one : http://blog.dbi-services.com/optimized-row-columnar-orc-format-in-postgresql/ ) and hive_fdw in case you have an hadoop cluster ( http://www1.bigsql.org/hadoopfdw/ ).
|
|
@theonesp this is the issue that we discussed yesterday, in case you are interested in taking it on. |
theonesp
commented
Nov 16, 2016
|
@tompollard Great! I'll have a look. Thanks |
theonesp
commented
Nov 19, 2016
|
@tompollard Why did you decide to create 14 tables? |
|
@theonesp 14 is just an arbitrary number and the aim was to improve performance. Not a great deal of thought went into the partitioning, so there is plenty of room for improvement :) |
|
Hi, |
alistairewj commentedSep 30, 2016
With the addition of text data (a fix in MIMIC-III v1.4), the partitioning of CHARTEVENTS is no longer effectively distributing the data.
Essentially, the addition of the text data added ~100 million rows to ITEMIDs > 220000. I've noticed queries are going slower and I'm sure users will start to notice this too. Perhaps we should take this opportunity move to a smarter partitioning strategy - one based on the type of data stored with the ITEMID (e.g. the Metavision text data goes into a few partitions of its own, vital sign data goes into its own partition, etc). I'd welcome thoughts from anyone who has any - @parisni @tompollard