poor performance for whole-system search with _lastUpdated #1654
Labels
bug
Something isn't working
P2
Priority 2 - Should Have
performance
performance
schema-change
a schema change
search
Milestone
Describe the bug
A clear and concise description of what the bug is.
Doing a whole system search like the following on the large postgres DB took over 2 minutes with over 700 resources matching
?_lastUpdated=2020-10-20T19:39:06Z
Also looking for any resource updated since date for the whole system is taking over 1 min 21 seconds finding 0 resources that match.
?_lastUpdated=ge2020-10-30
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Was expecting the whole system search on _lastUpdated to go quicker as it might be a more common search.
Additional context
New Query Builder - Whole-System Search
Specification:
Thoughts
Current implementation builds a huge query which is a union of searches for every resource in the model.
The current schema contains placeholder tables for system-level search parameters, but these are currently not populated during ingestion.
For certain use-cases, a relatively small number of FHIR resource tables will contain data. Eliminating the tables from the join makes it easier for the optimizer.
Options:
In all cases (except option 5) the goal is to use the new query builder, allowing for removal of the old code and consolidation of things like string constants.
LOGICAL_RESOURCES
and do not require a union. All other searches require union based on new query structure.LOGICAL_RESOURCES
.LOGICAL_RESOURCES
.resource_type_id
column in these parameter tables; cardinality estimates likely to be even more problematic leading to poor query planning and performance issues; schema change to add columns toLOGICAL_RESOURCES
.FIRST CHOICE: 3
SECOND CHOICE: 2
It is currently unknown if it will be necessary to introduce the
resource_type_id
to the system-level parameter tables. The current design thinking for option 3 is to switch to the UNION join if_type
is used to limit the number of resources e.g.&type=Observation,Patient
, even if all parameters in the search are available at the system level.Whole-system search results may include multiple resource-types. The union-style queries can join against the relevant
xx_RESOURCES
table to fetch theDATA
column, but this is not possible for the optimized queries using only the system-levelLOGICAL_RESOURCES
table. For those queries, additional queries must to be run to fetch the data payloads.In order to make the search implementation more consistent (and simplifying the search joins in the process), all search queries should not fetch the
DATA
payload but instead just return the list of {RESOURCE_TYPE_ID
,LOGICAL_ID
,CURRENT_RESOURCE_ID
} tuples. Further queries can then be run to fetch the required resources from the relevantxx_RESOURCES
tables (the code for this part is already available - it just needs to be called per resource-type). The implementation should therefore group the resources by resource-type to facilitate efficient retrieval of the payload from theDATA
column.Schema Change
Add the following columns to
LOGICAL_RESOURCES
:Note: columns are added as nullable because they need to be populated with realistic data first.
Data Migration
Population of the above columns can be performed either:
Optimizations
_profile
This search parameter is likely to be used frequently in IG-related FHIR search queries making it a good candidate for optimization. The parameter type is
uri
which is stored as a string value inxx_STR_VALUES
. The values are typically long strings and will often be the same value for every resource (associated with the given profile). This wastes space, and is particular costly for some indexes where the value string is not a leading member of the index.One possible solution is to normalize the value and store it in common_token_values. The resource_token_refs table could be used to provide the mapping between the COMMON_TOKEN_VALUE.TOKEN_VALUE and the resource, although a new table would need to be added to support this mapping at the system-level. Because storing multiple parameters in the same values table causes challenges with cardinality estimation (especially in PostgreSQL), a better solution is to generate a new table at both the system- and resource-specific levels for the whole purpose of handling the profile search parameter values. This is warranted because in many cases, profile will just act as a filter, rather than as part of the selective part of a query and should therefore be processed late in the execution plan because it is unlikely to be a very selective predicate.
The text was updated successfully, but these errors were encountered: