Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poor performance for whole-system search with _lastUpdated #1654

Closed
d0roppe opened this issue Oct 30, 2020 · 4 comments
Closed

poor performance for whole-system search with _lastUpdated #1654

d0roppe opened this issue Oct 30, 2020 · 4 comments
Assignees
Labels
bug Something isn't working P2 Priority 2 - Should Have performance performance schema-change a schema change search

Comments

@d0roppe
Copy link
Collaborator

d0roppe commented Oct 30, 2020

Describe the bug
A clear and concise description of what the bug is.
Doing a whole system search like the following on the large postgres DB took over 2 minutes with over 700 resources matching
?_lastUpdated=2020-10-20T19:39:06Z

Also looking for any resource updated since date for the whole system is taking over 1 min 21 seconds finding 0 resources that match.
?_lastUpdated=ge2020-10-30

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.
Was expecting the whole system search on _lastUpdated to go quicker as it might be a more common search.

Additional context

New Query Builder - Whole-System Search

Specification:

However, in some circumstances, a search is executed where there is no fixed type of resource:

Using search across all resource types (GET [base]?params...)
...

...the search criteria may need to specify one or more resource types that the search applies to. This can be done by using the _type parameter:

    GET [base]/?_type=Observation,Condition&other params...

If no type is specified, the only search parameters that be can be used in global search like this are the base parameters that apply to all resources. If multiple types are specified, any search parameters shared across the entire set of specified resources may be used (see search parameter registry).

Thoughts

Current implementation builds a huge query which is a union of searches for every resource in the model.

The current schema contains placeholder tables for system-level search parameters, but these are currently not populated during ingestion.

For certain use-cases, a relatively small number of FHIR resource tables will contain data. Eliminating the tables from the join makes it easier for the optimizer.

Options:

In all cases (except option 5) the goal is to use the new query builder, allowing for removal of the old code and consolidation of things like string constants.

# Description Pros Cons
1 Replace old query builder with new query builder. Build the same SQL Least effort; eliminates old code and improves future maintainability; No schema change No performance improvement.
2 Do not change parameter storage. Whole-system searches on _id and _lastUpdated query LOGICAL_RESOURCES and do not require a union. All other searches require union based on new query structure. Relatively simple to implement. No change to ingestion. Only improves performance for a subset of whole-system searches; schema change to add columns to LOGICAL_RESOURCES.
3 Store Resource-level parameters at both system and resource-specific levels. If the search uses only system-level parameters, use system-level tables. If the search uses common parameters, then revert to UNION. Because search parameters will be stored at the resource level, it is simple to build the search for a given resource type using the existing code. Should improve performance for a larger subset of whole-system searches. Requires update to ingestion; greater storage cost as some parameters stored and indexed twice; schema change to add columns to LOGICAL_RESOURCES.
4 Store Resource-level parameters only at the system level. Rework existing queries to use these tables instead. Only stores the search parameters once, saving storage; More complicated query writing; possible impact on performance when selecting parameters from resource-level tables; requires schema change to include resource_type_id column in these parameter tables; cardinality estimates likely to be even more problematic leading to poor query planning and performance issues; schema change to add columns to LOGICAL_RESOURCES.
5 Do nothing. Minimal effort; no schema change. High maintenance costs from keeping the old query builder; no performance benefit; potential difference in behavior between resource-specific and whole-system searches.

FIRST CHOICE: 3

SECOND CHOICE: 2

It is currently unknown if it will be necessary to introduce the resource_type_id to the system-level parameter tables. The current design thinking for option 3 is to switch to the UNION join if _type is used to limit the number of resources e.g. &type=Observation,Patient, even if all parameters in the search are available at the system level.


Whole-system search results may include multiple resource-types. The union-style queries can join against the relevant xx_RESOURCES table to fetch the DATA column, but this is not possible for the optimized queries using only the system-level LOGICAL_RESOURCES table. For those queries, additional queries must to be run to fetch the data payloads.

In order to make the search implementation more consistent (and simplifying the search joins in the process), all search queries should not fetch the DATA payload but instead just return the list of {RESOURCE_TYPE_ID, LOGICAL_ID, CURRENT_RESOURCE_ID} tuples. Further queries can then be run to fetch the required resources from the relevant xx_RESOURCES tables (the code for this part is already available - it just needs to be called per resource-type). The implementation should therefore group the resources by resource-type to facilitate efficient retrieval of the payload from the DATA column.

Schema Change

Add the following columns to LOGICAL_RESOURCES:

Column Type Nullable? Default
last_updated TIMESTAMP Y NULL
is_deleted CHAR(1) Y 'X'
version_id INT Y NULL

Note: columns are added as nullable because they need to be populated with realistic data first.

Data Migration

Population of the above columns can be performed either:

  1. during schema migration (migration steps within the schema tool). Long schema upgrade times can cause issues with container startup timeouts, so we'd prefer to avoid this if possible.
  2. during reindex. Requires code change to the reindex DAO to update the values in these columns.

Optimizations

_profile

This search parameter is likely to be used frequently in IG-related FHIR search queries making it a good candidate for optimization. The parameter type is uri which is stored as a string value in xx_STR_VALUES. The values are typically long strings and will often be the same value for every resource (associated with the given profile). This wastes space, and is particular costly for some indexes where the value string is not a leading member of the index.

One possible solution is to normalize the value and store it in common_token_values. The resource_token_refs table could be used to provide the mapping between the COMMON_TOKEN_VALUE.TOKEN_VALUE and the resource, although a new table would need to be added to support this mapping at the system-level. Because storing multiple parameters in the same values table causes challenges with cardinality estimation (especially in PostgreSQL), a better solution is to generate a new table at both the system- and resource-specific levels for the whole purpose of handling the profile search parameter values. This is warranted because in many cases, profile will just act as a filter, rather than as part of the selective part of a query and should therefore be processed late in the execution plan because it is unlikely to be a very selective predicate.

@d0roppe d0roppe added the bug Something isn't working label Oct 30, 2020
@lmsurpre
Copy link
Member

lmsurpre commented Nov 1, 2020

possibly related to #1331 (done) and #264 (not done)

@prb112 prb112 added the search label Nov 2, 2020
@kmbarton423 kmbarton423 added the performance performance label Nov 9, 2020
@kmbarton423 kmbarton423 added the P2 Priority 2 - Should Have label Feb 18, 2021
@lmsurpre
Copy link
Member

investigate migration of whole-system search to new query builder. there is related item for adding last_updated column to the global LOGICAL_RESOURCES table...that is separate.

@punktilious punktilious added this to the Sprint 2021-07 milestone May 11, 2021
@punktilious punktilious self-assigned this May 11, 2021
@punktilious punktilious added the schema-change a schema change label May 12, 2021
@michaelwschroeder
Copy link
Contributor

This item should have been addressed by issues #1921, #1922, #1923

@d0roppe
Copy link
Collaborator Author

d0roppe commented Jul 2, 2021

with the latest code the first search finding last updated for a specific date of over 700 resources, the search now comes back in 500 ms
Ran the second search with a changed date to more current to get 0 results. and it came back in 200 ms. Closing issue.

@d0roppe d0roppe closed this as completed Jul 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2 Priority 2 - Should Have performance performance schema-change a schema change search
Projects
None yet
Development

No branches or pull requests

6 participants