Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whole-system-search: use global parameter tables in whole system search join #1922

Closed
punktilious opened this issue Feb 8, 2021 · 3 comments
Assignees
Labels
P2 Priority 2 - Should Have performance performance persistence search

Comments

@punktilious
Copy link
Collaborator

punktilious commented Feb 8, 2021

Related to #264. Once resource ingestion populates the global parameter tables (parameter tables which are tied directly to the global LOGICAL_RESOURCES table, the whole system search query can be rewritten to greatly simplify the join.

The best implementation to retrieve the data payload for each resource type needs to be investigated. Options include a WITH+UNION or perhaps an outer join with a large coalesce statement. To simplify the join as much as possible, it may be necessary to hold the payload at the global layer, which would help to align the design with the ability to offload to the payload to external storage. These options should be evaluated for their impact on performance before any changes are made to the search query build code.

For example, the following structure may perform reasonably well:

        FROM fhirdata.resource_types rt,
             fhirdata.logical_resources lr,
             fhirdata.str_values AS param0
       WHERE rt.resource_type_id = lr.resource_type_id
         AND param0.parameter_name_id = 2582160
         AND param0.str_value = 'abc123'
         AND lr.logical_resource_id = param0.logical_resource_id
     ) 
         SELECT lres.logical_resource_id
           FROM driver
LEFT OUTER JOIN fhirdata.patient_logical_resources AS lres
             ON (lres.logical_resource_id = driver.logical_resource_id
            AND driver.resource_type = 'Patient')
UNION ALL
         SELECT lres.logical_resource_id
           FROM driver
LEFT OUTER JOIN fhirdata.observation_logical_resources AS lres
             ON (lres.logical_resource_id = driver.logical_resource_id
            AND driver.resource_type = 'Observation')
UNION ALL
         SELECT lres.logical_resource_id
           FROM driver
LEFT OUTER JOIN fhirdata.claim_logical_resources AS lres
             ON (lres.logical_resource_id = driver.logical_resource_id
            AND driver.resource_type = 'Claim')
;

Although this form requires the common table expression (CTE) result to be scanned multiple times even though it is evaluated only once. The 'resource_type = 'Claim' predicate in the outer join helps to avoid any logical reads for rows not matching the particular resource type for that sub-query which should keep CPU usage lower.

An alternative is to outer-join each of the resource-specific tables to the main whole-system-search subquery:

SELECT COALESCE(patients.logical_resource_id,
                claims.logical_resource_id,
                observations.logical_resource_id) AS logical_resource_id
  FROM (SELECT rt.resource_type, lr.logical_resource_id
        FROM fhirdata.resource_types rt,
             fhirdata.logical_resources lr,
             fhirdata.str_values AS param0
       WHERE rt.resource_type_id = lr.resource_type_id
         AND param0.parameter_name_id = 2582160
         AND param0.str_value = 'abc123'
         AND lr.logical_resource_id = param0.logical_resource_id
     ) AS driver
LEFT OUTER JOIN fhirdata.patient_logical_resources AS patients
             ON patients.logical_resource_id = driver.logical_resource_id
LEFT OUTER JOIN fhirdata.observation_logical_resources AS observations
             ON observations.logical_resource_id = driver.logical_resource_id
LEFT OUTER JOIN fhirdata.claim_logical_resources AS claims
             ON claims.logical_resource_id = driver.logical_resource_id
;

The downside of this is we make one logical read on every resource-specific logical-resources table for every matching row from the search (the driver sub-query). There are 139 resource-types in the FHIR-R4 model, requiring 139 logical reads per matching row (only one of which will actually match).

@lmsurpre lmsurpre added the P2 Priority 2 - Should Have label Mar 11, 2021
@lmsurpre lmsurpre added this to the Sprint 2021-08 milestone Jun 1, 2021
michaelwschroeder added a commit that referenced this issue Jun 9, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 10, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 14, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 15, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 16, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 18, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 18, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 22, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 22, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 22, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 22, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 22, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 22, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 22, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 23, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 23, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 23, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 23, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 23, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 23, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 23, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
punktilious added a commit that referenced this issue Jun 23, 2021
Issue #1922 - implement whole-system search in new query builder
michaelwschroeder added a commit that referenced this issue Jun 24, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 24, 2021
Signed-off-by: Mike Schroeder <mschroed@us.ibm.com>
michaelwschroeder added a commit that referenced this issue Jun 25, 2021
Issue #1922 - don't run reindex tests during search tests
@michaelwschroeder
Copy link
Contributor

There are four different scenarios for whole-system search, each of which will generate different SQL:

  1. _type parameter is not specified, and only global search parameters are specified
  2. _type parameter is specified, and only global search parameters are specified
  3. _type parameter is not specified, and non-global search parameters are specified (in this case would be user-defined whole-system search parameters)
  4. _type parameter is specified, and non-global search parameters are specified

Global search parameters are search parameters defined for all resources, and for which we have system-wide tables in which to store the indexed values. We currently support the following global search parameters:
_id
_lastUpdated
_profile
_security
_source
_tag

For scenarios 1 and 2, where only global search parameters are specified, we'll execute a "filter" query to get the list of logical resource IDs which match the search criteria, and then we'll execute a "data" query to get a page worth of resource data, which will be returned to the caller. Here's a filter query for scenario 1:

      SELECT LR0.RESOURCE_TYPE_ID, LR0.LOGICAL_RESOURCE_ID 
        FROM LOGICAL_RESOURCES AS LR0
       WHERE LR0.IS_DELETED = 'N'
         AND EXISTS (
      SELECT 1 
        FROM LOGICAL_RESOURCES AS LR1 
  INNER JOIN LOGICAL_RESOURCE_TAGS AS P2 ON P2.LOGICAL_RESOURCE_ID = LR1.LOGICAL_RESOURCE_ID
         AND ((P2.COMMON_TOKEN_VALUE_ID IN (9580,9517)))
       WHERE LR1.LOGICAL_RESOURCE_ID = LR0.LOGICAL_RESOURCE_ID)
    ORDER BY LR0.LOGICAL_RESOURCE_ID
 FETCH FIRST 10 ROWS ONLY

For scenario 2, the filter query is the same as above, only with an additional where clause to select only the resource types specified on the _type parameter:

      SELECT LR0.RESOURCE_TYPE_ID, LR0.LOGICAL_RESOURCE_ID 
        FROM LOGICAL_RESOURCES AS LR0
       WHERE LR0.IS_DELETED = 'N'
         AND LR0.RESOURCE_TYPE_ID IN (104,97)
         AND EXISTS (
      SELECT 1 
        FROM LOGICAL_RESOURCES AS LR1 
  INNER JOIN LOGICAL_RESOURCE_TAGS AS P2 ON P2.LOGICAL_RESOURCE_ID = LR1.LOGICAL_RESOURCE_ID
         AND ((P2.COMMON_TOKEN_VALUE_ID IN (9580,9517)))
       WHERE LR1.LOGICAL_RESOURCE_ID = LR0.LOGICAL_RESOURCE_ID)
    ORDER BY LR0.LOGICAL_RESOURCE_ID
 FETCH FIRST 10 ROWS ONLY

In both scenarios 1 and 2, the data query will be a UNION of type-specific resource tables based on the resource types of the logical resource IDs returned by the filter query. Since we're only getting one page worth of data, this will typically be a small number of resource types. Here's an example:

      SELECT RESOURCE_ID, LOGICAL_RESOURCE_ID, VERSION_ID, LAST_UPDATED, IS_DELETED, DATA, LOGICAL_ID 
        FROM (
      SELECT R.RESOURCE_ID, R.LOGICAL_RESOURCE_ID, R.VERSION_ID, R.LAST_UPDATED, R.IS_DELETED, R.DATA, LR.LOGICAL_ID 
        FROM (
      SELECT LR.LOGICAL_RESOURCE_ID, LR.LOGICAL_ID, LR.CURRENT_RESOURCE_ID 
        FROM Patient_LOGICAL_RESOURCES AS LR
       WHERE LR.IS_DELETED = 'N'
         AND LR.LOGICAL_RESOURCE_ID IN (4706,4718,4720,4722,4724)) AS LR 
  INNER JOIN Patient_RESOURCES AS R ON LR.CURRENT_RESOURCE_ID = R.RESOURCE_ID
      UNION ALL
      SELECT R.RESOURCE_ID, R.LOGICAL_RESOURCE_ID, R.VERSION_ID, R.LAST_UPDATED, R.IS_DELETED, R.DATA, LR.LOGICAL_ID 
        FROM (
      SELECT LR.LOGICAL_RESOURCE_ID, LR.LOGICAL_ID, LR.CURRENT_RESOURCE_ID 
        FROM Observation_LOGICAL_RESOURCES AS LR
       WHERE LR.IS_DELETED = 'N'
         AND LR.LOGICAL_RESOURCE_ID IN (1560,1570,1572,1574,1576)) AS LR 
  INNER JOIN Observation_RESOURCES AS R ON LR.CURRENT_RESOURCE_ID = R.RESOURCE_ID) AS COMBINED_RESULTS
    ORDER BY COMBINED_RESULTS.LOGICAL_RESOURCE_ID

For scenarios 3 and 4, we'll build one large UNION'd query to get the data, like we do today with the old query builder. We have to do it that way since the search parameter index values will be in the resource type-specific values tables rather than system-wide values tables. Here is an example query for scenario 4:

      SELECT RESOURCE_ID, LOGICAL_RESOURCE_ID, VERSION_ID, LAST_UPDATED, IS_DELETED, DATA, LOGICAL_ID 
        FROM (
      SELECT R.RESOURCE_ID, R.LOGICAL_RESOURCE_ID, R.VERSION_ID, R.LAST_UPDATED, R.IS_DELETED, R.DATA, LR.LOGICAL_ID 
        FROM (
      SELECT LR0.LOGICAL_RESOURCE_ID, LR0.LOGICAL_ID, LR0.CURRENT_RESOURCE_ID 
        FROM Condition_LOGICAL_RESOURCES AS LR0
       WHERE LR0.IS_DELETED = 'N'
         AND EXISTS (
      SELECT 1 
        FROM Condition_LOGICAL_RESOURCES AS LR1 
  INNER JOIN Condition_RESOURCE_TOKEN_REFS AS P2 ON P2.LOGICAL_RESOURCE_ID = LR1.LOGICAL_RESOURCE_ID
         AND P2.PARAMETER_NAME_ID = 1396
         AND (P2.COMMON_TOKEN_VALUE_ID = 30)
       WHERE LR1.LOGICAL_RESOURCE_ID = LR0.LOGICAL_RESOURCE_ID)) AS LR 
  INNER JOIN Condition_RESOURCES AS R ON LR.CURRENT_RESOURCE_ID = R.RESOURCE_ID
      UNION ALL 
      SELECT R.RESOURCE_ID, R.LOGICAL_RESOURCE_ID, R.VERSION_ID, R.LAST_UPDATED, R.IS_DELETED, R.DATA, LR.LOGICAL_ID 
        FROM (
      SELECT LR0.LOGICAL_RESOURCE_ID, LR0.LOGICAL_ID, LR0.CURRENT_RESOURCE_ID 
        FROM Observation_LOGICAL_RESOURCES AS LR0
       WHERE LR0.IS_DELETED = 'N'
         AND EXISTS (
      SELECT 1 
        FROM Observation_LOGICAL_RESOURCES AS LR3 
  INNER JOIN Observation_RESOURCE_TOKEN_REFS AS P4 ON P4.LOGICAL_RESOURCE_ID = LR3.LOGICAL_RESOURCE_ID
         AND P4.PARAMETER_NAME_ID = 1396
         AND (P4.COMMON_TOKEN_VALUE_ID = 30)
       WHERE LR3.LOGICAL_RESOURCE_ID = LR0.LOGICAL_RESOURCE_ID)) AS LR 
  INNER JOIN Observation_RESOURCES AS R ON LR.CURRENT_RESOURCE_ID = R.RESOURCE_ID
      UNION ALL 
      SELECT R.RESOURCE_ID, R.LOGICAL_RESOURCE_ID, R.VERSION_ID, R.LAST_UPDATED, R.IS_DELETED, R.DATA, LR.LOGICAL_ID 
        FROM (
      SELECT LR0.LOGICAL_RESOURCE_ID, LR0.LOGICAL_ID, LR0.CURRENT_RESOURCE_ID 
        FROM Procedure_LOGICAL_RESOURCES AS LR0
       WHERE LR0.IS_DELETED = 'N'
         AND EXISTS (
      SELECT 1 
        FROM Procedure_LOGICAL_RESOURCES AS LR5 
  INNER JOIN Procedure_RESOURCE_TOKEN_REFS AS P6 ON P6.LOGICAL_RESOURCE_ID = LR5.LOGICAL_RESOURCE_ID
         AND P6.PARAMETER_NAME_ID = 1396
         AND (P6.COMMON_TOKEN_VALUE_ID = 30)
       WHERE LR5.LOGICAL_RESOURCE_ID = LR0.LOGICAL_RESOURCE_ID)) AS LR 
  INNER JOIN Procedure_RESOURCES AS R ON LR.CURRENT_RESOURCE_ID = R.RESOURCE_ID) AS COMBINED_RESULTS
    ORDER BY COMBINED_RESULTS.LOGICAL_RESOURCE_ID
 FETCH FIRST 10 ROWS ONLY

For scenario 3, the query would be the same, except there would be a UNION'd sub-query for all supported resource types.

@d0roppe
Copy link
Collaborator

d0roppe commented Jul 2, 2021

Ran the search bucket against this latest code as suggested. Some new limitations in the search that used to be supported. Not sure if that was intentional, suspected a bug. Moving this back to in progress.
Search like

[base]/Organization?_id=id-abc-123&address=a

now gets

{
    "resourceType": "OperationOutcome",
    "id": "id-abc-123",
    "issue": [
        {
            "code": "invalid",
            "details": {
                "text": "Search parameter 'address' is not supported by read."
            },
            "expression": [
                "<empty>"
            ],
            "severity": "fatal"
        }
    ]
}

@d0roppe d0roppe modified the milestones: Sprint 2021-08, Sprint 2021-09 Jul 2, 2021
@d0roppe
Copy link
Collaborator

d0roppe commented Jul 2, 2021

My mistake I ran an outdated search bucket. I reran the current search bucket where those statements had been updated. and the tests pass now. Closing issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Priority 2 - Should Have performance performance persistence search
Projects
None yet
Development

No branches or pull requests

5 participants