Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix slow join query on concepts belonging to source #910

Closed
rkorytkowski opened this issue Aug 12, 2021 · 5 comments
Closed

Fix slow join query on concepts belonging to source #910

rkorytkowski opened this issue Aug 12, 2021 · 5 comments
Assignees
Milestone

Comments

@rkorytkowski
Copy link
Contributor

rkorytkowski commented Aug 12, 2021

SELECT "concepts"."id", "concepts"."internal_reference_id", "concepts"."public_access", "concepts"."created_at", "concepts"."updated_at", "concepts"."created_by_id", "concepts"."updated_by_id", "concepts"."is_active", "concepts"."extras", "concepts"."uri", "concepts"."version", "concepts"."released", "concepts"."retired", "concepts"."is_latest_version", "concepts"."name", "concepts"."full_name", "concepts"."default_locale", "concepts"."supported_locales", "concepts"."website", "concepts"."description", "concepts"."external_id", "concepts"."concept_class", "concepts"."datatype", "concepts"."comment", "concepts"."parent_id", "concepts"."versioned_object_id", "concepts"."mnemonic", T4."id", T4."internal_reference_id", T4."public_access", T4."created_at", T4."updated_at", T4."created_by_id", T4."updated_by_id", T4."is_active", T4."extras", T4."uri", T4."logo_path", T4."mnemonic", T4."version", T4."released", T4."retired", T4."is_latest_version", T4."name", T4."full_name", T4."default_locale", T4."supported_locales", T4."website", T4."description", T4."external_id", T4."custom_validation_schema", T4."organization_id", T4."user_id", T4."_background_process_ids", T4."canonical_url", T4."identifier", T4."contact", T4."jurisdiction", T4."publisher", T4."purpose", T4."copyright", T4."revision_date", T4."text", T4."snapshot", T4."experimental", T4."meta", T4."source_type", T4."content_type", T4."collection_reference", T4."hierarchy_meaning", T4."case_sensitive", T4."compositional", T4."version_needed", T4."hierarchy_root_id", "organizations"."id", "organizations"."internal_reference_id", "organizations"."public_access", "organizations"."created_at", "organizations"."updated_at", "organizations"."created_by_id", "organizations"."updated_by_id", "organizations"."is_active", "organizations"."extras", "organizations"."uri", "organizations"."logo_path", "organizations"."name", "organizations"."company", "organizations"."website", "organizations"."location", "organizations"."mnemonic", "organizations"."description", "organizations"."text" FROM "concepts" INNER JOIN "concepts_sources" ON ("concepts"."id" = "concepts_sources"."concept_id") INNER JOIN "sources" T4 ON ("concepts"."parent_id" = T4."id") LEFT OUTER JOIN "organizations" ON (T4."organization_id" = "organizations"."id") WHERE ("concepts_sources"."source_id" = 726 AND "concepts"."is_active" AND "concepts"."is_latest_version") ORDER BY "concepts"."id" DESC LIMIT 100 OFFSET 95700
takes 174971.665 ms on staging.

Possible improvements:

  1. Do not join on organizations and sources (do not prefetch). All concepts belong to the same org and source so it should be faster to lazy fetch org and source and have django use cache to assign the same object to each concept.
  2. Modify the query to no use join, but a subquery. JOIN is extremely slow on such a big result and it needs to perform the full join because of the OFFSET and ORDER BY. e.g.:
    SELECT * FROM "concepts" WHERE ("id" in (select "concept_id" from "concepts_sources" where "source_id" = 726 order by "concept_id" DESC LIMIT 100 OFFSET 95700) AND "concepts"."is_active" AND "concepts"."is_latest_version");
    returns in under 200 ms as compared to JOIN which takes seconds.
    The only problem is is_active and is_latest_version will further limit number of rows returned, which may make it nonviable option depending on the usage scenario...
  3. If it is an export and we know that we'll be fetching batches one after another in the same "transaction" then it makes more sense to use .iterator() instead of slicing (OFFSET) as the join query is only executed once and further iterations can be done instantly. Please note it cannot be combined with 2 in such a use case as IN will not perform well given thousands of values.
@rkorytkowski rkorytkowski changed the title Fix slow join query Fix slow join query on concepts belonging to source Aug 12, 2021
@paynejd paynejd added this to the Sprint 16 milestone Aug 12, 2021
snyaggarwal added a commit to OpenConceptLab/oclapi2 that referenced this issue Aug 13, 2021
@snyaggarwal
Copy link
Contributor

@rkorytkowski I have updated the export code to not use joins (select_related) on source/orgs. Also, the concepts queryset is not joining concept_sources now and uses the exact sub query method you mentioned above.

snyaggarwal added a commit to OpenConceptLab/oclapi2 that referenced this issue Aug 13, 2021
@rkorytkowski
Copy link
Contributor Author

I commented on the commit. Please tweak it.

snyaggarwal added a commit to OpenConceptLab/oclapi2 that referenced this issue Aug 13, 2021
@snyaggarwal
Copy link
Contributor

@rkorytkowski Updated based on feedback

snyaggarwal added a commit to OpenConceptLab/oclapi2 that referenced this issue Aug 13, 2021
@rkorytkowski
Copy link
Contributor Author

Cool! Let's test this out on staging! Thanks!

@jamlung-ri jamlung-ri modified the milestones: Sprint 16, Sprint 17 Aug 13, 2021
snyaggarwal added a commit to OpenConceptLab/oclapi2 that referenced this issue Aug 15, 2021
@rkorytkowski
Copy link
Contributor Author

Great improvement! ~1000 times faster on staging. Thanks @snyaggarwal !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants