Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source summary slow queries #1008

Closed
snyaggarwal opened this issue Sep 29, 2021 · 11 comments
Closed

Source summary slow queries #1008

snyaggarwal opened this issue Sep 29, 2021 · 11 comments
Assignees
Labels
api2 OCL API v2

Comments

@snyaggarwal
Copy link
Contributor

2021-08-26 13:00:58 UTC:10.1.11.186(47126):postgres@postgres:[20203]:LOG: duration: 153187.007 ms statement: SELECT COUNT() AS "__count" FROM "mappings" INNER JOIN "mappings_sources" ON ("mappings"."id" = "mappings_sources"."mapping_id") WHERE ("mappings_sources"."source_id" = 746 AND "mappings"."is_active" AND NOT "mappings"."retired" AND "mappings"."is_latest_version")
and
2021-08-26 13:02:20 UTC:10.1.11.186(41156):postgres@postgres:[13448]:LOG: duration: 145315.597 ms statement: SELECT COUNT(
) AS "__count" FROM "concepts" INNER JOIN "concepts_sources" ON ("concepts"."id" = "concepts_sources"."concept_id") WHERE ("concepts_sources"."source_id" = 746 AND "concepts"."is_active" AND NOT "concepts"."retired" AND "concepts"."is_latest_version")

Possible improvements:
Let's consider moving or duplicating is_active, retired and is_latest_version to mappings_sources and concepts_sources to avoid the join.

@snyaggarwal snyaggarwal added the api2 OCL API v2 label Sep 29, 2021
@snyaggarwal snyaggarwal self-assigned this Sep 29, 2021
@rkorytkowski
Copy link
Contributor

We also need those on concepts/mappings_collections, which may soon become much larger table than concepts/mappings_sources. It will be extremely expensive to maintain that metadata in sync on concepts/mappings_collections and counts are inherently expensive on Postgres.

Taking a step back I feel that what we actually need is saving these active counts instead. Similarly the last_child_updated for sources/collections should be stored. These counts will only have to be updated for HEAD sources and collections. We should invalidate these values upon modifying concepts/mappings, but only set them when requesting. I would save them in DB for transaction safety and simplicity.

@rkorytkowski
Copy link
Contributor

Count queries now lead to time outs on staging. The more content we load into staging the more this issue escalates. @snyaggarwal is there any chance you could shuffle things on your plate to address it sooner than later?

snyaggarwal added a commit to OpenConceptLab/oclapi2 that referenced this issue Nov 8, 2021
@snyaggarwal
Copy link
Contributor Author

@rkorytkowski I have added active_concepts and active_mappings on Source/Collection/Version. This should be fixed now.

snyaggarwal added a commit to OpenConceptLab/oclapi2 that referenced this issue Nov 8, 2021
snyaggarwal added a commit to OpenConceptLab/oclapi2 that referenced this issue Nov 8, 2021
snyaggarwal added a commit to OpenConceptLab/oclapi2 that referenced this issue Nov 8, 2021
@rkorytkowski
Copy link
Contributor

Thanks for addressing it so quickly. I feel we need a few adjustments here. I'm not sure if I caught all cases from your code but could you please describe the final approach? Also a few lines of documentation on active_concepts and active_mappings fields would be great i.e. point to how/when they are populated.

@rkorytkowski
Copy link
Contributor

When doing bulk imports how do we update the fields? Is it every resource or in bulk?

snyaggarwal added a commit to OpenConceptLab/oclapi2 that referenced this issue Nov 9, 2021
@snyaggarwal
Copy link
Contributor Author

When doing bulk imports how do we update the fields? Is it every resource or in bulk?

For now, it updates after every resource import. I am looking at optimizing this.
The issue with a single update of counts at the end of the import is figuring out which source/collections to update. Right now we don't have restrictions on content belonging to multiple parents. Single import file may have concepts/mappings from multiple parents. It gets even more complicated with parallel imports.

@rkorytkowski
Copy link
Contributor

@snyaggarwal I would suggest to have separate fields to mark counts as outdated (super fast update) instead of updating them right away. By the end of bulk import we could do a simple query to check which counts need to be updated and do it. It could be even done synchronously in bulk import as it is already processed asynchronously.

snyaggarwal added a commit to OpenConceptLab/oclapi2 that referenced this issue Nov 10, 2021
snyaggarwal added a commit to OpenConceptLab/oclapi2 that referenced this issue Nov 10, 2021
snyaggarwal added a commit to OpenConceptLab/oclapi2 that referenced this issue Nov 10, 2021
…ill only update concepts or mappings count when required
snyaggarwal added a commit to OpenConceptLab/oclapi2 that referenced this issue Nov 10, 2021
@rkorytkowski
Copy link
Contributor

We might still be missing something. When I go to https://app.staging.openconceptlab.org/#/orgs/CIEL/sources/CIEL/?q=&isTable=true&isList=false&isSplit=false&page=1&exactMatch=off

I still see in logs:

2021-11-12 14:52:56 UTC:10.1.11.167(42564):postgres@postgres:[15347]:LOG: duration: 1255.279 ms statement: SELECT COUNT(*) AS "__count" FROM "concepts" INNER JOIN "concepts_sources" ON ("concepts"."id" = "concepts_sources"."concept_id") INNER JOIN "sources" ON ("concepts_sources"."source_id" = "sources"."id") INNER JOIN "organizations" ON ("sources"."organization_id" = "organizations"."id") WHERE ("concepts"."is_active" AND "sources"."mnemonic" = 'CIEL' AND "sources"."version" = 'HEAD' AND "organizations"."mnemonic" = 'CIEL' AND NOT "concepts"."retired" AND "concepts"."is_latest_version" AND NOT ("concepts"."public_access" = 'None'))
2021-11-12 14:52:56 UTC:10.1.11.167(42564):postgres@postgres:[15347]:LOG: duration: 757.761 ms statement: SELECT COUNT(*) AS "__count" FROM "concepts" INNER JOIN "concepts_sources" ON ("concepts"."id" = "concepts_sources"."concept_id") INNER JOIN "sources" ON ("concepts_sources"."source_id" = "sources"."id") INNER JOIN "organizations" ON ("sources"."organization_id" = "organizations"."id") WHERE ("concepts"."is_active" AND "sources"."mnemonic" = 'CIEL' AND "sources"."version" = 'HEAD' AND "organizations"."mnemonic" = 'CIEL' AND NOT "concepts"."retired" AND "concepts"."is_latest_version" AND NOT ("concepts"."public_access" = 'None'))

It's as if the count was not taken from cache.

@snyaggarwal
Copy link
Contributor Author

Yea the listing API returns 'num_found' in response headers. It cant take the count from cached fields because the listing query may be of any type and combination (search/facets/scoped/global). It runs count on the queryset returned based on the query asked.

@snyaggarwal
Copy link
Contributor Author

the cached count is used in getting /summary/ of source/collection/version

@snyaggarwal
Copy link
Contributor Author

This is deployed on all environments. Closing this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api2 OCL API v2
Projects
None yet
Development

No branches or pull requests

2 participants