Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding flag to disable Achilles cache - fixes #2034 #2172

Merged
merged 7 commits into from
Feb 7, 2023

Conversation

anthonysena
Copy link
Collaborator

Adds the ability to turn off caching of Achilles results for all CDMs.

@anthonysena anthonysena marked this pull request as draft December 1, 2022 22:01
@anthonysena anthonysena marked this pull request as ready for review December 5, 2022 21:02
pom.xml Outdated Show resolved Hide resolved
@anthonysena anthonysena linked an issue Dec 13, 2022 that may be closed by this pull request
@chrisknoll
Copy link
Collaborator

chrisknoll commented Dec 13, 2022

Another enhancement:

In CDMCacheService.java (Function cacheRecords():

There is a loop of java code:

        if (ids == null) {
            // Full cache
            // Make sure that query returns ordered collection of ids or sort it after query is executed
            PreparedStatementRenderer cpsr = getConceptPreparedStatementRenderer(source);
            ids = jdbcTemplate.query(cpsr.getSql(), cpsr.getSetter(), (rs, rowNum) -> rs.getInt(1));
            List<Pair<Integer, Integer>> minMaxPairs = new ArrayList<>();
            int start = 0;
            // Get ranges of minimal and maximum concept identifiers to use in query
            while (start < ids.size()) {
                int end = Math.min(ids.size() - 1, start + COUNTS_BATCH_SIZE - 1);
                // If we have small number of concepts for the next range - add them to the current range
                if (end + COUNTS_BATCH_THRESHOLD >= ids.size() - 1) {
                    end = ids.size() - 1;
                }
                minMaxPairs.add(new ImmutablePair<>(ids.get(start), ids.get(end)));
                start = end + 1;
            }
            // Clear list of identifiers so the GC can collect them
            ids.clear();
            minMaxPairs.forEach(pair -> {
                PreparedStatementRenderer psr = getBatchPreparedStatementRenderer(source, pair.getLeft(), pair.getRight());
                cacheRecords(source, psr, mapper, jdbcTemplate);
            });

This can be replaced with a query that can prouduce min/max concepts directly in sql:


select group_id, min(concept_id) as min_concept, max(concept_id) as max_concept FROM
(
	select concept_id, ordinal / 100000 as group_id FROM ( -- here, 100000 would be replaced with COUNTS_BATCH_SIZE 
		select concept_id, row_number() over (order by concept_id) as ordinal
		FROM cdm.concept c
	) Q
) G
group by group_id

@anthonysena
Copy link
Collaborator Author

Another enhancement:

In CDMCacheService.java (Function cacheRecords():

There is a loop of java code:

        if (ids == null) {
            // Full cache
            // Make sure that query returns ordered collection of ids or sort it after query is executed
            PreparedStatementRenderer cpsr = getConceptPreparedStatementRenderer(source);
            ids = jdbcTemplate.query(cpsr.getSql(), cpsr.getSetter(), (rs, rowNum) -> rs.getInt(1));
            List<Pair<Integer, Integer>> minMaxPairs = new ArrayList<>();
            int start = 0;
            // Get ranges of minimal and maximum concept identifiers to use in query
            while (start < ids.size()) {
                int end = Math.min(ids.size() - 1, start + COUNTS_BATCH_SIZE - 1);
                // If we have small number of concepts for the next range - add them to the current range
                if (end + COUNTS_BATCH_THRESHOLD >= ids.size() - 1) {
                    end = ids.size() - 1;
                }
                minMaxPairs.add(new ImmutablePair<>(ids.get(start), ids.get(end)));
                start = end + 1;
            }
            // Clear list of identifiers so the GC can collect them
            ids.clear();
            minMaxPairs.forEach(pair -> {
                PreparedStatementRenderer psr = getBatchPreparedStatementRenderer(source, pair.getLeft(), pair.getRight());
                cacheRecords(source, psr, mapper, jdbcTemplate);
            });

This can be replaced with a query that can prouduce min/max concepts directly in sql:


select group_id, min(concept_id) as min_concept, max(concept_id) as max_concept FROM
(
	select concept_id, ordinal / 100000 as group_id FROM ( -- here, 100000 would be replaced with COUNTS_BATCH_SIZE 
		select concept_id, row_number() over (order by concept_id) as ordinal
		FROM cdm.concept c
	) Q
) G
group by group_id

Thanks @chrisknoll - I've opened a separate issue for this change so that the focus on this PR is for disabling the Achilles caching capability. I agree with what you have proposed above and will work on changing that functionality.

Made achilles_result_concept_count a required table.
Modified ddl population for record count table
Records are cached from achilles_result_concept_count instead of achilles_results.
@chrisknoll
Copy link
Collaborator

Latest commit makes some changes:
It incorporates the change to use achilles_concept_record_count, the query to populate this table is updated, zeroes are stored in webapiCache when they are not returned from the source,

I split off some of the functions into separate functions, and some of the naming was confusing, ie: you had cacheRecords with multiple function signatures, but one function was updateing the webapi cache, the other was fetching from the cdm source, and so I renamed them to be a bit clearer in their function.

The main change is the requirement of achilles_concept_recourd_count, so most likely we want put this into the 2.13 relase and not the 2.12.1 hotfix.

@chrisknoll
Copy link
Collaborator

Still working through validating the behavior in the caching layer. This is a report I'm seeing through pgAdmin dashboard desribign tuples in and tuples out:

image

In this graph, we see an expected insrt of 2000 rows (tuple=row in PG) but we're seeing this massive 800,000 tuples out and it is not clear if this is some side effect of Hibernate, or this is normal behavior of PG, but it is suspicious, and updates are going very slow (I estimate it's between 20-30 inserts per second, which will cause the 11,000,000 inserts to copy the record count records into cache extremely slow.

I'm trying to investigate if it is something to do with our own PG configuration: before this type of caching, the PG WebAPI would only insert data if an asset (ie: cohort def, concept set) was saved, or any time an analyitical task was executed, it would record the job. This would amount to a workload of a dozen or so updates/insert per minute, and the caching change is switching the load to be thousands of updates per second if we ever hope to complete 11 million updates during webapi startup.

My current thinking is that hibernate was the wrong solution to implement a caching layer on: there are transaction semantics and other unknown variables (to me) that IMO add overhead and other complications that a pure caching mechanism should not be concerned with.

Limit the warming batch cache to only concepts that exist in the vocabulary.
@anthonysena
Copy link
Collaborator Author

Making a note here for the release notes: the @results_schema.achilles_result_concept_count table will be REQUIRED for v2.13.

@chrisknoll
Copy link
Collaborator

@anton-abushkevich , this PR is ready, but we wanted to have someone approve from outside our organization. can you review?

@chrisknoll chrisknoll merged commit 8adbdd3 into master Feb 7, 2023
@delete-merged-branch delete-merged-branch bot deleted the issue-2034-disable-achilles-cache branch February 7, 2023 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to disable the Achilles cache
4 participants