Adding flag to disable Achilles cache - fixes #2034 #2172

anthonysena · 2022-12-01T22:01:44Z

Adds the ability to turn off caching of Achilles results for all CDMs.

pom.xml

chrisknoll · 2022-12-13T14:20:13Z

Another enhancement:

In CDMCacheService.java (Function cacheRecords():

There is a loop of java code:

        if (ids == null) {
            // Full cache
            // Make sure that query returns ordered collection of ids or sort it after query is executed
            PreparedStatementRenderer cpsr = getConceptPreparedStatementRenderer(source);
            ids = jdbcTemplate.query(cpsr.getSql(), cpsr.getSetter(), (rs, rowNum) -> rs.getInt(1));
            List<Pair<Integer, Integer>> minMaxPairs = new ArrayList<>();
            int start = 0;
            // Get ranges of minimal and maximum concept identifiers to use in query
            while (start < ids.size()) {
                int end = Math.min(ids.size() - 1, start + COUNTS_BATCH_SIZE - 1);
                // If we have small number of concepts for the next range - add them to the current range
                if (end + COUNTS_BATCH_THRESHOLD >= ids.size() - 1) {
                    end = ids.size() - 1;
                }
                minMaxPairs.add(new ImmutablePair<>(ids.get(start), ids.get(end)));
                start = end + 1;
            }
            // Clear list of identifiers so the GC can collect them
            ids.clear();
            minMaxPairs.forEach(pair -> {
                PreparedStatementRenderer psr = getBatchPreparedStatementRenderer(source, pair.getLeft(), pair.getRight());
                cacheRecords(source, psr, mapper, jdbcTemplate);
            });

This can be replaced with a query that can prouduce min/max concepts directly in sql:


select group_id, min(concept_id) as min_concept, max(concept_id) as max_concept FROM
(
	select concept_id, ordinal / 100000 as group_id FROM ( -- here, 100000 would be replaced with COUNTS_BATCH_SIZE 
		select concept_id, row_number() over (order by concept_id) as ordinal
		FROM cdm.concept c
	) Q
) G
group by group_id

anthonysena · 2022-12-29T15:48:13Z

Another enhancement:

In CDMCacheService.java (Function cacheRecords():

There is a loop of java code:

        if (ids == null) {
            // Full cache
            // Make sure that query returns ordered collection of ids or sort it after query is executed
            PreparedStatementRenderer cpsr = getConceptPreparedStatementRenderer(source);
            ids = jdbcTemplate.query(cpsr.getSql(), cpsr.getSetter(), (rs, rowNum) -> rs.getInt(1));
            List<Pair<Integer, Integer>> minMaxPairs = new ArrayList<>();
            int start = 0;
            // Get ranges of minimal and maximum concept identifiers to use in query
            while (start < ids.size()) {
                int end = Math.min(ids.size() - 1, start + COUNTS_BATCH_SIZE - 1);
                // If we have small number of concepts for the next range - add them to the current range
                if (end + COUNTS_BATCH_THRESHOLD >= ids.size() - 1) {
                    end = ids.size() - 1;
                }
                minMaxPairs.add(new ImmutablePair<>(ids.get(start), ids.get(end)));
                start = end + 1;
            }
            // Clear list of identifiers so the GC can collect them
            ids.clear();
            minMaxPairs.forEach(pair -> {
                PreparedStatementRenderer psr = getBatchPreparedStatementRenderer(source, pair.getLeft(), pair.getRight());
                cacheRecords(source, psr, mapper, jdbcTemplate);
            });

This can be replaced with a query that can prouduce min/max concepts directly in sql:


select group_id, min(concept_id) as min_concept, max(concept_id) as max_concept FROM
(
	select concept_id, ordinal / 100000 as group_id FROM ( -- here, 100000 would be replaced with COUNTS_BATCH_SIZE 
		select concept_id, row_number() over (order by concept_id) as ordinal
		FROM cdm.concept c
	) Q
) G
group by group_id

Thanks @chrisknoll - I've opened a separate issue for this change so that the focus on this PR is for disabling the Achilles caching capability. I agree with what you have proposed above and will work on changing that functionality.

Made achilles_result_concept_count a required table. Modified ddl population for record count table Records are cached from achilles_result_concept_count instead of achilles_results.

chrisknoll · 2023-01-17T16:38:06Z

Latest commit makes some changes:
It incorporates the change to use achilles_concept_record_count, the query to populate this table is updated, zeroes are stored in webapiCache when they are not returned from the source,

I split off some of the functions into separate functions, and some of the naming was confusing, ie: you had cacheRecords with multiple function signatures, but one function was updateing the webapi cache, the other was fetching from the cdm source, and so I renamed them to be a bit clearer in their function.

The main change is the requirement of achilles_concept_recourd_count, so most likely we want put this into the 2.13 relase and not the 2.12.1 hotfix.

chrisknoll · 2023-01-20T14:16:41Z

Still working through validating the behavior in the caching layer. This is a report I'm seeing through pgAdmin dashboard desribign tuples in and tuples out:

In this graph, we see an expected insrt of 2000 rows (tuple=row in PG) but we're seeing this massive 800,000 tuples out and it is not clear if this is some side effect of Hibernate, or this is normal behavior of PG, but it is suspicious, and updates are going very slow (I estimate it's between 20-30 inserts per second, which will cause the 11,000,000 inserts to copy the record count records into cache extremely slow.

I'm trying to investigate if it is something to do with our own PG configuration: before this type of caching, the PG WebAPI would only insert data if an asset (ie: cohort def, concept set) was saved, or any time an analyitical task was executed, it would record the job. This would amount to a workload of a dozen or so updates/insert per minute, and the caching change is switching the load to be thousands of updates per second if we ever hope to complete 11 million updates during webapi startup.

My current thinking is that hibernate was the wrong solution to implement a caching layer on: there are transaction semantics and other unknown variables (to me) that IMO add overhead and other complications that a pure caching mechanism should not be concerned with.

Limit the warming batch cache to only concepts that exist in the vocabulary.

anthonysena · 2023-01-24T14:12:30Z

Making a note here for the release notes: the @results_schema.achilles_result_concept_count table will be REQUIRED for v2.13.

…casting of variables outside of join clause

…disable-achilles-cache

chrisknoll · 2023-02-06T16:33:31Z

@anton-abushkevich , this PR is ready, but we wanted to have someone approve from outside our organization. can you review?

Adding flag to disable Achilles cache - fixes #2034

129d445

anthonysena marked this pull request as draft December 1, 2022 22:01

anthonysena marked this pull request as ready for review December 5, 2022 21:02

anthonysena commented Dec 13, 2022

View reviewed changes

pom.xml Outdated Show resolved Hide resolved

anthonysena linked an issue Dec 13, 2022 that may be closed by this pull request

Ability to disable the Achilles cache #2034

Closed

Set cdm.cache.cron.warming.enable = false by default

8555e88

anthonysena mentioned this pull request Dec 29, 2022

CDMCacheService cacheRecords update #2182

Closed

Merge branch 'master' into issue-2034-disable-achilles-cache

869e09e

anthonysena mentioned this pull request Dec 29, 2022

CDMCacheService batching behavior #2173

Closed

Refactored loadCache into batch-mode and by-concept mode

c690691

Made achilles_result_concept_count a required table. Modified ddl population for record count table Records are cached from achilles_result_concept_count instead of achilles_results.

Modify save to only save cache item if it is different.

abc0e36

Limit the warming batch cache to only concepts that exist in the vocabulary.

anthonysena assigned anton-abushkevich Jan 24, 2023

chrisknoll referenced this pull request Jan 25, 2023

Record and person count sql query performance was improved by moving …

5becdff

…casting of variables outside of join clause

anthonysena mentioned this pull request Jan 26, 2023

OOM with achilles reports caching #2192

Closed

chrisknoll added 2 commits January 27, 2023 13:02

Fixed DDL init.

54cbedc

Merge remote-tracking branch 'remotes/origin/master' into issue-2034-…

b63407a

…disable-achilles-cache

chrisknoll requested a review from anton-abushkevich February 6, 2023 16:33

anton-abushkevich requested a review from ssuvorov-fls February 7, 2023 10:16

ssuvorov-fls approved these changes Feb 7, 2023

View reviewed changes

chrisknoll merged commit 8adbdd3 into master Feb 7, 2023

delete-merged-branch bot deleted the issue-2034-disable-achilles-cache branch February 7, 2023 12:22

This was referenced Apr 10, 2023

Achilles result concept count query unnecessary cast #2194

Closed

Achilles concept count type mismatch #2170

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding flag to disable Achilles cache - fixes #2034 #2172

Adding flag to disable Achilles cache - fixes #2034 #2172

anthonysena commented Dec 1, 2022

chrisknoll commented Dec 13, 2022 •

edited

Loading

anthonysena commented Dec 29, 2022

chrisknoll commented Jan 17, 2023

chrisknoll commented Jan 20, 2023

anthonysena commented Jan 24, 2023

chrisknoll commented Feb 6, 2023

Adding flag to disable Achilles cache - fixes #2034 #2172

Adding flag to disable Achilles cache - fixes #2034 #2172

Conversation

anthonysena commented Dec 1, 2022

chrisknoll commented Dec 13, 2022 • edited Loading

anthonysena commented Dec 29, 2022

chrisknoll commented Jan 17, 2023

chrisknoll commented Jan 20, 2023

anthonysena commented Jan 24, 2023

chrisknoll commented Feb 6, 2023

chrisknoll commented Dec 13, 2022 •

edited

Loading