Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache dependencies when running queries through FlowAPI #1284

Merged
merged 89 commits into from
Sep 23, 2019

Conversation

jc-harrison
Copy link
Member

Closes #1152

I have:

  • Formatted any Python files with black
  • Brought the branch up to date with master
  • Added any relevant Github labels
  • Added tests for any new additions
  • Added or updated any relevant documentation
  • Added an Architectural Decision Record (ADR), if appropriate
  • Added an MPLv2 License Header if appropriate
  • Updated the Changelog

Description

This pull request adds a new argument store_dependencies (default False) to the Query.store method. Setting store_dependencies=True will cause all the unstored dependencies of a query to be stored before the query itself is stored.

When running queries through the API the Flowmachine server will, by default, use store(store_dependencies=True) to execute and store a query. This behaviour can be switched off by setting the environment variable FLOWMACHINE_SERVER_DISABLE_DEPENDENCY_CACHING=true, in which case only the top-level query will be cached.

A side-effect of this change is that the pre-cache script is no longer required in the docs build, so I have removed it.

The dependency storing is handled by a new method Query._store_dependencies_and_make_sql, which constructs a dependency graph of unstored dependencies and calls store() on each of them in an appropriate order, then waits for all of those to finish before returning the output of self._make_sql. When calling store with store_dependencies=True, _store_dependencies_and_make_sql is passed instead of _make_sql as the ddl_ops_func in write_query_to_cache.

I initially tried handling the dependency storing in an outer function which then calls query.store() after storing all the dependencies, but eventually decided pushing this down to run within write_query_to_cache is better because this way the top-level query is in "executing" state from the start, so a different thread can't start storing the same query before all the dependencies have started running.

I've added a function store_queries_in_order in utils.py, which takes a dependency graph and stores the queries in an order that ensures each query is stored after its dependencies. I've also added a method Query._unstored_dependencies_graph, which creates a dependency graph of only the unstored dependencies of a query (and excludes unstored dependencies of stored queries). This is a Query method rather than a function in utils.py primarily due to import issues with QueryStateMachine.

@codecov
Copy link

codecov bot commented Sep 18, 2019

Codecov Report

Merging #1284 into master will decrease coverage by 4.69%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #1284     +/-   ##
=========================================
- Coverage   94.17%   89.47%   -4.7%     
=========================================
  Files         155       18    -137     
  Lines        7499     1587   -5912     
  Branches      697       57    -640     
=========================================
- Hits         7062     1420   -5642     
+ Misses        331      159    -172     
+ Partials      106        8     -98
Flag Coverage Δ
#flowapi_unit_tests 82.36% <50%> (-0.19%) ⬇️
#flowauth_unit_tests 93.65% <ø> (ø) ⬆️
#flowclient_unit_tests 78.78% <ø> (ø) ⬆️
#flowetl_unit_tests ?
#flowkit_jwt_generator_unit_tests 100% <ø> (ø) ⬆️
#flowmachine_unit_tests ?
#integration_tests ?
Impacted Files Coverage Δ
...it_jwt_generator/flowkit_jwt_generator/fixtures.py 100% <ø> (ø) ⬆️
flowapi/flowapi/query_endpoints.py 79.71% <50%> (-9.85%) ⬇️
flowapi/flowapi/api_spec.py 25.53% <0%> (-68.09%) ⬇️
flowclient/flowclient/client.py 78.78% <0%> (-16.02%) ⬇️
flowapi/flowapi/user_model.py 89.28% <0%> (-5.96%) ⬇️
...e/server/query_schemas/joined_spatial_aggregate.py
...achine/core/server/query_schemas/daily_location.py
...e/flowmachine/features/raster/raster_statistics.py
...lowmachine/flowmachine/features/spatial/circles.py
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f4fe0d9...fceb96e. Read the comment docs.

Copy link
Member

@greenape greenape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff!

Need to follow this up quite quickly with automatic cache shrinking I think - I'd also like just a little bit in the docs to mention this, because it could be a major trap for the unwary.

flowmachine/flowmachine/core/dummy_query.py Outdated Show resolved Hide resolved
from typing import NamedTuple


class FlowmachineServerConfig(NamedTuple):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. I did not know you could do that!

@jc-harrison jc-harrison added the ready-to-merge Label indicating a PR is OK to automerge label Sep 23, 2019
@mergify mergify bot merged commit 956469a into master Sep 23, 2019
@mergify mergify bot deleted the simple-orchestration branch September 23, 2019 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request FlowAPI Issues related to the FlowKit API FlowMachine Issues related to FlowMachine ready-to-merge Label indicating a PR is OK to automerge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Orchestration/scheduler/pre-cache
2 participants