Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast track strategy for searching single recent beliefs #179

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

nhoening
Copy link
Contributor

@nhoening nhoening commented May 12, 2024

This PR adds a "ORDER_BY and LIMIT 1" strategy to search_session(), which drastically shortens the execution time for queries which really only want one (most recent) belief, e.g. to check connectivity status or show the latest measured value.

In this PR; we add two search filters most_recent_event_only and most_revent_belief_only (notice the missing "s").
Their addition retains backwards compatibility, so packages using timely beliefs should have an easy time starting to use them.

Also, this PR adds an index to make sure the search time is constant, no matter how far in the past the most recent event might be. The index only adds fields used in the ORDER BY which is the recommended practice.

The psql session below demonstrates its effect in such a case (most recent event quite a while in the past - for sensors where the most recent even is really recent, execution time is fast anyway). I repeated each search twice to make sure we see a consistent behavior.

flexmeasures_prod=> \timing on
Timing is on.
flexmeasures_prod=> SELECT
timed_belief.event_start,
timed_belief.belief_horizon,
timed_belief.source_id,
timed_belief.cumulative_probability,
timed_belief.event_value
FROM
timed_belief
JOIN data_source ON data_source.id = timed_belief.source_id
WHERE
timed_belief.sensor_id = 4
AND timed_belief.event_start <= '2024-05-03 16:14:33.262614+09:00'
ORDER BY
timed_belief.event_start DESC
LIMIT
1;
      event_start       | belief_horizon | source_id | cumulative_probability |      event_value      
------------------------+----------------+-----------+------------------------+-----------------------
 2023-07-02 14:50:00+00 | 00:00:00       |         5 |                    0.5 | 0.0011505681818181816
(1 row)

Time: 550.283 ms
flexmeasures_prod=> SELECT
timed_belief.event_start,
timed_belief.belief_horizon,
timed_belief.source_id,
timed_belief.cumulative_probability,
timed_belief.event_value
FROM
timed_belief
JOIN data_source ON data_source.id = timed_belief.source_id
WHERE
timed_belief.sensor_id = 4
AND timed_belief.event_start <= '2024-05-03 16:14:33.262614+09:00'
ORDER BY
timed_belief.event_start DESC
LIMIT
1;
      event_start       | belief_horizon | source_id | cumulative_probability |      event_value      
------------------------+----------------+-----------+------------------------+-----------------------
 2023-07-02 14:50:00+00 | 00:00:00       |         5 |                    0.5 | 0.0011505681818181816
(1 row)

Time: 580.948 ms
flexmeasures_prod=> CREATE INDEX timed_belief_simplewhere_idx ON "timed_belief" (sensor_id, event_start);
CREATE INDEX
Time: 6020.611 ms (00:06.021)
flexmeasures_prod=> SELECT                                                                               
timed_belief.event_start,
timed_belief.belief_horizon,
timed_belief.source_id,
timed_belief.cumulative_probability,
timed_belief.event_value
FROM
timed_belief
JOIN data_source ON data_source.id = timed_belief.source_id
WHERE
timed_belief.sensor_id = 4
AND timed_belief.event_start <= '2024-05-03 16:14:33.262614+09:00'
ORDER BY
timed_belief.event_start DESC
LIMIT
1;
      event_start       | belief_horizon | source_id | cumulative_probability |      event_value      
------------------------+----------------+-----------+------------------------+-----------------------
 2023-07-02 14:50:00+00 | 00:00:00       |         5 |                    0.5 | 0.0011505681818181816
(1 row)

Time: 1.480 ms
flexmeasures_prod=> SELECT
timed_belief.event_start,
timed_belief.belief_horizon,
timed_belief.source_id,
timed_belief.cumulative_probability,
timed_belief.event_value
FROM
timed_belief
JOIN data_source ON data_source.id = timed_belief.source_id
WHERE
timed_belief.sensor_id = 4
AND timed_belief.event_start <= '2024-05-03 16:14:33.262614+09:00'
ORDER BY
timed_belief.event_start DESC
LIMIT
1;
      event_start       | belief_horizon | source_id | cumulative_probability |      event_value      
------------------------+----------------+-----------+------------------------+-----------------------
 2023-07-02 14:50:00+00 | 00:00:00       |         5 |                    0.5 | 0.0011505681818181816
(1 row)

Time: 1.047 ms

I tested the case with most_recent_belief_only - the index doesn't apply for it, but the performance was decent already.
We also need to look into the space we are using with indices.

…strategy for fast access to single recent beliefs

Signed-off-by: Nicolas Höning <nicolas@seita.nl>
Signed-off-by: Nicolas Höning <nicolas@seita.nl>
@nhoening nhoening marked this pull request as ready for review May 12, 2024 13:54
@nhoening nhoening requested a review from Flix6x May 12, 2024 13:54
Copy link
Collaborator

@Flix6x Flix6x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good feature!

@@ -487,6 +504,17 @@ def apply_belief_timing_filters(q):
sources: list = [source] if not isinstance(source, list) else source
q = q.join(source_class).filter(cls.source_id.in_([s.id for s in sources]))

if most_recent_belief_only:
# most recent belief only makes sense on one defined event
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, the most recent belief is not necessarily about the most recent event.

I do like the simplifying assumption, so let's just provide clarity through documentation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it doesn't seem immediately clear to me why we couldn't cover this use case with a combination of most_recent_beliefs_only=True and most_recent_event_only=True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still get multiple events if you don't limit to one source.

But foremost, it's really quite slow to use a subquery or even two subqueries. (also with our recent improvement of how the subqueries work)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I was not suggesting to just use the two old arguments, but to use only 1 of the new arguments, which already limits the query to a single event (and actually, also already limits it to a single belief). It's not obvious to me why the other new argument is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't see you used a new one here.

Two reasons come to mind:

  • old argument uses a subquery which is slow
  • I added the most_recent_belief_only argument also for consistency in arguments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old argument uses a subquery which is slow

We are free to internally disconnect that logic and switch over to the fast track in case of a specific combination of filters, if it results in identical search results.

I added the most_recent_belief_only argument also for consistency in arguments.

That would make sense, but it doesn't really do what its name suggests. My thinking about this is still evolving. Since my previous comment I realized that picking the most recent event but not the most recent belief about that event doesn't seem like a practical use case.

Current situation:

  • The most_recent_belief_only filter takes the most recent event and returns the most recent belief about that event that it finds. -> A practical use case.
  • The most_recent_event_only filter takes the most recent event and returns the first belief about that event that it finds. -> Not a practical use case imo.

My preferred situation involves switching to introducing only one new filter, and renaming it:

  • The most_recent_only filter takes the most recent event and returns the most recent belief about that event that it finds.

Here I'm disregarding any thoughts on n_most_recent_events and probabilistic beliefs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically your current thoughts are to

  • only add most_recent_only filter
  • internally switch over to its logic when the user selects one source together with most_recent_events_only=True
  • document the above point, but also the current state of non-deterministic beliefs being poorly supported here.

Correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with one correction:

  • internally switch over to its logic when the user selects one source together with most_recent_events_only=True and most_recent_beliefs_only=True.

In conjunction, these filters should yield a single row (but only for a sensor holding only deterministic beliefs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a simple & fast way to tell if a sensor has only deterministic beliefs? Otherwise it's not a good idea to do this, maybe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way would be to count the unique cumulative_probability values. But since that would also be new functionality, I agree it is not a good idea to do this within this PR.

Comment on lines +584 to +585
# from sqlalchemy.dialects import postgresql
# print(q.compile(dialect=postgresql.dialect()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to keep these lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am considering it - they can come in quite handy whenever we work in this code.

Comment on lines 575 to 582
# apply fast-track most recent event|belief only approach
if most_recent_event_only:
if most_recent_belief_only:
q = q.order_by(cls.event_start.desc(), cls.belief_horizon.desc()).limit(
1
)
else:
q = q.order_by(cls.event_start.desc()).limit(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document that this is only meant to retrieve a deterministic belief. A probabilistic belief would have multiple rows (sharing the same event start and belief horizon).

@@ -311,6 +316,8 @@ def search_session( # noqa: C901
source: BeliefSource | list[BeliefSource] | None = None,
most_recent_beliefs_only: bool = False,
most_recent_events_only: bool = False,
most_recent_belief_only: bool = False,
most_recent_event_only: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we should (also) use this fast track in case of a most_recent_events_only=Trueand a source filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean because that combination also leads to only one returned event?

:param most_recent_events_only: only return (post knowledge time) beliefs for the most recent event (maximum event start)
:param most_recent_events_only: only return (post knowledge time) beliefs for the most recent event (maximum event start) for each source
:param most_recent_belief_only: only return the most recent belief of the most recent event which fits all filter criteria (will also apply most_recent_event_only)
:param most_recent_event_only: only return the most recent event which fits all filter criteria
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which -> that

@@ -336,7 +351,9 @@ def search_session( # noqa: C901
:param horizons_at_most: only return beliefs with a belief horizon equal or less than this timedelta (for example, use timedelta(0) to get post knowledge time beliefs)
:param source: only return beliefs formed by the given source or list of sources
:param most_recent_beliefs_only: only return the most recent beliefs for each event from each source (minimum belief horizon)
:param most_recent_events_only: only return (post knowledge time) beliefs for the most recent event (maximum event start)
:param most_recent_events_only: only return (post knowledge time) beliefs for the most recent event (maximum event start) for each source
:param most_recent_belief_only: only return the most recent belief of the most recent event which fits all filter criteria (will also apply most_recent_event_only)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which -> that

timely_beliefs/beliefs/classes.py Show resolved Hide resolved
Co-authored-by: Felix Claessen <30658763+Flix6x@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants