Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add index on timed_beliefs for faster search #167

Merged

Conversation

nhoening
Copy link
Contributor

@nhoening nhoening commented Feb 27, 2024

This will speed up search_session(), as that query and its subquery are looking for the same fields.

Note that I removed the usage of has_inherited_table(), which blocked the existing UniqueConstraint from being applied. Let me know if there was a strong argument for using it that I did not find. This function tests if one of the classes the model inherits from has a table assorted with it. I believe you might intend this as a protection of some sort? In FlexMeasures, we inherit from db.Model and from tb.TimedBeliefDBMixin, both of which have no table specified.

As to the unique constraint - it was never applied due to ,has_inherited_table() returning False. If we apply it, we don't allow beliefs for the same event (and from the same source...) with different probabilities (confirmed in one test failing and telling us that, as well). So I decided we don't need this constraint. The combined PK is the same but with the probability in it, so it seems to me we are fine.

…he same fields so searching is sped up

Signed-off-by: Nicolas Höning <nicolas@seita.nl>
@nhoening nhoening added the Database support Dealing with databases label Feb 27, 2024
@nhoening nhoening requested a review from Flix6x February 27, 2024 16:47
Signed-off-by: Nicolas Höning <nicolas@seita.nl>
@nhoening
Copy link
Contributor Author

Now one test is failing, example below, seemingly because it enters a belief which violates the unique constraint. I'll take a look later.

FAILED timely_beliefs/tests/test_belief_query.py::test_select_most_recent_probabilistic_beliefs - sqlalchemy.exc.IntegrityError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "timed_beliefs_quad_unique_and_search_idx"
DETAIL:  Key (event_start, belief_horizon, sensor_id, source_id)=(2025-01-02 22:45:00+00, 02:00:00, 1, 1) already exists.

[SQL: INSERT INTO timed_beliefs (event_start, belief_horizon, cumulative_probability, event_value, sensor_id, source_id) VALUES (%(event_start__0)s, %(belief_horizon__0)s, %(cumulative_probability__0)s, %(event_value__0)s, %(sensor_id__0)s, %(source_id__0) ... 4954 characters truncated ... on__37)s, %(cumulative_probability__37)s, %(event_value__37)s, %(sensor_id__37)s, %(source_id__37)s)]

Also a note from looking at results: we have 9867 warnings, some of which are DeprecationWarnings or FutureWarnings from Pandas, others are from us: UserWarning: <BeliefSource Source A> created from 'Source A'.

This seems useful: /home/runner/work/timely-beliefs/timely-beliefs/timely_beliefs/beliefs/classes.py:1086: PerformanceWarning: Adding/subtracting object-dtype array to DatetimeArray not vectorized.

Signed-off-by: Nicolas Höning <nicolas@seita.nl>
… as well

Signed-off-by: Nicolas Höning <nicolas@seita.nl>
Signed-off-by: Nicolas Höning <nicolas@seita.nl>
…red by primary key

Signed-off-by: Nicolas Höning <nicolas@seita.nl>
Signed-off-by: Nicolas Höning <nicolas@seita.nl>
@nhoening nhoening changed the title fix application of unique index on timed_beliefs, also add index add index on timed_beliefs for faster search Feb 28, 2024
timely_beliefs/beliefs/classes.py Show resolved Hide resolved
Comment on lines 189 to 192
"event_start",
"source_id",
"sensor_id",
"belief_horizon",
Copy link
Collaborator

@Flix6x Flix6x Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understood it, the order matters here, and it should lead to some kind of a funnel from most unique values to fewest unique values? If that is the case, I'd expect the following ordering from most unique to fewest unique values in a typical database:

  • event start: data covers a large period and slowly but steadily grows over time
  • sensor: grows with the size of the system being serviced, but let's say with less than one (hourly) sensor per hour
  • belief horizon: not many unique values with respect to the previous two, and likely a quite constant number
  • source: may grow with the number of API users, but once you have the sensor, there are usually only a couple of sources

Just my two cents on the matter.

That said, I don't really understand why/how the order would matter and why the database doesn't take care of figuring out the best order of such things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I read somewhere that range columns (like event_start) are a good place to lead with, as well.

To be honest, I will simply add the query that Mike suggested in the end, but not because I see a performance difference. After #166, the performance for sensors in our dataset with little data is so fast, I suspect indexes are not visible. And for the sensor with > 50% of the data, Postgres seems to ignore the index by default.

Not listing belief_horizon in the index (but including it as column) makes sense, as we are using min() on it.

This PR is an improvement over the status quo, but we might revisit indexing when we have much more data.

…ing min() on it)

Signed-off-by: Nicolas Höning <nicolas@seita.nl>
Signed-off-by: Nicolas Höning <nicolas@seita.nl>
@nhoening nhoening requested a review from Flix6x March 1, 2024 09:03
@Flix6x Flix6x merged commit 9932242 into main Mar 1, 2024
5 checks passed
@Flix6x Flix6x deleted the fix/fix-application-of-unique--and-add-quad-index-for-search branch March 1, 2024 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Database support Dealing with databases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants