add index on timed_beliefs for faster search #167

nhoening · 2024-02-27T16:46:49Z

This will speed up search_session(), as that query and its subquery are looking for the same fields.

Note that I removed the usage of has_inherited_table(), which blocked the existing UniqueConstraint from being applied. Let me know if there was a strong argument for using it that I did not find. This function tests if one of the classes the model inherits from has a table assorted with it. I believe you might intend this as a protection of some sort? In FlexMeasures, we inherit from db.Model and from tb.TimedBeliefDBMixin, both of which have no table specified.

As to the unique constraint - it was never applied due to ,has_inherited_table() returning False. If we apply it, we don't allow beliefs for the same event (and from the same source...) with different probabilities (confirmed in one test failing and telling us that, as well). So I decided we don't need this constraint. The combined PK is the same but with the probability in it, so it seems to me we are fine.

…he same fields so searching is sped up Signed-off-by: Nicolas Höning <nicolas@seita.nl>

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

nhoening · 2024-02-27T17:02:06Z

Now one test is failing, example below, seemingly because it enters a belief which violates the unique constraint. I'll take a look later.

FAILED timely_beliefs/tests/test_belief_query.py::test_select_most_recent_probabilistic_beliefs - sqlalchemy.exc.IntegrityError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "timed_beliefs_quad_unique_and_search_idx"
DETAIL:  Key (event_start, belief_horizon, sensor_id, source_id)=(2025-01-02 22:45:00+00, 02:00:00, 1, 1) already exists.

[SQL: INSERT INTO timed_beliefs (event_start, belief_horizon, cumulative_probability, event_value, sensor_id, source_id) VALUES (%(event_start__0)s, %(belief_horizon__0)s, %(cumulative_probability__0)s, %(event_value__0)s, %(sensor_id__0)s, %(source_id__0) ... 4954 characters truncated ... on__37)s, %(cumulative_probability__37)s, %(event_value__37)s, %(sensor_id__37)s, %(source_id__37)s)]

Also a note from looking at results: we have 9867 warnings, some of which are DeprecationWarnings or FutureWarnings from Pandas, others are from us: UserWarning: <BeliefSource Source A> created from 'Source A'.

This seems useful: /home/runner/work/timely-beliefs/timely-beliefs/timely_beliefs/beliefs/classes.py:1086: PerformanceWarning: Adding/subtracting object-dtype array to DatetimeArray not vectorized.

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

… as well Signed-off-by: Nicolas Höning <nicolas@seita.nl>

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

…red by primary key Signed-off-by: Nicolas Höning <nicolas@seita.nl>

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

timely_beliefs/beliefs/classes.py

Flix6x · 2024-02-29T20:08:11Z

timely_beliefs/beliefs/classes.py

+                "event_start",
+                "source_id",
+                "sensor_id",
+                "belief_horizon",


As I understood it, the order matters here, and it should lead to some kind of a funnel from most unique values to fewest unique values? If that is the case, I'd expect the following ordering from most unique to fewest unique values in a typical database:

event start: data covers a large period and slowly but steadily grows over time

sensor: grows with the size of the system being serviced, but let's say with less than one (hourly) sensor per hour

belief horizon: not many unique values with respect to the previous two, and likely a quite constant number

source: may grow with the number of API users, but once you have the sensor, there are usually only a couple of sources

Just my two cents on the matter.

That said, I don't really understand why/how the order would matter and why the database doesn't take care of figuring out the best order of such things.

Thanks. I read somewhere that range columns (like event_start) are a good place to lead with, as well.

To be honest, I will simply add the query that Mike suggested in the end, but not because I see a performance difference. After #166, the performance for sensors in our dataset with little data is so fast, I suspect indexes are not visible. And for the sensor with > 50% of the data, Postgres seems to ignore the index by default.

Not listing belief_horizon in the index (but including it as column) makes sense, as we are using min() on it.

This PR is an improvement over the status quo, but we might revisit indexing when we have much more data.

…ing min() on it) Signed-off-by: Nicolas Höning <nicolas@seita.nl>

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

…index-for-search

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

fix application of unique index on timed_beliefs, also add index on t…

af4f30e

…he same fields so searching is sped up Signed-off-by: Nicolas Höning <nicolas@seita.nl>

nhoening added the Database support Dealing with databases label Feb 27, 2024

nhoening requested a review from Flix6x February 27, 2024 16:47

flake8

fee7f4f

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

nhoening mentioned this pull request Feb 27, 2024

Fix/add event filters to search subquery #166

Merged

nhoening added 5 commits February 28, 2024 13:18

the order of the fields matters, has to match the query's order

fc1587c

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

add unique constraints separately, as it needs the probability fields…

a90c932

… as well Signed-off-by: Nicolas Höning <nicolas@seita.nl>

flake8

0f0be0e

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

no UNIQUE index - without probability it makes no sense, with is cove…

d0c2c14

…red by primary key Signed-off-by: Nicolas Höning <nicolas@seita.nl>

remove unused import

93fd143

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

nhoening changed the title ~~fix application of unique index on timed_beliefs, also add index~~ add index on timed_beliefs for faster search Feb 28, 2024

nhoening mentioned this pull request Feb 28, 2024

db: improve belief search with new index FlexMeasures/flexmeasures#992

Merged

Flix6x approved these changes Feb 29, 2024

View reviewed changes

nhoening added 2 commits March 1, 2024 10:00

do not include belief_horizon in index, just add as column (we are us…

b89a696

…ing min() on it) Signed-off-by: Nicolas Höning <nicolas@seita.nl>

black

f0a2b2a

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

nhoening requested a review from Flix6x March 1, 2024 09:03

nhoening added 2 commits March 1, 2024 10:38

Merge branch 'main' into fix/fix-application-of-unique--and-add-quad-…

6f115ad

…index-for-search

black with version we use in pre-commit

e104c01

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

Flix6x approved these changes Mar 1, 2024

View reviewed changes

Flix6x merged commit 9932242 into main Mar 1, 2024
5 checks passed

Flix6x deleted the fix/fix-application-of-unique--and-add-quad-index-for-search branch March 1, 2024 09:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add index on timed_beliefs for faster search #167

add index on timed_beliefs for faster search #167

nhoening commented Feb 27, 2024 •

edited

Loading

nhoening commented Feb 27, 2024

Flix6x Feb 29, 2024 •

edited

Loading

nhoening Mar 1, 2024

add index on timed_beliefs for faster search #167

add index on timed_beliefs for faster search #167

Conversation

nhoening commented Feb 27, 2024 • edited Loading

nhoening commented Feb 27, 2024

Flix6x Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

nhoening Mar 1, 2024

Choose a reason for hiding this comment

nhoening commented Feb 27, 2024 •

edited

Loading

Flix6x Feb 29, 2024 •

edited

Loading