Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EVENT_PUBLICATION insert fails due to big SERIALIZED_EVENT #519

Open
razubuddy opened this issue Feb 26, 2024 · 12 comments
Open

EVENT_PUBLICATION insert fails due to big SERIALIZED_EVENT #519

razubuddy opened this issue Feb 26, 2024 · 12 comments
Assignees
Labels
in: event publication registry Event publication registry type: improvement Minor improvements
Milestone

Comments

@razubuddy
Copy link

Spring Modulith: 1.2.0-M2

After introducing indexes on ENVENT_PUBLICATION table (#403) events with bigger payload are not working due to index size limit.

INSERT INTO EVENT_PUBLICATION (ID, EVENT_TYPE, LISTENER_ID, PUBLICATION_DATE, SERIALIZED_EVENT)
VALUES (?, ?, ?, ?, ?)

raises error

ERROR: index row requires 36896 bytes, maximum size is 8191
@odrotbohm odrotbohm self-assigned this Feb 27, 2024
@odrotbohm odrotbohm added the meta: waiting for feedback Waiting for feedback of the original reporter label Feb 27, 2024
@odrotbohm
Copy link
Member

If you want to use bigger event payloads, you need to adapt the schema definitions declared here to scale up the size of the serialized event column.

@odrotbohm odrotbohm added the in: event publication registry Event publication registry label Feb 27, 2024
@razubuddy
Copy link
Author

Yes, I use default JDBC PostgreSQL schema
Default provided schema for PostgreSQL in 1.2.0-M2 looks like this:

CREATE TABLE IF NOT EXISTS event_publication
(
  id               UUID NOT NULL,
  listener_id      TEXT NOT NULL,
  event_type       TEXT NOT NULL,
  serialized_event TEXT NOT NULL,
  publication_date TIMESTAMP WITH TIME ZONE NOT NULL,
  completion_date  TIMESTAMP WITH TIME ZONE,
  PRIMARY KEY (id)
);
CREATE INDEX IF NOT EXISTS event_publication_by_listener_id_and_serialized_event_idx ON event_publication (listener_id, serialized_event);
CREATE INDEX IF NOT EXISTS event_publication_by_completion_date_idx ON event_publication (completion_date);

The index event_publication_by_listener_id_and_serialized_event_idx is justified by query in JdbcEventPublicationRepository:

UPDATE EVENT_PUBLICATION
SET COMPLETION_DATE = ?
WHERE
    LISTENER_ID = ?
    AND SERIALIZED_EVENT = ?

This make me think I should care SERIALIZED_EVENT to be relatively small if it is indexed and Spring Modulith is designed in this direction, so I think using TEXT datatype is misleading.
On the other hand if you plan to support bigger event payloads, then it shouldn't be indexed and updating COMPLETION_DATE query could be redesigned, am I wright?

@odrotbohm
Copy link
Member

I am not a Postgres expert, but it looks like we might need to optimize our index declaration. Would you mind adding USING GIN to the declaration and try again, as recommended here?

@razubuddy
Copy link
Author

GIN, GIST index types are dedicated for column types like tsvector, it is won't create for text column type

@matiwinnetou
Copy link

Also running into this issue, in my case I can make events smaller but not sure how long this trick will suffice.

@matiwinnetou
Copy link

@odrotbohm what if one introduces another field, a hash of serialised_event and then put index on it?

@matiwinnetou
Copy link

It seems like GIN indices are supported by using a trick:

image

@matiwinnetou
Copy link

image

@matiwinnetou
Copy link

Here is my solution, I had to split index into 2:

-- Create a GIN index on just the tsvector column
CREATE INDEX IF NOT EXISTS serialized_event_tsvector_idx 
ON event_publication 
USING gin (to_tsvector('simple', serialized_event));

-- Create a separate index for listener_id if needed
CREATE INDEX IF NOT EXISTS listener_id_idx 
ON event_publication 
(listener_id);

Unfortunately I could not use JSONB as it is recommended because this is a serialized_event is a TEXT field on JPA in Modulith.

@JakeTiritilli
Copy link

I'm not sure about other databases, but for PostgreSQL we found that the hash index works well given that Modulith is doing a full equality comparison on the serialized event instead of searching for a substring. And it's not restricted by the size limit that the b-tree index has. We saw a significant performance improvement in the setting of the completion timestamp after adding this index.

CREATE INDEX IF NOT EXISTS event_publication_serialized_event_hash_idx ON event_publication USING hash(serialized_event);

@dustinsand
Copy link

Hi @odrotbohm , in case you missed the post from @JakeTiritilli about the hash index for PostgreSQL, I wanted to bring it to your attention because it dramatically reduced the load on PostgreSQL when the event_publication table had a large volume of records (1+ million in our case during performance testing) for SELECT and UPDATE. The SQL commands are now doing an "Index Scan using event_publication_serialized_event_hash_idx on event_publication" with O(1) !

@odrotbohm
Copy link
Member

I'll make sure the optimization gets applied in the 1.2 GA release in two weeks and the corresponding service releases for 1.0.x and 1.1.x.

@odrotbohm odrotbohm added this to the 1.2 milestone May 7, 2024
@odrotbohm odrotbohm added type: improvement Minor improvements and removed meta: waiting for feedback Waiting for feedback of the original reporter labels May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: event publication registry Event publication registry type: improvement Minor improvements
Projects
None yet
Development

No branches or pull requests

5 participants