Proposal: remold the storage layer to facilitate large bulk operations #8731

mwoodiupui · 2023-03-22T17:40:49Z

Neither a bug nor a new feature IMO, so I didn't know which to pick.

Is your feature request related to a problem? Please describe.
I have noticed some recurring patterns associated with bulk operations.

DAO methods which can return multiple results tend to do so as List. Very large lists may consume significant amounts of memory, and collecting them may defeat the DBMS' efforts to overlap storage, network, and compute operations.

Because Context encapsulates a single Session, large bulk operations have a serious issue with transaction lifetimes. Such an operation typically uses a DBMS query to produce a sequence of DSOs which are (possibly) modified one by one. Operating on a large number of DSOs can take minutes or hours, and the single transaction must be held open for the duration so as not to detach entities in the list which have not yet been considered. Yet the natural lifetime of a transaction for the actual modifications should be a single DSO (and perhaps its dependents) and is measured in milliseconds. The memory pressure from large, long-lived transactions has led to work-arounds like DBConnection#uncacheEntity(), but this only shifts the burden from DSpace to the DBMS and does not address the possibility of long-held locks.

For the purpose of discussion I would like to point out that one may divide bulk operations into two categories. Operations which update the database are subject to both issues. Operations which merely produce reports, without altering the database, don't have the transaction lifetime conflict.

Describe the solution you'd like
Since one almost always uses a List as a generator rather than fishing around in it at random, DAOs should return Iterable or Stream when multiple results are possible. The implementation can rely on the underlying DBMS driver to produce results on demand, freeing the DBMS and driver to optimize result delivery and minimizing memory pressure.

Using a read-only Session to produce the sequence of DSOs for consideration, and a separate read-write Session to modify individual DSOs, addresses the tension over transaction lifetimes, since each has its own transaction. A read-only Session should require only read locks at most, and may be implemented by the DBMS entirely without long-term locks, or with advisory locks. All of that locking should be internal to the DBMS and of no concern to DSpace. Context could expose Sessions rather than keeping them hidden inside DBConnection, or Sessions could be separated entirely from Context. I have been unable to think of a use for more than two Sessions, so a Context method to return (and, if necessary, create) a second should be sufficient.

I have implemented the two-Session approach as a quick hack for a specific problem, and it worked well.

Describe alternatives or workarounds you've considered
A bulk operation could hand over each DSO to a new Thread which supplies its own Context. A minimal implementation would simply wait() on the single worker thread. Pooled implementations are conceivable.

A rough-and-ready alternative would be to immediately drain the list of results into a list of detached entities, which are then reloaded as required. This allows the closing of the transaction that wraps the query which produced the list. It seems rather inefficient to do this, since a large result set would tend to start with entities which will have been pushed out of the session cache and the 2nd-level cache while accumulating the list. This addresses the transaction lifetime issue, but only incompletely addresses the memory pressure issue.

The text was updated successfully, but these errors were encountered:

tdonohue · 2023-03-22T19:28:22Z

Thanks @mwoodiupui . Good brainstorms.

I wonder if you could link to (if it's in GitHub somewhere) the example you've mentioned above:

I have implemented the two-Session approach as a quick hack for a specific problem, and it worked well.

That might provide more "clues" for future discussion. For lack of a better place to put this right now, I'm moving this to "Needs Discussion / Analysis" column. I agree this sort of thing would be useful to analyze further, just uncertain of priority right now.. (but a volunteer or steward could help push this forward in 8.x or similar, if anyone gets time/interest)

amgciadev · 2023-03-23T11:36:11Z

@mwoodiupui thanks a lot for adding this one!

mwoodiupui · 2023-03-23T14:33:15Z

The above-mentioned hack, as a patch against dspace-api 7.4, is at https://gist.github.com/mwoodiupui/3a07b72f40b702c444335679ed0269fd

This illustrates both the two-session approach and use of a Stream for the result set. But I had to ignore the DAO layer to do either.

mwoodiupui added new feature needs discussion Ticket or PR needs discussion before it can be moved forward. backend: Hibernate related to Hibernate or HQL needs triage New issue needs triage and/or scheduling labels Mar 22, 2023

mwoodiupui added this to the 8.0 milestone Mar 22, 2023

tdonohue removed the needs triage New issue needs triage and/or scheduling label Mar 22, 2023

mwoodiupui self-assigned this Mar 23, 2023

mwoodiupui mentioned this issue Jul 2, 2023

Enable the use of multiple database Sessions per request #8932

Draft

8 tasks

This was referenced Jan 20, 2024

MediaFilter over large repository result in downtime #9272

Closed

MediaFilter over large repository result in downtime #9273

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: remold the storage layer to facilitate large bulk operations #8731

Proposal: remold the storage layer to facilitate large bulk operations #8731

mwoodiupui commented Mar 22, 2023 •

edited

tdonohue commented Mar 22, 2023

amgciadev commented Mar 23, 2023

mwoodiupui commented Mar 23, 2023

Proposal: remold the storage layer to facilitate large bulk operations #8731

Proposal: remold the storage layer to facilitate large bulk operations #8731

Comments

mwoodiupui commented Mar 22, 2023 • edited

tdonohue commented Mar 22, 2023

amgciadev commented Mar 23, 2023

mwoodiupui commented Mar 23, 2023

mwoodiupui commented Mar 22, 2023 •

edited