Proposal: remold the storage layer to facilitate large bulk operations #8731
Labels
backend: Hibernate
related to Hibernate or HQL
needs discussion
Ticket or PR needs discussion before it can be moved forward.
new feature
Milestone
Neither a bug nor a new feature IMO, so I didn't know which to pick.
Is your feature request related to a problem? Please describe.
I have noticed some recurring patterns associated with bulk operations.
DAO methods which can return multiple results tend to do so as List. Very large lists may consume significant amounts of memory, and collecting them may defeat the DBMS' efforts to overlap storage, network, and compute operations.
Because
Context
encapsulates a singleSession
, large bulk operations have a serious issue with transaction lifetimes. Such an operation typically uses a DBMS query to produce a sequence of DSOs which are (possibly) modified one by one. Operating on a large number of DSOs can take minutes or hours, and the single transaction must be held open for the duration so as not to detach entities in the list which have not yet been considered. Yet the natural lifetime of a transaction for the actual modifications should be a single DSO (and perhaps its dependents) and is measured in milliseconds. The memory pressure from large, long-lived transactions has led to work-arounds likeDBConnection#uncacheEntity()
, but this only shifts the burden from DSpace to the DBMS and does not address the possibility of long-held locks.For the purpose of discussion I would like to point out that one may divide bulk operations into two categories. Operations which update the database are subject to both issues. Operations which merely produce reports, without altering the database, don't have the transaction lifetime conflict.
Describe the solution you'd like
Since one almost always uses a
List
as a generator rather than fishing around in it at random, DAOs should returnIterable
orStream
when multiple results are possible. The implementation can rely on the underlying DBMS driver to produce results on demand, freeing the DBMS and driver to optimize result delivery and minimizing memory pressure.Using a read-only Session to produce the sequence of DSOs for consideration, and a separate read-write
Session
to modify individual DSOs, addresses the tension over transaction lifetimes, since each has its own transaction. A read-onlySession
should require only read locks at most, and may be implemented by the DBMS entirely without long-term locks, or with advisory locks. All of that locking should be internal to the DBMS and of no concern to DSpace.Context
could exposeSession
s rather than keeping them hidden inside DBConnection, orSession
s could be separated entirely fromContext
. I have been unable to think of a use for more than twoSession
s, so aContext
method to return (and, if necessary, create) a second should be sufficient.I have implemented the two-
Session
approach as a quick hack for a specific problem, and it worked well.Describe alternatives or workarounds you've considered
A bulk operation could hand over each DSO to a new
Thread
which supplies its ownContext
. A minimal implementation would simplywait()
on the single worker thread. Pooled implementations are conceivable.A rough-and-ready alternative would be to immediately drain the list of results into a list of detached entities, which are then reloaded as required. This allows the closing of the transaction that wraps the query which produced the list. It seems rather inefficient to do this, since a large result set would tend to start with entities which will have been pushed out of the session cache and the 2nd-level cache while accumulating the list. This addresses the transaction lifetime issue, but only incompletely addresses the memory pressure issue.
The text was updated successfully, but these errors were encountered: