-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add @atdatabases/sqlite-pool module #289
Conversation
Signed-off-by: Matteo Collina <hello@matteocollina.com>
Signed-off-by: Matteo Collina <hello@matteocollina.com>
@databases/sqlite-pool (unreleased → 1.0.0)Breaking Changes
@databases/sqlite-sync (1.0.0 → 1.1.0)New Features
Packages With No ChangesThe following packages have no user facing changes, so won't be released:
|
Docs are missing, will add asap |
Signed-off-by: Matteo Collina <hello@matteocollina.com>
Signed-off-by: Matteo Collina <hello@matteocollina.com>
Signed-off-by: Matteo Collina <hello@matteocollina.com>
async query(query: SQLQuery): Promise<any[]> { | ||
const poolConnection = await this.#pool.getConnection(); | ||
try { | ||
const res = poolConnection.connection.query(query); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this step, which runs the actual query, is synchronous, you cannot possibly get any parallel execution of queries with this approach.
To test this, you can run one very slow query (e.g. joining two very large tables) in parallel with one very fast query. If the slow query starts first, it will block the thread until it is finished, at which point the fast query will finally run.
To get actual parallel query execution using the synchronous SQLite client, you'll need to use worker threads (https://nodejs.org/api/worker_threads.html). You'd create one thread per connection and then pass queries to the threads to actually execute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using worker_thread would be slower for the vast majority of cases unfortunately as the marshalling would wreck us. That's not the goal of what I'm trying to achieve with this anyway.
I have two goals:
- achieve the same interface of the pg/mysql/sqlite modules without having a mutex
- perform parallel read transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot have:
- Synchronous underlying database client/connection
- Single threaded
- Parallel queries
You must choose two. As soon as you submit one query via poolConnection.connection.query(query)
, the JavaScript thread that submitted the query is blocked until it receives a response, meaning that it cannot submit a second query to run in parallel, even if it has a separate connection to the database to run that parallel query on.
To demonstrate this, please create a file called index.js
with the following contents:
function createFakeConnection() {
return {
query: async () => {
const end = Date.now() + 1_000;
while (Date.now() < end);
},
};
}
async function run() {
const connectionCount = parseInt(process.argv[2], 10);
console.log(`Running ${connectionCount} parallel queries...`);
const connections = [];
for (let i = 0; i < connectionCount; i++) {
connections.push(createFakeConnection());
}
const start = Date.now();
await Promise.all(connections.map((connection) => connection.query()));
const end = Date.now();
console.log(`Took ${end - start}ms`);
}
run().catch((ex) => {
console.error(ex);
process.exit(1);
});
Now try running the following:
node index.js 1
- will take 1000msnode index.js 2
- will take 2000msnode index.js 3
- will take 3000ms
The async
keyword does not create a thread, it just says "this function should return a promise and it may use the await
keyword to yield to promises during execution". If the underlying resource is still synchronous, it will still block the entire thread. I've made the underlying resource a JavaScript while
loop here to try and make this more obvious, but it makes no difference if it's in JavaScript or a native C++ module.
Getting back to the three options that you can only pick 2 from:
- Synchronous underlying database client/connection
- Single threaded
- Parallel queries
We currently have:
@databases/sqlite-sync
- chooses 1 and 2, but not 3 (i.e. no parallel queries)@databases/sqlite
- chooses 2 and 3, but not 1 (i.e. parallel read queries, but the underlying database client is async using threads within C++)
This proposal is still 1 and 2 but not 3, it's just hidden behind additional layers of complexity.
I agree that doing 1 and 3 may not be faster for a lot of use cases - It'll only help if the queries are slow but the amount of data is small so marshalling is fast. The only alternative though is to continue to use sqlite3
which is actually async, or build something new at the native level that combines the async support of sqlite3
with the better support for BIGINT etc from better-sqlite3
.
better-sqlite3
do propose & have a somewhat decent example of how to do the multi-threaded approach here: https://github.com/WiseLibs/better-sqlite3/blob/HEAD/docs/threads.md You'd need to do some benchmarking to see how much slower this made things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. "achieve the same interface of the pg/mysql/sqlite modules" is fairly trivial, it just doesn't achieve the second goal of parallel queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. You can minimise somewhat the impact of marshalling by dealing with the SQLQuery
package steps in the primary thread so that all that's passed through to the child threads is a string
representation of the query and an array of values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware of all you are saying minus the fact that it actually works for the tx
method. See https://github.com/ForbesLindesay/atdatabases/pull/289/files#r1161826962 for more details.
await db.dispose(); | ||
}); | ||
|
||
test('two parallel queries', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ForbesLindesay this is the use case I’m talking about. Note that the two transactions are running in parallel.
Note that the db access is still synchronous, but if other asynchronous actions (eg an http request) happen within the transaction, things can progress in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to merge something that only adds this psudeo-asynchronousness, and does not add true parallel queries:
- Doing other async work inside a transaction is normally a very bad idea.
- In SQLite it means that no other database transactions or queries that write data can run until that async work completes. You probably don't want to block all database writes while waiting for some external HTTP service.
- Even in other database engines like Postgres, open transactions are an expensive/limited resource. Even very large database servers might be limited to a few hundred open connections/transactions before they stall.
- A transaction cannot be truly atomic if it needs to call external services other than the database. If a later step in the transaction fails, there is no way to "roll back" the external API call. This can often lead to inconsistent states. It's therefore normally better to make a decision to either do the db transaction, then call the external service, or call the external service then do the db transaction. This limits your possible failure modes - you only need to consider what happens if the server crashes between those two steps.
- I'm concerned people would assume this is more parallel than it is. For example, people would assume that
await db.query(...)
would not block the main JavaScript thread, but that assumption is broken here. That's a big part of why I preferred a clearly synchronous API forsqlite-sync
over this pseudo-async API in the first place.
This test also does not demonstrate that you can even run a read only query in parallel while there is an active open transaction. I do think that may be possible with this code, but I don't think this test actually demonstrates that functionality. Each of these transactions is simply run sequentially, one after another.
If you do still want to do this (and again, I don't personally think it's a good idea), you only actually need two connections: primary (for transactions and queries that write data) and secondary (for queries that only read data). You don't need any of the connection pooling logic because you are not able to run more than one "secondary" query at a time. I'm not sure about including this in the core @databases
project though, due to the reasons above.
We could consider something like @databases/sqlite-sync-compat
to be a package that provides an API that looks like the API for @databases/pg
/@databases/mysql
/@databases/sqlite
but is not actually async and still just uses a single connection. You could remove all the connection pooling stuff and simply take a lock on the entire connection for the duration of any transactions. I'm still not sure this is a good idea/something that should be part of the @databases
core project.
An important non-goal of @databases
is being able to reuse code between different database engines. I aim to build best-in-class libraries for connecting to and querying each of the supported databases. I want developers who have used @databases
to query one database engine to feel instantly familiar when querying another database engine and not have a significant learning curve to understand different APIs. That's why each library has such a similar structure. I do not want to hide genuine differences though. Many ORM layers can end up being limited to only supporting the feature set that's common to all database engines, or they have very inefficient implementations of some functionality on some database engines while being very fast on others. @databases
libraries are specific to each database engine, meaning you get the best that that specific engine has to offer. In the case of SQLite via better-sqlite3
that means a synchronous API, because the underlying database is not asynchronous. In the case of Postgres and MySQL that means connection pooling because the underlying database can support many concurrent connections/queries. My understanding of the platformatic project is that it's a higher level tool, so it may make sense for you to hide these differences between engines at that level, but I'm not sure @databases
should do that.
await db.dispose(); | ||
}); | ||
|
||
test('two parallel queries', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to merge something that only adds this psudeo-asynchronousness, and does not add true parallel queries:
- Doing other async work inside a transaction is normally a very bad idea.
- In SQLite it means that no other database transactions or queries that write data can run until that async work completes. You probably don't want to block all database writes while waiting for some external HTTP service.
- Even in other database engines like Postgres, open transactions are an expensive/limited resource. Even very large database servers might be limited to a few hundred open connections/transactions before they stall.
- A transaction cannot be truly atomic if it needs to call external services other than the database. If a later step in the transaction fails, there is no way to "roll back" the external API call. This can often lead to inconsistent states. It's therefore normally better to make a decision to either do the db transaction, then call the external service, or call the external service then do the db transaction. This limits your possible failure modes - you only need to consider what happens if the server crashes between those two steps.
- I'm concerned people would assume this is more parallel than it is. For example, people would assume that
await db.query(...)
would not block the main JavaScript thread, but that assumption is broken here. That's a big part of why I preferred a clearly synchronous API forsqlite-sync
over this pseudo-async API in the first place.
This test also does not demonstrate that you can even run a read only query in parallel while there is an active open transaction. I do think that may be possible with this code, but I don't think this test actually demonstrates that functionality. Each of these transactions is simply run sequentially, one after another.
If you do still want to do this (and again, I don't personally think it's a good idea), you only actually need two connections: primary (for transactions and queries that write data) and secondary (for queries that only read data). You don't need any of the connection pooling logic because you are not able to run more than one "secondary" query at a time. I'm not sure about including this in the core @databases
project though, due to the reasons above.
We could consider something like @databases/sqlite-sync-compat
to be a package that provides an API that looks like the API for @databases/pg
/@databases/mysql
/@databases/sqlite
but is not actually async and still just uses a single connection. You could remove all the connection pooling stuff and simply take a lock on the entire connection for the duration of any transactions. I'm still not sure this is a good idea/something that should be part of the @databases
core project.
An important non-goal of @databases
is being able to reuse code between different database engines. I aim to build best-in-class libraries for connecting to and querying each of the supported databases. I want developers who have used @databases
to query one database engine to feel instantly familiar when querying another database engine and not have a significant learning curve to understand different APIs. That's why each library has such a similar structure. I do not want to hide genuine differences though. Many ORM layers can end up being limited to only supporting the feature set that's common to all database engines, or they have very inefficient implementations of some functionality on some database engines while being very fast on others. @databases
libraries are specific to each database engine, meaning you get the best that that specific engine has to offer. In the case of SQLite via better-sqlite3
that means a synchronous API, because the underlying database is not asynchronous. In the case of Postgres and MySQL that means connection pooling because the underlying database can support many concurrent connections/queries. My understanding of the platformatic project is that it's a higher level tool, so it may make sense for you to hide these differences between engines at that level, but I'm not sure @databases
should do that.
I'm going to release this as a different module, thanks for your feedbacks! |
This module supports the same interface of
@atdatabase/sqlite
but it's based on@atadabtases/sqlite-sync
. The added benefit of this approach is that it supports multiple high-level transactions on top of the same SQLite database on disk.