Add @atdatabases/sqlite-pool module #289

mcollina · 2023-04-06T16:18:23Z

This module supports the same interface of @atdatabase/sqlite but it's based on @atadabtases/sqlite-sync. The added benefit of this approach is that it supports multiple high-level transactions on top of the same SQLite database on disk.

SQLite supports multiple simultaneous read transactions coming from separate database connections, possibly in separate threads or processes, but only one simultaneous write transaction - source.

Signed-off-by: Matteo Collina <hello@matteocollina.com>

rollingversions · 2023-04-06T16:18:26Z

@databases/sqlite-pool (unreleased → 1.0.0)

Breaking Changes

Initial implementation

@databases/sqlite-sync (1.0.0 → 1.1.0)

New Features

Expose the DatabaseOptions type

Packages With No Changes

The following packages have no user facing changes, so won't be released:

@databases/bigquery
@databases/cache
@databases/connection-pool
@databases/escape-identifier
@databases/expo
@databases/lock
@databases/migrations-base
@databases/mock-db
@databases/mock-db-typed
@databases/mysql
@databases/mysql-config
@databases/mysql-schema-cli
@databases/mysql-schema-introspect
@databases/mysql-schema-print-types
@databases/mysql-test
@databases/mysql-typed
@databases/pg
@databases/pg-bulk
@databases/pg-cluster
@databases/pg-config
@databases/pg-connection-string
@databases/pg-create
@databases/pg-data-type-id
@databases/pg-errors
@databases/pg-migrations
@databases/pg-schema-cli
@databases/pg-schema-introspect
@databases/pg-schema-print-types
@databases/pg-test
@databases/pg-typed
@databases/push-to-async-iterable
@databases/queue
@databases/shared
@databases/shared-print-types
@databases/split-sql-query
@databases/sql
@databases/sqlite
@databases/validate-unicode
@databases/website
@databases/websql
@databases/websql-core
@databases/with-container

Edit changelogs

Change log has not been updated since latest commit Update Changelog

mcollina · 2023-04-06T16:18:38Z

Docs are missing, will add asap

Signed-off-by: Matteo Collina <hello@matteocollina.com>

ForbesLindesay · 2023-04-08T14:15:06Z

packages/sqlite-pool/src/index.ts

+  async query(query: SQLQuery): Promise<any[]> {
+    const poolConnection = await this.#pool.getConnection();
+    try {
+      const res = poolConnection.connection.query(query);


Since this step, which runs the actual query, is synchronous, you cannot possibly get any parallel execution of queries with this approach.

To test this, you can run one very slow query (e.g. joining two very large tables) in parallel with one very fast query. If the slow query starts first, it will block the thread until it is finished, at which point the fast query will finally run.

To get actual parallel query execution using the synchronous SQLite client, you'll need to use worker threads (https://nodejs.org/api/worker_threads.html). You'd create one thread per connection and then pass queries to the threads to actually execute.

Using worker_thread would be slower for the vast majority of cases unfortunately as the marshalling would wreck us. That's not the goal of what I'm trying to achieve with this anyway.

I have two goals:

achieve the same interface of the pg/mysql/sqlite modules without having a mutex

perform parallel read transactions.

You cannot have:

Synchronous underlying database client/connection

Single threaded

Parallel queries

You must choose two. As soon as you submit one query via poolConnection.connection.query(query), the JavaScript thread that submitted the query is blocked until it receives a response, meaning that it cannot submit a second query to run in parallel, even if it has a separate connection to the database to run that parallel query on.

To demonstrate this, please create a file called index.js with the following contents:

function createFakeConnection() { return { query: async () => { const end = Date.now() + 1_000; while (Date.now() < end); }, }; } async function run() { const connectionCount = parseInt(process.argv[2], 10); console.log(`Running ${connectionCount} parallel queries...`); const connections = []; for (let i = 0; i < connectionCount; i++) { connections.push(createFakeConnection()); } const start = Date.now(); await Promise.all(connections.map((connection) => connection.query())); const end = Date.now(); console.log(`Took ${end - start}ms`); } run().catch((ex) => { console.error(ex); process.exit(1); });

Now try running the following:

node index.js 1 - will take 1000ms

node index.js 2 - will take 2000ms

node index.js 3 - will take 3000ms

The async keyword does not create a thread, it just says "this function should return a promise and it may use the await keyword to yield to promises during execution". If the underlying resource is still synchronous, it will still block the entire thread. I've made the underlying resource a JavaScript while loop here to try and make this more obvious, but it makes no difference if it's in JavaScript or a native C++ module.

Getting back to the three options that you can only pick 2 from:

Synchronous underlying database client/connection

Single threaded

Parallel queries

We currently have:

@databases/sqlite-sync - chooses 1 and 2, but not 3 (i.e. no parallel queries)

@databases/sqlite - chooses 2 and 3, but not 1 (i.e. parallel read queries, but the underlying database client is async using threads within C++)

This proposal is still 1 and 2 but not 3, it's just hidden behind additional layers of complexity.

I agree that doing 1 and 3 may not be faster for a lot of use cases - It'll only help if the queries are slow but the amount of data is small so marshalling is fast. The only alternative though is to continue to use sqlite3 which is actually async, or build something new at the native level that combines the async support of sqlite3 with the better support for BIGINT etc from better-sqlite3.

better-sqlite3 do propose & have a somewhat decent example of how to do the multi-threaded approach here: https://github.com/WiseLibs/better-sqlite3/blob/HEAD/docs/threads.md You'd need to do some benchmarking to see how much slower this made things.

P.S. "achieve the same interface of the pg/mysql/sqlite modules" is fairly trivial, it just doesn't achieve the second goal of parallel queries.

P.S. You can minimise somewhat the impact of marshalling by dealing with the SQLQuery package steps in the primary thread so that all that's passed through to the child threads is a string representation of the query and an array of values.

I'm aware of all you are saying minus the fact that it actually works for the tx method. See https://github.com/ForbesLindesay/atdatabases/pull/289/files#r1161826962 for more details.

mcollina · 2023-04-10T15:36:58Z

packages/sqlite-pool/src/__tests__/limited.test.ts

+  await db.dispose();
+});
+
+test('two parallel queries', async () => {


@ForbesLindesay this is the use case I’m talking about. Note that the two transactions are running in parallel.

Note that the db access is still synchronous, but if other asynchronous actions (eg an http request) happen within the transaction, things can progress in parallel.

I don't want to merge something that only adds this psudeo-asynchronousness, and does not add true parallel queries:

Doing other async work inside a transaction is normally a very bad idea.

In SQLite it means that no other database transactions or queries that write data can run until that async work completes. You probably don't want to block all database writes while waiting for some external HTTP service.

Even in other database engines like Postgres, open transactions are an expensive/limited resource. Even very large database servers might be limited to a few hundred open connections/transactions before they stall.

A transaction cannot be truly atomic if it needs to call external services other than the database. If a later step in the transaction fails, there is no way to "roll back" the external API call. This can often lead to inconsistent states. It's therefore normally better to make a decision to either do the db transaction, then call the external service, or call the external service then do the db transaction. This limits your possible failure modes - you only need to consider what happens if the server crashes between those two steps.

I'm concerned people would assume this is more parallel than it is. For example, people would assume that await db.query(...) would not block the main JavaScript thread, but that assumption is broken here. That's a big part of why I preferred a clearly synchronous API for sqlite-sync over this pseudo-async API in the first place.

This test also does not demonstrate that you can even run a read only query in parallel while there is an active open transaction. I do think that may be possible with this code, but I don't think this test actually demonstrates that functionality. Each of these transactions is simply run sequentially, one after another.

If you do still want to do this (and again, I don't personally think it's a good idea), you only actually need two connections: primary (for transactions and queries that write data) and secondary (for queries that only read data). You don't need any of the connection pooling logic because you are not able to run more than one "secondary" query at a time. I'm not sure about including this in the core @databases project though, due to the reasons above.

We could consider something like @databases/sqlite-sync-compat to be a package that provides an API that looks like the API for @databases/pg/@databases/mysql/@databases/sqlite but is not actually async and still just uses a single connection. You could remove all the connection pooling stuff and simply take a lock on the entire connection for the duration of any transactions. I'm still not sure this is a good idea/something that should be part of the @databases core project.

An important non-goal of @databases is being able to reuse code between different database engines. I aim to build best-in-class libraries for connecting to and querying each of the supported databases. I want developers who have used @databases to query one database engine to feel instantly familiar when querying another database engine and not have a significant learning curve to understand different APIs. That's why each library has such a similar structure. I do not want to hide genuine differences though. Many ORM layers can end up being limited to only supporting the feature set that's common to all database engines, or they have very inefficient implementations of some functionality on some database engines while being very fast on others. @databases libraries are specific to each database engine, meaning you get the best that that specific engine has to offer. In the case of SQLite via better-sqlite3 that means a synchronous API, because the underlying database is not asynchronous. In the case of Postgres and MySQL that means connection pooling because the underlying database can support many concurrent connections/queries. My understanding of the platformatic project is that it's a higher level tool, so it may make sense for you to hide these differences between engines at that level, but I'm not sure @databases should do that.

ForbesLindesay · 2023-04-19T10:22:12Z

packages/sqlite-pool/src/__tests__/limited.test.ts

+  await db.dispose();
+});
+
+test('two parallel queries', async () => {


I don't want to merge something that only adds this psudeo-asynchronousness, and does not add true parallel queries:

Doing other async work inside a transaction is normally a very bad idea.

In SQLite it means that no other database transactions or queries that write data can run until that async work completes. You probably don't want to block all database writes while waiting for some external HTTP service.

Even in other database engines like Postgres, open transactions are an expensive/limited resource. Even very large database servers might be limited to a few hundred open connections/transactions before they stall.

A transaction cannot be truly atomic if it needs to call external services other than the database. If a later step in the transaction fails, there is no way to "roll back" the external API call. This can often lead to inconsistent states. It's therefore normally better to make a decision to either do the db transaction, then call the external service, or call the external service then do the db transaction. This limits your possible failure modes - you only need to consider what happens if the server crashes between those two steps.

I'm concerned people would assume this is more parallel than it is. For example, people would assume that await db.query(...) would not block the main JavaScript thread, but that assumption is broken here. That's a big part of why I preferred a clearly synchronous API for sqlite-sync over this pseudo-async API in the first place.

This test also does not demonstrate that you can even run a read only query in parallel while there is an active open transaction. I do think that may be possible with this code, but I don't think this test actually demonstrates that functionality. Each of these transactions is simply run sequentially, one after another.

If you do still want to do this (and again, I don't personally think it's a good idea), you only actually need two connections: primary (for transactions and queries that write data) and secondary (for queries that only read data). You don't need any of the connection pooling logic because you are not able to run more than one "secondary" query at a time. I'm not sure about including this in the core @databases project though, due to the reasons above.

We could consider something like @databases/sqlite-sync-compat to be a package that provides an API that looks like the API for @databases/pg/@databases/mysql/@databases/sqlite but is not actually async and still just uses a single connection. You could remove all the connection pooling stuff and simply take a lock on the entire connection for the duration of any transactions. I'm still not sure this is a good idea/something that should be part of the @databases core project.

An important non-goal of @databases is being able to reuse code between different database engines. I aim to build best-in-class libraries for connecting to and querying each of the supported databases. I want developers who have used @databases to query one database engine to feel instantly familiar when querying another database engine and not have a significant learning curve to understand different APIs. That's why each library has such a similar structure. I do not want to hide genuine differences though. Many ORM layers can end up being limited to only supporting the feature set that's common to all database engines, or they have very inefficient implementations of some functionality on some database engines while being very fast on others. @databases libraries are specific to each database engine, meaning you get the best that that specific engine has to offer. In the case of SQLite via better-sqlite3 that means a synchronous API, because the underlying database is not asynchronous. In the case of Postgres and MySQL that means connection pooling because the underlying database can support many concurrent connections/queries. My understanding of the platformatic project is that it's a higher level tool, so it may make sense for you to hide these differences between engines at that level, but I'm not sure @databases should do that.

mcollina · 2023-06-16T08:10:32Z

I'm going to release this as a different module, thanks for your feedbacks!

mcollina added 2 commits April 4, 2023 08:51

incomplete

7811708

Signed-off-by: Matteo Collina <hello@matteocollina.com>

abort a transaction after a given amount of time

3594d0b

Signed-off-by: Matteo Collina <hello@matteocollina.com>

added docs

e94d7e7

Signed-off-by: Matteo Collina <hello@matteocollina.com>

mcollina marked this pull request as ready for review April 6, 2023 16:24

mcollina added 5 commits April 6, 2023 18:25

prettier

0e6fc8e

Signed-off-by: Matteo Collina <hello@matteocollina.com>

tslint passes

ce7c287

Signed-off-by: Matteo Collina <hello@matteocollina.com>

fixup

98ebc5c

Signed-off-by: Matteo Collina <hello@matteocollina.com>

maybe tests are passing now

bcd709d

Signed-off-by: Matteo Collina <hello@matteocollina.com>

prettier

28f814d

Signed-off-by: Matteo Collina <hello@matteocollina.com>

mcollina changed the title ~~Add @sqlite-pool module~~ Add @atdatabases/sqlite-pool module Apr 6, 2023

doc fix

0f6a6da

Signed-off-by: Matteo Collina <hello@matteocollina.com>

ForbesLindesay requested changes Apr 8, 2023

View reviewed changes

mcollina commented Apr 10, 2023

View reviewed changes

ForbesLindesay requested changes Apr 19, 2023

View reviewed changes

mcollina closed this Jun 16, 2023

mcollina deleted the sqlite-pool branch June 16, 2023 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add @atdatabases/sqlite-pool module #289

Add @atdatabases/sqlite-pool module #289

mcollina commented Apr 6, 2023

rollingversions bot commented Apr 6, 2023 •

edited

Loading

mcollina commented Apr 6, 2023

ForbesLindesay Apr 8, 2023

mcollina Apr 8, 2023

ForbesLindesay Apr 10, 2023 •

edited

Loading

ForbesLindesay Apr 10, 2023

ForbesLindesay Apr 10, 2023

mcollina Apr 10, 2023

mcollina Apr 10, 2023

ForbesLindesay Apr 19, 2023

ForbesLindesay Apr 19, 2023

mcollina commented Jun 16, 2023

Add @atdatabases/sqlite-pool module #289

Add @atdatabases/sqlite-pool module #289

Conversation

mcollina commented Apr 6, 2023

rollingversions bot commented Apr 6, 2023 • edited Loading

@databases/sqlite-pool (unreleased → 1.0.0)

Breaking Changes

@databases/sqlite-sync (1.0.0 → 1.1.0)

New Features

Packages With No Changes

mcollina commented Apr 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ForbesLindesay Apr 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcollina commented Jun 16, 2023

rollingversions bot commented Apr 6, 2023 •

edited

Loading

ForbesLindesay Apr 10, 2023 •

edited

Loading