Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add @atdatabases/sqlite-pool module #289

Closed
wants to merge 9 commits into from

Conversation

mcollina
Copy link
Contributor

@mcollina mcollina commented Apr 6, 2023

This module supports the same interface of @atdatabase/sqlite but it's based on @atadabtases/sqlite-sync. The added benefit of this approach is that it supports multiple high-level transactions on top of the same SQLite database on disk.

SQLite supports multiple simultaneous read transactions coming from separate database connections, possibly in separate threads or processes, but only one simultaneous write transaction - source.

Signed-off-by: Matteo Collina <hello@matteocollina.com>
Signed-off-by: Matteo Collina <hello@matteocollina.com>
@rollingversions
Copy link

rollingversions bot commented Apr 6, 2023

@databases/sqlite-pool (unreleased → 1.0.0)

Breaking Changes

  • Initial implementation

@databases/sqlite-sync (1.0.0 → 1.1.0)

New Features

  • Expose the DatabaseOptions type

Packages With No Changes

The following packages have no user facing changes, so won't be released:

  • @databases/bigquery
  • @databases/cache
  • @databases/connection-pool
  • @databases/escape-identifier
  • @databases/expo
  • @databases/lock
  • @databases/migrations-base
  • @databases/mock-db
  • @databases/mock-db-typed
  • @databases/mysql
  • @databases/mysql-config
  • @databases/mysql-schema-cli
  • @databases/mysql-schema-introspect
  • @databases/mysql-schema-print-types
  • @databases/mysql-test
  • @databases/mysql-typed
  • @databases/pg
  • @databases/pg-bulk
  • @databases/pg-cluster
  • @databases/pg-config
  • @databases/pg-connection-string
  • @databases/pg-create
  • @databases/pg-data-type-id
  • @databases/pg-errors
  • @databases/pg-migrations
  • @databases/pg-schema-cli
  • @databases/pg-schema-introspect
  • @databases/pg-schema-print-types
  • @databases/pg-test
  • @databases/pg-typed
  • @databases/push-to-async-iterable
  • @databases/queue
  • @databases/shared
  • @databases/shared-print-types
  • @databases/split-sql-query
  • @databases/sql
  • @databases/sqlite
  • @databases/validate-unicode
  • @databases/website
  • @databases/websql
  • @databases/websql-core
  • @databases/with-container

Edit changelogs

Change log has not been updated since latest commit Update Changelog

@mcollina
Copy link
Contributor Author

mcollina commented Apr 6, 2023

Docs are missing, will add asap

Signed-off-by: Matteo Collina <hello@matteocollina.com>
@mcollina mcollina marked this pull request as ready for review April 6, 2023 16:24
Signed-off-by: Matteo Collina <hello@matteocollina.com>
Signed-off-by: Matteo Collina <hello@matteocollina.com>
Signed-off-by: Matteo Collina <hello@matteocollina.com>
Signed-off-by: Matteo Collina <hello@matteocollina.com>
Signed-off-by: Matteo Collina <hello@matteocollina.com>
@mcollina mcollina changed the title Add @sqlite-pool module Add @atdatabases/sqlite-pool module Apr 6, 2023
Signed-off-by: Matteo Collina <hello@matteocollina.com>
async query(query: SQLQuery): Promise<any[]> {
const poolConnection = await this.#pool.getConnection();
try {
const res = poolConnection.connection.query(query);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this step, which runs the actual query, is synchronous, you cannot possibly get any parallel execution of queries with this approach.

To test this, you can run one very slow query (e.g. joining two very large tables) in parallel with one very fast query. If the slow query starts first, it will block the thread until it is finished, at which point the fast query will finally run.

To get actual parallel query execution using the synchronous SQLite client, you'll need to use worker threads (https://nodejs.org/api/worker_threads.html). You'd create one thread per connection and then pass queries to the threads to actually execute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using worker_thread would be slower for the vast majority of cases unfortunately as the marshalling would wreck us. That's not the goal of what I'm trying to achieve with this anyway.

I have two goals:

  1. achieve the same interface of the pg/mysql/sqlite modules without having a mutex
  2. perform parallel read transactions.

Copy link
Owner

@ForbesLindesay ForbesLindesay Apr 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot have:

  1. Synchronous underlying database client/connection
  2. Single threaded
  3. Parallel queries

You must choose two. As soon as you submit one query via poolConnection.connection.query(query), the JavaScript thread that submitted the query is blocked until it receives a response, meaning that it cannot submit a second query to run in parallel, even if it has a separate connection to the database to run that parallel query on.

To demonstrate this, please create a file called index.js with the following contents:

function createFakeConnection() {
  return {
    query: async () => {
      const end = Date.now() + 1_000;
      while (Date.now() < end);
    },
  };
}

async function run() {
  const connectionCount = parseInt(process.argv[2], 10);

  console.log(`Running ${connectionCount} parallel queries...`);

  const connections = [];
  for (let i = 0; i < connectionCount; i++) {
    connections.push(createFakeConnection());
  }

  const start = Date.now();
  await Promise.all(connections.map((connection) => connection.query()));
  const end = Date.now();

  console.log(`Took ${end - start}ms`);
}

run().catch((ex) => {
  console.error(ex);
  process.exit(1);
});

Now try running the following:

  • node index.js 1 - will take 1000ms
  • node index.js 2 - will take 2000ms
  • node index.js 3 - will take 3000ms

The async keyword does not create a thread, it just says "this function should return a promise and it may use the await keyword to yield to promises during execution". If the underlying resource is still synchronous, it will still block the entire thread. I've made the underlying resource a JavaScript while loop here to try and make this more obvious, but it makes no difference if it's in JavaScript or a native C++ module.

Getting back to the three options that you can only pick 2 from:

  1. Synchronous underlying database client/connection
  2. Single threaded
  3. Parallel queries

We currently have:

  • @databases/sqlite-sync - chooses 1 and 2, but not 3 (i.e. no parallel queries)
  • @databases/sqlite - chooses 2 and 3, but not 1 (i.e. parallel read queries, but the underlying database client is async using threads within C++)

This proposal is still 1 and 2 but not 3, it's just hidden behind additional layers of complexity.

I agree that doing 1 and 3 may not be faster for a lot of use cases - It'll only help if the queries are slow but the amount of data is small so marshalling is fast. The only alternative though is to continue to use sqlite3 which is actually async, or build something new at the native level that combines the async support of sqlite3 with the better support for BIGINT etc from better-sqlite3.

better-sqlite3 do propose & have a somewhat decent example of how to do the multi-threaded approach here: https://github.com/WiseLibs/better-sqlite3/blob/HEAD/docs/threads.md You'd need to do some benchmarking to see how much slower this made things.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. "achieve the same interface of the pg/mysql/sqlite modules" is fairly trivial, it just doesn't achieve the second goal of parallel queries.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. You can minimise somewhat the impact of marshalling by dealing with the SQLQuery package steps in the primary thread so that all that's passed through to the child threads is a string representation of the query and an array of values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware of all you are saying minus the fact that it actually works for the tx method. See https://github.com/ForbesLindesay/atdatabases/pull/289/files#r1161826962 for more details.

await db.dispose();
});

test('two parallel queries', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ForbesLindesay this is the use case I’m talking about. Note that the two transactions are running in parallel.

Note that the db access is still synchronous, but if other asynchronous actions (eg an http request) happen within the transaction, things can progress in parallel.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to merge something that only adds this psudeo-asynchronousness, and does not add true parallel queries:

  • Doing other async work inside a transaction is normally a very bad idea.
    • In SQLite it means that no other database transactions or queries that write data can run until that async work completes. You probably don't want to block all database writes while waiting for some external HTTP service.
    • Even in other database engines like Postgres, open transactions are an expensive/limited resource. Even very large database servers might be limited to a few hundred open connections/transactions before they stall.
    • A transaction cannot be truly atomic if it needs to call external services other than the database. If a later step in the transaction fails, there is no way to "roll back" the external API call. This can often lead to inconsistent states. It's therefore normally better to make a decision to either do the db transaction, then call the external service, or call the external service then do the db transaction. This limits your possible failure modes - you only need to consider what happens if the server crashes between those two steps.
  • I'm concerned people would assume this is more parallel than it is. For example, people would assume that await db.query(...) would not block the main JavaScript thread, but that assumption is broken here. That's a big part of why I preferred a clearly synchronous API for sqlite-sync over this pseudo-async API in the first place.

This test also does not demonstrate that you can even run a read only query in parallel while there is an active open transaction. I do think that may be possible with this code, but I don't think this test actually demonstrates that functionality. Each of these transactions is simply run sequentially, one after another.

If you do still want to do this (and again, I don't personally think it's a good idea), you only actually need two connections: primary (for transactions and queries that write data) and secondary (for queries that only read data). You don't need any of the connection pooling logic because you are not able to run more than one "secondary" query at a time. I'm not sure about including this in the core @databases project though, due to the reasons above.

We could consider something like @databases/sqlite-sync-compat to be a package that provides an API that looks like the API for @databases/pg/@databases/mysql/@databases/sqlite but is not actually async and still just uses a single connection. You could remove all the connection pooling stuff and simply take a lock on the entire connection for the duration of any transactions. I'm still not sure this is a good idea/something that should be part of the @databases core project.

An important non-goal of @databases is being able to reuse code between different database engines. I aim to build best-in-class libraries for connecting to and querying each of the supported databases. I want developers who have used @databases to query one database engine to feel instantly familiar when querying another database engine and not have a significant learning curve to understand different APIs. That's why each library has such a similar structure. I do not want to hide genuine differences though. Many ORM layers can end up being limited to only supporting the feature set that's common to all database engines, or they have very inefficient implementations of some functionality on some database engines while being very fast on others. @databases libraries are specific to each database engine, meaning you get the best that that specific engine has to offer. In the case of SQLite via better-sqlite3 that means a synchronous API, because the underlying database is not asynchronous. In the case of Postgres and MySQL that means connection pooling because the underlying database can support many concurrent connections/queries. My understanding of the platformatic project is that it's a higher level tool, so it may make sense for you to hide these differences between engines at that level, but I'm not sure @databases should do that.

await db.dispose();
});

test('two parallel queries', async () => {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to merge something that only adds this psudeo-asynchronousness, and does not add true parallel queries:

  • Doing other async work inside a transaction is normally a very bad idea.
    • In SQLite it means that no other database transactions or queries that write data can run until that async work completes. You probably don't want to block all database writes while waiting for some external HTTP service.
    • Even in other database engines like Postgres, open transactions are an expensive/limited resource. Even very large database servers might be limited to a few hundred open connections/transactions before they stall.
    • A transaction cannot be truly atomic if it needs to call external services other than the database. If a later step in the transaction fails, there is no way to "roll back" the external API call. This can often lead to inconsistent states. It's therefore normally better to make a decision to either do the db transaction, then call the external service, or call the external service then do the db transaction. This limits your possible failure modes - you only need to consider what happens if the server crashes between those two steps.
  • I'm concerned people would assume this is more parallel than it is. For example, people would assume that await db.query(...) would not block the main JavaScript thread, but that assumption is broken here. That's a big part of why I preferred a clearly synchronous API for sqlite-sync over this pseudo-async API in the first place.

This test also does not demonstrate that you can even run a read only query in parallel while there is an active open transaction. I do think that may be possible with this code, but I don't think this test actually demonstrates that functionality. Each of these transactions is simply run sequentially, one after another.

If you do still want to do this (and again, I don't personally think it's a good idea), you only actually need two connections: primary (for transactions and queries that write data) and secondary (for queries that only read data). You don't need any of the connection pooling logic because you are not able to run more than one "secondary" query at a time. I'm not sure about including this in the core @databases project though, due to the reasons above.

We could consider something like @databases/sqlite-sync-compat to be a package that provides an API that looks like the API for @databases/pg/@databases/mysql/@databases/sqlite but is not actually async and still just uses a single connection. You could remove all the connection pooling stuff and simply take a lock on the entire connection for the duration of any transactions. I'm still not sure this is a good idea/something that should be part of the @databases core project.

An important non-goal of @databases is being able to reuse code between different database engines. I aim to build best-in-class libraries for connecting to and querying each of the supported databases. I want developers who have used @databases to query one database engine to feel instantly familiar when querying another database engine and not have a significant learning curve to understand different APIs. That's why each library has such a similar structure. I do not want to hide genuine differences though. Many ORM layers can end up being limited to only supporting the feature set that's common to all database engines, or they have very inefficient implementations of some functionality on some database engines while being very fast on others. @databases libraries are specific to each database engine, meaning you get the best that that specific engine has to offer. In the case of SQLite via better-sqlite3 that means a synchronous API, because the underlying database is not asynchronous. In the case of Postgres and MySQL that means connection pooling because the underlying database can support many concurrent connections/queries. My understanding of the platformatic project is that it's a higher level tool, so it may make sense for you to hide these differences between engines at that level, but I'm not sure @databases should do that.

@mcollina mcollina closed this Jun 16, 2023
@mcollina
Copy link
Contributor Author

I'm going to release this as a different module, thanks for your feedbacks!

@mcollina mcollina deleted the sqlite-pool branch June 16, 2023 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants