Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For discussion: Using a lib.sql to factor out many of the large hasql queries #1511

Closed
wants to merge 1 commit into from

Conversation

monacoremo
Copy link
Member

@monacoremo monacoremo commented Apr 30, 2020

This this not actually meant for merging, but I wanted to discuss with you if this could be a viable approach.

The large queries e.g. in DbStructure.hs are partly so convoluted because we cannot define our own functions. In PostgreSQL there is a sparsely documented feature pg_temp, a temporary per-session schema that tables, views and also functions can be defined on. pg_temp is an alias for a per-session schema named pg_temp_nnn, which is automatically created on first use. It's implemented in namespace.c and referenced in a few places in the Postgres docs.

The idea is to have a PostgREST lib.sql that is loaded in each new session that defines functions (and maybe views?) that are useful for the DbStructure queries and maybe also requests. This PR contains a small proof of concept where a part of the allColumns query is factored out into a function.

Benefits of the approach:

  • Quick iteration - the SQL file can be checked with test/with_tmp_db psql -f src/PostgREST/lib.sql and it's easy to try out or explain analyze individual functions.
  • Unit testing of queries/functions - we could set up a separate libtests.sql that tests all the functions using e.g. pgtap
  • Maintainability, as the huge queries in e.g. DbStructure can be factored into smaller functions
  • Potentially performance? The functions, possibly also prepared statements and views could be parsed once by PostgreSQL and then reused for subsequent requests. On the other hand, we need to load this lib.sql for each sql session (although that could maybe be split into libs for dbstructure and requests).

Issues that I think should be possible to fix:

  • I'm not sure where the best point to hook the lib.sql in would be - as a quick hack I plugged it into getDbStructure and made the transactions writable (it's a bit scary that this works...). There should be a better way to do this! Is there a way to hook into the creation of new connections with hasql-pool?
  • Cabal currently does not recompile when the SQL file is modified - maybe there is a cleaner way than the embed-file approach I picked?

Possibly more fundamental issues:

  • Will we have write access to pg_temp in all use cases?
    • default_transaction_read_only could be set to on, but begin read write; or set transaction read write; seems to override that in all cases
    • temporary is a default privilege on databases, see here. Users could have executed revoke temporary on database [...] from [role];. This could be fixed with grant temporary .... Would the case for increased maintainability and performance be enough to risk blocking the (presumably few) users who revoked this?
    • Would not work in recovery or with parallel workers, which should not be an issue
    • ...other ways that users might have blocked writing to pg_temp?
  • ... possibly others, to be discussed

I tested this with postgrest-test-spec-all, the proof of concept works for all PostgreSQL versions since 9.4.

@monacoremo monacoremo marked this pull request as draft April 30, 2020 08:48
@monacoremo monacoremo force-pushed the monacoremo-sqllib branch 2 times, most recently from f320b2a to 5133f8d Compare April 30, 2020 11:05
@monacoremo
Copy link
Member Author

monacoremo commented Apr 30, 2020

Example for refactoring allTables and accessibleTables here: https://github.com/monacoremo/postgrest/tree/monacoremo-sqllib-tables

This could also be great for debugging and diagnosing user issues, e.g. you can run in psql:

\i src/PostgREST/lib.sql

select * from pg_temp.postgrest_tables;

and see exactly what 'PostgREST sees'.

@@ -0,0 +1,18 @@
-- Make sure that we can create temporary functions if we are in a transaction.
set transaction read write;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh.. your proposal seems like a really nice idea for maintainability, also more ergonomic development.

But this one seems like a road block(not sure if there's a workaround). A big use case for postgrest is to be able to run in read-only replicas. This transaction level would reject the create function:

BEGIN ISOLATION LEVEL READ COMMITTED READ ONLY;
create or replace function pg_temp.postgrest_numeric_precision(a pg_attribute, t pg_type)
returns integer
language sql
as $$
  select
    information_schema._pg_char_max_length(
      information_schema._pg_truetypid(a.*, t.*),
      information_schema._pg_truetypmod(a.*, t.*)
    )
$$;
ERROR:  25006: cannot execute CREATE FUNCTION in a read-only transaction
LOCATION:  PreventCommandIfReadOnly, utility.c:246

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm sorry. You mentioned:

default_transaction_read_only could be set to on, but begin read write; or set transaction read write; seems to override that in all cases

Would set transaction read write work in read replicas? We would have to test that.

@steve-chavez
Copy link
Member

Looks really nice for maintainability, but the unknown unknowns scare me. We could invalidate use cases. Not sure if read replicas would work.

We'd have to look into this with much care. If merged, a new major version would have to be released.

Also, I'd like to mention that I think it's possible to have one single query for the whole schema cache. This query was shared a long time ago: https://gist.github.com/steve-chavez/eae6a67ec81b195c133bcb9ff0c917fb. So that could alleviate the need of having too many separate sql queries and replace some of the haskell code. It would likely be faster as well.

There was also some discussion regarding making that a materialized VIEW, which could speed up start-ups. Though our query is fast for now.

Mentioning that because it might be another avenue for better maintainability and perf.

@monacoremo
Copy link
Member Author

@steve-chavez You're right, for read only replicas this would likely lead to problems. Too bad, but thank you for looking at this so quickly! Technically it would likely work also for those, as it's not so much different than prepared statements that also work and create server side objects. But still it might not work by default, which would create too many issues.

Having everything in one query if definitely doable! I'm not sure if that huge query would be more maintainable, but there is definitely also a lot of data wrangling going on the Haskell side that would probably be easier to express in SQL. That's probably the best direction - maybe splitting up the table/column/pk and procs side.

Materialized views would also require write access, I think that might held up by the same issues as the lib approach, no?

@monacoremo monacoremo closed this Apr 30, 2020
@steve-chavez
Copy link
Member

@monacoremo It's a bummer that pg_temp needed write capabilities. It was a really interesting idea. Also, I didn't know about pg_temp until now.

Regarding maintainability, I think the best we can do for now is to use CTEs with meaningful names and comments. The idea about splitting them into files is certainly feasible(without using pg_temp) but I think we could try to join them and reduce some of the haskell code first(if possible).

Materialized views would also require write access, I think that might held up by the same issues as the lib approach, no?

Yes, it was mostly an idea. To let the user specify a materialized view in a config option and somehow provide them our schema cache query so they can create the mat view manually.

But the schema cache query has gotten really fast over the years so we no longer had a strong need for that.

Making it more fast would still be great though. Would make #1512 more feasible

@monacoremo
Copy link
Member Author

@steve-chavez just played around with that, it's not even that difficult to get everything as JSON from one query: https://gist.github.com/monacoremo/baeff41fd7fde418d6684cfe4988c7d0 (pretty much a 1:1 translation of DbStructure.hs, started off with sed 's/^/-- /g' < src/PostgREST/DbStructure.hs > dbstructure.sql).

Now if we could use Aeson to automatically derive a decoder for that output based on the type definitions... Should be a fun challenge :-)

@steve-chavez
Copy link
Member

@monacoremo Nice! You made that easy!

Just in case you didn't notice this. In https://gist.github.com/monacoremo/baeff41fd7fde418d6684cfe4988c7d0#file-dbstructure-sql-L655-L656 there's a string that's conditionally passed with Haskell code. You could make that logic in sql instead:

-- change below
with
pg_version as (
  SELECT
    current_setting('server_version_num')::integer as server_version_num,
    current_setting('server_version') as server_version
),
views as (
  select
    n.nspname   as view_schema,
    c.relname   as view_name,
    r.ev_action as view_definition
  from pg_class c
  join pg_namespace n on n.oid = c.relnamespace
  join pg_rewrite r on r.ev_class = c.oid
  where (c.relkind in ('v', 'm')) and n.nspname = 'test'
),
removed_subselects as(
  select
    view_schema, view_name,
    regexp_replace(view_definition,
     -- this would be the change
      case when (select server_version_num from pg_version) < 100000
        then ':subselect {.*?:constraintDeps <>} :location \d+} :res(no|ult)'
        else ':subselect {.*?:stmt_len 0} :location \d+} :res(no|ult)'
      end,
     --
      '', 'g') as x
  from views
),
...

@monacoremo
Copy link
Member Author

@steve-chavez working proof of concept here: https://github.com/monacoremo/postgrest/tree/monacoremo-megaquery?files=1

A little bit messy, but it's just a quick experiment and it works! Have yet to plug in your sql solution for the conditional string, which is mich nicer!

This could be very promising - getting everything in one query could be faster, but also easier to introspect, debug and improve further. And Aeson could replace all the manual decoders in a robust way (keys instead of order-dependent tuples).

@monacoremo monacoremo deleted the monacoremo-sqllib branch May 18, 2020 08:05
@wolfgangwalther
Copy link
Member

@steve-chavez
Not sure whether this has been discussed before, but are there any good arguments against splitting up a part of the postgrest code into a postgres extension that is installed into the database?

This could potentially ease development and increase maintainability and maybe also extensibility - by allow to override/replace some of those postgrest sql functions. All the schema cache stuff could go in there, but potentially also (parts of) the main queries.

This extension could even be auto-installed by PostgREST, although a manual CREATE EXTENSION shouldn't be too hard for users who will write all their code in postgres anyways...

If those functions are pre-defined in the database, it shouldn't be a problem for the read-replica case etc.

Also conceptually it makes sense to have parts of the code in postgres itself - because thats exactly what PostgREST is all about ;)

I'm sure there is some blocker that I am missing?

@steve-chavez
Copy link
Member

@wolfgangwalther I think a pg extension would be a good idea! I also talked about that intention with Remo.

All the schema cache stuff could go in there, but potentially also (parts of) the main queries.

Yes, I believe the schema cache should be the first step. We want to convert all of the DbStructure logic into pure SQL in #1513. I believe it's possible, but we need to put more work in it. If we can't make it a single query, perhaps we can export a set of views in the extension.

Making the schema cache an extension would also have the immediate benefit of letting us produce the OpenAPI output in SQL(which would help with many issues, like #1082 (comment)).

This extension could even be auto-installed by PostgREST, although a manual CREATE EXTENSION shouldn't be too hard for users who will write all their code in postgres anyways... If those functions are pre-defined in the database, it shouldn't be a problem for the read-replica case etc.

You're right about the read-replica. I think once we have that extension, PostgREST could detect it and call the views(maybe functions later) from there. Otherwise, keep working as usual. Advanced users that want to override some behavior would install the extension manually(to not be invasive and require high privileges from the authenticator role).

@LorenzHenk
Copy link

What I really like about PostgREST is that it's very simple to use and non-invasive concerning the database. There's no special configuration needed on the database side for simple features.

PostgREST is stateless in the sense that it does not have to build up any database state.

As far as I understand, there is no technical need (e.g. you want to save data) for you to create the views and functions in the database. It's for performance reasons, simplifications and customization.

Here's my thoughts on the custom extension idea:

Pro:

  • simpler queries on PostgREST side
  • simpler customization & overriding of functionality

Contra:

  • (unnecessary) state in the database
  • harder to setup
    • additional permissions on the database required
      • installing a custom extension is not possible for PostgreSQL running in the cloud
  • harder to upgrade
    • functions may be overwritten by an advanced user

This looks like a bad trade.

@steve-chavez
Copy link
Member

Fair points @LorenzHenk.

I was mostly interested on the pg extension because it could provide a way for users to refine their OpenAPI in SQL.

But now I'm thinking we should try to do this at the Haskell level.

@wolfgangwalther
Copy link
Member

But now I'm thinking we should try to do this at the Haskell level.

I have not given up completely on this idea, yet, because imho the upside is a lot bigger than just the OpenAPI stuff. Going through various ideas on how to implement some features / fix bugs, I'm quite often at a point, where I would just need to have a small plpgsql function defined to make things possible at all.

Focusing on two of your contras, @LorenzHenk:

  • harder to setup
    • additional permissions on the database required
      • installing a custom extension is not possible for PostgreSQL running in the cloud

Yes, I came to the same conclusion regarding a real "extension". That won't work and is too complicated. If we just designated a pgrst schema (could be changed via config option), where postgrest could create some functions and views etc., that would already be enough, though.
Additional permissions could be as simple as CREATE privileges on that one schema, maybe even for a special pgrst admin user.

However, I agree that we need a good fallback for the case where those privileges are not there. It would be good to still be able to just put PostgREST in front of an existing database and have it work somehow (although I don't really see that happening without any intervention at all - you still need to create those authenticator and anonymous users, right?).

I'm not sure about pg_temp, that was mentioned upthread, but this could help. Maybe we can make it the default schema, if no other is set via config. Not sure how much overhead it would be to create those pgrest objects once per database connection (connections are kept open for more than 1 transaction, right?). Also, read replicas (mentioned above already) and connection pooling might be challenges to solve here.

  • harder to upgrade
    • functions may be overwritten by an advanced user

We could do the following:

  • add all objects in the pgrst schema with another pgrst_ prefix
  • have the pgrst schema in the search path automatically
  • call all those objects unqualified, so without the schema, from PostgREST

This would allow users to create those objects with the same name in a different schema - one which will be placed before pgrst in the search path. Those functions would not be overwritten, but just... hidden. Updating should be no problem. PostgREST will still call those custom overrides, because they appear earlier in the search path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants