Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pgrst.accept setting to RPC #1582

Closed
wants to merge 5 commits into from

Conversation

steve-chavez
Copy link
Member

@steve-chavez steve-chavez commented Sep 15, 2020

Allows setting a custom media type for the response format(Accept and Content-Type headers) by using the pgrst.accept setting on RPC.

Basically, it serves as an escape hatch for when the user wants an application/msgpack or an image/png response. Example:

-- with this function
create or replace function ret_image(name text) returns bytea as $$
  select img from images where name = $1;
$$ language sql
set pgrst.accept = 'image/png';

-- you can obtain an `image/png` by doing:
curl -H "Accept: image/png" localhost:3000/rpc/ret_image?name=A
curl -H "Accept: */*" localhost:3000/rpc/ret_image?name=A
curl localhost:3000/rpc/ret_image?name=A

Currently it doesn't handle a wildcard media subtype(like image/*) but handles the all media types wildcard(*/*). Which is good enough for cases like browser <img> requests or right click -> view image requests(check these tests and #1462).
image/* can be handled in a later enhancement.

Related issues:

* move json ops tests to JsonOperatorSpec
* move phfts operator tests to QuerySpec and RpcSpec
* add custom spec for pre-request header guc tests
* Add tests for browsers img and navigation requests
@wolfgangwalther
Copy link
Member

Love it!

Couldn't find a test-case for it - maybe it's worth adding: What happens when you do curl -H "Accept: different/type" localhost:3000/rpc/ret_image?name=A? And what happens if you use application/json (or any of the "default" types)? I would expect both to throw.

Another thing is wildcard matching: I agree with you that wildcards in the request can be a later addition. However, I think we really need to allow wildcards in pgrst.accept. So if my function is not ret_image but ret_file (for arbitrary file types), we would at least need to allow pgrst.accept='*/*' now. I think it's rather unlikely that people will have a function that returns just e.g. image/png and not some other file types as well? In all the use-cases I had so far, I had to return different mimetypes.

++ rawContentTypes
++ [CTOpenAPI | tpIsRootSpec target]
ActionInvoke _ -> case procAccept of
Just acc -> [CTOther $ toS acc]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it could easily be extended to allow multiple media types, e.g. like pgrst.accept='image/png,image/jpeg' - would just need to convert this to a list of multiple CTOther?

@steve-chavez
Copy link
Member Author

@wolfgangwalther Thanks! I've added some test cases for default and unknown types.

we would at least need to allow pgrst.accept='/' now.

Oh, that's a new one. On the single mimetype case I've added now, we're in fact accepting */*(the default when no Accept header is sent) and then responding with the particular mimetype.

(I got confused here with the accept keyword, since we also set the type. Maybe we should use pgrst.mime for the single mimetype case)

So with */* we'd be on the multiple mimetype case and we let the user set the mimetype on the function body.

In all the use-cases I had so far, I had to return different mimetypes.

How would the user decide which mimetype to set in the body? For example, if chrome sends text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, should the user parse that in SQL and order the mimetypes according to q parameters?

One thing that is great about being static and specific on the mimetype(single or list) is that it would allow us to document them on our OpenAPI output(check openapi mime types). Wildcards and setting the content type on the function body wouldn't allow us to do that.

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Sep 15, 2020

we would at least need to allow pgrst.accept='/' now.

Oh, that's a new one. On the single mimetype case I've added now, we're in fact accepting */*(the default when no Accept header is sent) and then responding with the particular mimetype.

But that's the other way around from what I suggested, correct? So client requests */* and we have pgrst.accept='single/type' set.

(I got confused here with the accept keyword, since we also set the type. Maybe we should use pgrst.mime for the single mimetype case)

I think we should keep it consistent: The single mime type case is just the simplest case of content negotation (see below) - it's just a yes/no decision. But since it's still content negotation it should be accept. After negotiating a specific mime type in postgrest (without wildcard), we can set that mimetype as response header Content-Type. This could even be possible in some multi mimetype cases - see below for the application/xhtml+xml case (I'm writing this backwards.. :D).

So with */* we'd be on the multiple mimetype case and we let the user set the mimetype on the function body.

Yes. But a simple multiple mimetype case to start with, because for content-negotation you don't even need to make any decision. */* will just accept every request.

In all the use-cases I had so far, I had to return different mimetypes.

How would the user decide which mimetype to set in the body? For example, if chrome sends text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, should the user parse that in SQL and order the mimetypes according to q parameters?

This is the "Content Negotiation" part - this should be done in postgrest, not in the RPC. If we could extend that in the future, so that postgrest somehow tells the RPC (probably via GUC) on which of the possible mimetypes it has acted, that would be great. Example:

  • Request is sent with header Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
  • The RPC on the endpoint has pgrst.accept='application/xhtml+xml,*/*' set
  • Postgrest decides that application/xhtml+xml is the best match and sets pgrst.accepted='application/xhtml+xml' (note the difference from accept - it's accepted here)
  • If the function had pgrst.accept='image/png,*/*' instead, postgrest would set pgrst.accepted='*/*' as the best match.
  • pgrst.accepted would always be one of pgrst.accept - so the user writing the RPC would exactly know which values to expect

The multiple mimetypes I referred to are entirely based on content: I am keeping files in the database. That could be images, pdf, ... whatever the user uploads. So once a specific file is requested I know exactly which mimetype to return (the type of the file). But I need to act on any accept header. I guess right now I could rely on the browser always sending */* as part of the accept header, just set pgrst.accept='application/octet-stream' and then set the proper header in the function body.

One thing that is great about being static and specific on the mimetype(single or list) is that it would allow us to document them on our OpenAPI output(check openapi mime types). Wildcards and setting the content type on the function body wouldn't allow us to do that.

I agree that would be great to put in the OpenAPI output. I don't think we should disallow overriding the content-type on the function body, however. Now if my function body overrides the Content-Type and the endpoint could return anything... I think an OpenAPI output like produces: */* would actually be better than e.g. produces: application/octet-stream (referring to my example above).

This seems to be allowed in OpenAPI 3.x:

[...] For responses that match multiple keys, only the most specific key is applicable. e.g. text/plain overrides text/*

@steve-chavez
Copy link
Member Author

I think we should keep it consistent: The single mime type case is just the simplest case of content negotation

I agree, it would be a matter of being clear about the single mime case on the docs.

This is the "Content Negotiation" part - this should be done in postgrest, not in the RPC
If we could extend that in the future, so that postgrest somehow tells the RPC (probably via GUC) on which of the possible mimetypes it has acted, that would be great.

Great idea! In fact the parseHttpAccept already orders(q params included) the mimes in a client Accept and returns a list. I think we can pick the best match(first one in the list) for the GUC.

So, right now we have these cases:

  • Single mime: pgrst.accept = image/png(Done)
  • Single wildcard: pgrst.accept = */*
  • Multiple mimes: pgrst.accept = image/png, image/jpeg
  • Multiple mimes and wildcard: pgrst.accept = application/xhtml+xml, text/html, */*;

How about if we handle the 3 pending cases in this way:

  • Send the request.mime(name up for debate) GUC with the best match to let the user set the Content-Type according to his own logic.
  • The request.mime GUC will only be enabled for RPC calls with pgrst.accept. This is to avoid a bit of overhead on normal requests.
  • On this type of RPC, we'll not set a Content-Type by default. This is allowed according to RFC 7231.

A sender that generates a message containing a payload body SHOULD generate a Content-Type header field in that message
SHOULD: This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item

I've been checking how to do this in the code and it looks feasible. Also it would have the nice side effect of rejecting an invalid Accept without starting an empty commit on the db(currently happening).

@wolfgangwalther What do you think?

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Oct 1, 2020

So, right now we have these cases:

Is the "wildcard" you mention here just about a full wildcard or does it extend to "half" wildcards like image/* as well? I assume it does.

Negotiation

I think we can pick the best match(first one in the list) for the GUC.

So, just for clarification, the negotiation algorithm you're suggesting would look like this?

def negotiate_content_type:
  for accepted_by_client in parseHttpAccept.ordered_list:
    for accepted_by_rpc in pgrst.accept:
      if accepted_by_client matches accepted_by_rpc:
        return most_specific_of(accepted_by_rpc, accepted_by_client)
  throw not acceptable error

Alternative approach

If parseHttpAccept were to return not only the ordered list, but also the q params (does it? I still can't read haskell that easily :/ ), an alternative algorithm could be to parse pgrst.accept with parseHttpAccept as well (so ordering would be handled, even including q as well!) and then:

  • create all combinations of accepted client and rpc types
  • filter out those that don't match
  • calculate client.q * rpc.q for every match (q defaults to 1 if not specified)
  • could even add a factor of 0.5 for "partial matches" because of wildcards, if we wanted to be really smart here
  • pick the match with the highest result and break a tie by order in the client header

This would be a bit like the algorithm that apache uses. Since this algorithm would be nicely encapsulated in a function anyway, I think it would not make the overall design more complex (just the negotiation algorithm of course :D).

Return value

I think in any case we can do better than returning either always accepted_by_client or accepted_by_rpc. That's why I put the most_specific_of function in the example, that would do something like the following:

  • if the match is like client=*/* and rpc=image/png - the return should be image/png

  • if the match is like client=image/png and rpc=*/* - the return should be image/png as well

  • if the match is like client=image/png and rpc=image/png - of course image/png

  • it the match is like client=*/* and rpc=image/* - the return would be image/*. Also the other way around, of course.

  • and so on

Content-Type

The request.mime GUC will only be enabled for RPC calls with pgrst.accept. This is to avoid a bit of overhead on normal requests.

+1. That makes a lot of sense - RPC functions should not be concerned about the returned content-type being application/json or text/csv if postgrest handles that.

On this type of RPC, we'll not set a Content-Type by default. This is allowed according to RFC 7231.

What about:

  • if request.mime does not contain a wildcard -> set Content-Type to request.mime by default, but allow override

  • if request.mime does contain any kind of wildcard, don't set Content-Type by default

I think the first one would be a sane default, that would simplify a lot of RPCs, especially those that just accept a single specific mimetype.

@wolfgangwalther
Copy link
Member

I just found this: https://wiki.postgresql.org/wiki/Inlining_of_SQL_functions

It reads:

A [...] function call will be inlined if all of the following conditions are met:
[...]
the function has no SET clauses in its definition

That sounds like a show-stopper for the whole SET ...options... idea, because inlining is important for performance of RPCs.

What we need is another way of providing custom postgrest options for individual sql objects. Finding a more general solution here would also allow us to extend the concept here from RPCs to tables.

Maybe something like a config option db-config that points to a table. The schema cache queries could then query this table for additional config options. Optional, of course.

The table schema could be as simple as:

CREATE TABLE pgrst.config (
  oid OID,
  config JSON
);

in which one could insert:

INSERT INTO pgrst.config VALUES ('ret_image(text)'::regprocedure, '{"accept":"image/png"}')

Not sure if the oid approach is the best. It has the advantage, that renaming database objects is possible. Not sure what happens on CREATE OR REPLACE FUNCTION, though. DROP FUNCTION; CREATE FUNCTION would probably not work without adding a new row to pgrst.config as well.

Another approach is, to have type, schema and name columns, where type could be class, proc, ... - according to the pg_ tables.

@steve-chavez
Copy link
Member Author

steve-chavez commented Nov 19, 2020

That sounds like a show-stopper for the whole SET ...options... idea, because inlining is important for performance of RPCs.

Hmm.. I guess we'd have to measure that to see if the perf we lose is considerable. But really the main use case for pgrst.accept was sending images(and removing raw-media-types) and those need to be cached anyway.

I think the pgrst.accept approach brings a good DX(the table approach is bit more complicated) and is a good enough escape hatch for different media types. Lots of good things(experiments) can come up with this, but first we need to offer a simple interface.

So since flexibility was the original motivation and not max performance, maybe the inlining issue is not an actual show-stopper.

Still, I'd be interesting to clarify the real loss, I see things like The rules set out here are believed to be correct for pg versions between 8.4 and 9.5 on the wiki and it looks like the info there is not conclusive.

(the wiki page is a bit complicated :O.. will need to revisit that later)

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Nov 19, 2020

Still, I'd be interesting to clarify the real loss, I see things like The rules set out here are believed to be correct for pg versions between 8.4 and 9.5 on the wiki and it looks like the info there is not conclusive.

Forgot to mention that I checked the source. Unchanged in terms of the SET since 9.5. And that makes sense, because once the function is inlined in the main query, it's impossible to set some GUCs just for the inlined part - the context is lost. So this is very unlikely to have changed.

Hmm.. I guess we'd have to measure that to see if the perf we lose is considerable. But really the main use case for pgrst.accept was sending images(and removing raw-media-types) and those need to be cached anyway.

Without inlining all the filters and limits etc. will be applied on the materialized result of the function call. This means, especially with big amounts of data like files, that the filters have to be re-implemented in the function itself. In most cases, this will be a simple PK lookup from a function argument, in that case it will not affect performance, that's true. But using any of the query syntax for filtering or using limit and offset will quickly not be possible anymore.

Re-implementing any of the filter behaviour will be much more complex than the config table :/

I am not sure how this affects performance for resource embedding.

We should be able to run a few tests with the current state of this PR, right? Eh... we would need a branch for comparison that allows inlining properly. Currently it's blocked in general, because of some other conditions not being met.

So maybe we should first make sure that we have queries that allow inlining for regular RPCs - could give a performance improvement as well. And once we have that, we can compare this in the specific use-case of pgrst.accept.

@wolfgangwalther
Copy link
Member

See #1652 for a specific case where the consequences for performance of broken inlining are shown.

@steve-chavez steve-chavez marked this pull request as draft November 24, 2020 22:22
@wolfgangwalther
Copy link
Member

wolfgangwalther commented Dec 8, 2020

TLDR: I'm proposing a solution that will:

  • allow us to use the SET pgrst.accept concept
  • allow us to inline RPC calls
  • solve the "accept custom mimetypes for tables and views" problem nicely

This solution is a bit of work to implement - but once we get this done, this should be a major improvement and really useful.

Before going into detail about what we should do, I will outline the inlining problem again, with a couple of examples that show why we need to support it for performance.


Inlining

What is inlining?

Why do we need to make sure that our RPC function calls can be inlined?

When a function call is inlined, the function body is put directly into the main query and the whole query is parsed as one big query. Here's an example:

CREATE FUNCTION search_client (query TEXT) RETURNS SETOF clients
STABLE LANGUAGE SQL AS $$
  SELECT * FROM clients WHERE name LIKE '%' || query || '%'
$$;

SELECT * FROM search_client('john doe') LIMIT 10;

Let's ignore for a second that to my knowledge it's not possible to add an index that would cover the name LIKE ... - the same can be done with full text search functions, but that would complicate matters too much here.

If the function call is inlined the query will be treated as:

SELECT * FROM (
  SELECT * FROM clients WHERE name LIKE '%' || 'john doe' || '%'
) AS search_client LIMIT 10;

This can be optimized as one query - so the LIMIT will be pushed inside the subquery and once 10 rows are found that match the condition, the query is done. On a very bigtable with many johne does, this will be fast.

Now assume, that the query can not be inlined. One way to do this would be to remove the STABLE from the function definition. In this case the function would VOLATILE and can't be inlined. Now the function call acts as an optimization fence. That means, that the query inside the function is executed, the whole table is scanned for all rows that match the condition, the result is put together (if I remember correctly as an array) and only then is the LIMIT applied, so the first 10 rows of that resultset are taken. On a big table this will be slow.

See #1652 for a case where exactly this is happening. See https://wiki.postgresql.org/wiki/Inlining_of_SQL_functions for all the conditions that must be satisfied for inlining.

Why do we need inlining for RPCs returning a custom mimetype?

Of course there are use-cases where inlining is not important. This is the case when the resultset inside the function call is not further reduced by other paramters. That basically means: Without inlining all the query parameters for filtering and limiting will come with a performance hit for RPCs. However if we pass in a PK column via function argument and query for exactly 1 row, inlining does not matter.

Here are some examples that could use custom mimetypes / accept header, but need query parameters for filtering / limiting:

  • Assume someone needs to output Line-delimited JSON (came up here: Streaming results #278 (comment), also see https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON) or any other format not supported out of the box by PostgREST. They could use a custom mimetype and then handle that in an RPC. Of course, in this case they don't want to always transform the full table of records, before applying filtering and limiting. They need inlining.

  • Assume a table that holds files in a BYTEA column and has multiple other columns. The files are of different mimetype - so we need to call this through an RPC to set the Content-Type header correctly. The user should be able to filter on different columns and apply a LIMIT 1 to get a valid file back. If those filters were implemented as function arguments, we would need a lot of overloaded functions for all kind of filter combinations, to make up the final query, or we would need to dynamically create the final query. Inlining and pushing through conditions and limits is much easier.

For performance, we absolutely need to support inlining most of the time.


SET on the function definition prevents inlining

It's easy to understand why: Once SET is used on the function definition, the GUC that is set, will have that value for the duration of running the function, so only "inside". Once you inline the query in the main query and treat everything as one, there is only one scope and the SET would apply to the whole query. This would result in different (and possibly unpredictable) behaviour, so those functions can not be inlined.

However, we would be using SET pgrst.accept just as a hint for our schema cache to know which function supports which mimetypes - we wouldn't even need the GUC inside the function body at all.

It seems like a serious waste of performance to prevent inlining just to use the nice way of hinting directly on the function body.

... but - what if we could do both? Read on! :)


Virtual / computed columns to our rescue!

One precondition for PostgREST returning a custom mimetype is always that this must be in the form of a TEXT or BYTEA column - because we need to just pass it right through to the client, without any post-processing (that is the whole point after all...). Not a column, but one column. What if... this column was a computed column instead?

Computed columns are defined through functions that take exactly 1 argument of the base table's type. We can identify all functions that provide computed columns in the schema cache and map them to their base tables / types. This would give us the positive side-effect, that we could add those columns to the OpenAPI output as well!

Now we do the following:

  • If the computed column's function has SET pgrst.accept and
  • the request's Accept header is matching that

we return the custom mime-type (just as discussed upwards in this PR and in other issues before). We do this by ignoring the ?select= query parameter and use just the equivalent of ?select=computed_column. The select parameter only makes sense when we post-process with PostgREST anyway. Remember: Custom mimetype is always 1 column returned. The computed column can handle merging multiple columns together. However, even if we're not making use of the select with PostgREST, we can still pass this parameter on to the computed column function via GUC, so that this function can apply the select internally, if there is a need for it.

This would allow us...

... to query tables or views with custom mime-types:

CREATE TABLE clients (
  name TEXT,
  budget MONEY,
  whatever OTHER
);

CREATE FUNCTION yaml (clients) RETURNS TEXT
IMMUTABLE LANGUAGE SQL
SET pgrst.accept='text/vnd.yaml' AS $$
  <some yaml transformation happening here>
$$;

Now you can request this with:

GET /clients?budget=gt.BIGNUMBER
Accept: text/vnd.yaml

And get a nice return of clients in yaml format.

... to create "generic" custom output formatters:

Not really generic, but we can add overloads for yaml (clients), yaml (projects), yaml (users)... great flexibility!

... to use the same on RPCs with composite return types ("smart views"):

Those RPCs have to return a base table type, that can be used for computed columns. But they have to anyway, because otherwise inlining will again be prevented. See the inlining conditions mentioned above.

Those can then be queried just like the tables. This also allows to return different output formats for a single RPC, depending on which computed column is used!

Note: RPCs that return any form of scalars are a bit special here. The best solution here is probably to just use SET pgrst.accept on the RPC itself as originally planned. All the filtering on the different columns doesn't apply here, so those filters do not need to be pushed inside anyway. The only thing that would need to be pushed in would be a LIMIT, but that can be added as an argument as well.


Why does this work?

"But.. if we do it like this, then the computed columns functions can not be inlined, right? And we need inlining!".

Yes and no. We don't need inlining for computed columns, because those will do a transformation from a row of the base table to a single value. There is nothing to gain from inlining here, as there can't be any conditions or limits pushed inside. This transformation has to happen in a function call. Most likely those functions would have other characteristics that would prevent inlining anyway...

Note: It's not entirely right that we never need inlining for computed columns, but not in this case. Whenever we use computed columns in the SELECT part of the query (and the whole custom mimetype stuff is really JUST that), we don't need inlining. When we use computed columns in other parts of the query, e.g. WHERE, we can very much benefit from inlining, because that might allow index usage. But this is completely unrelated here.


If we can pull this off, this would be a really powerful feature. We need quite a few parts to play together, to get this right, but in my opinion, it will be well worth it.

@wolfgangwalther
Copy link
Member

A couple more notes:

Note: RPCs that return any form of scalars are a bit special here. The best solution here is probably to just use SET pgrst.accept on the RPC itself as originally planned. All the filtering on the different columns doesn't apply here, so those filters do not need to be pushed inside anyway. The only thing that would need to be pushed in would be a LIMIT, but that can be added as an argument as well.

It makes sense to implement pgrst.accept not only on computed columns, but also directly on the RPCs. Once for RPCs that return scalars (can't used computed columns here) and also because some scenarios (as demonstrated in this PR, the straightforward "select 1 row by PK" query) are considerably easier to handle with that and don't need inlining anyway.

Regarding inlining of RPCs in general: In the majority of cases, this is not possible right now, with the way the queries are constructed. So allowing inlining can improve performance in the future, but this PR will not lead to a degradation in performance given the way we call RPCs right now.

Therefore, I conclude that the approach in this PR can be continued as-is, without lowering our chances of implementing faster options on top of that. Once we have all the pieces in place, it will be mostly a documentation issue, to tell people how to write RPCs with pgrst.accept (and also without!) properly to have them run fast!

@wolfgangwalther wolfgangwalther mentioned this pull request Dec 8, 2020
@steve-chavez
Copy link
Member Author

@wolfgangwalther The computed columns approach is genius! 💯 🥇 💥 🤯

So simple and yet so powerful 💪.

Couldn't think of a way to also make it work with a RPC returning a scalar. But as you mention, we can be clear as when not to do this on the docs.

So I'll revisit and finish this one later. I'll let you implement your great idea with computed columns on another PR.

Once we finish these we can call #1548 solved.

@wolfgangwalther
Copy link
Member

I'll let you implement your great idea with computed columns on another PR.

Arghhhh :D

@steve-chavez
Copy link
Member Author

Arghhhh :D

😆 I thought you might like to get full credit for that one. But I'll help in reviewing :D

@wolfgangwalther
Copy link
Member

Just another idea here, so we don't forget.

... to create "generic" custom output formatters:

Not really generic, but we can add overloads for yaml (clients), yaml (projects), yaml (users)... great flexibility!

We can have a true generic solution by implementing the same for custom aggregate functions. Once we find an aggregate that has a finalfunc with SET pgrst.accept, we can call this instead of json_agg. This would allow e.g. generic transforms from json -> yaml or other formats. Maybe CSV output could even be implemented like that, because it would help nicely with adding headers or footers to multiple rows of custom mimetype. Generating a zip file out of multiple files would be another one... ah endless possibilities.

@wolfgangwalther wolfgangwalther changed the base branch from master to main December 31, 2020 14:11
@wolfgangwalther
Copy link
Member

The more I think about it, the more I think we should keep this simple and limit it to supporting aggregate functions only. No SET pgrst.accept on RPCs directly or on virtual columns. Aggregation is really the right stage to handle custom output formats - just like we do the transformation to json or csv at that stage right now, or use string concat for text and binary output formats.

All other approaches have serious drawbacks:

  • RPC: No support for tables/views, potentially serious performance issues.
  • Virtual Columns: Only work reliable for single entities. Imagine requesting a list of files with different mimetypes - one call to set the header would be overridden by the next and the result would be a mess. At the same time, we can't limit this to return a single row only, this would not allow to implement other mimetypes that support multiple entities (e.g. YAML etc.).

This could be solved by using aggregate functions:

  • To qualify an aggregate function needs to have both SET pgrst.accept= for content negotiation and a return type of either TEXT or BYTEA.
  • To match aggregate functions with routes/database objects (tables, views, stored procedures) the type of the first argument is taken:
    • If it's a composite type as an argument, the aggregate will be matched with the table/view (if it's a table's row type) and any RPC that returns this type. To be able to pass the row to the aggregate, we need to change our select part of the query slightly to return the full row as one column like this: SELECT table FROM schema.table. This will not support any ?select= in the query string, because the aggregation function makes the decision which columns from the row to take.
    • If it's a scalar type it will match with any RPC returning that type. ?select= is not supported anyway in this case.
    • If it's JSON or JSONB any table/view/RPC is supported. This uses the regular query including ?select= and embedding support and allows for generic handlers (e.g. to transform to YAML). The aggregate is called somehow like this: SELECT yaml_agg(to_json(postgrest_t))

This approach does support everything we need:

  • Tables, views and RPCs
  • Inlining of RPCs
  • Works with RPCs that return scalar values, too
  • Throwing errors when multiple entities are requested for a mimetype that doesn't support multiple
  • Container mimetypes for multiple entities (e.g. YAML or even a ZIP-wrapper around multiple files...)
  • Generic mimetype handlers (e.g. better CSV support)

Another benefit is that we don't need the whole "select one output column only" magic. Even from a table you will be able to just set your Accept header and the aggregate function takes care of choosing the right column. Much better handling in code and the API, I think.

@wolfgangwalther
Copy link
Member

One advantage of not implementing pgrst.accept for RPCs would be, that we could allow setting it to enable Content-Type handling (so the other way around, so parsing the body on e.g. POST requests). Basically, setting pgrst.accept on

  • an aggregate function would be for output
  • a regular function would be for input

Those functions would then have to take BYTEA or TEXT and return either JSON / JSONB to take the current path. Or they could return the target's table row type to avoid the intermediate conversion to json. An INSERT query should be straight-forward in that case, I think. This would allow us to parse multipart form data requests, I think. So uploading files could be easily possible, too!

@wolfgangwalther
Copy link
Member

Also, maybe I could start implementation by just limiting to a specific media type? Like only allow text/html domains right now, on aggregates and functions. No overriding default postgREST types and not doing anything on a */* domain.

If the user wants to also support application/xthml+xml it would have to create other functions/aggregates for now.

If you really want to implement one specific type only (not sure how much that would simplify the implementation, while still making progress towards the ultimate goal), then I would suggest you don't add html. I think text/csv would be better suited, because it should be much simpler to write some generic aggregate handlers representing any row type - and we would be solving most issues around the current text/csv support, because this could be user-implemented. We'd just need to add a good aggregate to the docs for that. This would then allow you to remove built-in text/csv support, which should simplify the code.

Next step would be doing the same for xml. This way we'd have a basic feature, with which we can test the general concepts without introducing breaking changes that don't have an alternative anymore.

Removing csv and xml support, but only providing html instead... would break usability for existing use-cases and could be a deal-breaker.

@aljungberg
Copy link
Contributor

With the data reps solution, and the relevant synonymous no-op transform, we can find a solution such as text/html -> application/xhtml+xml -> application/. With hardcoded generalisation in the negotiator, it would only recognise text/html and text/ as satisfiable.

Yeah, I think what we're supposing above with all the CAST based solutions is essentially:

  • built-in generalization of image/png -> image/* etc. - i.e. those required by the spec anyway.
  • custom-defined generalization / synonyms basically exactly the way you suggest.

Much less boiler-plate for those things that we need anyway to be spec compliant.

I do agree with the idea of reducing boilerplate and in fact I got closer to your point of view as I wrote the rest of that response. I like the conceptual simplicity of an A->B map defined explicitly, but you end up with a lot of boilerplate in the generalised system.

Using data reps has the advantage of working without requiring SUPERUSER privileges, unlike creating casts with CREATE CAST... WITH INOUT;, which does require SUPERUSER access.

Wait, what? WITH INOUT does not require superuser, afaik. WITHOUT FUNCTION (binary coercible casts) require superuser. I wanted to suggest them first, but that wouldn't work. For a WITH INOUT cast we only need to be owner of one of the types - which we always are, because we create those DOMAINS ourself.

You are correct! Just tested this on Google SQL. Could have sworn I had tested this, my bad.

The challenge is in how to deal with "multiple possible options on the database side". I.e. not in "interpreting the request-user's priorities" (those are defined by spec), but in "interpreting the developer's priorities" in case of a tie. This is because I can easily create multiple aggregates or overloaded RPCs that could serve the same request equally well - and we need to sort those cases.

From going over it quickly, your proposed solution to this seems to be "length of path" in point 4. I haven't been able to think about that too deeply, yet, as said above.

Yes, we're thinking about the same thing here. Another way to describe it is that the dev provides a set in their schema, POSSIBILITIES, the client provides another set, WANTS, and our job is to find the best fit member of the intersection. WANTS turns out to be a priority ordered set by definition of Accept:, so our only remaining job is to prioritise the POSSIBILITIES and I think my solution does that in a natural manner. If we always choose the shortest path, that will correspond with the highest degree of intentionality expressed by the developer.

Maybe you can have a look at my suggestion that I posted basically at the same time and see how close we are or which key points in our suggestions are different?

Right, these proposals are quite close in fact. I think these are the differences:

  • The data reps based solution doesn't use create aggregate at all the way I wrote it. In my proposal it basically treats SETOF rowtype as just one more type that can be transformed. Whether you want to transform a whole table into a HTML <table> or encode an RPC raw pixel output to image/png, it's all the same thing: create your transformation function, define a domain cast.
  • ...although it might need to support create aggregate too for efficient map reduce style performance.
  • We agree on q. The developer's job is to define what's available, the client's job is to say what they want, PostgREST's job is to match the two. q is an instruction provided by the client to elucidate their want.
  • In both our proposals, every RPC function provides a specific potential return type, as specific as it can manage (and there are cases like SELECT * FROM image_uploads where image/* indeed is as specific as it can get).
  • You have AcceptHandlers which you filter, sort by a combination of specificity and priority, and then pick the first one.
  • I have something very similar. Filter the set of possible solutions and return the first one. (The search order is such that if you find a solution, it's automatically the best fit for the given Accept: header. You can stop immediately.)
  • But there's a difference in how it interplays with data reps, and, in my humble opinion, simplicity.

Apologies for how long this got but here's a deeper analysis:

Another way to describe this whole thing is that content-negotiation is just search through a 2D space. The points in the space are "content types I can serve", and we place them such that X and Y axis are client and developer preference respectively, 0 being the most preferred. At first look it seems like we can't give a definitive answer for the correct solution by just saying "pick the point the shortest distance from the origin" because there may be multiple solutions at the same coordinate, and you may have equidistant solutions forming an arc. In this case we're choosing to minimise X because that's what the Accept: header calls for which solves the latter.

So that leaves the problem that here could be two Content Types on the same coordinates. Both your solution and mine resolve this by making y = dev_priority(handler) bijective, ensuring no overlapping solutions. You have the order of precedence table you described, and in fact I have the same, roughly, in my post because I adopted that idea from you (built-ins < AS IMPLICIT casts).

The main difference, I think, is that data reps rely on this "chain of transformations" thing for simplicity and developer expressiveness. It's hard to make a simple example, but I'll give it a go.

Suppose we have the following scenario:

  • Out input is table rows from submarine_propellers.
  • Client accepts text/html but is stuck in the times when XML was great and so prefers application/xml+html.
  • We have functions, csvgen and htmlgen which can process submarine_propellers rows into text/csv and text/html. We also have another function, crazybingo, that makes unstructured text of any row input. Its return content type is text/*.
  • And we have casts, automatically or explicitly generated, text/html -> text/*, text/html -> application/xml+html (via a data rep transformer that makes sure to close all tags). We also have text/csv -> text/html using a function which makes CSV into a table, htmltabulate.

My solution (data reps):

  1. Define a set of (A, B, fn) tuples such that fn(A) -> B, forming a graph where each node represents a type, and the edges represent the transformations between types. Also add entry points for each type we can select from Postgres. There is only 1 edge between each node pair because when we load the data rep transforms, higher priority ones replace lower ones (like CAST ... AS IMPLICIT replaces a built-in transform).
  2. Consider the client's requested MIME types in order of preference, application/xml+html;text/html;text/*.
  3. For each requested MIME type, find a path from any entry point to the desired MIME type. If a path is found, it's a valid solution.
  4. Prefer the shortest path and search for shorter paths first. This ensures that the first solution found is always the best one. The best case search is a single A*, the worst (when there's no possible solution) is n such searches where n is the number of acceptable types.
  5. Since there is a path select submarine_propellers -[htmlgen]-> text/html -[xmlize]-> application/xml+html, we're done. We respond with SELECT xmlize(htmlgen(subquery)).
  6. Note we didn't choose to generate text/* because even though we can generate it and the client accepts it, we considered the client preferences in order.
  7. We also didn't choose xmlize(htmltabulate(csvgen(subquery))) because that's a longer path. This works well. htmlgen is a better handler because it specifically makes text/html. Even that the client requested application/xml+html which is neither, we resolved an ambiguous situation the best way and likely the way the developer intended.

AcceptHandler solution:

  1. Create AcceptHandlers with different priority levels, including custom handlers created via CAST ... AS IMPLICIT, CAST ... AS ASSIGNMENT, and built-in handlers.
  2. Filter out handlers that don't support at least one of the the requested MIME types. In this case, we keep csvgen, htmlgen, crazybingo, xmlize, htmltabulate.
  3. Filter out the ones that don't support the input. Let's say xmlize can accept any rows as input to make it a possible candidate still.
  4. Sort the remaining handlers by the specificity of their input, allowing for generic handlers to be overridden by more specific ones for special cases.
  5. If still tied, sort the handlers by their priority levels.
  6. Now, unless I've misunderstood, xmlize won't be chosen because it is less specific on its input. crazybingo also falls away. htmlgen, csvgen come out on top but tied, both able to specifically work on this input.
  7. htmlgen is selected because it has the highest q. Also csvgen makes text/csv which matches text/* but that's less specific.

In this example, both solutions aim to find the best match between the client's requested MIME types and the available content types provided by the developer, but the outcome is different.

Let me know if I misunderstood anything about your algorithm. If I did not, I propose that the data rep based solution finds a better solution with less ambiguity at each step.

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Mar 24, 2023

@aljungberg Thanks for explaining your approach in depth again - I have a much better undestanding of it now. Just some random thoughts on your post at first.

  • The data reps based solution doesn't use create aggregate at all the way I wrote it. In my proposal it basically treats SETOF rowtype as just one more type that can be transformed. Whether you want to transform a whole table into a HTML <table> or encode an RPC raw pixel output to image/png, it's all the same thing: create your transformation function, define a domain cast.

  • ...although it might need to support create aggregate too for efficient map reduce style performance.

Aggregates are not only required for performance, but also to allow certain formats in the first place. We can't take SETOF rowtype as an argument to a function. This is what aggregation is about. Currently we need some kind of aggregation in all responses, because PostgREST expects one returned row. And just concatenating the final text/bytea output after transformation is not going to work because some mimetypes require headers and footers. json_agg is a simple example: You need the [] brackets around the whole response and , between the rows. That's what aggregation does.

  • But there's a difference in how it interplays with data reps, and, in my humble opinion, simplicity.

I think I begin to see the that your approach will end up being simpler to use.

Apologies for how long this got but here's a deeper analysis:

Thanks a lot for that, it helped my understanding a lot!

Another way to describe this whole thing is that content-negotiation is just search through a 2D space. The points in the space are "content types I can serve", and we place them such that X and Y axis are client and developer preference respectively, 0 being the most preferred. At first look it seems like we can't give a definitive answer for the correct solution by just saying "pick the point the shortest distance from the origin" because there may be multiple solutions at the same coordinate, and you may have equidistant solutions forming an arc. In this case we're choosing to minimise X because that's what the Accept: header calls for which solves the latter.

After the first two sentences I was like "yeah and then we have multiple solutions at the same coordinate".. and then you just said that in your next sentence. This line of thought resonates very well with me.

The main difference, I think, is that data reps rely on this "chain of transformations" thing for simplicity and developer expressiveness. It's hard to make a simple example, but I'll give it a go.

It worked for me. I now understand what you meant by chain of transformations. In fact, one of the built-in handlers relies on this concept. application/json is handled by a chain of json_agg(to_json(row)), which is exactly that. It's basically using json as an intermediate type for %ROWTYPE --to_json--> json --json_agg--> "application/json".

  1. Define a set of (A, B, fn) tuples such that fn(A) -> B, forming a graph where each node represents a type, and the edges represent the transformations between types. Also add entry points for each type we can select from Postgres.

I see how this is much better for performance compared to my solution, because resolving the priorities happens during schema cache reload, not when serving the request.

  1. Prefer the shortest path and search for shorter paths first. This ensures that the first solution found is always the best one. The best case search is a single A*, the worst (when there's no possible solution) is n such searches where n is the number of acceptable types.

This is the key thing that I am not sure about. Is shorter always better? Am I able to express my intent as a developer this way? Following up this post I will try to put some actual real world cases that I have right now into code and try to test the concept.

  1. Since there is a path select submarine_propellers -[htmlgen]-> text/html -[xmlize]-> application/xml+html, we're done. We respond with SELECT xmlize(htmlgen(subquery)).

The xmlize part is really nice - not only allowing transformations between source data types and mimetypes.. but also "lateral" transformations between different mimetypes. That's the core of what you call "data representations". Elegant.

  1. Note we didn't choose to generate text/* because even though we can generate it and the client accepts it, we considered the client preferences in order.

AFAIK the order of mimetypes in the Accept header does not matter. As long as they don't have any q, they are all equally weighted. At least that's what the spec says, IIRC.

In this example, both solutions aim to find the best match between the client's requested MIME types and the available content types provided by the developer, but the outcome is different.

I didn't go through my own algorithm here in detail, but I'd assume that for both algorithms the same intent would just have to be expressed slightly differently. I assume it would be possible to achieve the same result, although maybe not as elegant as in your case with xmlize.

Let me know if I misunderstood anything about your algorithm.

I'm not sure whether there is a mis-understanding for one of us. I think we both haven't been 100% explicit about this, but after reading your post, I am left with the impression that there is one key difference in how both of would use CAST ... AS IMPLICIT for overriding the default:

  • In both we basically have a built-in default that equates to this:

    -- "api" is the schema exposed by PostgREST
    CREATE DOMAIN "application/json" AS json;
    CREATE FUNCTION pg_catalog.to_json(any) RETURNS json;
    CREATE AGGREGATE pg_catalog.json_agg(json) RETURNS json;
    CREATE CAST (json AS "application/json") WITH INOUT AS BUILTIN; -- BUILTIN ofc doesn't exist, but is written here to highlight the slight difference to IMPLICIT or no AS.
  • In my proposal, we would use for example CREATE CAST (csv AS "text/csv") WITH INOUT AS IMPLICIT; to "promote" text/csv to the default over application/json.

  • In (don't) understand your proposal to instead do something like CREATE CAST (json AS "application/json") WITH FUNCTION my_custom_json_handler(); to override the built-in json handler. So basically: By writing the implicit built-in out explicitly, you are changing it's properties. You could possibly change it's "weight", too, by adding or removing AS IMPLICIT to that cast. I read this into what you wrote in:

    There is only 1 edge between each node pair because when we load the data rep transforms, higher priority ones replace lower ones (like CAST ... AS IMPLICIT replaces a built-in transform).

    But I don't see how this would allow us to replace a global default format, which is currently JSON. Or maybe.. after writing this down, I do?

Let's try. I want an API to return only text/csv and no json anymore. So, with your proposal, I'd try the following:

-- should this be required/forced by PostgREST to use BYTEA? Does it ever make sense to really use TEXT for `*/*`? BYTEA seems to be the only built-in type that can actually represent all mimetypes...
CREATE DOMAIN "*/*" AS TEXT;
CREATE DOMAIN "text/csv" AS TEXT;

-- does this mean text/csv is now my default return type?
CREATE CAST ("text/csv" AS "*/*") WITH INOUT;

CREATE AGGREGATE to_csv(any) RETURNS "text/csv";

With this, I would have the following paths available for a Accept: */* request:

  • */* <--builtin-- application/json <--json_agg-- json <--to_json-- any
  • */* <--cast-- text/csv <--to_csv-- any

In this case, the second path would be chosen, because it's shorter. But that seems more like a coincidence, because the default path just happens to have the to_json in it as a second step. At least I didn't require AS IMPLICIT to somehow "override the default" - and I wouldn't know where to put it.

Assuming we have a two-step process for CSV, too:

  • */* <--builtin-- application/json <--json_agg-- json <--to_json-- any
  • */* <--cast-- text/csv <--csv_agg-- csv <--to_csv-- any

How would I now tell PostgREST that text/csv should override the builtin default?

@wolfgangwalther
Copy link
Member

Trying @aljungberg's transformation chains suggestion.

Example 1a: Generic files table with download function.

CREATE TABLE files (
  PRIMARY KEY (file),
  file UUID GENERATED ALWAYS AS (md5(data)::UUID) STORED,
  type TEXT GENERATED ALWAYS AS (byteamagic_mime(substr(data, 0, 4100))) STORED,
  data BYTEA NOT NULL,
  filename TEXT NOT NULL
);

CREATE DOMAIN "*/*" AS BYTEA;
CREATE FUNCTION download(file UUID) RETURNS "*/*"
STABLE
LANGUAGE plpgsql AS $$
BEGIN
  PERFORM
    set_config('response.headers', Jsonb_build_object('Content-Type', type)::TEXT, TRUE)
  FROM files
  WHERE files.file = download.file;

  RETURN data FROM files WHERE files.file = download.file;
END$$;

For a request like the following:

GET /download?file=<md5> HTTP/1.1
Accept: */*

There is only one path available, I think: */*. I think this path doesn't have any edges and just a single node, because what the client requests is what the endpoint (RPC) returns. No transformers needed.

Example 1b: Extending example 1 to return multiple files in a zip archive

(Haven't implemented it like this, I'm using mod_zip for nginx instead, which provides a much better solution)

CREATE DOMAIN "application/zip" AS BYTEA;
CREATE AGGREGATE to_zip(files) RETURNS "application/zip";
-- pseudo: state transition function creates a zip file from each `$1.data` using `$1.filename`.

This should allow me to do both of these:

  1. List files via json api:
GET /files?select=file,type,filename HTTP/1.1
Accept: application/json, */*
  1. Download zip archive of some of them:
GET /files?filename=like.*important* HTTP/1.1
Accept: application/zip

But this is rather inconvenient in most cases, because I'd request the file list via javascript, where I can add the accept header, and download the file directly via URL, where I can't make the browser change the header. So it would be better, if for the files endpoint only, the default would be to return the zip archive when requesting */*.

Let's try:

CREATE CAST ("application/zip" AS "*/*") WITH INOUT;

What would this do? In the current example this could work, because to_zip is the only thing dealing with application/zip, so this path would only match the /files endpoint. But once we maybe add another to_zip(any) overload... suddenly application/zip would be the default for all endpoints.

Instead we could try to change the AGGREGATE above to:

CREATE AGGREGATE to_zip(files) RETURNS "*/*";

We don't need to create the application/zip DOMAIN at all here.

Now, the */* request would have those paths:

  • */* <--to_zip-- files
  • */* <--builtin-- application/json <--json_agg-- json <--to_json-- files

The first one is shorter and will be taken.

Assuming we really had the to_zip(any) overload - this would already be filtered out when loading the schema cache, because the */* <-- files edge has a more specific function with to_zip(files). Not that we could do anything here anyway, because PostgreSQL overloading will handle this case for us.

For the Accept: application/json request, we'd only have the built-in json path available.

I like this, so far.

Example 2: Provide a report in json, csv and excel

CREATE FUNCTION report(<filter arguments>) RETURNS TABLE (
  who NAME,
  what TEXT,
  when TIMESTAMPTZ
) ...;

CREATE DOMAIN "text/csv" AS TEXT;
CREATE AGGREGATE to_csv(anycompatible) RETURNS "text/csv" ...;

CREATE DOMAIN "application/vnd.ms-excel" AS BYTEA;
CREATE AGGREGATE to_xlsx(anycompatible) RETURNS "application/vnd.ms-excel" ...;

Short of implementing those aggregates, this should be simple and produce the desired result. This should even work, when using select=who,what in the query string, I think...?

Hm, my examples so far, seem to be too simple. I think they would work equally well with both approaches?

@aljungberg
Copy link
Contributor

aljungberg commented Mar 24, 2023

Edit: sorry, didn't see your second comment until I already posted this. Hopefully the below is still on point.

Thanks for taking the time to try to find counterexamples! That's key to confirming this can work.

We can't take SETOF rowtype as an argument to a function.

Fair enough, I didn't know functions can't take record sets as input. Is that even if you explicitly define the setof row type as a complex type? It would be clean to make everything a simple function (which can then use aggregation inside if it wants to).

Throwing aggregation functions into the mix adds a bit of extra work in the discovery of the mappings and how to output the result, but luckily the broad idea remains the same. All we need to know is our (A, B, fn) tuples, giving us the map to build our paths, whether fn is a regular function or an aggregate.

Let's try. I want an API to return only text/csv and no json anymore. So, with your proposal, I'd try the following:

-- should this be required/forced by PostgREST to use BYTEA? Does it ever make sense to really use TEXT for `*/*`? BYTEA seems to be the only built-in type that can actually represent all mimetypes...
CREATE DOMAIN "*/*" AS TEXT;
CREATE DOMAIN "text/csv" AS TEXT;

-- does this mean text/csv is now my default return type?
CREATE CAST ("text/csv" AS "*/*") WITH INOUT;

Okay, I think my longer explanation will be more helpful but yes, something like that, although I don't think you should do this for reasons I'll state below.

But answering your question as stated: with data reps this cast would add a mapping of (text/csv, */*, noop), so yes, a client asking for */* would usually get text/csv if that's the type of an available input in the context, being the shortest path. The exception would be if there's an input literally of type */* available. Perhaps inadvisable, but technically possible; it would take precedence since that's an even shorter path.

CREATE AGGREGATE to_csv(any) RETURNS "text/csv";

With this, I would have the following paths available for a Accept: */* request:

  • */* <--builtin-- application/json <--json_agg-- json <--to_json-- any
  • */* <--cast-- text/csv <--to_csv-- any

In this case, the second path would be chosen, because it's shorter. But that seems more like a coincidence, because the default path just happens to have the to_json in it as a second step. At least I didn't require AS IMPLICIT to somehow "override the default" - and I wouldn't know where to put it.

Assuming we have a two-step process for CSV, too:

  • */* <--builtin-- application/json <--json_agg-- json <--to_json-- any
  • */* <--cast-- text/csv <--csv_agg-- csv <--to_csv-- any

How would I now tell PostgREST that text/csv should override the builtin default?

Right so there are two mechanisms a developer can use to promote an alternative resolution. The first one you already mentioned and I believe is the most intuitive: just create casts making a shorter path. Turn csv_agg(to_csv(x)) into cvsify(x) in a single function (or aggregate if that's what it takes), and now your shorter path trumps the default.

Let's say that's not possible, performant or convenient. Then that's where the order of mappings by priority that I mentioned plays in. I realise I didn't fully explain it. Basically, the resolver stops when it finds a solution, any solution, and it visits potential solutions in a certain inherent order: by path length, then by mapping order. So in our example at hand, and I'm going to simplify the path finding algorithm a little here (in reality we'd explore solutions in parallel for performance reasons):

  1. Consider solutions of length 1. No matter what order we try potentials here, there are none. [See note 1 and 2]
  2. Solutions of length 2? No. Proceed to look for length 3.
  3. Start by considering the mapping (any, csv, to_csv). Why this one first? It's user defined. The (any, json, to_json) edge won't be followed until we've ruled out that the priority edge can give us a length 3 solution. [Note 2]
  4. Since we find the noop(csv_agg(to_csv(x))) path satisfying, we return it and stop searching.

So you trumped the default solution when you did CREATE AGGREGATE to_csv(any) RETURNS "text/csv"; Being a user defined mapping, it was inserted earlier in our ordered set of mappings. So thinking of it in the graph sense, the edges have priorities (like costs in path finding). Getting down to the nitty gritty code, as a practical matter, the map that holds all data reps with key any has values [(csv, to_csv), (json, to_json)], the values being ordered. And as you rightfully pointed out this is all sorted during schema loading so there's no runtime cost. Lacking some kind of quantum computer, we have to visit some edge first, and it just so happens the edge we should visit is first in the list.

Sorry getting a little late here so I hope I didn't overlook any of your questions. You are right, there may be edge cases I didn't consider with my shortest length and highest priority edge resolution ordering, so this is a useful exercise. My intuition for why it's correct is that 2D search space I mentioned. Solutions don't overlap and we have a tie breaker if two solutions are the same distance from the origin: we pick the leftmost one as per the accept header. So the algorithm always finds the "best" solution, at least for my definition of best, and the way this fails is only if the developer doesn't have enough freedom to adjust x and y when 'positioning' solutions they want to 'win'. And there seems to be good freedom on both axises, so we should be OK.

(Disappointed if it's true that the Accept RFC doesn't specify how to break ties with equal q like I imagined, but as a practical matter grouping by q and then going left to right seems perfectly reasonable. What else would you expect as a user?)

[1]: We need to make sure not to make any of our built-ins be accidental "shortcuts" that skips over types. One step at a time is best, like in image/png -> image/* -> */*, even if it's tempting to skip the middle step. This ensures maximum composability and flexibility for the developer to inject alternative paths.
[2]: In reality we won't do length 1, 2, 3, 4, 5... in order, restarting each time, because a 5 step path starts with a 4 step path so we don't want to throw away our intermediate work. I imagine we'd use a "fill" style path finding algorithm, building on a set of working solutions, in the right order, until we find a solution or there are no more working solutions that didn't dead-end, or we reach the max path length constant.

@wolfgangwalther
Copy link
Member

Fair enough, I didn't know functions can't take record sets as input. Is that even if you explicitly define the setof row type as a complex type?

SETOF is not part of a type definition. It's part of CREATE FUNCTION. A function can either return a scalar value - or a set of values of a certain type.

Throwing aggregation functions into the mix adds a bit of extra work in the discovery of the mappings and how to output the result, but luckily the broad idea remains the same. All we need to know is our (A, B, fn) tuples, giving us the map to build our paths, whether fn is a regular function or an aggregate.

Yes. One thing to consider while search all possible paths: We need a maximum of one aggregate function in each path. Sometimes, having zero aggregates could work, too - but most of the time we need exactly one aggregate.

Okay, I think my longer explanation will be more helpful but yes, something like that, although I don't think you should do this for reasons I'll state below.

I couldn't find those reasons. What exactly should I not do and why?

But answering your question as stated: with data reps this cast would add a mapping of (text/csv, */*, noop), so yes, a client asking for */* would usually get text/csv if that's the type of an available input in the context, being the shortest path. The exception would be if there's an input literally of type */* available. Perhaps inadvisable, but technically possible; it would take precedence since that's an even shorter path.

That's certainly wanted. In my example 1a in my other comment above, I would still want the download() function, which returns */* to work, even when my default is text/csv for other endpoints through this cast.

Let's say that's not possible, performant or convenient. Then that's where the order of mappings by priority that I mentioned plays in. I realise I didn't fully explain it. Basically, the resolver stops when it finds a solution, any solution, and it visits potential solutions in a certain inherent order: by path length, then by mapping order.

This makes it impossible to "bump" longer paths before shorter paths by manipulating anything remotely similar to "priorites" or "weights", right?

My gut feeling is that, it's important to consider path length, because otherwise the number of possible solutions would just grow and it could lead to some strange solutions that the developer didn't really intend. However, at the same time, I feel like for most regular scenarios you'd be working in a range of maybe 1 to 4 nodes in a path - and it's not immediately clear to me, that in this space, shorter paths are always better than longer paths. But, I haven't found a counter example, yet, so maybe it just really works.

  1. Start by considering the mapping (any, csv, to_csv). Why this one first? It's user defined. The (any, json, to_json) edge won't be followed until we've ruled out that the priority edge can give us a length 3 solution. [Note 2]

  2. Since we find the noop(csv_agg(to_csv(x))) path satisfying, we return it and stop searching.

Thanks, that gave me a much better idea where you'd consider the order of the mappings. I will try and see to come up with examples where this is useful and find cases where we'd need to manipulate that order.

but as a practical matter grouping by q and then going left to right seems perfectly reasonable. What else would you expect as a user?

Correct, I think in practice that's what others do, anyway.

[1]: We need to make sure not to make any of our built-ins be accidental "shortcuts" that skips over types. One step at a time is best, like in image/png -> image/* -> */*, even if it's tempting to skip the middle step. This ensures maximum composability and flexibility for the developer to inject alternative paths.

So, you're saying we actually have */* <-- application/* <-- application/json <-- json <-- any as a built-in? I.e. with the intermediate application/* type?

(I just learned that json_agg(anyelement) takes anyelement already, so we actually don't need to consider to_json as another edge anymore as in a lot of previous examples)

Speaking of that, to maximize composability, we should not have this full path as a built-in - but rather each edge seperately:

  • (application/*, */*, no-op)
  • (application/json, application/*, no-op)
  • (json, application/json, no-op)
  • (anyelement, json, json_agg)

This would allow a developer to replace just a single edge of that chain. Examples:

  • I can do CREATE CAST ("application/json" AS json) WITH FUNCTION json_strip_nulls; to remove all null values by default from my api response - but still use the same mimetype. How cool is that? Path length is the same, but the (json, application/json, ...) edge is replaced, because the custom one has priority.
  • Let's also consider (anyelement, jsonb, jsonb_agg) a built-in, even though we don't use it by default. This would allow us to simply do a CREATE DOMAIN "application/json" AS JSONB; to change our whole default json handling to jsonb types instead of json.
  • All other examples that I can think of.. would change the mimetype, so they would all boil down to CREATE CAST ("*/*" AS ...) ... in some way, if that new format was supposed to be my default format. The question is: Can I implement everything I'd like as a default format with 4 nodes or fewer? I assume the answer is yes.

I can't find a reason why we'd ever need AS IMPLICIT or anything like that to manage priorities. The only thing seems to be, that we should prioritize custom-made edges over built-ins.

@aljungberg
Copy link
Contributor

aljungberg commented Mar 28, 2023

Thank you for the thoughtful examples. It's encouraging to see that the model seems to work in the cases given so far.

SETOF is not part of a type definition. It's part of CREATE FUNCTION. A function can either return a scalar value - or a set of values of a certain type.

Yes I see that now, there are just row types. Regular functions accept only rows or scalars as parameters. Can the SETOF x be cast to an array like x[]? We could potentially do that automatically when building the transformation chain. (The reason I'm going on about this is that I really like the idea of every transform being just a plain function. It seems so simple for the dev.)

If you bear with me, let me play with this idea for a moment using a concrete example.

Let's say we want to implement a custom table->json representation. Maybe we have a settings table with unique names and instead of outputting ["name": "value"], ["name2", "value2"], ... we want {"name": "value", "name2": "value2}. This can be implemented with Postgres' json_object_agg but PostgREST wouldn't know to use that. Sounds like a job for data reps.

CREATE TABLE config (
    key VARCHAR PRIMARY KEY,
    value VARCHAR
);

INSERT INTO config (key, value) VALUES
    ('key1', 'value1'),
    ('key2', 'value2'),
    ('key3', 'value3');

CREATE DOMAIN "application/json-dict" AS text;
CREATE OR REPLACE FUNCTION format_config(rows config[])
RETURNS "application/json-dict" AS $$
  WITH config_rows AS (SELECT * FROM unnest(rows))
  SELECT coalesce(json_object_agg(src.key, src.value), '{}')::varchar FROM config_rows AS src 
$$ LANGUAGE SQL;
CREATE CAST (config[] AS"application/json-dict") WITH FUNCTION format_config AS IMPLICIT;

-- Request comes in with `Accept: application/json-dict`
WITH pgrst_source AS (SELECT ARRAY(SELECT (key, value)::config FROM config) AS inner)
SELECT format_config(_postgrest_t.inner)
FROM (SELECT inner FROM "pgrst_source") _postgrest_t;
-- { "key1" : "value1", "key2" : "value2", "key3" : "value3" }

So that works for my example, and now we have a nearly fully generalised top level query which is kind of cool! Sets and scalars are the same at the highest level. And now we have a clean obvious way for the developer to say "When asked to make REP out of THING[], call function x". And it's the same way as for THING essentially. Whether REP is a JSON array, or a dictionary, or even XML, CSV, protobuf, zip file.

Although even while writing this I got doubts.

  1. Does the ARRAY(subquery) -> unnest(x) round trip cause PostgreSQL to accumulate all the results in one go rather than streaming them through one row at a time?
  2. We'd have to move pagination, offsets and limits into the CTE since format_config (in this example) wouldn't know to do that. So that's a pretty major change.

In the pursuit of simplicity in one aspect, maybe I'm actually making other things more complex and potentially degrading performance at the same time. Hmm, yes, maybe we just press forward with aggregates despite all the time I took to write this up.

Yes. One thing to consider while search all possible paths: We need a maximum of one aggregate function in each path. Sometimes, having zero aggregates could work, too - but most of the time we need exactly one aggregate.

Yes that would be the case. I guess in theory you could have a mapping that turned an intermediate result back into a SETOF records? I don't see a use for that.

I couldn't find those reasons. What exactly should I not do and why?

Yes, sorry, it ran a little late when I was writing that and I never really finished the thought. I wanted to say that I think you generally want the types you return to be as specific as possible for the matching algorithm to find the best fit.

Like you said, in our downloads RPC example, */* may be the only choice for return type. But if the client requests image/*, we can't satisfy that request. Every image/* is a */* but not every */* is an image/*. The client asked for only images, we can't send them an audio file accidentally.

We can use our generalisation rules to select a more specific type like image/png and treat it as the less specific image/* or */*, but not the other way around. This holds even if the download with the given UUID is in fact an image. We don't know that at pathing time, which only relies on static information, and we aren't going to speculatively invoke the download function just to see if it returns an image.

The solution to this problem may be to have the dev make download return "image_or_any" and then alias that placeholder to both image/* and */*. Now it's like we have overloaded the return type of download(uuid) so it can return two different types, and we can route image/* requests to it. Then we would feel justified to invoke it. If the file turns out not to be an image and the function set a content type header not actually acceptable, we can throw away the response at the PostgREST level and instead return 406 Not Acceptable.

So this kind of combines static and dynamic routing, but only when explicitly described as possible by the developer, so we don't burn CPU speculatively invoking methods that can never make acceptable content.

But answering your question as stated: with data reps this cast would add a mapping of (text/csv, */*, noop), so yes, a client asking for */* would usually get text/csv if that's the type of an available input in the context, being the shortest path. The exception would be if there's an input literally of type */* available. Perhaps inadvisable, but technically possible; it would take precedence since that's an even shorter path.

That's certainly wanted. In my example 1a in my other comment above, I would still want the download() function, which returns */* to work, even when my default is text/csv for other endpoints through this cast.

Yep.

Let's say that's not possible, performant or convenient. Then that's where the order of mappings by priority that I mentioned plays in. I realise I didn't fully explain it. Basically, the resolver stops when it finds a solution, any solution, and it visits potential solutions in a certain inherent order: by path length, then by mapping order.

This makes it impossible to "bump" longer paths before shorter paths by manipulating anything remotely similar to "priorites" or "weights", right?

Right, shorter paths always win. But you can shorten your own paths if need be. Or just change your original return type and cast specifically on that. So I think in practice this apparent inflexibility in the scheme won't matter.

My gut feeling is that, it's important to consider path length, because otherwise the number of possible solutions would just grow and it could lead to some strange solutions that the developer didn't really intend. However, at the same time, I feel like for most regular scenarios you'd be working in a range of maybe 1 to 4 nodes in a path - and it's not immediately clear to me, that in this space, shorter paths are always better than longer paths. But, I haven't found a counter example, yet, so maybe it just really works.

Yes, the intuition for why shorter paths are "right" is because you get closer to the type actually requested. Fewer transformations is better.

We'll definitely want to set a maximum path length to prevent pathological behaviour, and perhaps that number is 4. As a dev using PostgREST, maybe a longer path length would allow you to implement a cleaner, more decomposed solution, but there's diminishing returns. And this limit only restricts your implementation, not outcome. You can still create an infinite number of domain types to effect any necessary transformation chain.

[1]: We need to make sure not to make any of our built-ins be accidental "shortcuts" that skips over types. One step at a time is best, like in image/png -> image/* -> */*, even if it's tempting to skip the middle step. This ensures maximum composability and flexibility for the developer to inject alternative paths.

So, you're saying we actually have */* <-- application/* <-- application/json <-- json <-- any as a built-in? I.e. with the intermediate application/* type?

Yes exactly, I think this makes it both easy to understand and easier to tinker with for the dev. Just as you go on to say below.

Speaking of that, to maximize composability, we should not have this full path as a built-in - but rather each edge seperately:

  • (application/*, */*, no-op)
  • (application/json, application/*, no-op)
  • (json, application/json, no-op)
  • (anyelement, json, json_agg)

This would allow a developer to replace just a single edge of that chain. Examples:

  • I can do CREATE CAST ("application/json" AS json) WITH FUNCTION json_strip_nulls; to remove all null values by default from my api response - but still use the same mimetype. How cool is that? Path length is the same, but the (json, application/json, ...) edge is replaced, because the custom one has priority.
  • Let's also consider (anyelement, jsonb, jsonb_agg) a built-in, even though we don't use it by default. This would allow us to simply do a CREATE DOMAIN "application/json" AS JSONB; to change our whole default json handling to jsonb types instead of json.
  • All other examples that I can think of.. would change the mimetype, so they would all boil down to CREATE CAST ("*/*" AS ...) ... in some way, if that new format was supposed to be my default format. The question is: Can I implement everything I'd like as a default format with 4 nodes or fewer? I assume the answer is yes.

Yes so I guess another way to conceptualise this is that since our generalisation rules go step by step, as a developer you select how "wide" your customisation is by where you insert your casts. Casting from */* is very powerful and can change the default behaviour of PostgREST completely. Meanwhile your json_strip_nulls is a significantly more tailored change.

I can't find a reason why we'd ever need AS IMPLICIT or anything like that to manage priorities. The only thing seems to be, that we should prioritize custom-made edges over built-ins.

Hmm yes, you might be right. When doing data reps we chose (it was with you I discussed it with I think) to always require AS IMPLICIT because it clearly signals you want these transformations to happen as automatically as possible. I think that's still true. But I can't think of a scenario where you'd need more than 2 priorities-- built-in and custom.

What does PostgREST actually produce?

I know this discussion has become very wide and deep already but just one more thought that occurred to me: do we really output JSON by default or do we output text? Our top level expression for the output is json_agg(_postgrest_t)::character varying. Where does that final ::character varying fit into the picture? Where does it "come from" in the context of content negotiation? If we had a data rep that produced binary data, would we want something else?

Like we touched upon binary JSON like jsonb. Is that a different content type or is it a different encoding? We talked about encoding images with base64. That definitely feels like an encoding. But base64 doesn't fit into the Content-Encoding header because you can very well have compression on top of that (and you should, base64 compresses very well). I don't think Content-Endcoding and the corresponding Accept-Encoding headers hold the answer.

So how do we choose our top level cast once we're done with our transformation chain? Is there an implicit transformation somewhere in */* <-- application/* <-- application/json <-- json <-- any from the gestalt/logical JSON document to its UTF-8 wire form? And how would you override it to be a binary wire format?

@steve-chavez
Copy link
Member Author

steve-chavez commented Mar 29, 2023

What does PostgREST actually produce?

Our top level expression for the output is json_agg(_postgrest_t)::character varying. Where does that final ::character varying fit into the picture?

I think that's safe to remove now and can be ignored for this feature. It was probably added to support an older libpq or postgres version(likely related to failed Hasql decoding).

Just to elaborate. character varying is an alias for VARCHAR(ref). pg recommends using TEXT over VARCHAR(ref). In fact we use text instead of character varying in another fragment:

asJsonSingleF :: Bool -> SqlFragment
asJsonSingleF returnsScalar
| returnsScalar = "coalesce((json_agg(_postgrest_t.pgrst_scalar)->0)::text, 'null')"
| otherwise = "coalesce((json_agg(_postgrest_t)->0)::text, 'null')"

If we had a data rep that produced binary data, would we want something else?

On #2701, I experimented with adding a binary aggregation(ref) disregarding the character varying(ref) and it worked just fine.


To avoid confusion, I'll try to remove the character varying in another PR, if there's trouble I'll change it to text.


Edit: Done on #2726. Our Hasql decoding is bytea, that supports both bytea and text just fine.

So I don't think we should change any assumption made for the feature discussed here so far.

@steve-chavez
Copy link
Member Author

This is so awesome, I think I'm starting to grasp the idea.

  • (application/*, */*, no-op)
  • (application/json, application/*, no-op)
  • (json, application/json, no-op)
  • (anyelement, json, json_agg)

Right now we don't actually support an application/*, and we fail when the client requests that, but it can be added.

So besides the application/json builtin, we also have..

builtin application/vnd.pgrst.object+json

  • (json, application/vnd.pgrst.object+json, no-op)
  • (anyelement, json, json_first_agg)

json_first_agg really is json_agg(_postgrest_t)->0(ref), which reminds me a bit of the first_last_agg extension.

builtin application/geo+json

  • (json, application/geo+json, no-op)
  • (anyelement, json, geojson_agg)

The geojson_agg is a builtin(ref), but it is(should be) an aggregate in the end. We've had an issue regarding modifying its format, so this could be very well done with this feature. We might also consider dropping it in a later release.

builtin text/csv

  • (text, text/csv, no-op)
  • (anyelement, text, csv_agg)

csv_agg is a more complicated builtin(ref). At one point I proposed to implement a csv_agg to pgsql-hackers, but this was rejected.. still it sounds it should be an aggregate.

custom text/html

So for my use case of a custom text/html, I have two options.

1 - Without overriding the default */*. In this case I'd have to define:

  • (text, text/html, no-op) // CREATE DOMAIN "text/html" + CAST
  • (projects, text, projects_html_agg) // CREATE AGGREGATE projects_html_agg(projects) RETURNS text

It seems it could also be done by just one node?

  • (projects, text/html, projects_html_agg) // CREATE AGGREGATE projects_html_agg(projects) RETURNS "text/html"

2 - Overriding the default */*. It could be done like:

  • (text/html, */*, no-op) // CREATE DOMAIN "text/html" + CREATE DOMAIN "/" + CAST
  • (text, text/html, no-op) // CAST
  • (projects, text, projects_html_table_agg) // CREATE AGGREGATE projects_html_agg(projects) RETURNS text

Or just two nodes?

  • (text/html, */*, no-op) // CREATE DOMAIN "text/html" + CREATE DOMAIN "/" + CAST
  • (projects, text/html, projects_html_table_agg) // CREATE AGGREGATE projects_html_agg(projects) RETURNS "text/html"

In both of these cases we wouldn't be touching any other resource because the aggregate only takes the projects table, but we could be more generic with an html_table_agg(anyelement), which would require different handling since it would affect all the other resources. This shouldn't be a problem though, it can be left to the developer.


I can't find a reason why we'd ever need AS IMPLICIT or anything like that to manage priorities. The only thing seems to be, that we should prioritize custom-made edges over built-ins.

I really like that, not using the IMPLICIT clause makes this so much cleaner.


So far it seems this ticks all the boxes. I think we can start a draft implementation.

@aljungberg
Copy link
Contributor

This is so awesome, I think I'm starting to grasp the idea.

  • (application/*, */*, no-op)
  • (application/json, application/*, no-op)
  • (json, application/json, no-op)
  • (anyelement, json, json_agg)

Right now we don't actually support an application/*, and we fail when the client requests that, but it can be added.

Yes, for every specific content type we can produce using built-in rules, we would add built-in generalisation rules. So in this case, we'd go all the way from application/json to application/* and */*, and put that first in the resolution map section for built-ins, making application/json in effect the default return type of PostgREST, while allowing the dev to override that with ease.

As a practical matter, we can choose whether to make these generalisation transforms explicit or implicit. For the latter, we can just assume them in the content negotiator/data rep resolver. But I think the cleanest way is explicit: to literally add these "do nothing" mappings to the data rep type conversion maps. Then we have fewer special cases in the code and debuggability seems better because every step in the path will always be chosen from a list you can print and look at.

If using explicit rules we have to take some care so we also generate the same kind of explicit type relaxation rules for user-defined types. So if the user defines banana/yellow and banana/brown, in that order[1], we would find them when scanning for relevant domains and infer (banana/brown, banana/*, no-op) as well as (banana/*, */*, no-op) (you might ask the dev to remember to do this but we shouldn't give them needless busywork). Using implicit rules, any transform from x/y to x/* would just be a regex match, so no extra care needed with that approach.

(In my view, figuring out what you want versus what you have is 80% of programming, so making things concrete and inspectable is important.)

builtin application/vnd.pgrst.object+json

  • (json, application/vnd.pgrst.object+json, no-op)
  • (anyelement, json, json_first_agg)

json_first_agg really is json_agg(_postgrest_t)->0(ref), which reminds me a bit of the first_last_agg extension.

Oh yeah, perfect. So again I see this change as simplifying by formalising our data transformations. Having a built-in rule that says, "use json_first_agg in this case" is so much more discoverable and clear than finding two lines in SqlFragment.hs after manually tracing the code.

With this change, application/vnd.pgrst.object+json would barely even be a special case in our code. It could work with nothing more than a single addition to our built-in map list, and the corresponding json_first_agg implementation.

I guess in practice we also want to add LIMIT 1 automatically so we don't aggregate a whole table to just select the first row. Can that be done inside of json_first_agg? If it can, application/vnd.pgrst.object+json would truly stop being a special case at all as far as the Haskell code is concerned. Just another built-in data rep.

On that note: json_first_agg isn't a real function, like you said. So how do we make that work? I see 3 possibilities.

  1. PostgREST automatically creates this aggregate function in the database if it doesn't see it during schema cache loading. Normally PostgREST doesn't change your schema, so I'm not sure this feels right. We would namespace it of course, but now PostgREST would require a bunch of extra permissions to function, and leave artefacts behind. Plus if we have a bug in one of these functions we also need to mind automatic migration of them to implement fixes which will require even more permissions because now we have to delete the old functions too.
  2. We could ask the developer to insert our "library" of SQL functions as a prerequisite for some subset of our functionality. If they choose not to, PostgREST would degrade gracefully, but application/vnd.pgrst.object+json would be disabled. This would mean we'd actually be less powerful than before, out of the box, so also not great.
  3. We could still use SQL fragments, while still preserving the generalisation and unified approach to transformation. We'd have a predefined set of built-in operations which we inline as needed when building the final query, keyed on special sentinel names in the data rep maps. Like our built-in rule would say (anyelement, json, _postgrest_builtin.json_first_agg), _postgrest_builtin.json_first_agg being a placeholder rather than a real function. It gets replaced with canned code (json_agg(x)->0) when data rep is outputting the transformation chain.

builtin application/geo+json

  • (json, application/geo+json, no-op)
  • (anyelement, json, geojson_agg)

The geojson_agg is a builtin(ref), but it is(should be) an aggregate in the end. We've had an issue regarding modifying its format, so this could be very well done with this feature. We might also consider dropping it in a later release.

Right so this is a great example of this feature empowering PostgREST users. They want something that works differently than our bundled json_object ... ST_AsGeoJSON operation? They can just provide a custom function that overrides our (anyelement, json, geojson_agg) built-in mapping. So then we can close #2699. That issue and every future such issue just goes away because devs can easily self serve. They'll be happy, we'll be happy.

custom text/html

  • (projects, text/html, projects_html_agg) // CREATE AGGREGATE projects_html_agg(projects) RETURNS "text/html"

Yep. As a developer you can just go straight to it, (projects, text/html, project_html_agg). Simple, obvious.

2 - Overriding the default /. It could be done like:

So in your example with projects_html_agg, you don't have to do anything at all to make */* produce text/html by default. For this example, let's assume you've got two mappings from projects: one to text/csv via one function, and another to text/html via another, both user-defined. You want to control what happens when Accept: */* is given while requesting projects. This is actually already defined: whichever mapping (CAST) from the projects content type you defined last wins[1].

[1]: Edges in the path algorithm have priorities. User-defined trump built-in. When there are multiple user-defined casts, we should always choose the last one because anything else would be unintuitive. In natural language if I say, "bananas should be represented in the style of Van Gogh" and then later say, "bananas should be represented by jazz improv", you'd assume I changed my mind. Also it fits conceptually. Built-in rules are defined "first", being present at "compile time", and replaced by user-defined rules at "run time". And just in the same way, additional user-defined rules trump what went before.[2] As a practical matter, the way we implement this is that the values in the map from type X is an insertion ordered set which we seed with our built-in rules, then process in reverse order when pathing. E.g. {"banana/brown: [(banana/*, no-op), (banana/yellow, user_defined_banana_painter_fn)]}. This means this user defined banana refresher gets a chance to create a valid path before the built-in banana generaliser.
[2]: In the context of how projects get represented,text/html or text/csv, you as a developer have not actually replaced the first with the second, both exist at once and the first will be used if the user explicitly requests text/html. The reason you get text/csv for */* is because or the ordering mentioned[3], not because you replaced a rule. Replacing rules is also possible and in that case no ordering is necessary: if you give another (projects, text/csv, x) tuple, we overwrite the previous one because it is never useful for us to have two edges of the same priority between the same pair of types. If the output type works it will always end up working with the first edge, if it does not work it will never work with any edge for that pair.
[3]: the path finder starts from the entry point so for an accept of */* we're asking it to solve how to represent projects as */* and it will start and end with the path projects -> text/csv -> text/* -> */*. At that first step we could have chosen text/html but for the ordering rule.

@steve-chavez
Copy link
Member Author

Sorry for letting this languish. I had to prioritize other issues due to work.

I'm starting implementation on #2825. For now the plan is:

  • First stage: support only specific media types, no */* or <media>/*. No overriding the default json handling for */*. No path finding at this stage.

    • This should make it easier to start the feature while also covering many use cases.
    • I believe there are other unexplored options. Some frameworks for example rely on the endpoint suffix for determining the default handler(like /projects.json). This can be discussed later.
    • Having media as types makes a lot of sense but */* or <media>/* are not really media types. This could fit better as a server config (maybe using in-db config) since content negotiation is server behavior really.

    CREATE FUNCTION download(file UUID) RETURNS "/"

    The solution to this problem may be to have the dev make download return "image_or_any" and then alias that placeholder to both image/* and /
    If the file turns out not to be an image and the function set a content type header not actually acceptable, we can throw away the response at the PostgREST level and instead return 406 Not Acceptable.
    So this kind of combines static and dynamic routing, but only when explicitly described as possible by the developer, so we don't burn CPU speculatively invoking methods that can never make acceptable content.

    • I agree with the above and I think we should avoid dynamic media types if possible. I believe what the download case really needs is a sum type (no support on pg unfortunately) for the files.data column(ref) then we'd return that column type in the function instead of */*.

      But this can be left for later. It also seems like an edge case.

  • Second stage: media type aliasing. This might be done with CASTs or perhaps we should consider just detecting functions that go from media -> media. I've been noticing CASTs between domains logs WARNINGs, which might scare some users.

  • Third stage: overriding default for */*. More complex content negotiation. Path finding.

So all things considered seems implementation can be started on the First stage without causing breaking changes on later stages. This way we avoid analysis paralysis while also closing some issues.


Seems we can use domain based media types for Content-Type too: #2826

@steve-chavez
Copy link
Member Author

Closing as this got implemented on #3076

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Binary Endpoints
4 participants