Enable embedding through multiple layers of views recursively #1625

wolfgangwalther · 2020-10-15T19:29:03Z

Resolves #1607.

Changes the allSourceColumns query to recurse, when the base relation is still a view or materialized view. The query now returns the "true" base columns through multiple view layers, enabling the embedding of views of views of... you know!

steve-chavez · 2020-10-16T15:47:50Z

Interesting! 👀

One thing we need to check with changes to DbStructure is if they considerably slow down our startup time.

We currently test this manually(me and Remo) with a sample big schema we share privately.
Would you like to test this for your self? If so, you can shoot me an email at stevechavezast@gmail.com and I'll share you the instructions.

wolfgangwalther · 2020-10-16T19:07:16Z

I used the "big schema" and did some performance testing. Since I only changed the query, I ran it directly on the database. I used the following function to run 100 times - alternating both queries - and compute the average plan and execution time for each query.

create or replace function timeit (n int, variadic queries text[], out plan_avg numeric[], out execution_avg numeric[])
language plpgsql as $$
declare
  i int;
  j int;
  explained json;
  plan_sum numeric[];
  execution_sum numeric[];
begin
  for i in 1..n loop
    for j in array_lower(queries, 1)..array_upper(queries, 1) loop
      execute 'explain (analyze, format json) ' || queries[j] into explained;
      plan_sum[j] := coalesce(plan_sum[j], 0) + (explained->0->>'Planning Time')::numeric;
      plan_avg[j] := plan_sum[j] / i;
      execution_sum[j] := coalesce(execution_sum[j], 0) + (explained->0->>'Execution Time')::numeric;
      execution_avg[j] := execution_sum[j] / i;
    end loop;
  end loop;
end
$$;

select * from timeit(100,
$q$ ... old query ... $q$,
$q$ ... new query ... $q$);

The results are as follows:

query | plan_avg | execution_avg
old   | 0.56 ms  | 395.3 ms
new   | 0.63 ms  | 399.7 ms

I repeated that and had the same difference (~+5ms for the new query).

So, as expected the recursion is a bit slower. But really not much.

As a side note:

The old query returns 2208 rows
The new query returns 2181 rows
The new query without where base_kind not in ('v', 'm') returns 2490 rows

The new query returns fewer rows, because rows that have a view as a base table are removed - those are useless anyway. However, without removing those rows, the new query returns almost 300 rows more, so there are plenty of "views of views" in that schema. To quantify that, I ran the following query:

with
q1 as (...),
q2 as (...)
select * from q2
except
select * from q1

That's all base columns that have been detected by the new query, that had not been detected before. The query returns 188 rows.

Trading 5ms cache reload time for 188 new opportunities* to embed stuff. Anyone?

*well not really. not nearly all of those columns are unique or referenced at all in a fk constraint...

wolfgangwalther · 2020-10-16T20:45:08Z

*well not really. not nearly all of those columns are unique or referenced at all in a fk constraint...

Thinking this further... we don't need to return all source columns from this query, but only those that are used in any foreign key constraint on either side. Correct?

I changed both the old and the new query to return only those columns:

The old query now returns 543 rows
The new query returns 519 rows (the same logic as above applies here)
The except query returns a difference of 24 columns

Those 24 columns now really represent new opportunities to embed.

And even better - those queries perform better than ever:

query        | plan_avg | execution_avg
master       | 0.54ms   | 387.7ms
recursive    | 0.62ms   | 393.0ms
master_fk    | 1.16ms   | 362.2ms
recursive_fk | 1.06ms   | 362.0ms

Now, the recursive query is just as fast as the one without recursion - but both are about 7% faster than before.

All tests were run on PG 12.

wolfgangwalther · 2020-10-17T14:46:17Z

Reordered the whole query a bit and replaced regexp_split_to_array with string_to_array, which is a lot faster. The query now comes it at around 220-230ms, so almost twice as fast.

Should be good to merge.

steve-chavez · 2020-10-17T15:32:10Z

Mindblown! 💥 🤯

Amazing work here Wolfgang. I wonder what @monacoremo thinks 😄.

We might just call this new version The Performance Release .

I'm going to give it a review and then merge. It's all looking good though, I'm just going to digest and marvel at how the new queries work :)

monacoremo · 2020-10-17T17:27:57Z

Great stuff!!! Like where we are going with this :-)

wolfgangwalther · 2020-10-18T09:19:19Z

Rebased on current master and did some small renaming for less of a diff and better readability.

Glad you guys like it. I'm not sure whether I'm digging my own grave here with the performance stuff, because now it will be very hard for me to get my "transform to json" approach on the same level... ;)

wolfgangwalther · 2020-10-23T21:28:23Z

Rebased on master.

Added another test as well, which is currently still failing. This is about a chain of views, where one of the views in the middle is not in the public schema. Will have to see how I can fix this the best, while keeping performance.

wolfgangwalther · 2020-10-23T23:02:49Z

Embedding is now possible through hidden views in the middle of the chain as well. This does have the same performance on the big schema, but that's mainly because there are no hidden views in there. The query, as it is now, has to parse all views in the database to detect relationships through hidden views. In a setup with many hidden, but unrelated views this could increase the time for this query.

steve-chavez · 2020-10-25T02:26:47Z

@wolfgangwalther I've finally understood your great work here!

The query is looking solid, but I think we can clarify the code a bit(for example allSourceColumns is no more, it should be something like allSourceFkColumns).

I'll do some reviews, but first..

Embedding is now possible through hidden views in the middle of the chain as well.

Would this cover a realistic use case?

IIUC, It would imply that a user creates public views for their API and then creates private views based on those public views.

That wouldn't make sense according to schema isolation. The idea is to work inside out, selecting private tables or views to be exposed as public.

The query, as it is now, has to parse all views in the database to detect relationships through hidden views. In a setup with many hidden, but unrelated views this could increase the time for this query.

The price to pay seems a bit too high. PostgREST can be used on legacy databases and those can have a potentially large number of views.

wolfgangwalther · 2020-10-25T09:36:31Z

The query is looking solid, but I think we can clarify the code a bit(for example allSourceColumns is no more, it should be something like allSourceFkColumns).

I'll change that together with everything else you find in your review :)

Would this cover a realistic use case?

IIUC, It would imply that a user creates public views for their API and then creates private views based on those public views.

That wouldn't make sense according to schema isolation. The idea is to work inside out, selecting private tables or views to be exposed as public.

I did this, because this was exactly what lead me to this issue. It is the other way around, though. My scenario:

hidden schema data with tables
hidden schema hidden with some views as an abstraction on the data tables (some code that is shared between some of my public views)
public schema public with view that use those in hidden

So this is exactly the inside-out approach you're describing.

Before my latest change the view detection would only take those views in the public schema, but not those in the hidden schema. The table in data would never be found.

The price to pay seems a bit too high. PostgREST can be used on legacy databases and those can have a potentially large number of views.

I think, in fact, it probably only makes sense to do it like this. I would assume that it is more likely to have such a "view-chain" when using something like I did above (using views as another layer between data tables and public views).

But maybe we can do something inbetween: Right now we were either looking for views only in the schemas mentioned in db-schema or in all schemas everywhere. We could just look for views that are in the search path, so db-schema + db-extra-search-path. In this case I could add hidden to the search path in my case (which is something that I already did anyway). Users can still control how many views are exposed like this, then.

steve-chavez · 2020-11-01T03:49:59Z

hidden schema data with tables
hidden schema hidden
public schema public with view that use those in hidden
So this is exactly the inside-out approach you're describing.

Just got It! I was a bit mislead by the tests. That definitely coincides with the schema isolation approach.

We could just look for views that are in the search path, so db-schema + db-extra-search-path

That sounds right! I agree with that change.

wolfgangwalther · 2020-11-02T11:06:31Z

Wolfgang:

Thinking this further... we don't need to return all source columns from this query, but only those that are used in any foreign key constraint on either side. Correct?

Steve:

(for example allSourceColumns is no more, it should be something like allSourceFkColumns).

Looking at the code, my earlier assumption is wrong. It is not only the FK columns that we need, but also the PK columns of the base tables, because we have this line:

postgrest/src/PostgREST/DbStructure.hs

Line 58 in 4b4a622

keys' = addViewPrimaryKeys srcCols keys

We're missing a test-case here, because my change should have made at least one of them fail. I guess all relevant PKs in our test-schema are also used in some kind of foreign key relationship. I will try to make up a test-case to make sure we're not breaking anything.

wolfgangwalther · 2020-11-02T11:52:16Z

We just don't have any case to test inserting into a view and getting a location header back. At least I couldn't find any. I'm pretty sure this works already, so we need to add that.

steve-chavez · 2020-11-02T19:09:28Z

it is not only the FK columns that we need, but also the PK columns of the base tables
We just don't have any case to test inserting into a view and getting a location header back

@wolfgangwalther You're right, I also missed that one. A new test would be great.

wolfgangwalther · 2020-11-03T06:59:59Z

Rebased on #1644, which we should merge first. This shows that the new test is indeed breaking with the current changes here - will fix.

wolfgangwalther · 2020-11-06T08:30:19Z

Rebased on master, fixed the missing pks and did the todos from above.

steve-chavez · 2020-11-09T07:43:49Z

Sorry for the late response here @wolfgangwalther. I wanted to make sure we get a nightly for your PR, so I did #1648 first.

I'll review your latest changes tomorrow :)

test/fixtures/schema.sql

steve-chavez · 2020-11-16T22:07:27Z

Just FYI. I've released a nightly(2020-11-15-14-58-6e04fe7) with this change.

Also, we definitely need to document the db-extra-search-path interaction with hidden/middle views. I think a subsection under Embedding Views would do.

wolfgangwalther · 2020-11-17T08:03:50Z

Sweet. Just started testing the nightly in my current project. The first thing that happenend was that some embeddings, that I had in place before, suddenly returned 30x, because there were 5 additional m2m relationships found to embed on.... :D
So this is definitely something that we need to point out for the next release as well, just as discussed in #1593.

steve-chavez · 2020-11-17T15:26:23Z

suddenly returned 30x, because there were 5 additional m2m relationships

Oh, that's bad. And #1593 is likely to bring more possibilities.

The m2m's is something I was thinking sometime ago.
Right now, many tables can be treated as junctions even when they're not really meant to be.

Perhaps we can select junctions for embedding in a stricter way? Like when they only have FKs and no additional columns?

We could even test this with a config option like db-strict-junctions.

wolfgangwalther · 2020-11-18T09:20:52Z

Oh, that's bad. And #1593 is likely to bring more possibilities.

My reaction was more like: "YES! It worked :D". Wasn't that bad, as I have automated tests in place for all embeddings I'm using and it was just a simple hint to add.

The m2m's is something I was thinking sometime ago.
Right now, many tables can be treated as junctions even when they're not really meant to be.

Perhaps we can select junctions for embedding in a stricter way? Like when they only have FKs and no additional columns?

We could even test this with a config option like db-strict-junctions.

I thought about this a bit. I agree that something needs to be done about m2m relationships and junction tables. Not sure whether I'd go in the same direction, though.

In my scenarios so far, I almost always had some other columns on those junction tables. I can't use m2m embedding for those. I do want a "flat" output and not something nested, so I ended up creating views that join the junction and target table. This view is then easily (now!) embeddable.

However, if there was a way to do this "flat m2m embedding" in the query syntax, that would be even better.

So this boils down to: In the current implementation m2m embedding only makes sense with "strict" junction tables anyway - so the limitation would sure help to decrease ambiguation. But maybe it would be possible to extend the m2m feature to make it more usable with extended junction tables instead?

steve-chavez · 2020-11-18T16:26:04Z

So this boils down to: In the current implementation m2m embedding only makes sense with "strict" junction tables anyway

Yes, I also believe that, the non-fk columns are lost anyway.

But maybe it would be possible to extend the m2m feature to make it more usable with extended junction tables instead?

Maybe we could use the cardinality hint I mentioned on #1643 (comment).

So, by default only strict junctions are automatically detected, but extended junctions can be used for M2M embedding by adding an m2m hint.

wolfgangwalther · 2020-11-19T08:14:03Z

But maybe it would be possible to extend the m2m feature to make it more usable with extended junction tables instead?

Maybe we could use the cardinality hint I mentioned on #1643 (comment).

So, by default only strict junctions are automatically detected, but extended junctions can be used for M2M embedding by adding an m2m hint.

I like that. It would reduce complexity for the most common cases, but still allow all of those we currently do.

Do we need to support adding multiple hints in this case? I think yes, because otherwise we can't tell which m2m to use.

To avoid name collisions with columns or fks named "m2m" / "o2m" / "m2o", maybe we should use another operator for the cardinality hint and separate the two types of hints? Could be easier to parse as well?

Yes, I also believe that, the non-fk columns are lost anyway.

Do you think there'd be any way we can change this? I think this has been requested in the past as well. Maybe we can just allow selecting columns from the junction table inside a m2m relationship to flatten the output instead of having two nested embeds. Flattening could also be useful for m2o relationships.

Flattening would have to work differently for m2o and m2m relationships:

in a m2o relationship, we'd want to move the embedded columns to the "outer" level. I suggest something like this:
GET /films?select=title,director:directors.last_name,directors()
returns

[
  { "title": "Workers Leaving The Lumière Factory In Lyon",
    "director": "Lumière"
  },
  { "title": "The Dickson Experimental Sound Film",
    "director": "Dickson"
  },
  { "title": "The Haunted Castle",
    "director": "Méliès"
  }
]

in the m2m relationship, we could then use .. instead of directors. to qualify the column, so something like this:
GET /actors?select=first_name,last_name,films!m2m.roles(title,..character)
character is then taken from the junction table.

steve-chavez · 2020-11-19T22:15:52Z

Regarding flattening, I came up with a proposal here: #1233 (comment). Would that be enough for the use cases you have in mind?

Do we need to support adding multiple hints in this case?

Yes, that's also something I considered on the first disambiguation proposal. It was like:

GET /<table>?select=*,<table>!<cardinality>!<fk>(*)

To avoid name collisions with columns or fks named "m2m" / "o2m" / "m2o", maybe we should use another operator for the cardinality hint and separate the two types of hints?

I thought those names are rare enough to not be used as columns/fks. But we can also warn about this on docs.

Not sure about more operators. I used the ! because we are short in urlsafe chars to use as operators(it also reminded me of the !important in css). IIRC, the only other char that was available was $, but pg identifiers can have $ so I discarded it.

wolfgangwalther · 2020-11-20T19:16:06Z

Regarding flattening, I came up with a proposal here: #1233 (comment). Would that be enough for the use cases you have in mind?

If I read that correctly, that will only work for the embedding of m2o relationships. The example you gave there was a m2o->m2o relationship, right?

Regarding m2o relationships our proposals differ in two aspects:

use of -> vs .. That's just a matter of choice, I guess. I feel like . more closely resembles the SQL ("table.column"), but -> would work as well of course.
Your suggestion has the embedding and the flattening together like this: clients(*)->name, where I have two fields: clients.name,clients(). The empty embedding would be used to do the join, while the clients.name / alias.column would put one of the columns in the result. This could easily be extended to support multiple columns.

I have not looked in detail at the SQL, yet, but I think it should be possible to do this with multiple columns as well.

m2m relationships are different. I don't see how your proposal would work in this case. Taking the same example, but a slightly different request:

GET /tasks?select=name,clients!projects(*)

This would give me a list of clients through the m2m relationship. How would I go about adding the columns of the projects table / junction now?

Yes, that's also something I considered on the first disambiguation proposal. It was like:
GET /<table>?select=*,<table>!<cardinality>!<fk>(*)

I thought those names are rare enough to not be used as columns/fks. But we can also warn about this on docs.

Not sure about more operators. I used the ! because we are short in urlsafe chars to use as operators(it also reminded me of the !important in css). IIRC, the only other char that was available was $, but pg identifiers can have $ so I discarded it.

You're right. Using multiple ! and making m2o, o2m and m2m keywords should be fine and straightforward.

…EST#1625) * Include hidden views from the search path. Hidden views are views in unexposed schemas that are part of a view dependency chain. * Change allSourceColumns to only return pk and fk columns

wolfgangwalther force-pushed the embedding-views-recursively branch from 4ec7acb to 63a1420 Compare October 16, 2020 06:20

wolfgangwalther force-pushed the embedding-views-recursively branch 2 times, most recently from 6a3cd88 to 205052e Compare October 16, 2020 20:40

wolfgangwalther force-pushed the embedding-views-recursively branch from 205052e to f1f866f Compare October 17, 2020 14:43

wolfgangwalther force-pushed the embedding-views-recursively branch 2 times, most recently from 390d189 to b9d58c9 Compare October 17, 2020 14:50

wolfgangwalther force-pushed the embedding-views-recursively branch from b9d58c9 to 1d9f0b8 Compare October 18, 2020 09:16

wolfgangwalther mentioned this pull request Oct 19, 2020

Fix embedding through views with subqueries inside function calls #1632

Merged

wolfgangwalther force-pushed the embedding-views-recursively branch from 1d9f0b8 to 7ac9f8d Compare October 23, 2020 21:26

wolfgangwalther force-pushed the embedding-views-recursively branch from 7ac9f8d to a02e5dd Compare October 23, 2020 22:43

wolfgangwalther force-pushed the embedding-views-recursively branch 2 times, most recently from 1ed7090 to 65afac9 Compare November 3, 2020 06:56

wolfgangwalther mentioned this pull request Nov 3, 2020

Add test to insert into a view with location header returned #1644

Merged

Enable embedding through multiple layers of views recursively

3cde1cd

wolfgangwalther force-pushed the embedding-views-recursively branch from 65afac9 to 2d10ce2 Compare November 6, 2020 08:24

wolfgangwalther added 3 commits November 6, 2020 09:29

parse hidden views only in search-path

15ba497

return pk and fk columns

3c32de4

rename allSourceColumns

6e2b8ef

wolfgangwalther force-pushed the embedding-views-recursively branch from 2d10ce2 to 6e2b8ef Compare November 6, 2020 08:29

steve-chavez reviewed Nov 14, 2020

View reviewed changes

test/fixtures/schema.sql Show resolved Hide resolved

steve-chavez merged commit 6e04fe7 into PostgREST:master Nov 15, 2020

wolfgangwalther deleted the embedding-views-recursively branch November 17, 2020 08:00

steve-chavez mentioned this pull request Nov 17, 2020

Removing single column limit from join table M2M mapping detection. #1593

Merged

wolfgangwalther mentioned this pull request Nov 22, 2020

Add note about embedding of view-chains and the interaction with db-extra-search-path PostgREST/postgrest-docs#365

Merged

This was referenced Dec 5, 2020

Change SET LOCAL gucs to set_config #1600

Merged

Feature Request: Updating Columns to Default Values #1567

Closed

wolfgangwalther mentioned this pull request Dec 21, 2020

add hollywood.sql test fixtures #1702

Draft

61 tasks

wolfgangwalther mentioned this pull request Jan 11, 2021

Feature request: Being able to manually specify foreign keys in views #1179

Closed

LorenzHenk mentioned this pull request Jan 13, 2021

[Feature Request] Provide information on possible embeddings #1731

Closed

wolfgangwalther mentioned this pull request Apr 14, 2021

Add postgrest-loadtest based on vegeta #1812

Merged

6 tasks

steve-chavez mentioned this pull request May 22, 2021

Add request.spec to db-root-spec #1794

Merged

4 tasks

wolfgangwalther mentioned this pull request Nov 6, 2021

Config param for not scanning the full schemata of a database #2013

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable embedding through multiple layers of views recursively #1625

Enable embedding through multiple layers of views recursively #1625

wolfgangwalther commented Oct 15, 2020

steve-chavez commented Oct 16, 2020

wolfgangwalther commented Oct 16, 2020 •

edited

Loading

wolfgangwalther commented Oct 16, 2020 •

edited

Loading

wolfgangwalther commented Oct 17, 2020

steve-chavez commented Oct 17, 2020

monacoremo commented Oct 17, 2020

wolfgangwalther commented Oct 18, 2020

wolfgangwalther commented Oct 23, 2020

wolfgangwalther commented Oct 23, 2020

steve-chavez commented Oct 25, 2020

wolfgangwalther commented Oct 25, 2020

steve-chavez commented Nov 1, 2020

wolfgangwalther commented Nov 2, 2020

wolfgangwalther commented Nov 2, 2020

steve-chavez commented Nov 2, 2020

wolfgangwalther commented Nov 3, 2020

wolfgangwalther commented Nov 6, 2020

steve-chavez commented Nov 9, 2020

steve-chavez commented Nov 16, 2020 •

edited

Loading

wolfgangwalther commented Nov 17, 2020

steve-chavez commented Nov 17, 2020

wolfgangwalther commented Nov 18, 2020

steve-chavez commented Nov 18, 2020

wolfgangwalther commented Nov 19, 2020

steve-chavez commented Nov 19, 2020

wolfgangwalther commented Nov 20, 2020

Enable embedding through multiple layers of views recursively #1625

Enable embedding through multiple layers of views recursively #1625

Conversation

wolfgangwalther commented Oct 15, 2020

steve-chavez commented Oct 16, 2020

wolfgangwalther commented Oct 16, 2020 • edited Loading

wolfgangwalther commented Oct 16, 2020 • edited Loading

wolfgangwalther commented Oct 17, 2020

steve-chavez commented Oct 17, 2020

monacoremo commented Oct 17, 2020

wolfgangwalther commented Oct 18, 2020

wolfgangwalther commented Oct 23, 2020

wolfgangwalther commented Oct 23, 2020

steve-chavez commented Oct 25, 2020

wolfgangwalther commented Oct 25, 2020

steve-chavez commented Nov 1, 2020

wolfgangwalther commented Nov 2, 2020

wolfgangwalther commented Nov 2, 2020

steve-chavez commented Nov 2, 2020

wolfgangwalther commented Nov 3, 2020

wolfgangwalther commented Nov 6, 2020

steve-chavez commented Nov 9, 2020

steve-chavez commented Nov 16, 2020 • edited Loading

wolfgangwalther commented Nov 17, 2020

steve-chavez commented Nov 17, 2020

wolfgangwalther commented Nov 18, 2020

steve-chavez commented Nov 18, 2020

wolfgangwalther commented Nov 19, 2020

steve-chavez commented Nov 19, 2020

wolfgangwalther commented Nov 20, 2020

wolfgangwalther commented Oct 16, 2020 •

edited

Loading

wolfgangwalther commented Oct 16, 2020 •

edited

Loading

steve-chavez commented Nov 16, 2020 •

edited

Loading