-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable embedding through multiple layers of views recursively #1625
Enable embedding through multiple layers of views recursively #1625
Conversation
4ec7acb
to
63a1420
Compare
Interesting! 👀 One thing we need to check with changes to DbStructure is if they considerably slow down our startup time. We currently test this manually(me and Remo) with a sample big schema we share privately. |
I used the "big schema" and did some performance testing. Since I only changed the query, I ran it directly on the database. I used the following function to run 100 times - alternating both queries - and compute the average plan and execution time for each query. create or replace function timeit (n int, variadic queries text[], out plan_avg numeric[], out execution_avg numeric[])
language plpgsql as $$
declare
i int;
j int;
explained json;
plan_sum numeric[];
execution_sum numeric[];
begin
for i in 1..n loop
for j in array_lower(queries, 1)..array_upper(queries, 1) loop
execute 'explain (analyze, format json) ' || queries[j] into explained;
plan_sum[j] := coalesce(plan_sum[j], 0) + (explained->0->>'Planning Time')::numeric;
plan_avg[j] := plan_sum[j] / i;
execution_sum[j] := coalesce(execution_sum[j], 0) + (explained->0->>'Execution Time')::numeric;
execution_avg[j] := execution_sum[j] / i;
end loop;
end loop;
end
$$;
select * from timeit(100,
$q$ ... old query ... $q$,
$q$ ... new query ... $q$); The results are as follows:
I repeated that and had the same difference (~+5ms for the new query). So, as expected the recursion is a bit slower. But really not much. As a side note:
The new query returns fewer rows, because rows that have a view as a base table are removed - those are useless anyway. However, without removing those rows, the new query returns almost 300 rows more, so there are plenty of "views of views" in that schema. To quantify that, I ran the following query: with
q1 as (...),
q2 as (...)
select * from q2
except
select * from q1 That's all base columns that have been detected by the new query, that had not been detected before. The query returns 188 rows. Trading 5ms cache reload time for 188 new opportunities* to embed stuff. Anyone? *well not really. not nearly all of those columns are unique or referenced at all in a fk constraint... |
6a3cd88
to
205052e
Compare
Thinking this further... we don't need to return all source columns from this query, but only those that are used in any foreign key constraint on either side. Correct? I changed both the old and the new query to return only those columns:
Those 24 columns now really represent new opportunities to embed. And even better - those queries perform better than ever:
Now, the recursive query is just as fast as the one without recursion - but both are about 7% faster than before. All tests were run on PG 12. |
205052e
to
f1f866f
Compare
Reordered the whole query a bit and replaced Should be good to merge. |
390d189
to
b9d58c9
Compare
Mindblown! 💥 🤯 Amazing work here Wolfgang. I wonder what @monacoremo thinks 😄. We might just call this new version The Performance Release . I'm going to give it a review and then merge. It's all looking good though, I'm just going to digest and marvel at how the new queries work :) |
Great stuff!!! Like where we are going with this :-) |
b9d58c9
to
1d9f0b8
Compare
Rebased on current master and did some small renaming for less of a diff and better readability. Glad you guys like it. I'm not sure whether I'm digging my own grave here with the performance stuff, because now it will be very hard for me to get my "transform to json" approach on the same level... ;) |
1d9f0b8
to
7ac9f8d
Compare
Rebased on master. Added another test as well, which is currently still failing. This is about a chain of views, where one of the views in the middle is not in the public schema. Will have to see how I can fix this the best, while keeping performance. |
7ac9f8d
to
a02e5dd
Compare
Embedding is now possible through hidden views in the middle of the chain as well. This does have the same performance on the big schema, but that's mainly because there are no hidden views in there. The query, as it is now, has to parse all views in the database to detect relationships through hidden views. In a setup with many hidden, but unrelated views this could increase the time for this query. |
@wolfgangwalther I've finally understood your great work here! The query is looking solid, but I think we can clarify the code a bit(for example I'll do some reviews, but first..
Would this cover a realistic use case? IIUC, It would imply that a user creates public views for their API and then creates private views based on those public views. That wouldn't make sense according to schema isolation. The idea is to work inside out, selecting private tables or views to be exposed as public.
The price to pay seems a bit too high. PostgREST can be used on legacy databases and those can have a potentially large number of views. |
I'll change that together with everything else you find in your review :)
I did this, because this was exactly what lead me to this issue. It is the other way around, though. My scenario:
So this is exactly the inside-out approach you're describing. Before my latest change the view detection would only take those views in the
I think, in fact, it probably only makes sense to do it like this. I would assume that it is more likely to have such a "view-chain" when using something like I did above (using views as another layer between data tables and public views). But maybe we can do something inbetween: Right now we were either looking for views only in the schemas mentioned in |
Just got It! I was a bit mislead by the tests. That definitely coincides with the schema isolation approach.
That sounds right! I agree with that change. |
Wolfgang:
Steve:
Looking at the code, my earlier assumption is wrong. It is not only the FK columns that we need, but also the PK columns of the base tables, because we have this line: postgrest/src/PostgREST/DbStructure.hs Line 58 in 4b4a622
We're missing a test-case here, because my change should have made at least one of them fail. I guess all relevant PKs in our test-schema are also used in some kind of foreign key relationship. I will try to make up a test-case to make sure we're not breaking anything. |
We just don't have any case to test inserting into a view and getting a location header back. At least I couldn't find any. I'm pretty sure this works already, so we need to add that. |
@wolfgangwalther You're right, I also missed that one. A new test would be great. |
1ed7090
to
65afac9
Compare
Rebased on #1644, which we should merge first. This shows that the new test is indeed breaking with the current changes here - will fix. |
65afac9
to
2d10ce2
Compare
2d10ce2
to
6e2b8ef
Compare
Rebased on master, fixed the missing pks and did the todos from above. |
Sorry for the late response here @wolfgangwalther. I wanted to make sure we get a nightly for your PR, so I did #1648 first. I'll review your latest changes tomorrow :) |
Just FYI. I've released a nightly(2020-11-15-14-58-6e04fe7) with this change. Also, we definitely need to document the |
Sweet. Just started testing the nightly in my current project. The first thing that happenend was that some embeddings, that I had in place before, suddenly returned 30x, because there were 5 additional m2m relationships found to embed on.... :D |
Oh, that's bad. And #1593 is likely to bring more possibilities. The m2m's is something I was thinking sometime ago. Perhaps we can select junctions for embedding in a stricter way? Like when they only have FKs and no additional columns? We could even test this with a config option like |
My reaction was more like: "YES! It worked :D". Wasn't that bad, as I have automated tests in place for all embeddings I'm using and it was just a simple hint to add.
I thought about this a bit. I agree that something needs to be done about m2m relationships and junction tables. Not sure whether I'd go in the same direction, though. In my scenarios so far, I almost always had some other columns on those junction tables. I can't use m2m embedding for those. I do want a "flat" output and not something nested, so I ended up creating views that join the junction and target table. This view is then easily (now!) embeddable. However, if there was a way to do this "flat m2m embedding" in the query syntax, that would be even better. So this boils down to: In the current implementation m2m embedding only makes sense with "strict" junction tables anyway - so the limitation would sure help to decrease ambiguation. But maybe it would be possible to extend the m2m feature to make it more usable with extended junction tables instead? |
Yes, I also believe that, the non-fk columns are lost anyway.
Maybe we could use the So, by default only strict junctions are automatically detected, but extended junctions can be used for M2M embedding by adding an |
I like that. It would reduce complexity for the most common cases, but still allow all of those we currently do. Do we need to support adding multiple hints in this case? I think yes, because otherwise we can't tell which m2m to use. To avoid name collisions with columns or fks named "m2m" / "o2m" / "m2o", maybe we should use another operator for the cardinality hint and separate the two types of hints? Could be easier to parse as well?
Do you think there'd be any way we can change this? I think this has been requested in the past as well. Maybe we can just allow selecting columns from the junction table inside a m2m relationship to flatten the output instead of having two nested embeds. Flattening could also be useful for m2o relationships. Flattening would have to work differently for m2o and m2m relationships:
[
{ "title": "Workers Leaving The Lumière Factory In Lyon",
"director": "Lumière"
},
{ "title": "The Dickson Experimental Sound Film",
"director": "Dickson"
},
{ "title": "The Haunted Castle",
"director": "Méliès"
}
]
|
Regarding flattening, I came up with a proposal here: #1233 (comment). Would that be enough for the use cases you have in mind?
Yes, that's also something I considered on the first disambiguation proposal. It was like: GET /<table>?select=*,<table>!<cardinality>!<fk>(*)
I thought those names are rare enough to not be used as columns/fks. But we can also warn about this on docs. Not sure about more operators. I used the |
If I read that correctly, that will only work for the embedding of m2o relationships. The example you gave there was a m2o->m2o relationship, right? Regarding m2o relationships our proposals differ in two aspects:
I have not looked in detail at the SQL, yet, but I think it should be possible to do this with multiple columns as well. m2m relationships are different. I don't see how your proposal would work in this case. Taking the same example, but a slightly different request:
This would give me a list of clients through the m2m relationship. How would I go about adding the columns of the projects table / junction now?
You're right. Using multiple |
…EST#1625) * Include hidden views from the search path. Hidden views are views in unexposed schemas that are part of a view dependency chain. * Change allSourceColumns to only return pk and fk columns
Resolves #1607.
Changes the
allSourceColumns
query to recurse, when the base relation is still a view or materialized view. The query now returns the "true" base columns through multiple view layers, enabling the embedding of views of views of... you know!