Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable embedding through multiple layers of views recursively #1625

Merged

Conversation

wolfgangwalther
Copy link
Member

Resolves #1607.

Changes the allSourceColumns query to recurse, when the base relation is still a view or materialized view. The query now returns the "true" base columns through multiple view layers, enabling the embedding of views of views of... you know!

@steve-chavez
Copy link
Member

Interesting! 👀

One thing we need to check with changes to DbStructure is if they considerably slow down our startup time.

We currently test this manually(me and Remo) with a sample big schema we share privately.
Would you like to test this for your self? If so, you can shoot me an email at stevechavezast@gmail.com and I'll share you the instructions.

@wolfgangwalther
Copy link
Member Author

wolfgangwalther commented Oct 16, 2020

I used the "big schema" and did some performance testing. Since I only changed the query, I ran it directly on the database. I used the following function to run 100 times - alternating both queries - and compute the average plan and execution time for each query.

create or replace function timeit (n int, variadic queries text[], out plan_avg numeric[], out execution_avg numeric[])
language plpgsql as $$
declare
  i int;
  j int;
  explained json;
  plan_sum numeric[];
  execution_sum numeric[];
begin
  for i in 1..n loop
    for j in array_lower(queries, 1)..array_upper(queries, 1) loop
      execute 'explain (analyze, format json) ' || queries[j] into explained;
      plan_sum[j] := coalesce(plan_sum[j], 0) + (explained->0->>'Planning Time')::numeric;
      plan_avg[j] := plan_sum[j] / i;
      execution_sum[j] := coalesce(execution_sum[j], 0) + (explained->0->>'Execution Time')::numeric;
      execution_avg[j] := execution_sum[j] / i;
    end loop;
  end loop;
end
$$;

select * from timeit(100,
$q$ ... old query ... $q$,
$q$ ... new query ... $q$);

The results are as follows:

query | plan_avg | execution_avg
old   | 0.56 ms  | 395.3 ms
new   | 0.63 ms  | 399.7 ms

I repeated that and had the same difference (~+5ms for the new query).

So, as expected the recursion is a bit slower. But really not much.

As a side note:

  • The old query returns 2208 rows
  • The new query returns 2181 rows
  • The new query without where base_kind not in ('v', 'm') returns 2490 rows

The new query returns fewer rows, because rows that have a view as a base table are removed - those are useless anyway. However, without removing those rows, the new query returns almost 300 rows more, so there are plenty of "views of views" in that schema. To quantify that, I ran the following query:

with
q1 as (...),
q2 as (...)
select * from q2
except
select * from q1

That's all base columns that have been detected by the new query, that had not been detected before. The query returns 188 rows.

Trading 5ms cache reload time for 188 new opportunities* to embed stuff. Anyone?

*well not really. not nearly all of those columns are unique or referenced at all in a fk constraint...

@wolfgangwalther wolfgangwalther force-pushed the embedding-views-recursively branch 2 times, most recently from 6a3cd88 to 205052e Compare October 16, 2020 20:40
@wolfgangwalther
Copy link
Member Author

wolfgangwalther commented Oct 16, 2020

*well not really. not nearly all of those columns are unique or referenced at all in a fk constraint...

Thinking this further... we don't need to return all source columns from this query, but only those that are used in any foreign key constraint on either side. Correct?

I changed both the old and the new query to return only those columns:

  • The old query now returns 543 rows
  • The new query returns 519 rows (the same logic as above applies here)
  • The except query returns a difference of 24 columns

Those 24 columns now really represent new opportunities to embed.

And even better - those queries perform better than ever:

query        | plan_avg | execution_avg
master       | 0.54ms   | 387.7ms
recursive    | 0.62ms   | 393.0ms
master_fk    | 1.16ms   | 362.2ms
recursive_fk | 1.06ms   | 362.0ms

Now, the recursive query is just as fast as the one without recursion - but both are about 7% faster than before.

All tests were run on PG 12.

@wolfgangwalther
Copy link
Member Author

Reordered the whole query a bit and replaced regexp_split_to_array with string_to_array, which is a lot faster. The query now comes it at around 220-230ms, so almost twice as fast.

Should be good to merge.

@wolfgangwalther wolfgangwalther force-pushed the embedding-views-recursively branch 2 times, most recently from 390d189 to b9d58c9 Compare October 17, 2020 14:50
@steve-chavez
Copy link
Member

Mindblown! 💥 🤯

Amazing work here Wolfgang. I wonder what @monacoremo thinks 😄.

We might just call this new version The Performance Release .

I'm going to give it a review and then merge. It's all looking good though, I'm just going to digest and marvel at how the new queries work :)

@monacoremo
Copy link
Member

Great stuff!!! Like where we are going with this :-)

@wolfgangwalther
Copy link
Member Author

Rebased on current master and did some small renaming for less of a diff and better readability.

Glad you guys like it. I'm not sure whether I'm digging my own grave here with the performance stuff, because now it will be very hard for me to get my "transform to json" approach on the same level... ;)

@wolfgangwalther
Copy link
Member Author

Rebased on master.

Added another test as well, which is currently still failing. This is about a chain of views, where one of the views in the middle is not in the public schema. Will have to see how I can fix this the best, while keeping performance.

@wolfgangwalther
Copy link
Member Author

Embedding is now possible through hidden views in the middle of the chain as well. This does have the same performance on the big schema, but that's mainly because there are no hidden views in there. The query, as it is now, has to parse all views in the database to detect relationships through hidden views. In a setup with many hidden, but unrelated views this could increase the time for this query.

@steve-chavez
Copy link
Member

@wolfgangwalther I've finally understood your great work here!

The query is looking solid, but I think we can clarify the code a bit(for example allSourceColumns is no more, it should be something like allSourceFkColumns).

I'll do some reviews, but first..

Embedding is now possible through hidden views in the middle of the chain as well.

Would this cover a realistic use case?

IIUC, It would imply that a user creates public views for their API and then creates private views based on those public views.

That wouldn't make sense according to schema isolation. The idea is to work inside out, selecting private tables or views to be exposed as public.

The query, as it is now, has to parse all views in the database to detect relationships through hidden views. In a setup with many hidden, but unrelated views this could increase the time for this query.

The price to pay seems a bit too high. PostgREST can be used on legacy databases and those can have a potentially large number of views.

@wolfgangwalther
Copy link
Member Author

The query is looking solid, but I think we can clarify the code a bit(for example allSourceColumns is no more, it should be something like allSourceFkColumns).

I'll change that together with everything else you find in your review :)

Would this cover a realistic use case?

IIUC, It would imply that a user creates public views for their API and then creates private views based on those public views.

That wouldn't make sense according to schema isolation. The idea is to work inside out, selecting private tables or views to be exposed as public.

I did this, because this was exactly what lead me to this issue. It is the other way around, though. My scenario:

  • hidden schema data with tables
  • hidden schema hidden with some views as an abstraction on the data tables (some code that is shared between some of my public views)
  • public schema public with view that use those in hidden

So this is exactly the inside-out approach you're describing.

Before my latest change the view detection would only take those views in the public schema, but not those in the hidden schema. The table in data would never be found.

The price to pay seems a bit too high. PostgREST can be used on legacy databases and those can have a potentially large number of views.

I think, in fact, it probably only makes sense to do it like this. I would assume that it is more likely to have such a "view-chain" when using something like I did above (using views as another layer between data tables and public views).

But maybe we can do something inbetween: Right now we were either looking for views only in the schemas mentioned in db-schema or in all schemas everywhere. We could just look for views that are in the search path, so db-schema + db-extra-search-path. In this case I could add hidden to the search path in my case (which is something that I already did anyway). Users can still control how many views are exposed like this, then.

@steve-chavez
Copy link
Member

hidden schema data with tables
hidden schema hidden
public schema public with view that use those in hidden
So this is exactly the inside-out approach you're describing.

Just got It! I was a bit mislead by the tests. That definitely coincides with the schema isolation approach.

We could just look for views that are in the search path, so db-schema + db-extra-search-path

That sounds right! I agree with that change.

@wolfgangwalther
Copy link
Member Author

Wolfgang:

Thinking this further... we don't need to return all source columns from this query, but only those that are used in any foreign key constraint on either side. Correct?

Steve:

(for example allSourceColumns is no more, it should be something like allSourceFkColumns).

Looking at the code, my earlier assumption is wrong. It is not only the FK columns that we need, but also the PK columns of the base tables, because we have this line:

keys' = addViewPrimaryKeys srcCols keys

We're missing a test-case here, because my change should have made at least one of them fail. I guess all relevant PKs in our test-schema are also used in some kind of foreign key relationship. I will try to make up a test-case to make sure we're not breaking anything.

@wolfgangwalther
Copy link
Member Author

We just don't have any case to test inserting into a view and getting a location header back. At least I couldn't find any. I'm pretty sure this works already, so we need to add that.

@steve-chavez
Copy link
Member

it is not only the FK columns that we need, but also the PK columns of the base tables
We just don't have any case to test inserting into a view and getting a location header back

@wolfgangwalther You're right, I also missed that one. A new test would be great.

@wolfgangwalther
Copy link
Member Author

Rebased on #1644, which we should merge first. This shows that the new test is indeed breaking with the current changes here - will fix.

@wolfgangwalther
Copy link
Member Author

Rebased on master, fixed the missing pks and did the todos from above.

@steve-chavez
Copy link
Member

Sorry for the late response here @wolfgangwalther. I wanted to make sure we get a nightly for your PR, so I did #1648 first.

I'll review your latest changes tomorrow :)

@steve-chavez steve-chavez merged commit 6e04fe7 into PostgREST:master Nov 15, 2020
@steve-chavez
Copy link
Member

steve-chavez commented Nov 16, 2020

Just FYI. I've released a nightly(2020-11-15-14-58-6e04fe7) with this change.

Also, we definitely need to document the db-extra-search-path interaction with hidden/middle views. I think a subsection under Embedding Views would do.

@wolfgangwalther wolfgangwalther deleted the embedding-views-recursively branch November 17, 2020 08:00
@wolfgangwalther
Copy link
Member Author

Sweet. Just started testing the nightly in my current project. The first thing that happenend was that some embeddings, that I had in place before, suddenly returned 30x, because there were 5 additional m2m relationships found to embed on.... :D
So this is definitely something that we need to point out for the next release as well, just as discussed in #1593.

@steve-chavez
Copy link
Member

suddenly returned 30x, because there were 5 additional m2m relationships

Oh, that's bad. And #1593 is likely to bring more possibilities.

The m2m's is something I was thinking sometime ago.
Right now, many tables can be treated as junctions even when they're not really meant to be.

Perhaps we can select junctions for embedding in a stricter way? Like when they only have FKs and no additional columns?

We could even test this with a config option like db-strict-junctions.

@wolfgangwalther
Copy link
Member Author

Oh, that's bad. And #1593 is likely to bring more possibilities.

My reaction was more like: "YES! It worked :D". Wasn't that bad, as I have automated tests in place for all embeddings I'm using and it was just a simple hint to add.

The m2m's is something I was thinking sometime ago.
Right now, many tables can be treated as junctions even when they're not really meant to be.

Perhaps we can select junctions for embedding in a stricter way? Like when they only have FKs and no additional columns?

We could even test this with a config option like db-strict-junctions.

I thought about this a bit. I agree that something needs to be done about m2m relationships and junction tables. Not sure whether I'd go in the same direction, though.

In my scenarios so far, I almost always had some other columns on those junction tables. I can't use m2m embedding for those. I do want a "flat" output and not something nested, so I ended up creating views that join the junction and target table. This view is then easily (now!) embeddable.

However, if there was a way to do this "flat m2m embedding" in the query syntax, that would be even better.

So this boils down to: In the current implementation m2m embedding only makes sense with "strict" junction tables anyway - so the limitation would sure help to decrease ambiguation. But maybe it would be possible to extend the m2m feature to make it more usable with extended junction tables instead?

@steve-chavez
Copy link
Member

So this boils down to: In the current implementation m2m embedding only makes sense with "strict" junction tables anyway

Yes, I also believe that, the non-fk columns are lost anyway.

But maybe it would be possible to extend the m2m feature to make it more usable with extended junction tables instead?

Maybe we could use the cardinality hint I mentioned on #1643 (comment).

So, by default only strict junctions are automatically detected, but extended junctions can be used for M2M embedding by adding an m2m hint.

@wolfgangwalther
Copy link
Member Author

But maybe it would be possible to extend the m2m feature to make it more usable with extended junction tables instead?

Maybe we could use the cardinality hint I mentioned on #1643 (comment).

So, by default only strict junctions are automatically detected, but extended junctions can be used for M2M embedding by adding an m2m hint.

I like that. It would reduce complexity for the most common cases, but still allow all of those we currently do.

Do we need to support adding multiple hints in this case? I think yes, because otherwise we can't tell which m2m to use.

To avoid name collisions with columns or fks named "m2m" / "o2m" / "m2o", maybe we should use another operator for the cardinality hint and separate the two types of hints? Could be easier to parse as well?

Yes, I also believe that, the non-fk columns are lost anyway.

Do you think there'd be any way we can change this? I think this has been requested in the past as well. Maybe we can just allow selecting columns from the junction table inside a m2m relationship to flatten the output instead of having two nested embeds. Flattening could also be useful for m2o relationships.

Flattening would have to work differently for m2o and m2m relationships:

  • in a m2o relationship, we'd want to move the embedded columns to the "outer" level. I suggest something like this:
    GET /films?select=title,director:directors.last_name,directors()
    returns
[
  { "title": "Workers Leaving The Lumière Factory In Lyon",
    "director": "Lumière"
  },
  { "title": "The Dickson Experimental Sound Film",
    "director": "Dickson"
  },
  { "title": "The Haunted Castle",
    "director": "Méliès"
  }
]
  • in the m2m relationship, we could then use .. instead of directors. to qualify the column, so something like this:
    GET /actors?select=first_name,last_name,films!m2m.roles(title,..character)
    character is then taken from the junction table.

@steve-chavez
Copy link
Member

Regarding flattening, I came up with a proposal here: #1233 (comment). Would that be enough for the use cases you have in mind?

Do we need to support adding multiple hints in this case?

Yes, that's also something I considered on the first disambiguation proposal. It was like:

GET /<table>?select=*,<table>!<cardinality>!<fk>(*)

To avoid name collisions with columns or fks named "m2m" / "o2m" / "m2o", maybe we should use another operator for the cardinality hint and separate the two types of hints?

I thought those names are rare enough to not be used as columns/fks. But we can also warn about this on docs.

Not sure about more operators. I used the ! because we are short in urlsafe chars to use as operators(it also reminded me of the !important in css). IIRC, the only other char that was available was $, but pg identifiers can have $ so I discarded it.

@wolfgangwalther
Copy link
Member Author

Regarding flattening, I came up with a proposal here: #1233 (comment). Would that be enough for the use cases you have in mind?

If I read that correctly, that will only work for the embedding of m2o relationships. The example you gave there was a m2o->m2o relationship, right?

Regarding m2o relationships our proposals differ in two aspects:

  • use of -> vs .. That's just a matter of choice, I guess. I feel like . more closely resembles the SQL ("table.column"), but -> would work as well of course.

  • Your suggestion has the embedding and the flattening together like this: clients(*)->name, where I have two fields: clients.name,clients(). The empty embedding would be used to do the join, while the clients.name / alias.column would put one of the columns in the result. This could easily be extended to support multiple columns.

I have not looked in detail at the SQL, yet, but I think it should be possible to do this with multiple columns as well.


m2m relationships are different. I don't see how your proposal would work in this case. Taking the same example, but a slightly different request:

GET /tasks?select=name,clients!projects(*)

This would give me a list of clients through the m2m relationship. How would I go about adding the columns of the projects table / junction now?

Yes, that's also something I considered on the first disambiguation proposal. It was like:

GET /<table>?select=*,<table>!<cardinality>!<fk>(*)

I thought those names are rare enough to not be used as columns/fks. But we can also warn about this on docs.

Not sure about more operators. I used the ! because we are short in urlsafe chars to use as operators(it also reminded me of the !important in css). IIRC, the only other char that was available was $, but pg identifiers can have $ so I discarded it.

You're right. Using multiple ! and making m2o, o2m and m2m keywords should be fine and straightforward.

monacoremo pushed a commit to monacoremo/postgrest that referenced this pull request Jul 17, 2021
…EST#1625)

* Include hidden views from the search path. Hidden views are views in unexposed schemas that are part of a view dependency chain.

* Change allSourceColumns to only return pk and fk columns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Embedding views recursively
3 participants