Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filtering on main table with embedded table criteria(inner join) #1075

Closed
BrandonChubb opened this issue Feb 26, 2018 · 27 comments · Fixed by #1949
Closed

filtering on main table with embedded table criteria(inner join) #1075

BrandonChubb opened this issue Feb 26, 2018 · 27 comments · Fixed by #1949
Assignees
Labels
enhancement a feature, ready for implementation

Comments

@BrandonChubb
Copy link

Per issues like #696 and #1067 it seems there's no way to apply all filters of embedded tables to the main table -- though there would be a way to do that via SQL.

  • Would it be reasonable to file a enhancement request for this?
  • Is there a large technical barrier to implementing this (perhaps we would attempt it ourselves)?
  • Is it just functionality that you would consider a vary low priority?

Thanks for any advice!

@steve-chavez
Copy link
Member

This has been requested many times(check also #994 (comment) and this stackoverflow question) and all current workarounds seem lacking to me.

The feature seems doable, basically the current queries need to use explicit JOINs instead of subselects, for example:

-- GET "/clients?select=id,name,projects(id,name)"
-- currently generates this query
-- cost=21434.90..21434.92
WITH pg_source AS (
  SELECT "test"."clients"."id",
         "test"."clients"."name",
         COALESCE((
           SELECT json_agg("projects".*)
           FROM (
             SELECT "test"."projects"."id", "test"."projects"."name"
             FROM "test"."projects"
             WHERE "test"."projects"."client_id" = "test"."clients"."id") "projects"), '[]') AS "projects"
   FROM "test"."clients")
SELECT coalesce(json_agg(_postgrest_t), '[]')::character varying AS BODY
FROM (SELECT * FROM pg_source) _postgrest_t;

-- can be changed to this new query
-- cost=267.04..267.06
WITH pg_source AS (
   SELECT "test"."clients"."id",
          "test"."clients"."name",
          coalesce(nullif(json_agg("projects_projects")::text, '[null]'), '[]')::json AS "projects"
   FROM "test"."clients"
   LEFT JOIN LATERAL (
     SELECT "test"."projects"."id", "test"."projects"."name"
     FROM "test"."projects"
     WHERE "test"."projects"."client_id" = "test"."clients"."id"
   ) AS "projects_projects" ON TRUE
   GROUP BY "test"."clients"."id"
)
SELECT coalesce(json_agg(_postgrest_t), '[]')::character varying AS BODY
FROM (SELECT * FROM pg_source) _postgrest_t;

Some things to notice:

  • The new query is notably more efficient(check the cost), did a test on 1M rows and the new query is x30 faster.
  • GROUP BY addition, needed because of the upper level aggregate, I think this will not be a problem since we only allow joins based on FK, so we can always list this key columns here.
  • The nullif(json_agg("projects_projects")::text, '[null]' is needed because of a pg bug.

The many to many embeds work with an analogous query that includes a LEFT JOIN LATERAL and parent embeds already have it: https://github.com/begriffs/postgrest/blob/32c7e32bdfc9146a11c818d83af84953d43edd8e/src/PostgREST/QueryBuilder.hs#L263.

Once that it's done we can parametrize the LEFT/INNER join:

GET /clients?select=id,name,projects+inner(id,name)
GET /clients?select=id,name,projects+left(id,name) -- this is the default

Hopefully we can use +(PR pending) since it's clearer but otherwise it would be projects.inner or projects.left.

@steve-chavez steve-chavez added the enhancement a feature, ready for implementation label Apr 16, 2018
@ruslantalpa
Copy link
Contributor

ruslantalpa commented Apr 16, 2018

how about 3-4 levels deep, and 3-4 child types at the same time

@steve-chavez steve-chavez changed the title filtering on main table with embedded table criteria filtering on main table with embedded table criteria(inner join) Aug 14, 2018
@steve-chavez
Copy link
Member

@ruslantalpa Here's a more complex example that includes an M2M relationship.

This request(tasks-users is M2M):

curl "localhost:3000/clients" -G \
 -d select="id,name,projects(id,name,tasks(id,name,users(id,name)))"

Would generate:

WITH pg_source AS (
  SELECT 
    "test"."clients"."id",
    "test"."clients"."name",
    coalesce(nullif(json_agg("projects_projects")::text, '[null]'), '[]')::json AS "projects"
  FROM "test"."clients"
  LEFT JOIN LATERAL (
    SELECT 
      "test"."projects"."id", 
      "test"."projects"."name",
      coalesce(nullif(json_agg("tasks_tasks")::text, '[null]'), '[]')::json AS "tasks"
    FROM "test"."projects"
    LEFT JOIN LATERAL (
      SELECT
        "test"."tasks"."id",
        "test"."tasks"."name",
        coalesce(nullif(json_agg("users_users")::text, '[null]'), '[]')::json AS "users"
      FROM "test"."tasks"
      LEFT JOIN LATERAL (
        SELECT 
          "test"."users"."id",
          "test"."users"."name"
        FROM 
          "test"."users_tasks",
           "test"."users"
        WHERE 
          "test"."users"."id" = "test"."users_tasks"."user_id" AND
          "test"."tasks"."id" = "test"."users_tasks"."task_id"
      ) AS "users_users" ON TRUE
      WHERE
        "test"."tasks"."project_id" = "test"."projects"."id"
      GROUP BY "test"."tasks"."id"
    ) AS "tasks_tasks" ON TRUE
    WHERE "test"."projects"."client_id" = "test"."clients"."id"
    GROUP BY "test"."projects"."id"
  ) AS "projects_projects" ON TRUE
  GROUP BY "test"."clients"."id"
)
SELECT coalesce(json_agg(_postgrest_t), '[]')::character varying AS BODY
FROM (SELECT * FROM pg_source) _postgrest_t;

@steve-chavez
Copy link
Member

steve-chavez commented Mar 8, 2019

One problem with the GROUP BY.

When the query references a table, we can just GROUP BY the PKs of the table for the aggregate(json_agg) to work.
But when the query references a VIEW, we need to list all the non-aggregate columns of the VIEW in the GROUP BY.

See https://dba.stackexchange.com/a/88991/158470

@steve-chavez
Copy link
Member

Main idea for generating the above query would be to use the ?select=.. listed columns/computed-columns and add them to the GROUP BY. If a select=* is present the schema cache will be queried for all the columns of the table/view and all of these would be added to the GROUP BY.

@ruslantalpa
Copy link
Contributor

3 problems with this direction

  1. GROUP BY is always more expensive then joins (not sure how it compares current subselects) since it usually results in the use of temporary tables (which might result in disk usage, ... this is from experience long ago with MySQL but I think this logic about gropupby/temp disk still holds true ). So before even starting to go this direction and changing the shape of them main query there would have to be VERY VERY rigorous test, with 3-7 tables each with 1m+ rows and 5-10 columns, explain output, pg_bench tests that fetch 1K+ rows. I.e. my fear is to not break 90% of functionality for a 10% usecase feature (because making the main query slower is breaking postgrest).

  2. If the "test" from the point above prove the query speed is acceptable (at least as fast as current subselects), how much complexity this feature introduces. It's quite hard already to follow the code that generates the query and it's not a good thing when code that only 2-3 people understand gets even more complex.

  3. No one really though about and analyzed this situation to come up with a database solution (because it's easy to just say "hey postgrest, can you implement this?"). This is a useful feature but not that useful as to dedicate that much effort (1 & 2) when there is database solution that delivers an acceptable compromise. My feeling is that some kind of "virtual column" function that brings up the properties of the child to the parent might do the job and no one really researched this.

@steve-chavez
Copy link
Member

@ruslantalpa Regarding 2, I think this would greatly simplify the code that does the JOINs and the resulting query is easier to understand.

About 1, here #1075 (comment) I mentioned that the LATERAL JOIN query plan cost was much lower than the SUBSELECT one. It's likely that were making the query faster. Still, more tests could be useful.

@steve-chavez
Copy link
Member

New queries with LATERAL instead of SUBSELECT are much faster on big workloads, here are my results using pgbench and the same idea as in #978 (comment):

https://gist.github.com/steve-chavez/f79b5c3e777a435d024d44cebb8ac8f4.

  • New child query: x30 faster.
  • New m2m query: x7 faster.

@ruslantalpa
Copy link
Contributor

how about selecting 1k-10k rows?

@steve-chavez
Copy link
Member

@ruslantalpa I'd really appreciate your help with that.

It'd be better if you could replicate my results to be extra sure the queries are fast.

@Remi-C
Copy link

Remi-C commented Jul 15, 2019

Hey @steve-chavez and @ruslantalpa , what is the status on that enhancement?
I don't understand how other user get by without this feature.
The amajor strength of PostgREST is the ability to embed.
Not being able to filter based on the embedding severely restrain this ability imo.

I actually don't see any workaround when the filtering can be done on several embedded fields with various operators.

There might be a quick fix if you allow this kind of syntax :
http://localhost:3000/father?select=*,children(child_name, child_age)&children.age=eq.15 --> right now : will output all the fathers with a children list that might be empty.
We have no way to get only the parents with children of age 15

Simple syntax, the SQL postgREST will generate will most likely not be efficient, but that's better than nothing !
http://localhost:3000/father?select=*,children(child_name, child_age)&children.age=eq.15&father.children=not.is.null

@steve-chavez
Copy link
Member

steve-chavez commented Jul 15, 2019

@Remi-C The problem right now is that PostgREST only does LEFT JOINs when embedding childs, so the parents always remain in the result, syntax wise there isn't an issue.

There is a workaround with computed columns though, see PostgREST/postgrest-docs#211 (comment).

Edit: Copy/pasted the wrong link.

@Remi-C
Copy link

Remi-C commented Jul 15, 2019

@steve-chavez thanks for the answer.
I agree some form of inner join would be cool,
but here I was thinking something much simpler (and less efficient).
I was talking about filtering the parent result based on the {children} object.
Aka father.children=not.is.null (where children is the full json object), which would de facto filter at the parent level using the children level, without you having to rewrite the full software.

Computed columns/views are not going to work with 4 levels of nesting, and filtering happening on various columns with various operators.
Avoiding to write by hand these functions / views is kind of the main purpose of PostgREST in the first place in my situation.

So far the only workaround with realistic performances I can come up with is to go from my nice relational model with dozen of tables to a one table document oriented model.

That is having only one big table with a json type and abuse the jsonb indexing capabilites of postgres.

image

You know postgres very well, you can understand why this backup plan feels wrong.

I'm assuming there is no way to inject some of my own SQL into postgREST generated queries right?

@Iced-Sun
Copy link

Iced-Sun commented Aug 26, 2019

Per the comment of #1075 and the comment of #1018, the to-1 relationship may be a simlper but also demanded case.

Taking the example in comment, would it be possible to generate the following query:

WITH pg_source AS (
  SELECT "test"."clients"."id",
         "test"."clients"."name",
         COALESCE((
           SELECT json_agg("projects".*)
           FROM (
             SELECT "test"."projects"."id", "test"."projects"."name"
             FROM "test"."projects"
             WHERE "test"."projects"."client_id" = "test"."clients"."id") "projects"), '[]') AS "projects"
   FROM "test"."clients")
SELECT coalesce(json_agg(_postgrest_t), '[]')::character varying AS BODY
FROM (SELECT * FROM pg_source) _postgrest_t
-- add this line
WHERE projects IS NOT NULL;

Is it possible? Seems that it doesn't change much.

@Iced-Sun
Copy link

Iced-Sun commented Dec 29, 2019

Further investigation on subquery vs. join

The new-*-query in https://gist.github.com/steve-chavez/f79b5c3e777a435d024d44cebb8ac8f4 uses JOIN + GROUP BY to aggregate embedded resources. Although the LATERAL keyword presents, in essence the joining is vanilla, e.g.,

SELECT items.*, subitems_subitems.*
FROM items
LEFT JOIN LATERAL (
  SELECT id, name FROM subitems WHERE subitems.item_id = items.id
) AS subitems_subitems ON TRUE

is basically equivalent to

SELECT items.*, subitems.id, subitems.name
FROM items
LEFT JOIN subitems ON subitems.item_id = items.id

That leaves more opportunity for postgres to evaluate execution plans, but the vanilla JOIN requires GROUP BY to actually do the aggregation, which is concerned as the point 2 in #1075 (comment).

Another approach is to leverage the correlated join, which resembles the correlated subquery approach (but with the correlated parts in the range table instead of in the target list). The child query is:

WITH pg_source AS (
  SELECT
    "new_m2m_child_queries"."items"."id",
    "new_m2m_child_queries"."items"."name",
    "new_m2m_child_queries"."items"."c1",
    "new_m2m_child_queries"."items"."c2",
    "new_m2m_child_queries"."items"."c3",
    "new_m2m_child_queries"."items"."c4",
    coalesce(nullif(subitems::text, '[null]'), '[]')::json AS "subitems"
  FROM "new_m2m_child_queries"."items"
  LEFT JOIN LATERAL (
    SELECT json_agg(_) subitems FROM (
    SELECT
      "new_m2m_child_queries"."subitems"."id",
      "new_m2m_child_queries"."subitems"."name"
    FROM "new_m2m_child_queries"."subitems"
    WHERE
      "new_m2m_child_queries"."subitems"."item_id" = "new_m2m_child_queries"."items"."id") _
  ) AS "subitems_subitems" ON TRUE
  WHERE
    "new_m2m_child_queries"."items"."id" > :rid  AND
    "new_m2m_child_queries"."items"."id" <= (:rid + 100)
)
SELECT coalesce(json_agg(_postgrest_t), '[]')::character varying AS BODY
FROM (SELECT * FROM pg_source) _postgrest_t;

while the m2m-query is:

WITH pg_source AS (
   SELECT
     "new_m2m_child_queries"."subitems"."id",
     "new_m2m_child_queries"."subitems"."name",
     "new_m2m_child_queries"."subitems"."c1",
     "new_m2m_child_queries"."subitems"."c2",
     "new_m2m_child_queries"."subitems"."c3",
     "new_m2m_child_queries"."subitems"."c4",
     "pieces"
   FROM "new_m2m_child_queries"."subitems"
   LEFT JOIN LATERAL (
      SELECT json_agg(_) as pieces from (
      SELECT
        "new_m2m_child_queries"."pieces"."id",
        "new_m2m_child_queries"."pieces"."name"
      FROM
        "new_m2m_child_queries"."subitems_pieces",
        "new_m2m_child_queries"."pieces"
      WHERE
        "new_m2m_child_queries"."subitems"."id" = "new_m2m_child_queries"."subitems_pieces"."subitem_id" AND
        "new_m2m_child_queries"."pieces"."id" = "new_m2m_child_queries"."subitems_pieces"."piece_id"
   )_) AS "pieces_pieces" ON TRUE
   WHERE
    "new_m2m_child_queries"."subitems"."id" > :rid AND
    "new_m2m_child_queries"."subitems"."id" <= (:rid + 100)
)
SELECT coalesce(json_agg(_postgrest_t), '[]')::character varying AS BODY
FROM (SELECT * FROM pg_source) _postgrest_t;

These queries force Postgres to do the correlated join, hence the aggregation is performed per loop instead of an overal GroupAggregate. Indexes on the foreign key are vital to performances.

Here are some benchmarks.

child query (TPS)

Index: create index on subitems(item_id).

size approach w/o index with index
100 subquery 0.117936 (x1) 841.420923 (x7134)
join + group by 4.745253 (x40) 739.023114 (x6266)
correlated join 0.118051 (x1) 837.194807 (x7099)
1000 subquery 0.011767 (x1) 144.347582 (x12267)
join + group by 4.583183 (x389) 127.866821 (x10866)
correlated join 0.011838 (x1) 141.539364 (x12028)

m2m query (TPS)

Index: create index on subitems_pieces(subitem_id); create index on subitems_pieces(piece_id).

size approach w/o index with index
100 subquery 0.420336 (x1) 743.610854 (x1769)
join + group by 2.290528 (x5.45) 678.160037 (x1613)
correlated join 0.441805 (x1.05) 752.817718 (x1791)
1000 subquery 0.042927 (x1) 128.298963 (x2989)
join + group by 2.252303 (x52) 120.049724 (x2797)
correlated join 0.039652 (x0.92) 135.044025 (x3146)

It seems to me that JOIN + GROUP BY may be preferable in the absence of proper indexes, however, it is also acceptable even with the indexes present.

@kkwiatkowski
Copy link

Do we have any further update on this matter?

@LorenzHenk
Copy link

As mentioned by @sunbing81 here, for to-1 relations it would be sufficient to be able to filter whether an embedded row is NULL.

Example use case:
You have a list of all tasks and want to filter where the project owner is X and the task is not completed.
You'd use the following select: select=*,project(*)
So you'd define the following filters:

project.project_owner=eq.X
task_state=not.eq.completed
project=not.is.null         <-- this is not possible at the moment

Only is.null and not.is.null are useful filters for embedded rows, other filters should not be allowed.

If you use the foreign key column to configure embedding (e.g. /tasks?select=*,project_id(*)), in the resulting data the foreign key column property is replaced with the embedded row object / array.
How should one differ between "the embedded row is NULL" and "the foreign key column is NULL" - and is this difference even relevant?

@steve-chavez
Copy link
Member

steve-chavez commented Jan 27, 2021

Only is.null and not.is.null are useful filters for embedded rows, other filters should not be allowed.

Perhaps other types of filters could be useful if we add support for flattening: #1233 (comment)

So what we need is a way to filter the top level embeds.

The query that @sunbing81 proposed can work if we remove the coalesce in json_agg(projects:

WITH pg_source AS (
  SELECT "test"."clients"."id",
         "test"."clients"."name",
         (
           SELECT json_agg("projects".*)
           FROM (
             SELECT "test"."projects"."id", "test"."projects"."name"
             FROM "test"."projects"
             WHERE "test"."projects"."client_id" = "test"."clients"."id") "projects") AS "projects"
   FROM "test"."clients")
SELECT coalesce(json_agg(_postgrest_t), '[]')::character varying AS BODY
FROM (SELECT * FROM pg_source) _postgrest_t
-- add this line
WHERE projects IS NOT NULL
-- otherwise we'd have to do something like
-- WHERE json_array_length(projects) > 0;

This approach certainly looks easier than implementing the INNER JOIN. Haven't thought all the details though.

Edit:

The problem with this approach is that it's not composable for recursive embeds. The subquery must be in an upper level to use it in a WHERE clause(see here). I guess we could create many artificial levels but I'm sure that would heavily impact the perf of the query.

@volkandkaya
Copy link

As mentioned by @sunbing81 here, for to-1 relations it would be sufficient to be able to filter whether an embedded row is NULL.

Example use case:
You have a list of all tasks and want to filter where the project owner is X and the task is not completed.
You'd use the following select: select=*,project(*)
So you'd define the following filters:

project.project_owner=eq.X
task_state=not.eq.completed
project=not.is.null         <-- this is not possible at the moment

Only is.null and not.is.null are useful filters for embedded rows, other filters should not be allowed.

If you use the foreign key column to configure embedding (e.g. /tasks?select=*,project_id(*)), in the resulting data the foreign key column property is replaced with the embedded row object / array.
How should one differ between "the embedded row is NULL" and "the foreign key column is NULL" - and is this difference even relevant?

Did you ever solve this?

@LorenzHenk
Copy link

Did you ever solve this?

@volkandkaya no 😕

@drmnk
Copy link

drmnk commented May 20, 2021

@steve-chavez so as I can see so far, you planning to filter main table by "sweeping" from it null-ed embeddable resource. Am I right?

Right now I want to create a simple server, that would get postgrest-format requests from client, ask Postgrest server, sweep results with null embeds and give response back to client.

All I want to ask - will you bring "sweeping" logic right now to existed Postgrest queries?
Or will you add some modificators to queries so two versions of queries will be - one new with sweeping and one old without?

Thank you very much.

@steve-chavez
Copy link
Member

All I want to ask - will you bring "sweeping" logic right now to existed Postgrest queries?
Or will you add some modificators to queries so two versions of queries will be - one new with sweeping and one old without?

@drmnk IIUYC, the second one, it will be possible to keep null embeddings or discard them through the url: ?select=*,embed!<left|inner>(*).

Many users are asking for filtering nulls by default, so maybe there will be a config option to do it like that(embedding-mode="inner").

@drmnk
Copy link

drmnk commented May 20, 2021

oh, great, config option is must-have!

@colearendt
Copy link
Contributor

colearendt commented Sep 2, 2021

Any update or timing expectation on this one? I just hit this issue for performance reasons (too much data in the original table) and the possible workarounds are much less than ideal (build a stored procedure).

I don't know haskell, but if there is anything that ignorant not-yet-contributors can do to multiply or accelerate efforts (help with test coverage, use a prototype build for feedback, help with docs, etc.), then I would be happy to help!! (I imagine others would be as well 😄 )

@steve-chavez
Copy link
Member

Implemented in #1949 🎉 🎉

For anyone that would like to try it before a new release, you can download binaries here(at the bottom of the page).

@mstade
Copy link

mstade commented Oct 4, 2021

Awesome news @steve-chavez, congrats! Can't wait to use this in Supabase, we've got several scenarios where this is going to simplify things A LOT. Amazing work, huge thanks!! 🎉🎉🎉

@colearendt
Copy link
Contributor

Just want to pop in and say this is working great in my testing! Each time I have a new use case for this feature and I implement it, it makes me smile 😄 Thanks so much for the awesome work here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement a feature, ready for implementation
Development

Successfully merging a pull request may close this issue.