Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregating an Ordered Set in BigQuery — ARRAY_AGG(DISTINCT ... ORDER BY ...) is not support! 😭 #209

Closed
munro opened this issue Mar 9, 2022 · 2 comments

Comments

@munro
Copy link

munro commented Mar 9, 2022

I'm trying to generate a set ordered by a another value, but I'm really struggling to pull it off. I'm not sure how to work around BigQuery's limitations, after trying quite a few attempts. Without doing some clever work around for BigQuery not supporting ARRAY_AGG(DISTINCT ... ORDER BY ...), the only other way I see is JOINing another select after the GROUP BY is preformed, but I'm also struggling to figure out how to pull that off in Logica as well. Anyways, love the project! Overall it's been really cool! 🚀

Code

Here's the most straight forward attempt

%%logica Test

@Engine("bigquery");

OrderedSet(a) = SqlExpr("ARRAY_AGG(DISTINCT {value} ORDER BY {arg})", {value: a.value, arg: a.arg});

TestOrderedSet(g:, ids_by_t? Array= t -> id) distinct :-
  Data(g:, id:, t:);

Data(g: 1, id: 1, t: 5);
Data(g: 1, id: 2, t: 4);
Data(g: 1, id: 2, t: 4);
Data(g: 1, id: 3, t: 3);
Data(g: 1, id: 3, t: 3);
Data(g: 1, id: 4, t: 2);
Data(g: 1, id: 5, t: 1);

Data(g: 2, id: 10, t: 1);
Data(g: 2, id: 10, t: 1);
Data(g: 2, id: 20, t: 2);
Data(g: 2, id: 20, t: 0);
Data(g: 2, id: 30, t: 3);

Expected(g: 1, ids_by_t: [5, 4, 3, 2, 1]);
Expected(g: 2, ids_by_t: [20, 10, 30]);
Expected(g: 3, ids_by_t: [20, 10, 30]);

Test(g:, actual:, expected:) :-
  TestOrderedSet(g:, ids_by_t: actual),
  Expected(g:, ids_by_t: expected);

Output

And the error 😢

BadRequest: 400 An aggregate function that has both DISTINCT and ORDER BY arguments can only ORDER BY expressions that are arguments to the function at [77:39]

                   -----Query Job SQL Follows-----                   

    |    .    |    .    |    .    |    .    |    .    |    .    |
   1:WITH t_2_Data AS (SELECT * FROM (
   2:  
   3:    SELECT
   4:      1 AS g,
   5:      1 AS id,
   6:      5 AS t
   7:   UNION ALL
   8:  
   9:    SELECT
  10:      1 AS g,
  11:      2 AS id,
  12:      4 AS t
  13:   UNION ALL
  14:  
  15:    SELECT
  16:      1 AS g,
  17:      2 AS id,
  18:      4 AS t
  19:   UNION ALL
  20:  
  21:    SELECT
  22:      1 AS g,
  23:      3 AS id,
  24:      3 AS t
  25:   UNION ALL
  26:  
  27:    SELECT
  28:      1 AS g,
  29:      3 AS id,
  30:      3 AS t
  31:   UNION ALL
  32:  
  33:    SELECT
  34:      1 AS g,
  35:      4 AS id,
  36:      2 AS t
  37:   UNION ALL
  38:  
  39:    SELECT
  40:      1 AS g,
  41:      5 AS id,
  42:      1 AS t
  43:   UNION ALL
  44:  
  45:    SELECT
  46:      2 AS g,
  47:      10 AS id,
  48:      1 AS t
  49:   UNION ALL
  50:  
  51:    SELECT
  52:      2 AS g,
  53:      10 AS id,
  54:      1 AS t
  55:   UNION ALL
  56:  
  57:    SELECT
  58:      2 AS g,
  59:      20 AS id,
  60:      2 AS t
  61:   UNION ALL
  62:  
  63:    SELECT
  64:      2 AS g,
  65:      20 AS id,
  66:      0 AS t
  67:   UNION ALL
  68:  
  69:    SELECT
  70:      2 AS g,
  71:      30 AS id,
  72:      3 AS t
  73:  
  74:) AS UNUSED_TABLE_NAME  ),
  75:t_0_TestOrderedSet AS (SELECT
  76:  Data.g AS g,
  77:  ARRAY_AGG(DISTINCT Data.id ORDER BY Data.t) AS ids_by_t
  78:FROM
  79:  t_2_Data AS Data
  80:GROUP BY g),
  81:t_3_Expected AS (SELECT * FROM (
  82:  
  83:    SELECT
  84:      1 AS g,
  85:      ARRAY[5, 4, 3, 2, 1] AS ids_by_t
  86:   UNION ALL
  87:  
  88:    SELECT
  89:      2 AS g,
  90:      ARRAY[20, 10, 30] AS ids_by_t
  91:   UNION ALL
  92:  
  93:    SELECT
  94:      3 AS g,
  95:      ARRAY[20, 10, 30] AS ids_by_t
  96:  
  97:) AS UNUSED_TABLE_NAME  )
  98:SELECT
  99:  TestOrderedSet.g AS g,
 100:  TestOrderedSet.ids_by_t AS actual,
 101:  Expected.ids_by_t AS expected
 102:FROM
 103:  t_0_TestOrderedSet AS TestOrderedSet, t_3_Expected AS Expected
 104:WHERE
 105:  (Expected.g = TestOrderedSet.g)
    |    .    |    .    |    .    |    .    |    .    |    .    |

Generate SQL

And the generated SQL for the lazy

WITH t_2_Data AS (SELECT * FROM (
  
    SELECT
      1 AS g,
      1 AS id,
      5 AS t
   UNION ALL
  
    SELECT
      1 AS g,
      2 AS id,
      4 AS t
   UNION ALL
  
    SELECT
      1 AS g,
      2 AS id,
      4 AS t
   UNION ALL
  
    SELECT
      1 AS g,
      3 AS id,
      3 AS t
   UNION ALL
  
    SELECT
      1 AS g,
      3 AS id,
      3 AS t
   UNION ALL
  
    SELECT
      1 AS g,
      4 AS id,
      2 AS t
   UNION ALL
  
    SELECT
      1 AS g,
      5 AS id,
      1 AS t
   UNION ALL
  
    SELECT
      2 AS g,
      10 AS id,
      1 AS t
   UNION ALL
  
    SELECT
      2 AS g,
      10 AS id,
      1 AS t
   UNION ALL
  
    SELECT
      2 AS g,
      20 AS id,
      2 AS t
   UNION ALL
  
    SELECT
      2 AS g,
      20 AS id,
      0 AS t
   UNION ALL
  
    SELECT
      2 AS g,
      30 AS id,
      3 AS t
  
) AS UNUSED_TABLE_NAME  ),
t_0_TestOrderedSet AS (SELECT
  Data.g AS g,
  ARRAY_AGG(DISTINCT Data.id ORDER BY Data.t) AS ids_by_t
FROM
  t_2_Data AS Data
GROUP BY g),
t_3_Expected AS (SELECT * FROM (
  
    SELECT
      1 AS g,
      ARRAY[5, 4, 3, 2, 1] AS ids_by_t
   UNION ALL
  
    SELECT
      2 AS g,
      ARRAY[20, 10, 30] AS ids_by_t
   UNION ALL
  
    SELECT
      3 AS g,
      ARRAY[20, 10, 30] AS ids_by_t
  
) AS UNUSED_TABLE_NAME  )
SELECT
  TestOrderedSet.g AS g,
  TestOrderedSet.ids_by_t AS actual,
  Expected.ids_by_t AS expected
FROM
  t_0_TestOrderedSet AS TestOrderedSet, t_3_Expected AS Expected
WHERE
  (Expected.g = TestOrderedSet.g);
@EvgSkv
Copy link
Owner

EvgSkv commented Mar 14, 2022

Hi @munro
that's right, you can't have DISTINCT and ORDER BY at the same time in BQ. Thus we need to pre-aggregate the data first, see the LeanData predicate:

@Engine("bigquery");

TestOrderedSet(g:, ids_by_t? Array= t -> id) distinct :-
  LeanData(g:, id:, t:);

LeanData(g:, id:, t? Min= t) distinct :-
  Data(g:, id:, t:);

Data(g: 1, id: 1, t: 5);
Data(g: 1, id: 2, t: 4);
Data(g: 1, id: 2, t: 4);
Data(g: 1, id: 3, t: 3);
Data(g: 1, id: 3, t: 3);
Data(g: 1, id: 4, t: 2);
Data(g: 1, id: 5, t: 1);

Data(g: 2, id: 10, t: 1);
Data(g: 2, id: 10, t: 1);
Data(g: 2, id: 20, t: 2);
Data(g: 2, id: 20, t: 0);
Data(g: 2, id: 30, t: 3);

Expected(g: 1, ids_by_t: [5, 4, 3, 2, 1]);
Expected(g: 2, ids_by_t: [20, 10, 30]);
Expected(g: 3, ids_by_t: [20, 10, 30]);

Test(g:, actual:, expected:) :-
  TestOrderedSet(g:, ids_by_t: actual),
  Expected(g:, ids_by_t: expected);

Let me know if it solves your problem.

@munro
Copy link
Author

munro commented Apr 27, 2022

@EvgSkv It looks like you solved the problem so easily with just 2 lines 🥹 thank you! It seems really obvious now— still makes me wondering what I was trying to do lmao. I hope to get a chance to play with this more in the future. again I love the work! ❤️‍🔥

@munro munro closed this as completed Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants