Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python aggregation does not create aggr_group when aggregating over all rows #6726

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments
Closed

Comments

@monetdb-team
Copy link

@monetdb-team monetdb-team commented Nov 30, 2020

Date: 2019-07-05 16:20:24 +0200
From: @swingbit
To: SQL devs <>
Version: 11.31.13 (Aug2018-SP2)
CC: martin.van.dinther, @PedroTadim

Last updated: 2019-12-20 15:36:33 +0100

Comment 27113

Date: 2019-07-05 16:20:24 +0200
From: @swingbit

User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36
Build Identifier:

start transaction;

CREATE AGGREGATE pycount(val STRING)
RETURNS INTEGER
LANGUAGE PYTHON {
unique = numpy.unique(aggr_group)
x = numpy.zeros(shape=(unique.size))
for i in range(0, unique.size):
x[i] = val[aggr_group==unique[i]].size
return(x)
};

sql>select count(term) from (values (1,'apple'), (2,'pear')) as x(id, term);
+------+
| L5 |
+======+
| 2 |
+------+
1 tuple

sql>select pycount(term) from (values (1,'apple'), (2,'pear')) as x(id, term);
Python exception

  1. def pyfun(val,_columns,_column_types,_conn):
  1. unique = numpy.unique(aggr_group)
  1. x = numpy.zeros(shape=(unique.size))
  2. for i in range(0, unique.size):
    global name 'aggr_group' is not defined

One could say that this is correct, as there are no groups for this aggregation.
In that case, probably the python aggregation should check whether aggr_group exists, and aggregate over all rows in case it doesn't.
Is that the correct interpretation? If so, maybe this should be better explained somewhere.
The original post (https://www.monetdb.org/blog/embedded-pythonnumpy-monetdb) does not mention this case.
Also, to simplify things, maybe aggr_group should also be passed, but with a null value. That would be a cleaner test to perform in the python code.

The other interpretation could be that there is 1 group, which includes all rows.
In that case, the absence of aggr_group is a bug.

Reproducible: Always

Comment 27124

Date: 2019-07-11 10:52:33 +0200
From: @sjoerdmullender

I suppose the documentation could be better, but your function could also catch the problem:

CREATE AGGREGATE pycount(val STRING)
RETURNS INTEGER
LANGUAGE PYTHON {
try:
unique = numpy.unique(aggr_group)
except NameError:
unique = numpy.unique([0])
x = numpy.zeros(shape=(unique.size))
for i in range(0, unique.size):
x[i] = val[aggr_group==unique[i]].size
return(x)
};

Comment 27125

Date: 2019-07-11 10:54:09 +0200
From: @swingbit

Indeed, I just was not sure whether this is how it was intended to be.

Comment 27146

Date: 2019-07-17 17:14:47 +0200
From: Martin van Dinther <<martin.van.dinther>>

FYI: https://www.monetdb.org/blog/embedded-pythonnumpy-monetdb has been updated with info that parameter aggr_group is only created when GROUP BY is used.

So instead of:
select pycount(term) from (values (1,'apple'), (2,'pear')) as x(id, term);
use:
select pycount(term) from (values (1,'apple'), (2,'pear')) as x(id, term) group by id;

Comment 27147

Date: 2019-07-17 17:43:55 +0200
From: @swingbit

Thanks for adding to the documentation.

To make it clear: I do agree that the problem was not in the code, but in the documentation.

However, the solution is not to use the GROUP BY. That would yield a different result (the original query was an aggregation with no grouping).

The correct solution is to check for the existence of aggr_group in the python function:

try:
aggr_group
except NameError:
aggr_group doesn't exist
no groups, aggregate on all data
else:
aggr_group exists
aggregate on groups

Note that this should be done on every python aggregate function defined, no exception.
If you want to make the docs even clearer, perhaps you could include something like that.

Comment 27443

Date: 2019-12-05 15:07:14 +0100
From: Martin van Dinther <<martin.van.dinther>>

I have updated the documentation. It now uses:
try:
unique = numpy.unique(aggr_group)
x = numpy.zeros(shape=(unique.size))
for i in range(0, unique.size):
x[i] = numpy.sum(val[aggr_group==unique[i]])
except NameError:
aggr_group doesn't exist. no groups, aggregate on all data
x = numpy.sum(val)
return(x)

This way it behaves similar to SQL aggregate functions.

I tested it with SQL queries:
SELECT groupnr, python_aggregate(value) FROM grouped_ints GROUP BY groupnr;
SELECT value, python_aggregate(groupnr) FROM grouped_ints GROUP BY value;
SELECT python_aggregate(groupnr) FROM grouped_ints;
SELECT python_aggregate(value) FROM grouped_ints;

and for the extended table (I called it grouped_ints2):
SELECT groupnr, groupnr2, python_aggregate(value) FROM grouped_ints2 GROUP BY groupnr, groupnr2;
SELECT python_aggregate(groupnr) FROM grouped_ints2;
SELECT python_aggregate(groupnr2) FROM grouped_ints2;
SELECT python_aggregate(value) FROM grouped_ints2;

These queries all work now and with the correct output.

Comment 27446

Date: 2019-12-05 16:16:07 +0100
From: MonetDB Mercurial Repository <>

Changeset ab919d359405 made by Martin van Dinther martin.van.dinther@monetdbsolutions.com in the MonetDB repo, refers to this bug.

For complete details, see https//devmonetdborg/hg/MonetDB?cmd=changeset;node=ab919d359405

Changeset description:

Add test for bug #6726
Note that query: SELECT value, python_aggregate(groupnr) FROM grouped_ints GROUP BY value;
does not give the correct answer. It only returned two rows instead of 5.

Comment 27447

Date: 2019-12-05 16:20:54 +0100
From: Martin van Dinther <<martin.van.dinther>>

Reopened as test reveals a new problem in one case where the grouping is done on a column with unique values. See query:
SELECT value, python_aggregate(groupnr) FROM grouped_ints GROUP BY value;

Comment 27460

Date: 2019-12-06 15:32:19 +0100
From: MonetDB Mercurial Repository <>

Changeset 2d8848e66425 made by Pedro Ferreira pedro.ferreira@monetdbsolutions.com in the MonetDB repo, refers to this bug.

For complete details, see https//devmonetdborg/hg/MonetDB?cmd=changeset;node=2d8848e66425

Changeset description:

Fix for bug #6726 failing test. Look for TYPE_void bats in pyapi and generate dense sequences from them. (I have to check for 32-bit architectures yet)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant