Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distinct multicolumn performance and implementation suggestion for count distinct #3307

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments

Comments

@monetdb-team
Copy link

@monetdb-team monetdb-team commented Nov 30, 2020

Date: 2013-06-16 23:13:52 +0200
From: @skinkie
To: SQL devs <>
Version: 11.15.3 (Feb2013-SP1)
CC: @njnes

Last updated: 2013-09-27 13:47:15 +0200

Comment 18849

Date: 2013-06-16 23:13:52 +0200
From: @skinkie

User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.20 Safari/537.36
Build Identifier:

I'm currently trying to run some analysis on a table of the following structure; data is available.

CREATE TABLE "sys"."kv6_import" (
"receive" TIMESTAMP NOT NULL,
"message" TIMESTAMP NOT NULL,
"vehicle" TIMESTAMP NOT NULL,
"messagetype" VARCHAR(10) NOT NULL,
"operatingday" DATE NOT NULL,
"dataownercode" VARCHAR(10) NOT NULL,
"lineplanningnumber" VARCHAR(10),
"journeynumber" INTEGER NOT NULL,
"reinforcementnumber" INTEGER NOT NULL,
"userstopcode" VARCHAR(10),
"passagesequencenumber" INTEGER,
"distancesincelastuserstop" INTEGER,
"punctuality" INTEGER,
"rd_x" VARCHAR(11),
"rd_y" VARCHAR(11),
"blockcode" INTEGER,
"vehiclenumber" INTEGER,
"wheelchairaccessible" VARCHAR(5),
"source" VARCHAR(10) NOT NULL,
"numberofcoaches" INTEGER
);

At this moment I have loaded 10 * 10^6 rows - it is about 3 days of data. A very simple question I would like to ask the database:

How many lines were actually monitored given the days available in the database. Other SQL solutions offer SELECT COUNT(DISTINCT column1,columnN)), MonetDB doesn't, so I would rewrite my query in the following structure;

select count(*), dataownercode, lineplanningnumber, operatingday from (select distinct dataownercode, lineplanningnumber, operatingday from kv6_import) as x group by dataownercode, lineplanningnumber, operatingday;

The performance is poor - no results - I have killed the database server over it. Explain as attached. I have also tried to only run the subquery which also doesn't result in output given a reasonable time.

Since MonetDB at first sight doesn't support the count distinct on multiple columns, some TSQL website suggested to use column concat, with a slightly different syntax MonetDB gives a very acceptable performance 2826 tuples (13.2s).

select count(distinct dataownercode||lineplanningnumber||operatingday), dataownercode, lineplanningnumber, operatingday from kv6_import group by dataownercode, lineplanningnumber, operatingday;

Although my implementation cuts edges, adding a unique separator would in fact implement the more common syntax.

What remains is why the select distinct is slow in the first place.

Reproducible: Always

MonetDB 5 server v11.15.8 (64-bit, 64-bit oids)
This is an unreleased version
Copyright (c) 1993-July 2008 CWI
Copyright (c) August 2008-2013 MonetDB B.V., all rights reserved
Visit http://www.monetdb.org/ for further information
Found 15.6GiB available memory, 4 available cpu cores
Libraries:
libpcre: 8.32 2012-11-30 (compiled with 8.32)
openssl: OpenSSL 1.0.1e 11 Feb 2013 (compiled with OpenSSL 1.0.1e 11 Feb 2013)
libxml2: 2.9.0 (compiled with 2.9.0)
Compiled by: root@openkvk.nl (x86_64-unknown-linux-gnu)
Compilation: gcc -g -Werror -Wall -Wextra -W -Werror-implicit-function-declaration -Wpointer-arith -Wdeclaration-after-statement -Wundef -Wformat=2 -Wno-format-nonliteral -Winit-self -Winvalid-pch -Wmissing-declarations -Wmissing-format-attribute -Wmissing-prototypes -Wold-style-definition -Wpacked -Wunknown-pragmas -Wvariadic-macros -fstack-protector-all -Wstack-protector -Wpacked-bitfield-compat -Wsync-nand -Wjump-misses-init -Wmissing-include-dirs -Wlogical-op -Wunreachable-code
Linking : /usr/x86_64-pc-linux-gnu/bin/ld -m elf_x86_64

Comment 18850

Date: 2013-06-16 23:15:07 +0200
From: @skinkie

Created attachment 203
Explain for slow select distinct

Attached file: explainselect.txt (text/plain, 9202 bytes)
Description: Explain for slow select distinct

Comment 18877

Date: 2013-06-22 17:17:51 +0200
From: @njnes

Could you send the ddl statements of your query. Also could you try the performance of the distinct, using first numbers (preferably almost unique first) and later strings?

Comment 18878

Date: 2013-06-22 17:19:23 +0200
From: @njnes

why is lineplanning'number' not a number? Anyway start with date followed by lineplanning.

Comment 18881

Date: 2013-06-22 23:39:11 +0200
From: @skinkie

Created attachment 207
Trace of date-first

Niels; I have just tried your suggestion by moving the date to the front. The performance is awesome. 2s cold, to 1.4s hot.

select distinct operatingday, dataownercode, lineplanningnumber from kv6_import limit 10;

I have attached the trace of this query. I have validated that moving operatingday to the left again kills the performance.

The reasoning regarding lineplanningnumber is the following: typically the LinePlanningNumber (in Dutch: systeemnummer) is used to identify an individual line. What actually occurs much is that this value is the denormalised representation of the organisational unit (typically the area) and the public linenumber. So for example for Connexxion in the Schiphol area the LinePlanningNumber would be M190.

The reason why the RD coordinates are not integers are sadly due to some very bad data quality. We now prefer to do analyses on what is actually received.

Attached file: trace-datefirst.txt (text/plain, 16626 bytes)
Description: Trace of date-first

Comment 18882

Date: 2013-06-22 23:45:25 +0200
From: @skinkie

Created attachment 208
Trace of date second

Sounds also pretty handly to be able to compare the performance of this query;

select distinct dataownercode, operatingday, lineplanningnumber from kv6_import limit 10;

Attached file: trace-datesecond.txt (text/plain, 16603 bytes)
Description: Trace of date second

Comment 18883

Date: 2013-06-23 10:00:49 +0200
From: @njnes

basically we should optimize the order of the distinct. This calls for an extension of the relational optimizer.

Comment 18884

Date: 2013-06-23 11:28:51 +0200
From: @skinkie

Are you thinking in the direction of finding/storing the column with the most unique values so the number of bins are known?

Comment 18885

Date: 2013-06-23 11:41:18 +0200
From: @njnes

For group by's we allready reorder the groupby expressions using the rules' of thumb

Comment 18886

Date: 2013-06-23 11:44:32 +0200
From: @njnes

For group by's we allready reorder the groupby expressions using the rules' of thumb
ordered columns first.

fixed size before variable sized
small size types before larger types

This could indeed be improved using statistics on the data.

Comment 18887

Date: 2013-06-23 11:45:02 +0200
From: @njnes

So for now I'm looking into rewriting the multi column distinct into a group by.

Comment 18888

Date: 2013-06-23 11:58:11 +0200
From: @njnes

it seems the (distinct c1, c2 etc) is a mysql extension. So for now at most a feature request.

Comment 18889

Date: 2013-06-23 12:00:27 +0200
From: @skinkie

Yes, and PostgreSQL. Should I file a separate bug for that?

Comment 18894

Date: 2013-06-26 13:48:40 +0200
From: @njnes

The distinct is now rewritten into a group by, for which we automatically reorder the group by expressions. This should solve the performance problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant