Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to improve the performance of SQL queries with a large list of members in IN clause. #6689

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments

Comments

@monetdb-team
Copy link

@monetdb-team monetdb-team commented Nov 30, 2020

Date: 2019-03-01 07:07:04 +0100
From: Akhilesh <<akhilesh.visouriya>>
To: SQL devs <>
Version: 11.27.13 (Jul2017-SP4)
CC: @aris-koning, arshad.super, martin.van.dinther, @njnes

Last updated: 2019-11-28 10:00:05 +0100

Comment 26913

Date: 2019-03-01 07:07:04 +0100
From: Akhilesh <<akhilesh.visouriya>>

User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36
Build Identifier:

We are trying to improve performance of a particular pattern of queries that is going against a 100 million row, 50 column Monet table (all columns defined as int datatype).
We are hosting the table on a server that has 132 cores and close to 1 TB of memory and data sitting on a high speed SSD.
The pattern of queries as follows –

Select (column 2), (column 3), (column4) from table X where column 1 IN (x,y,z,…………..100k ids) ;

where column 1 is the primary key for the table.

Above query takes an hour and half to complete. So we attempted various optimization approaches – we partitioned the table, ran order index on primary key but were not able to bring it to less than 40 mins.

We did notice that this particular query pattern with large list of members in IN clause is not using more than 1 core when executing. Other query patterns are using multiple cores.

Any suggestions on what we can possibly tune or look into?
Thanks

Reproducible: Always

Steps to Reproduce:

1.We should have table of around 100 million rows and 50-100 columns(all columns defined as int datatype).

2.SQL queries with a large list of members in IN clause as per below example
Select (column 2), (column 3), (column4) from table X where column 1 IN (x,y,z,…………..100k ids) ;

Actual Results:

Select (column 2), (column 3), (column4) from table X where column 1 IN (x,y,z,…………..100k ids) ;

query takes an hour and half to complete.

So we attempted various optimization approaches – we partitioned the table, ran order index on primary key but were not able to bring it to less than 40 mins.

We are noticing that particular query pattern with large list of members in IN clause is not using more than 1 core when executing. Other query patterns are using multiple cores.

Comment 26916

Date: 2019-03-04 08:06:34 +0100
From: Arshad <<arshad.super>>

Hello,

I want to add few more details to what Akhilesh has provided. This is also seen by me on both "Jul2017-SP4" and "Aug2018-SP2".
Incidentally, The SQL is pretty much the same. That is Select with IN Clause with 150,000 IDS. Tried out with both sequential_pipe and default_pipe on both "Jul2017-SP4" and "Aug2018-SP2", and for both performance is same. Please do note that default_pipe on Aug2018-SP2 the performance does increase by mere 10 seconds not much.

Any hints/workaround for this kind of case. 150 Id's is for one of our smaller tables. The standard tables the size is pretty huge and it takes hours to complete. I have provided a 5000 Id's example in "reproducer" section, which takes greater than 2 mins. We tried debugging to see where it is getting slowed down and that info is provided below. We unfortunately cannot wait this long for results and would like your views/workaround on this.

Thanks,
Arshad

Debugging/System Details below

The SQL is very similar like below which is already given:
Select (column 2), (column 3), (column4) from table X where column 1 IN (x,y,z,…………..100k ids) ;

Tracing the SQL execution, what was seen was maximum time was taken in candscan_int(). The real wait is done under macro scan_sel(candscan, o = candlist++, w = (BUN) (((oid *) Tloc(s,q?(q - 1):0)) + 1)). I am not sure if this is tree/BUN build up or build up/processing both.

(gdb) info break
Num Type Disp Enb Address What
1 breakpoint keep y 0x00007f69c3952200 in candscan_int at gdk_select.c:850
breakpoint already hit 74100 times
c
2 breakpoint keep y 0x00007f69b2b2aa90 in exp_bin at rel_bin.c:345
breakpoint already hit 74631 times
c
3 breakpoint keep y 0x00007f69c396a9e0 in BATselect at gdk_select.c:1186
breakpoint already hit 74113 times
c
(gdb)

(gdb) Breakpoint 1, candscan_int (b=b@entry=0x7f699ed8b880, s=s@entry=0x7f699e8ece70, bn=bn@entry=0x7f699f0c31d0, tl=tl@entry=0x7f691d2ee818, th=th@entry=0x7f691d2ee818, equi=equi@entry=1, anti=anti@entry=0, r=r@entry=0, q=q@entry=82831, cnt=cnt@entry=0, off=off@entry=0, dst=dst@entry=0x7f691d32cde0, candlist=candlist@entry=0x7f691d4bbe00, maximum=maximum@entry=82831, use_imprints=use_imprints@entry=0, hval=, lval=, hi=, li=) at gdk_select.c:850
850 scan_sel(candscan, o = candlist++, w = (BUN) (((oid *) Tloc(s,q?(q - 1):0)) + 1))

What was observed was that rel_bin() is first executed. Once that is finished candscan_int()->scan_sel() is called. The below top snapshot is taken when scan_sel() is running. This takes up 99% of the CPU.

top - 05:09:31 up 314 days, 6:26, 7 users, load average: 2.84, 2.86, 2.56
Threads: 7 total, 1 running, 6 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.4 us, 1.0 sy, 0.0 ni, 96.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 10562755+total, 982952 free, 56946828+used, 48582438+buff/cache
KiB Swap: 33554428 total, 9442496 free, 24111932 used. 48075398+avail Mem

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND

436748 monetad+ 20 0 26.939g 0.026t 0.026t R 99.3 2.6 1:06.53 mserver5
436482 monetad+ 20 0 26.939g 0.026t 0.026t S 0.3 2.6 0:00.07 mserver5
436463 monetad+ 20 0 26.939g 0.026t 0.026t S 0.0 2.6 0:00.06 mserver5

My Server/system details:

$ mserver5 --version
MonetDB 5 server v11.31.13 "Aug2018-SP2" (64-bit)
Copyright (c) 1993 - July 2008 CWI
Copyright (c) August 2008 - 2019 MonetDB B.V., all rights reserved
Visit https://www.monetdb.org/ for further information
Found 1007.3GiB available memory, 128 available cpu cores
Libraries:
libpcre: 8.32 2012-11-30 (compiled with 8.32)
openssl: OpenSSL 1.0.2k 26 Jan 2017 (compiled with OpenSSL 1.0.2k-fips 26 Jan 2017)
libxml2: 2.9.1 (compiled with 2.9.1)
Compiled by: monetadmin@lnx1490.ch3.prod.i.com (x86_64-pc-linux-gnu)
Compilation: gcc -std=gnu99 -g -O2
Linking : /bin/ld -m elf_x86_64 -Wl,-Bsymbolic-functions

Reproducible: Always

Steps to Reproduce:

/* Running SQL with 150K IN ids is very slow. Here query.sql has just 5000 IDS and it takes more than two minutes */
1.mclient -is -d DB -u USR < query.sql

Actual Results:

password:
+------+
| L3 |
+======+
| 7408 |
+------+
1 tuple (133.6s)

Expected Results:

It should be much quicker than 2 minutes.

Comment 26931

Date: 2019-03-19 09:13:34 +0100
From: @aris-koning

Hi Arshad,

This is fixed in the current default branch. The current situation is that for every value in the in-value-list a separate select query is executed and merged into a result holder. The new implementation uses join mechanics which is much faster.

Kind regards,

Aris

Comment 26940

Date: 2019-03-21 09:40:41 +0100
From: Arshad <<arshad.super>>

Hi Aris,

Thanks for the fix!!

Anytime soon new source tar ball will be released (SP3)? I currently do not have a hg setup with me, I always prefer a stable tar ball source release. So it will take some time before I could verify this. If you could please provide the bug ID against which the code went in, I can hand pick and patch the SP2 and test out.

Thanks
Arshad

Comment 26941

Date: 2019-03-21 15:52:47 +0100
From: Martin van Dinther <<martin.van.dinther>>

As a workaround (till the fix has been released) for the query
Select c2, c3, c4 from X
where c1 IN (1,2,3,4,5,6,7,8,9,10);
you could try to use following query construct:
Select c2, c3, c4 from X
join (values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) as invalues(inval) on c1 = inval;

This will do a join instead of the IN (...).

If this workaround does not improve your query performance you could also use a (temporary) table to store the values to match and next join with that table:

create local temporary table invalues(inval int primary key) on commit preserve rows;
insert into invalues values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
select * from invalues;

Next you can do:
Select c2, c3, c4 from X
join invalues on c1 = inval;

Comment 26943

Date: 2019-03-22 14:51:36 +0100
From: @sjoerdmullender

(In reply to Arshad from comment 3)

Hi Aris,

Thanks for the fix!!

Anytime soon new source tar ball will be released (SP3)? I currently do not
have a hg setup with me, I always prefer a stable tar ball source release.
So it will take some time before I could verify this. If you could please
provide the bug ID against which the code went in, I can hand pick and patch
the SP2 and test out.

Thanks
Arshad

As Aris said, this is the the default branch. The plans are to release from the Apr2019 branch this coming April. Some as yet undetermined time after that we will start working on releasing from what is currently the default branch. So I'm afraid it's still quite some time until this change gets packaged in a release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant