Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with lower(string) #7420

Closed
johnfouf opened this issue Nov 23, 2023 · 9 comments
Closed

Performance issue with lower(string) #7420

johnfouf opened this issue Nov 23, 2023 · 9 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@johnfouf
Copy link

Describe the bug
Large strings seems to be very slow.

To Reproduce
I have a table that includes the plaintext of 50K publications from arxiv (50K rows in the table). The size is about 2GBs.
The query
create temp table arxivlower as select lower(text) from arxiv on commit preserve rows;

finishes in 33seconds. The trace follows:

+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| usec | statement |
+==========+=========================================================================================================================================================================+
| 9 | X_1=0@0:void := querylog.define("trace create temp table arxivlower as select lower(text) from arxiv on commit preserve rows;":str, "default_pipe":str, 18:int); |
| 4 | X_4=0:int := sql.mvc(); |
| 94 | X_13=0@0:void := sqlcatalog.create_table("tmp":str, "arxivlower":str, 0x7f9310156ba0:ptr, 1:int); |
| 41 | C_14=[50000]:bat[:oid] := sql.tid(X_4=0:int, "sys":str, "arxiv":str); |
| 375 | X_17=[50000]:bat[:str] := sql.bind(X_4=0:int, "sys":str, "arxiv":str, "text":str, 0:int); |
| 24 | X_24=[50000]:bat[:str] := algebra.projection(C_14=[50000]:bat[:oid], X_17=[50000]:bat[:str]); |
| 32041834 | X_25=[50000]:bat[:str] := batstr.toLower(X_24=[50000]:bat[:str]); # widen offset heap |
| 9 | X_28=50000:lng := aggr.count(X_25=[50000]:bat[:str]); |
| 32068883 | barrier X_90=false:bit := language.dataflow(); |
| 11 | (X_29=0@0:oid, X_30=nil:bat[:oid]) := sql.claim(X_4=0:int, "tmp":str, "lala":str, X_28=50000:lng); |
| 1646239 | X_34=0:int := sql.append(X_4=0:int, "tmp":str, "arxivlower":str, "v":str, X_29=0@0:oid, X_30=nil:bat[:oid], X_25=[50000]:bat[:str]); # copy vheap; widen empty offset hea |
: : p; memcpy offsets :
| 11 | X_36=0@0:void := sql.exportOperation(); |
+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Expected behavior
The same query on the same table for example in sqlite finishes in 3 seconds, so it looks like a bug more than a perf issue.

Software versions

  • MonetDB version number v11.48.0
  • OS and version: [e.g. Ubuntu 22.04]
@njnes
Copy link
Contributor

njnes commented Dec 2, 2023

is this an installed version or local compiled. If later which cmake command line did you use.

@johnfouf
Copy link
Author

johnfouf commented Dec 4, 2023

It is locally compiled. Here is the compilation script --> https://github.com/athenarc/YeSQL/blob/main/exec.sh. In this repo, an older Monetdb version is used (here https://github.com/athenarc/YeSQL/tree/main/YeSQL_MonetDB), we have also used the same script with the latest.

Some possible hints that I have:
In monetdb the result of the lower differs than that of sqlite in case of unicode characters. Monetdb's seems to be more correct than sqlite's. For example
sqlite: select lower('ΔHello') --> 'Δhello'
monetdb: select lower('ΔHello') --> 'δhello'

This seems to add much time in monetdb's execution, while it could be cool to select in someway how to deal with unicode (e.g., ignore them) this could fasten the execution a lot. My dataset contains unicode chars in several texts and this slows down execution. However, if I remove all unicode characters from the input dataset the execution time of lower in monetdb remains the same (33 seconds). I would expect this to be much faster.

A hint also for the latest:

If I put the ascii version into a file and run in python the following:

input_file_path = 'arxivascii.txt'
with open(input_file_path, 'r') as input_file:
content = input_file.read()
lowercase_content = content.lower()`

I am getting:

real 0m3.698s

If I run it with the original unicode content I am getting:
real 0m28.003s

which is pretty much inline with the previous difference between sqlite and monetdb. However, monetdb requires the same time for both ascii and unicode versions of the input text, so it seems that it could do it better in case the input strings include only ascii characters.

@njnes
Copy link
Contributor

njnes commented Dec 4, 2023

sqlite> select 'Ⱦ';
Ⱦ
sqlite> select lower('Ⱦ');
Ⱦ
sqlite> select lower('A');
a
sqlite> select lower('Ä');
Ä

Sqlite doesn't handle all relevant codepoints

@njnes
Copy link
Contributor

njnes commented Dec 4, 2023

I think the cmake shoud do the usual optimization. So indeed the expected difference is in the utf8 handling and always attempt to lower codepoints (as we don't keep a property of ascii only or so).

I have tested a bit with the code using tpch lineitem comments. There a small difference (improvement) coould be made, but not large enough to make those big changes. Could you pass (or give some pointer) to these large strings, such that I could test this better?

@njnes
Copy link
Contributor

njnes commented Dec 5, 2023

seems my azure accounts cannot access that file

@johnfouf
Copy link
Author

johnfouf commented Dec 5, 2023

@njnes
Copy link
Contributor

njnes commented Dec 5, 2023

I've tested abit. The files take 12 seconds on my local machine. With some changes to the code I can get that down to about 50%. Will need more testing before I can checkin those changes. We will maintain the correctness, ie we will still do the change of multi byte codepoints (which could require larger result arrays). So we do abit more than sqlite. Also al of this is very single threaded, ie could be that in queries your performance may vary also.

@njnes
Copy link
Contributor

njnes commented Dec 5, 2023

the performance improvement fix was pushed to the default branch.

@mvdvm mvdvm changed the title Performance issue with large strings Performance issue with lower(string) Dec 7, 2023
@mvdvm mvdvm added the enhancement New feature or request label Dec 7, 2023
@njnes njnes self-assigned this Jan 10, 2024
@njnes njnes added this to the NEXTRELEASE milestone Jan 10, 2024
@njnes njnes closed this as completed Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants