New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue with lower(string) #7420
Comments
is this an installed version or local compiled. If later which cmake command line did you use. |
It is locally compiled. Here is the compilation script --> https://github.com/athenarc/YeSQL/blob/main/exec.sh. In this repo, an older Monetdb version is used (here https://github.com/athenarc/YeSQL/tree/main/YeSQL_MonetDB), we have also used the same script with the latest. Some possible hints that I have: This seems to add much time in monetdb's execution, while it could be cool to select in someway how to deal with unicode (e.g., ignore them) this could fasten the execution a lot. My dataset contains unicode chars in several texts and this slows down execution. However, if I remove all unicode characters from the input dataset the execution time of lower in monetdb remains the same (33 seconds). I would expect this to be much faster. A hint also for the latest: If I put the ascii version into a file and run in python the following: input_file_path = 'arxivascii.txt' I am getting: real 0m3.698s If I run it with the original unicode content I am getting: which is pretty much inline with the previous difference between sqlite and monetdb. However, monetdb requires the same time for both ascii and unicode versions of the input text, so it seems that it could do it better in case the input strings include only ascii characters. |
sqlite> select 'Ⱦ'; Sqlite doesn't handle all relevant codepoints |
I think the cmake shoud do the usual optimization. So indeed the expected difference is in the utf8 handling and always attempt to lower codepoints (as we don't keep a property of ascii only or so). I have tested a bit with the code using tpch lineitem comments. There a small difference (improvement) coould be made, but not large enough to make those big changes. Could you pass (or give some pointer) to these large strings, such that I could test this better? |
Includes both versions of the input csv file, unicode and ascii |
seems my azure accounts cannot access that file |
I've tested abit. The files take 12 seconds on my local machine. With some changes to the code I can get that down to about 50%. Will need more testing before I can checkin those changes. We will maintain the correctness, ie we will still do the change of multi byte codepoints (which could require larger result arrays). So we do abit more than sqlite. Also al of this is very single threaded, ie could be that in queries your performance may vary also. |
the performance improvement fix was pushed to the default branch. |
Describe the bug
Large strings seems to be very slow.
To Reproduce
I have a table that includes the plaintext of 50K publications from arxiv (50K rows in the table). The size is about 2GBs.
The query
create temp table arxivlower as select lower(text) from arxiv on commit preserve rows;
finishes in 33seconds. The trace follows:
+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| usec | statement |
+==========+=========================================================================================================================================================================+
| 9 | X_1=0@0:void := querylog.define("trace create temp table arxivlower as select lower(text) from arxiv on commit preserve rows;":str, "default_pipe":str, 18:int); |
| 4 | X_4=0:int := sql.mvc(); |
| 94 | X_13=0@0:void := sqlcatalog.create_table("tmp":str, "arxivlower":str, 0x7f9310156ba0:ptr, 1:int); |
| 41 | C_14=[50000]:bat[:oid] := sql.tid(X_4=0:int, "sys":str, "arxiv":str); |
| 375 | X_17=[50000]:bat[:str] := sql.bind(X_4=0:int, "sys":str, "arxiv":str, "text":str, 0:int); |
| 24 | X_24=[50000]:bat[:str] := algebra.projection(C_14=[50000]:bat[:oid], X_17=[50000]:bat[:str]); |
| 32041834 | X_25=[50000]:bat[:str] := batstr.toLower(X_24=[50000]:bat[:str]); # widen offset heap |
| 9 | X_28=50000:lng := aggr.count(X_25=[50000]:bat[:str]); |
| 32068883 | barrier X_90=false:bit := language.dataflow(); |
| 11 | (X_29=0@0:oid, X_30=nil:bat[:oid]) := sql.claim(X_4=0:int, "tmp":str, "lala":str, X_28=50000:lng); |
| 1646239 | X_34=0:int := sql.append(X_4=0:int, "tmp":str, "arxivlower":str, "v":str, X_29=0@0:oid, X_30=nil:bat[:oid], X_25=[50000]:bat[:str]); # copy vheap; widen empty offset hea |
: : p; memcpy offsets :
| 11 | X_36=0@0:void := sql.exportOperation(); |
+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Expected behavior
The same query on the same table for example in sqlite finishes in 3 seconds, so it looks like a bug more than a perf issue.
Software versions
The text was updated successfully, but these errors were encountered: