Performance issue with lower(string) #7420

johnfouf · 2023-11-23T09:13:06Z

Describe the bug
Large strings seems to be very slow.

To Reproduce
I have a table that includes the plaintext of 50K publications from arxiv (50K rows in the table). The size is about 2GBs.
The query
create temp table arxivlower as select lower(text) from arxiv on commit preserve rows;

finishes in 33seconds. The trace follows:

+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| usec | statement |
+==========+=========================================================================================================================================================================+
| 9 | X_1=0@0:void := querylog.define("trace create temp table arxivlower as select lower(text) from arxiv on commit preserve rows;":str, "default_pipe":str, 18:int); |
| 4 | X_4=0:int := sql.mvc(); |
| 94 | X_13=0@0:void := sqlcatalog.create_table("tmp":str, "arxivlower":str, 0x7f9310156ba0:ptr, 1:int); |
| 41 | C_14=[50000]:bat[:oid] := sql.tid(X_4=0:int, "sys":str, "arxiv":str); |
| 375 | X_17=[50000]:bat[:str] := sql.bind(X_4=0:int, "sys":str, "arxiv":str, "text":str, 0:int); |
| 24 | X_24=[50000]:bat[:str] := algebra.projection(C_14=[50000]:bat[:oid], X_17=[50000]:bat[:str]); |
| 32041834 | X_25=[50000]:bat[:str] := batstr.toLower(X_24=[50000]:bat[:str]); # widen offset heap |
| 9 | X_28=50000:lng := aggr.count(X_25=[50000]:bat[:str]); |
| 32068883 | barrier X_90=false:bit := language.dataflow(); |
| 11 | (X_29=0@0:oid, X_30=nil:bat[:oid]) := sql.claim(X_4=0:int, "tmp":str, "lala":str, X_28=50000:lng); |
| 1646239 | X_34=0:int := sql.append(X_4=0:int, "tmp":str, "arxivlower":str, "v":str, X_29=0@0:oid, X_30=nil:bat[:oid], X_25=[50000]:bat[:str]); # copy vheap; widen empty offset hea |
: : p; memcpy offsets :
| 11 | X_36=0@0:void := sql.exportOperation(); |
+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Expected behavior
The same query on the same table for example in sqlite finishes in 3 seconds, so it looks like a bug more than a perf issue.

Software versions

MonetDB version number v11.48.0
OS and version: [e.g. Ubuntu 22.04]

njnes · 2023-12-02T15:56:37Z

is this an installed version or local compiled. If later which cmake command line did you use.

johnfouf · 2023-12-04T10:00:48Z

It is locally compiled. Here is the compilation script --> https://github.com/athenarc/YeSQL/blob/main/exec.sh. In this repo, an older Monetdb version is used (here https://github.com/athenarc/YeSQL/tree/main/YeSQL_MonetDB), we have also used the same script with the latest.

Some possible hints that I have:
In monetdb the result of the lower differs than that of sqlite in case of unicode characters. Monetdb's seems to be more correct than sqlite's. For example
sqlite: select lower('ΔHello') --> 'Δhello'
monetdb: select lower('ΔHello') --> 'δhello'

This seems to add much time in monetdb's execution, while it could be cool to select in someway how to deal with unicode (e.g., ignore them) this could fasten the execution a lot. My dataset contains unicode chars in several texts and this slows down execution. However, if I remove all unicode characters from the input dataset the execution time of lower in monetdb remains the same (33 seconds). I would expect this to be much faster.

A hint also for the latest:

If I put the ascii version into a file and run in python the following:

input_file_path = 'arxivascii.txt'
with open(input_file_path, 'r') as input_file:
content = input_file.read()
lowercase_content = content.lower()`

I am getting:

real 0m3.698s

If I run it with the original unicode content I am getting:
real 0m28.003s

which is pretty much inline with the previous difference between sqlite and monetdb. However, monetdb requires the same time for both ascii and unicode versions of the input text, so it seems that it could do it better in case the input strings include only ascii characters.

njnes · 2023-12-04T10:46:21Z

sqlite> select 'Ⱦ';
Ⱦ
sqlite> select lower('Ⱦ');
Ⱦ
sqlite> select lower('A');
a
sqlite> select lower('Ä');
Ä

Sqlite doesn't handle all relevant codepoints

njnes · 2023-12-04T11:09:50Z

I think the cmake shoud do the usual optimization. So indeed the expected difference is in the utf8 handling and always attempt to lower codepoints (as we don't keep a property of ascii only or so).

I have tested a bit with the code using tpch lineitem comments. There a small difference (improvement) coould be made, but not large enough to make those big changes. Could you pass (or give some pointer) to these large strings, such that I could test this better?

johnfouf · 2023-12-05T12:41:05Z

https://imisathena-my.sharepoint.com/personal/johnfouf_athenarc_gr/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fjohnfouf%5Fathenarc%5Fgr%2FDocuments%2Farxiv%2Ezip&parent=%2Fpersonal%2Fjohnfouf%5Fathenarc%5Fgr%2FDocuments&ga=1

Includes both versions of the input csv file, unicode and ascii

njnes · 2023-12-05T13:32:07Z

seems my azure accounts cannot access that file

johnfouf · 2023-12-05T13:37:25Z

Sorry, wrong link:

https://imisathena-my.sharepoint.com/:u:/g/personal/johnfouf_athenarc_gr/Efo1-idJzRNEvx3BGw6zSlEBrXc-lk3315icHDQ8omDzXQ

njnes · 2023-12-05T14:49:37Z

I've tested abit. The files take 12 seconds on my local machine. With some changes to the code I can get that down to about 50%. Will need more testing before I can checkin those changes. We will maintain the correctness, ie we will still do the change of multi byte codepoints (which could require larger result arrays). So we do abit more than sqlite. Also al of this is very single threaded, ie could be that in queries your performance may vary also.

njnes · 2023-12-05T15:08:25Z

the performance improvement fix was pushed to the default branch.

mvdvm changed the title ~~Performance issue with large strings~~ Performance issue with lower(string) Dec 7, 2023

mvdvm added the enhancement New feature or request label Dec 7, 2023

njnes self-assigned this Jan 10, 2024

njnes added this to the NEXTRELEASE milestone Jan 10, 2024

njnes closed this as completed Jan 10, 2024

sjoerdmullender modified the milestones: NEXTRELEASE, Dec2023-SP1 Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with lower(string) #7420

Performance issue with lower(string) #7420

johnfouf commented Nov 23, 2023

njnes commented Dec 2, 2023

johnfouf commented Dec 4, 2023 •

edited

njnes commented Dec 4, 2023

njnes commented Dec 4, 2023

johnfouf commented Dec 5, 2023

njnes commented Dec 5, 2023

johnfouf commented Dec 5, 2023

njnes commented Dec 5, 2023

njnes commented Dec 5, 2023

Performance issue with lower(string) #7420

Performance issue with lower(string) #7420

Comments

johnfouf commented Nov 23, 2023

njnes commented Dec 2, 2023

johnfouf commented Dec 4, 2023 • edited

njnes commented Dec 4, 2023

njnes commented Dec 4, 2023

johnfouf commented Dec 5, 2023

njnes commented Dec 5, 2023

johnfouf commented Dec 5, 2023

njnes commented Dec 5, 2023

njnes commented Dec 5, 2023

johnfouf commented Dec 4, 2023 •

edited