New zstd 1.5.5 version is two times slower in compression speed than older 1.4.5 version #3906

Dmitri555 · 2024-02-12T15:34:18Z

New 1.5.5 version is two times slower than older one 1.4.5 in compression speed

New Version 1.5.5

START=$(date +%s); time ./zstd1.5.5 -T0 -4 -o testnew core-fastdpi_wrk0.470908.11;END=$(date +%s); echo Elapsed time $((END-START)) Sec
core-fastdpi_wrk0.470908.11 : 0.15% ( 13.1 GiB => 20.5 MiB, testnew)

real 0m7.465s
user 0m6.487s
sys 0m7.535s
Elapsed time 8 Sec

Old version 1.4.5

START=$(date +%s); time ./zstd1.4.5 -T0 -4 -o testold core-fastdpi_wrk0.470908.11;END=$(date +%s); echo Elapsed time $((END-START)) Sec
core-fastdpi_wrk0.470908.11 : 0.15% (14083944448 => 21562128 bytes, testold)

real 0m2.975s
user 0m7.905s
sys 0m1.707s
Elapsed time 3 Sec

OS: CentOS
Version 8.1
Compiler gcc
Flags default
Other relevant hardware specs AMD 64 core
Build system cmake

Cyan4973 · 2024-02-12T17:45:17Z

The sample labelled 1.5.5 has a much higher sys time.
Not sure what it corresponds to, as if OS had a harder time doing something.
Moreover, since user < real, it doesn't even feel multi-threaded, which -T0 should achieve.

So, that gives a few possibilities to look into :

caching difference : if the target file is accessed for the first time, it's more difficult for OS to retrieve it, whereas in subsequent runs, the file is likely cached. Try accessing the file with 1.5.5 after accessing it with 1.4.5 to rule out this effect.
multithreading: maybe -T0 doesn't work, for reasons unknown, and default to single thread. In which case, try to manually set a nb of cores. For example, -T30.

Also : use -vv, or even -vvv, to extract more debug information.

Dmitri555 · 2024-02-12T21:10:38Z

I changed the order of execution and added -T32 option: same result

time ./zstd1.4.5 -4 -T32 core-file -c > /dev/null
core-file : 0.15% (14083944448 => 21562128 bytes, /stdout)
real 0m2.876s user 0m7.874s sys 0m1.433s

time ./zstd1.5.5 -4 -T32 core-file -c > /dev/null
real 0m6.740s user 0m6.407s sys 0m6.808s

Dmitri555 · 2024-02-12T21:20:36Z

START=$(date +%s); time ./zstd1.4.5 -vvv -4 -T32 core-file -o 3 ; END=$(date +%s); echo Elapsed time $((END-START)) Sec
*** zstd command line interface 64-bits v1.4.5, by Yann Collet ***
set nb workers = 32
core-file: 1199042560 bytes
core-file : 0.15% (14083944448 => 21562128 bytes, 3)
core-file : Completed in 9.34 sec (cpu load : 100%)

real 0m2.925s
user 0m7.919s
sys 0m1.427s
Elapsed time 3 Sec

START=$(date +%s); time ./zstd1.5.5 -vvv -4 -T32 core-file -o 4 ; END=$(date +%s); echo Elapsed time $((END-START)) Sec
*** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***
--zstd=wlog=21,clog=18,hlog=18,slog=1,mml=5,tlen=0,strat=2
--format=.zst --no-sparse --block-size=0 --memory=134217728 --threads=32 --content-size
set nb workers = 32
core-file: 14083944448 bytes
Decompression will require 2097152 B of memory
core-file : 0.15% (14083944448 B => 21545116 B, 4) %
core-file : Completed in 8.58 sec (cpu load : 171%)

real 0m8.588s
user 0m6.731s
sys 0m7.952s
Elapsed time 9 Sec

For strange reason with -vvv option zstd even reports wrong execution time:
Completed in 9.34 sec VS Elapsed time 3 Sec

Cyan4973 · 2024-02-12T21:48:39Z

Some more advanced tests that could be attempted :

Try v1.5.3. It would help determine if the issue comes from some new feature added in v1.5.4.

I was also wondering if the definition of "level 4" has changed between v1.4.5 and v1.5.5, but after verification, it has not, so it should not behave vastly differently.

This could be confirmed by using the internal benchmark module, bypassing potential I/O bottleneck :
zstd -b4 core-file
zstd -b4 -T32 core-file

Dmitri555 · 2024-02-13T11:40:16Z

./zstd1.4.5 -b4 core-file
Not enough memory; testing 2709 MB only...
4#core-file :2840941909 -> 7883578 (360.36),2520.8 MB/s ,13232.4 MB/s

./zstd1.5.5 -b4 -T32 core-file
Not enough memory; testing 2709 MB only...
4#core-file :2840941909 -> 8076033 (x351.77), 7734.9 MB/s, 11526.4 MB/s

./zstd1.5.5 -b4 core-file
Not enough memory; testing 2709 MB only...
4#core-file :2840941909 -> 7879476 (x360.55), 6341.0 MB/s, 13338.3 MB/s

Where came from message "Not enough memory" ? Every new run comes with new bug :)

cat /proc/meminfo
MemTotal: 131575760 kB
MemAvailable: 116197980 kB

Cyan4973 · 2024-02-13T17:29:46Z

The benchmark module has an internal memory limit (8 GB, divided into 3 buffers, hence ~2.7 GB per buffer). If you want to use more memory, you can change the limit manually and recompile.
But I don't think that will be necessary. The point is just to benchmark the inner compression loop, and get some speed and ratio estimations. I suspect 2.7 GB of data is good enough to get a good idea of the behavior. And there is already enough information there to learn a few things :

The target file is unusual: zstd achieves extremely high compression ratios with it (x360!). This suggests very high redundancies. We typically don't calibrate for such extreme range, so changes may happen there that we did not notice.
The compression algorithm itself is extremely fast, as expected in such a case. At ~6 GB/s for a single core, it should take about 2 seconds to compress the whole 13 GB file. This is not consistent with the report that it takes > 7 sec to do that.
Also, oddly enough, 1.5.5 seems to be much faster than v1.4.5, rather than slower, on the tested sample. This is also not consistent with the previous report suggesting that v1.5.5 is slower.

All this seems to point towards I/O as the potential bottleneck. And it's logical, considering the extreme speeds requested.

The next test could employ a ramfs file system, like /tmp/, to remove the physical component of I/O limitations. But there will still be the File System itself, and the I/O component within zstd, both of which could become bottlenecks considering the extreme speed targeted.

After that, presuming the performance difference comes from the I/O component within zstd, it will be necessary to bisect to find the commit that resulted in this performance difference on this specific scenario.

Cyan4973 · 2024-02-13T19:22:03Z

I made a few local tests, to attempt to mimic the scenario, using a highly compressible synthetic data source of 13 GB.

On a macos laptop, it gives the following :

version	-4	-4 -T0
`v1.4.5`	6.2s	2.68s
`v1.5.5`	3.4s	2.59s

As can be seen in this measurement, v1.5.5 is indeed way faster than v1.4.5 in single-thread mode. But once -T0 is employed, they both converge towards the same limit, because I/O is now the bottleneck.

In case it would be macos specific, I also made the same test on a older Ubuntu Desktop :

version	-4	-4 -T0
`v1.4.5`	6.85s	2.59s
`v1.5.5`	3.85s	2.31s

Basically, same conclusion.

So the reported issue is not reproduced.

chschroeder · 2024-02-25T14:30:37Z

I observed a similar discrepancy. I measured zstd's compression speed on a weak VM (zstd v1.4.8) that has only 4 cores and 8GB RAM and repeated the experiment on a more powerful machine (zstd v1.5.5), which is a few years old but has 128GB of RAM and 64 cores. My expectation was that the latter should be faster (if only because of the more recent software version that promises speedups in the release notes since then).

In a small experiment, I compared 1) zstd, 2) zstd with the "--long" parameter, and 3) lrzip. Each strategy was restricted to only use 4 cores. Each of them was evaluated on compression speed and file size over several compression levels. The runs on v1.5.5 were about 10-15% slower than the runs using v1.4.8. Might not be as drastic as "two times slower", but this was contrary to my expectations and therefore seems suspicious to me.

Additional differences that might be Version 1.4.8 was installed via Debian sources, whereas v1.5.5 was manually compiled with make.

The data to be compressed was a small set of web crawling results, where the single files are of size up to 4GB. Unfortunately, I cannot share the files, but they are comparable to the common crawl web archive files.

Dmitri555 · 2024-02-26T15:22:56Z

We typically don't calibrate for such extreme range

As I said before this is a regular core file from crashed sofware. It can have large areas of unused/uninitialized memory. It is nothing unusual.

Probably you are trying to reproduce the problem on your notebook with SSD drive but I have an old-school server with lots of regular HDDs.
PS
Looks like --no-asyncio solved this performance problem

Cyan4973 · 2024-02-26T17:24:40Z

Looks like --no-asyncio solved this performance problem

In which case, I would assume the issue started happening between v1.5.2 and v1.5.4.

yoniko · 2024-03-11T22:29:11Z

Thank you for reporting @Dmitri555 .
I've been able to replicate some use-cases of slow-downs when using AsyncIO on an AMD machine (didn't replicate on an Intel server / M1 laptop). The numbers I was getting are similar to what @chschroeder has observed (10-15% slowdown).
It only reproduced for me in cases where very little writes happened (very high compression ratio) and compression speed and reads were very quick (around 2G/s). I believe that in those cases the additional overhead for AsyncIO's thread synchronization becomes meaningful.

However, I didn't manage to reproduce the x2 slow-down reported here. My suspicious are that it's either the HDDs or NUMA, but there are just guesses. @Dmitri555 if you can run the same experiment only ram to ram without going through the HDD it'd be helpful. Additionally, if you can make such the process is pinned to one socket (in case there are multiple CPUs on the machine) that'd allow us to rule out NUMA as well.

As for the 15% slowdown, I've spent some time debugging this and I believe this is caused by the additional overhead introduced by AsyncIO's thread synchronization. This should only manifest in cases where the read, write and compression workloads are extremely fast to the point where the added synchronization syscalls actually take a meaningful time of the runtime. Even so, it only reproduced for me on an AMD machine.

I don't think there's an easy fix here, one solution is to increase the size of our read buffers, but this could have negative results for other use-cases. The better solution would be to add an io_uring compatible asyncio implementation, that should allow us to remove most of the overhead. We've built the asyncio module with io_uring in mind, so the same API should work, but implementing and testing would still take some work.

Dmitri555 · 2024-04-24T13:44:20Z

Async I/O make performance worser on file reading
--no-asyncio helps but not as good as was older version

DISK DRIVE:

time zstd1.5.6 -o /dev/null -T0 -3 core-file
real 0m6.808s

time zstd1.5.6 -o /dev/null --no-asyncio -T0 -3 core-file
real 0m2.738s

time zstd1.4.5 -o /dev/null -T0 -3 core-file
real 0m2.477s

TMPFS (RAM DRIVE)

time zstd1.5.6 -o /dev/null -T0 -3 core-file
real 0m7.275s

time zstd1.5.6 --no-asyncio -o /dev/null -T0 -3 core-file
real 0m2.668s

time zstd1.4.5 -o /dev/null -T0 -3 core-file
real 0m2.242s

Cyan4973 self-assigned this Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New zstd 1.5.5 version is two times slower in compression speed than older 1.4.5 version #3906

New zstd 1.5.5 version is two times slower in compression speed than older 1.4.5 version #3906

Dmitri555 commented Feb 12, 2024

Cyan4973 commented Feb 12, 2024

Dmitri555 commented Feb 12, 2024

Dmitri555 commented Feb 12, 2024

Cyan4973 commented Feb 12, 2024 •

edited

Dmitri555 commented Feb 13, 2024

Cyan4973 commented Feb 13, 2024 •

edited

Cyan4973 commented Feb 13, 2024 •

edited

chschroeder commented Feb 25, 2024

Dmitri555 commented Feb 26, 2024 •

edited

Cyan4973 commented Feb 26, 2024

yoniko commented Mar 11, 2024 •

edited

Dmitri555 commented Apr 24, 2024

New zstd 1.5.5 version is two times slower in compression speed than older 1.4.5 version #3906

New zstd 1.5.5 version is two times slower in compression speed than older 1.4.5 version #3906

Comments

Dmitri555 commented Feb 12, 2024

Cyan4973 commented Feb 12, 2024

Dmitri555 commented Feb 12, 2024

Dmitri555 commented Feb 12, 2024

Cyan4973 commented Feb 12, 2024 • edited

Dmitri555 commented Feb 13, 2024

Cyan4973 commented Feb 13, 2024 • edited

Cyan4973 commented Feb 13, 2024 • edited

chschroeder commented Feb 25, 2024

Dmitri555 commented Feb 26, 2024 • edited

Cyan4973 commented Feb 26, 2024

yoniko commented Mar 11, 2024 • edited

Dmitri555 commented Apr 24, 2024

Cyan4973 commented Feb 12, 2024 •

edited

Cyan4973 commented Feb 13, 2024 •

edited

Cyan4973 commented Feb 13, 2024 •

edited

Dmitri555 commented Feb 26, 2024 •

edited

yoniko commented Mar 11, 2024 •

edited