Weak duplicate elimination in string heaps > 64KB #6138
Date: 2016-12-02 12:21:24 +0100
Last updated: 2019-09-13 08:42:15 +0200
Date: 2016-12-02 12:21:24 +0100
Build is default branch 8086d2d529f2
varchar (and other variable-length columns) store their actual values in a separate .theap file, with a hash structure in order to eliminate duplicate values. This works well for heaps smaller than 64KB, however heaps larger than that switch to an alternate set of heuristics, whose implementation details make it much less effective than it should be. This causes unnecessarily high disk usage, with the corresponding performance decrease due to more data needing to be accessed.
sql>create table foo as select cast(value as varchar(64)) as v from generate_series(cast(0 as int),2500) with data;
sql>select count(*) from foo;insert into foo values ('dupe');select heapsize from sys.storage() where "table"='foo';
The second line should be run repeatedly. This is the easiest possible case of duplicate elimination (inserting the exact same value as last time), however the heapsize still increases over time. The de-duplication appears to get lost during bl_restart(), so running that line once every 30 seconds shows the heapsize increasing each time.
The duplicate elimination breaks because BATload_intern() calls strCleanHash(), which memsets the entire hash table to zero - any time the BAT is unloaded and reloaded then all de-duplication is lost. The change in 4faed73ce142 made unload/reload happen significantly more frequently than it used to. The explanation for the hash zeroing appears in a comment in e6d90b529745 "heap may have been mmaped-ed, appended-by-force, and then corrupted by crash".
If that is the only reason for the wipe then I reckon this might work instead:
diff -r 8086d2d529f2 gdk/gdk_atoms.c
If there are other reasons for the wipe (e.g. dealing with files from older versions which may be differently formatted) then that won't work. It will work for the case of a change to the hash function, although it won't be particularly effective until new values are inserted with the new hash function.
The only alternative I can currently think of is to rebuild the hash table based on the first 64KB of data and hope that the value distribution in the column hasn't changed too much over time. It may be worth, however, thinking about ways to avoid the cost of the 'else' branch too - rebuilding the whole hash table every time the BAT is loaded is not an insignificant amount of CPU usage.
Date: 2017-02-03 10:06:10 +0100
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=2ff5b4701e0e
Date: 2017-03-02 15:42:22 +0100
I implemented a different fix: only clean the hash table the first time a string heap is loaded after server restart.
Date: 2018-12-25 07:23:06 +0100
I am here for the share this amazing post need to follow here http://syncsettingswindows10.com and save the all internet setting of windows latest version for the easily update to system.
Date: 2019-09-13 08:42:15 +0200
Online VAPE SHOP offers top quality http://schloss34.bravesites.com
The text was updated successfully, but these errors were encountered: