-
Notifications
You must be signed in to change notification settings - Fork 12
Home
Intended for Database Administrators using Ubuntu version 16 or higher.
Persistent key-value stores (i.e. RocksDB, LevelDB, etc.) are a critical component in many data-intensive applications. However, the normal approach using Log-Structured Merge-Trees (LSM-trees) requires that keys and values are sorted, which leads to a high write amplification (as well as I/O amplification). If a storage engine could separate keys from values, this could benefit write amplification with minimal impact to the high write performance associated with normal key-value stores based solely on LSM-trees. TRocks solves the problem of separating keys from values. The keys are stored in the LSM-tree for efficient lookups, and although the LSM-tree still suffers from a high write amplification, only the keys are effected. On the other hand, values are stored in a ring buffer, which does not need to be sorted for efficient lookups. TRocks is targeted at value sizes that are at least two times larger than the key size. The larger the value, the more beneficial TRocks can be over traditional key-value stores.
Supported Platforms:
- Ubuntu 16 and higher
- CentOS 7
- Windows (requires a separate set of installation instructions) Once you have downloaded the TRocks repository, navigate to the folder and ensure that you have all of the necessary dependencies, which are listed in the file at INSTALL.md. Then execute the command:
INDIRECT_VALUE_SUPPORT=1 CFLAGS=-DINDIRECT_VALUE_SUPPORT make -j24 release
System Settings File Handles: You need to set the number of simultaneous open file handles globally.
vim /root/.bashrc
ulimit -u unlimited *Or 500000, or whatever limit you prefer.
You can locate our default options files in the tools directory. “db_bench” is the default benchmarking utility for Rocksdb and TRocks. MySQL (and MariaDB) are also supported, their installation instructions will not be listed here. Simple Settings Instructions
- Max_open_files=
- Should match the above
ulimit –n
setting
- Should match the above
- num_levels=6
- Recommended setting.
- This ensures there are enough files and storage space in the key-value store to hold all of the keys.
- target_file_size_base=
- The size of each SST file.
- 100MB * 40B / (Expected Value Size)
- Example: 100MB * 40B / 400B = 10MB
- level_compaction_dynamic_level_bytes=True
- Recommended setting.
- This ensures that the size of each level is proportional to the total size of the key-value store, so you never end up writing more keys than the key-value store can hold.
- vlogfile_size_max=
- The size of each VLog file.
- 10MB * compression_ratio * (Expected Value Size) / 40B
- Example: 10MB * 0.5 * 400B / 40B = 50MB
- write_buffer_size=
- Compaction Level L0 is the in-memory level.
- target_file_size_base + (vlogfile_size_max / compression_ratio)
- Example: 10MB + (50MB / 0.5) = 110MB
- TRocks-specific options:
- vlogring_activation_level=0
- Enables the TRocks ring buffer.
- Only 1 ring buffer is supported at this time.
- min_indirect_val_size=24
- The recommended minimum value size to store in the ring buffer.
- All values smaller than this size are stored with the keys instead of in the ring buffer.
- fraction_remapped_during_compaction=20
- fraction_remapped_during_active_recycling=15
- fragmentation_active_recycling_trigger=25
- fragmentation_active_recycling_klaxon=50
- vlogring_activation_level=0
These settings are intended for developers.
- max_bytes_for_level_multiplier=10
- target_file_size_base= (TO-DO – Update based on the total number of expected keys)
- The size of each SST file.
- The total size of each key per value will be key_size + key_overhead + value_reference
- Example: 10+16+10 bytes of overhead, or roughly 40 byte per key-value
- total_key_size * number of key-value-pairs / max_bytes_for_level_multiplier^num_levels
- Example: 40B * 10^9 / 10^6 = 40KB
- Example: 40B * 25*10^9 / 10^6 = 1MB
- level_compaction_dynamic_level_bytes=True
- If level_compaction_dynamic_level_bytes is false, then level targets are determined as following: L1's target will be max_bytes_for_level_base. And then Target_Size(Ln+1) = Target_Size(Ln) * max_bytes_for_level_multiplier * max_bytes_for_level_multiplier_additional[n]. max_bytes_for_level_multiplier_additional is by default all 1.
- If False, then for example, if max_bytes_for_level_base = 16384, max_bytes_for_level_multiplier = 10 and max_bytes_for_level_multiplier_additional is not set, then size of L1, L2, L3 and L4 will be 16384, 163840, 1638400, and 16384000, respectively.
- If True, then that the size of each level is proportional to the total size of the key-value store, so you never end up writing more keys than the key-value store can hold.
- max_bytes_for_level_base= (Honestly, just set it to 1g and let level_compaction_dynamic_level_bytes=True take care of it)
- L1 should hold at least <level0_file_num_compaction_trigger> SST Files
- However, 2x <level0_file_num_compaction_trigger> is recommended
- level0_file_num_compaction_trigger*target_file_size_base[*2 – optional]
- HOWEVER, if
- max_bytes_for_level_base*(max_bytes_for_level_multiplier^(num_levels-1))
- < (less than)
- Total keys * (key size+overhead)
- Then you must use the formula:
- L1 should hold at least <level0_file_num_compaction_trigger> SST Files
- Total keys * (key size+overhead) / (max_bytes_for_level_multiplier^(num_levels-1))
- Example: 40KB * 10 = 400KB
- Example: 1MB * 4 = 4MB
- write_buffer_size=
- L0 is the in-memory level. As memtables fill up with new writes, they are marked immutable and moved to L0. From there they are compacted into L1. The values are split from the keys during the first compaction, so L0 files contain full key-value pairs and should be much larger than SST files at higher levels. The L0 should be large enough to hold full memtables that can be compacted into L1.
- (<key_size+overhead+value_size> * <max_bytes_for_level_base>) / ( <key_size+overhead> * level0_file_num_compaction_trigger )
- Example: (40+400) * 4MB / (40 * 4) = 11MB
- vlogfile_size_max= (TO-DO – Henry’s suggestion - Research SST:VLOG file ratio. 1:1, 1:10, 1:100, what’s best?)
- The size of each VLog file.
- (value_size) * total number of key-value-pairs / (max_bytes_for_level_multiplier)^(num_levels) *
- Example: 800B * 10^9 / 10^6 * 0.5 = 400KB
- Example: 400B * 25*10^9 / 10^6 * 0.5 = 5MB
- Still optimizing:
- L0 file counts:
- level0_file_num_compaction_trigger=4
- level0_slowdown_writes_trigger=12
- level0_stop_writes_trigger=20
- These options control how much old data is copied from the oldest end of the ring to the newest. Copying old data from the very end removes its fragmentation. However, copying old data that is not at the very end momentarily INCREASES fragmentation until the entire file can be deleted. Thus it will take some experimentation and good record-keeping to figure out the best settings. On top of that, the best settings will depend on the distribution of keys.
- fraction_remapped_during_compaction=20
- fraction_remapped_during_active_recycling=10
- fragmentation_active_recycling_trigger=30
- fragmentation_active_recycling_klaxon=50
- L0 file counts:
- Do not exceed 100K SSTs or 100K Vlog files
- Do not exceed 200K combined opened file handles.
- More than 200K file handles is certainly allowed, but may not be possible based on the Level 1 holding at least <level0_file_num_compaction_trigger> files and scaling that to the total size of the database.
- The last level of the database should be able to hold all of the files that contain the keys AND be large enough such that it can hold (key_size+overhead)*total number of keys.
- L1 holds up to 10 SST files
- Each VLog file should have about the same number of values as there keys in 1 SST
- (TO-DO – Henry’s suggestion - Research SST:VLOG file ratio. 1:1, 1:10, 1:100, what’s best?)
- A compression factor of 0.5 is considered typical.
- The size of values after compression will be
- Example:
- Vlog files on disk: 5MB per Vlog File * 100K Vlog Files = 500GB
- Total amount of data stored (compressed): 500GB /
- Example:
The last level can hold up to keys. Each higher level is a factor of <max_bytes_for_level_multiplier> smaller. So assuming max_bytes_for_level_multiplier=10), level (num_levels-1) will have total number of key-value-pairs/10 keys and level 1 will have total number of key-value-pairs/(max_bytes_for_level_multiplier)^(num_levels-1).
- L1 is special in that every L0->L1 compaction uses all the files in L1, because the keys come in random order and the keys in every L0 file span the entire keyspace. If L1 is way too big, every file in L1 will participate in every compaction and there will be no way to start multiple compactions because all the files will be busy. In addition, the compaction to L1 will be slow because it is dealing with so many files, and L0 will start to overfill because files can't get out to L1. This will lead to throttling the host path. [Need Citation]
- There will need to be some experimenting, and perhaps some new code, to get the size of L0 right. If compactions into L1 bog down, the system responds by compacting from L0 back into L0, which means some of the SSTs in L0 have had values split off but most haven't. I think this case is rare and we could avoid it if we need to with no code changes.
- Explaining Compaction:
- Each layer has a preferred maximum size, and once compactions are finished every layer except the last is at its maximum size. Thus adding a single file automatically causes L1 to be overfilled, and a file is selected to be compacted into L2, which is then overfilled, so a file is selected for compaction into L3, ...
- The only compactions that concern us here are compactions into the bottom level. TRocks always selects the file whose parents in the bottom level are pointing to the oldest values. As those files are compacted, any value in the 'old' section is read from disk and recopied at the front of the ring. The option fraction_remapped_during_compaction tells where the 'old' part ends.
- The smaller fraction_remapped_during_compaction is, the less data is copied. In my original design and in my simulations, I chose the largest file for compaction (following RocksDB). Now I choose the oldest, and I'm not sure what the best setting is. One average, writing 10% of the database will revisit 100% of the SSTs, so 15 (meaning remap the oldest 15%) seems reasonable, but we will have to see.
- If there are very old files that are not picked up by normal compaction (the easiest way for this to happen is if the key-range of the oldest files is not overwritten for a long time), we perform a 'compaction-lite' called Active Recycling that remaps all the files that refer to the oldest keys, whether they are in a contiguous key-range or not. The fragmentation_active_recycling_trigger tells how much fragmentation there has to be to start an Active Recycling operation. It can be thought of as the desired space amp limit.
- We will need to get an idea for how low we can drive the space amp without causing the Active Recycling load to become intolerably large.
- The size of L0 is measured in number of files, not bytes. Its nominal size is the compaction_trigger.
- Each file in L0 starts out as a memtable, whose size is write_buffer_size.
- memtables should be sized so that the set of them (compaction_trigger * write_buffer_size) is about the size of L1 after values are stripped; maybe a little less than that. That way the compaction into L1 will not make it grossly mis-sized.
- With 16B keys/800B values, each SST having 1e4 k/vs; L1 having 10 SSTs; and compaction_trigger of 4: you want (compaction_trigger * write_buffer_size) = 1e410900 = 9e7. compaction_trigger of 4 suggests write_buffer_size of 2.2e7=22MB. But it needs to be at least 8x the L1 size to engage the new code that fixes the L0 compaction score, so make it, say, 40MB.
- We may use a bigger value later as we understand the tradeoffs better.
- How do I find my write amp?
- TBD (currently the SUM write amp defines the total (keys and values) database write amp
- The other write amplification metrics in compactions stats still need to be verified.
- Where do I change my options?
- db_bench file, command line, myrocks file, other?
- Release Notes (TBD - Wiki Page)
- Unanswered question?
- Contact maintainer: Remington Brasga at Remington.Brasga@kioxia.com