MySQLOnRocksDB/mysql-5.6
forked from facebook/mysql-5.6

Loading…
Make MyRocksTablePropertiesCollector trigger range compaction if there are many delete-marked entries #71
Will we be able to remove tombstones before they reach the max level? Looking at current compaction code and the call for KeyNotExistsBeyondOutputLevel my guess is that most tombstones are not dropped prior to compaction with the max level because the KeyNotExists check isn't exact. But compaction to the max level means we use much more IO and more write amp.
https://github.com/facebook/rocksdb/blob/master/db/compaction_job.cc#L763
SuggestCompactRange() is the function to use.
@mdcallag these special range compactions will also drop sizes from source levels and have the same effects as normal compactions. So it's not totally extra costs to automatic compactions. L0->L1 compactions may be triggered more frequently, which would be extra costs.
If they drop tombstones like normal compactions then my guess is they are unlikely to drop tombstones prior to compacting into the max level.
Here is a reproducible test case.
Start my.cnf as follows. Smaller write buf size and sst file size, and no compression so that compactions happen easily.
loose_rocksdb_default_cf_options=write_buffer_size=64k;target_file_size_base=64k;max_bytes_for_level_base=512k;compression_per_level=kNoCompression"Create two tables, primary key and secondary key, and dedicated CF for the secondary key
create table r1 (
id1 int,
id2 int,
type int,
value varchar(100),
value2 int,
value3 int,
primary key (type, id1, id2),
index id1_type (id1, type, value2, value, id2) COMMENT 'cf1'
) engine=rocksdb collate latin1_bin;
create table r2 like r1;
- Insert 50,000 rows and compact these tables. Make sure compact r1 before r2.
# generate 50,000 rows like below.
#!/usr/bin/perl
for(my $i= 1; $i <= 100000; $i++) {
my $value = 'x' x 50;
print "$i,$i,$i,$value,$i,$i\n";
}
Then
load data local infile 'foo' into table r1 fields terminated by ',';
optimize table r1;
load data local infile 'foo' into table r2 fields terminated by ',';
optimize table r2;
- Update a secondary index of r1 table 10,000 times. You can generate update statements like this.
#!/usr/bin/perl
for(my $i= 1; $i <= 10000; $i++) {
print "update r1 set value2=value2+1 where id1=500;\n";
}
Then
mysql> source 'bar'
- Parse all sst files with rocksdb/sst_dump. You can find that some sst files had mostly deleted entries. Example:
for f in `ls /data/mysql/3306/data/.rocksdb/*.sst`
do
DELETED=`./sst_dump --command=scan --output_hex --file=$f | grep " : 0" | wc -l`
EXISTS=`./sst_dump --command=scan --output_hex --file=$f | grep " : 1" | wc -l`
echo "$f $DELETED $EXISTS"
done
=>
/data/mysql/3306/data/.rocksdb/000651.sst 289 1
Our goal will eventually eliminate these files.
LinkBench showed that some of (id1, link_type) pairs of id1_type index had huge number of delete-marked entries in sst files. This made point lookup with the (id1, link_type) much slower, because Next() needs to scan huge number of delete-marked keys. We need to optimize more so that delete-marked entries are compacted.
I'm thinking of making MyRocksTablePropertiesCollector trigger range compaction asynchronously, if some of the index prefixes had decent amount of delete-marked keys. MyRocksTablePropertiesCollector is called when new sst files are created, and it knows index definitions (key parts), so it would be possible to know which key ranges have many delete-marked keys. Then it would be easy to trigger CompactRange asynchronously. Siying suggested using an experimental API MarkForCompaction() for that (https://reviews.facebook.net/D37083).