MySQLOnRocksDB/mysql-5.6
forked from facebook/mysql-5.6

Loading…
Range scan with equal predicates returns wrong results with bloom filter #109
@spetrunia In the above example, do you know if there is any way to know the length of equal conditions at ha_rocksdb::index_read_map()? ("id1=1 AND id2=1 AND id3>=2" => equal cond length is 4 + 8 + 4=16 (index_id + 8 byte id1 + 4 byte id2), and "id1=1 AND id2>=1 AND id3>=2" => equal cond length is 4+8=12)
One can't see the end of the range inside ha_rocksdb::index_read_map() call.
We need to overload handler::read_range_first(). It sees both ends of the range, so we can compare both endpoints and find the length of the common prefix.
Thanks @spetrunia! I'm going to fix it.
Handling IN clause is also interesting.
example: select * from r1 where id1=1 and id2 in (1,2,3);
read_range_first() is called three times, and both start_key and end_key matched on both id1 and id2. can_use_bloom_filter() is called only once at ha_rocksdb::setup_index_scan(), because Iterator is reused in this statement. MyRocks needs either of the following I think.
Do not include id2 for bloom filter. Only (index_id + id1=1) should be used. The problem is this is less efficient, and I'm not sure how to get that from read_range_first().
Decide if bloom filter can be used or not per each setup_index_scan().
https://reviews.facebook.net/D46851
@spetrunia Could you please review this diff? This diff calls kd->pack_index_tuple() twice (for end key and start key), and I'm still not 100% confident that it is safe.
end_key_packed_size= kd->pack_index_tuple(table, pack_buffer,
end_key_packed_tuple, end_key->key,
end_key->keypart_map);
packed_size= kd->pack_index_tuple(table, pack_buffer,
sec_key_packed_tuple, key,
keypart_map);
@yoshinorim, reviewed, ok to push. pack_index_tuple modifies table->record0, but this should be ok because the column values will be decoded back from the storage format when the record is read.
Suppose there is a multi-column index on (id1, id2), and selecting rows by "id1=1 AND id2 >= 2". Both id1 and id2 are BIGINT, and prefix bloom filter length was defined as 20. In this case, bloom filter shouldn't be used, but actually it's used -- and filtering rule is (id1=1 and id2=2). Any file that doesn't contain "id2=2" is potentially filtered out by the bloom filter, and query results end up missing rows.
Herman found the bug. Here is a test case.
$ cat mysql-test/suite/rocksdb/t/bloomfilter2-master.opt --rocksdb_default_cf_options=write_buffer_size=64k;block_based_table_factory={filter_policy=bloomfilter:10:true;whole_key_filtering=0;};prefix_extractor=capped:24 $ cat mysql-test/suite/rocksdb/t/bloomfilter2.test --source include/have_rocksdb.inc CREATE TABLE t1 (id1 BIGINT, id2 INT, id3 BIGINT, value INT, PRIMARY KEY (id1, id2, id3)) ENGINE=rocksdb; --disable_query_log let $i = 1; while ($i <= 10000) { let $insert = INSERT INTO t1 VALUES(1, 1, $i, $i); inc $i; eval $insert; } --enable_query_log SELECT COUNT(*) FROM t1 WHERE id1=1 AND id2=1 AND id3=1; SELECT COUNT(*) FROM t1 WHERE id1=1 AND id2=1 AND id3>=2; SELECT COUNT(*) FROM t1 WHERE id1=1 AND id2>=1 AND id3>=2; DROP TABLE t1; $ cat mysql-test/suite/rocksdb/r/bloomfilter2.result CREATE TABLE t1 (id1 BIGINT, id2 INT, id3 BIGINT, value INT, PRIMARY KEY (id1, id2, id3)) ENGINE=rocksdb; SELECT COUNT(*) FROM t1 WHERE id1=1 AND id2=1 AND id3=1; COUNT(*) 1 SELECT COUNT(*) FROM t1 WHERE id1=1 AND id2=1 AND id3>=2; COUNT(*) 9999 SELECT COUNT(*) FROM t1 WHERE id1=1 AND id2>=1 AND id3>=2; COUNT(*) 9999 DROP TABLE t1;In my environment, both SELECT returned 6600 rows, not 9999 rows.
can_use_bloom_filter() was called. It should have returned false, but it returned true. eq_cond.size was 24 bytes in both select (-- should have been 16 and 12).
The problem was eq_cond contained both "=" and ">=" condition columns. eq_cond had to contain only "=" condition columns. Call stack was handler::multi_range_read_next => handler::read_range_first => ha_rocksdb::index_read_map => ha_rocksdb::setup_index_scan => can_use_bloom_filter. It seemed like "=" condition was included in mrr_cur_range->end_key (handler::multi_range_read_next), but the variable seems not passed to ha_rocksdb::index_read_map. I'm not sure if there is any way to pass "=" column values without modifying storage engine api.