Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Clone in Desktop Download ZIP

Loading…

DML statements over reverse-ordered CFs are very slow after #86. #107

Closed
spetrunia opened this Issue · 5 comments

4 participants

@spetrunia
Collaborator

Finally figured out why some DELETE queries got very slow (about 100x slower) after fix for #86.

create table t4 (
  id int, value int, value2 varchar(200), 
  primary key (id) comment 'rev:cf_i3', 
  index(value) comment 'rev:cf_i3'
) engine=rocksdb;

Consider a query:

delete from t4 where id <= 3000;

EXPLAIN is:

+----+-------------+-------+-------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref   | rows | Extra       |
+----+-------------+-------+-------+---------------+---------+---------+-------+------+-------------+
|  1 | SIMPLE      | t4    | range | PRIMARY       | PRIMARY | 4       | const |    1 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+-------+------+-------------+

MySQL will use the following algorithm

  h->index_init(PRIMARY);
  for (res=h->index_first(); res != EOF && table.id<3000 ; h->index_next())
  {
    h->delete_row(); // deletes the row we've just read
  }

The table uses reverse column families, so this translates into these RocksDB
calls:

  trx= db->BeginTransaction();
  iter= trx->NewIterator();
  iter->Seek(index_number);
  while ()
  {
    if (!iter->Valid() || key_value_in_table(iter->key()))
    {
      // No more rows in the range
      break;
    }

    rowkey= iter->key();
    trx->Delete(rowkey);  // (*)

    iter->Prev();  // (**)
  }

Note the lines () and (*).

include/rocksdb/utilities/transaction.h has this comment:

  // The returned iterator is only valid until Commit(), Rollback(), or
  // RollbackToSavePoint() is called.
  // NOTE: Transaction::Put/Merge/Delete will currently invalidate this iterator
  // until
  // the following issue is fixed:
  // https://github.com/facebook/rocksdb/issues/616
  virtual Iterator* GetIterator(const ReadOptions& read_options) = 0;

I assume it refers to this comment in issue #616:

  it is not safe to mutate the WriteBatchWithIndex while iterating through 
  the iterator generated by NewIteratorWithBase()

So I implemented 'class Stabilized_iterator', which wraps the iterator returned
by GetIterator(), but keeps itself valid across Put/Merge/Delete calls.

It does so by

  • remembering the key it is pointing at
  • calling backend_iter->Seek(last_key) if Put or Delete operation happens.

This works, but in the above scenario it is very slow.

Here's why. The table is in reverse-ordered CF, so it stores the data in this physical order:

   TABLE 
  row10K
  ...
  row03
  row02
  row01
  row00
  another-table-row

However, DELETE works in the logical order. First it deletes row00, then row01, etc. Eventually, Transaction's WriteBatchWithIndex has:

   kDeletedRecord  row03
   kDeletedRecord  row02
   kDeletedRecord  row01
   kDeletedRecord  row00

We read row04. We call "trx->Delete(row04)", and the WriteBatchWithIndex now is:

   kDeletedRecord  row04
   kDeletedRecord  row03
   kDeletedRecord  row02
   kDeletedRecord  row01
   kDeletedRecord  row00

Then, we call iter->Prev() (line (**)). Stabilized_iterator notes that its underlying iterator is invalidated. In orer to restore it, it calls

  backend_iter->Seek(row04).

This operation finds row04 in the table, but it also sees {kDeletedRecord, row04} in the WriteBatchWithIndex. It advances both of its underlying iterators, until it reaches another-table-row.

Then, Stabilized_iterator calls backend_iter->Prev(). In this call, the iterator walks back through the pairs of row00, row01, ... row04, until it finds row05 in the base table.

This works, but if one deletes N rows then it's O(N^2) operations.

@spetrunia spetrunia self-assigned this
@spetrunia
Collaborator

Possible solutions:

  1. Fix facebook/rocksdb#616. This might be complex.

  2. We only care about the scenario where SQL layer does Deletes/Updates on the fly. This means, SQL layer will request to delete the row that was just read. Rdb_transaction could catch the Delete operation and before the iterator is invalidated, advance it so it is no longer pointing at the row that's about to be deleted.

  3. A storage engine may implement Bulk Update/Delete API. This is not a full solution because there are cases where bulk update/delete is not used.

@yoshinorim
Owner

@spetrunia : According to Siying, it is safe to mutable WriteBatchWithIndex while iterating it if using in a right way (the most important thing is to operate on a copy of the key when changing the state). Here is a unit test to verify it -- https://github.com/facebook/rocksdb/blob/3.13.fb/utilities/write_batch_with_index/write_batch_with_index_test.cc#L1247-L1366

@siying
Collaborator

@igorcanadi added a test case for mutating WriteBatchWithIndex: https://reviews.facebook.net/D39501

So you should be able to add/remove keys from it. You need to be careful about the way to do it. For example, the way will not work:

rocksdb::Slice rowkey= iter->key();
trx->Delete(rowkey);  // (*)

iter->Prev();  // (**)

rowkey is a reference to memory location. In our implementation issuing Delete() with it will cause problem as it changes in the middle of deletion and cause wrong results. But if you do:

std::string rowkey= iter->key(); 
trx->Delete(rowkey);  // (*)

iter->Prev();  // (**)

It will work.

@spetrunia
Collaborator

@yoshinorim @siying thanks for clarification. My example was overly simplified, in the actual code MyRocks copies away the key/value it has got from the iterator.

@spetrunia
Collaborator

https://reviews.facebook.net/D45873 (Issue #86 patch) is now updated to make use of this

@hermanlee hermanlee closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.