Skip to content
This repository has been archived by the owner on Jul 26, 2022. It is now read-only.

Change locking system into using RocksDB's pessimistic transactions #86

Closed
spetrunia opened this issue Jun 17, 2015 · 15 comments
Closed

Comments

@spetrunia
Copy link

RocksDB's pessimistic transaction system handles locking and also takes care of storing not-yet-committed changes made by the transaction. That is, it has two counterparts in MyRocks:

  1. Row locking module (rdb_locks.h: class LockTable)
  2. Table of transaction's changes (rdb_rowmods.h: class Row_table).

If we just replace #.1, there will be data duplication (Row_table will have the same data as WriteBatchWithIndex).

Using the API to get SQL semantics

At start, we call
transaction->SetSnapshot()

this gives us:

+  // If SetSnapshot() is used, then any attempt to read/write a key in this
+  // transaction will fail if either another transaction has this key locked
+  // OR if this key has been written by someone else since the most recent
+  // time SetSnapshot() was called on this transaction.

then, reading, modifying and writing a key can be done with simple

rocksdb_trx->Get() 
... modify the record as needed
rocksdb_trx->Put()

Misc notes

  • Modifying the same row multiple times will work, it seems.

Open issues

  • Transaction API doesnt allow to create iterators. How does one create an Iterator that looks at a snapshot that agrees with the transaction? Pass wtite_options.snapshot into Transaction::BeginTransaction() ?
  • Statement rollback. If a statement within a transaction fails, it is rolled back. All its changes are undone, and its locks should be released. Transaction's changes/locks should remain. There seems to be no way to achieve this in the API?
  • SELECT ... LOCK IN SHARE MODE. There seems to be no way to achieve shared read locks in the API.
@agiardullo
Copy link

  1. Yes, I still need to add iterator support. A hacky workaround would be to temporary re-read keys that you iterate through until iterator support is present. (Fyi, we do not have plans to add gap-locking support right now.)

  2. From discussions with Yoshinori, I was under the impression that we did not need to support partial rollback of a transaction. But this is do-able if needed. How critical is this support in the short-term?

  3. Correct. We discussed not supporting shared locks in v1 and will likely add support for this in the future. The initial plan was to have lock in shared mode simply take an exclusive lock in the meantime.

@spetrunia
Copy link
Author

  1. From discussions with Yoshinori, I was under the impression that we did not need to support partial rollback of a transaction. But this is do-able if needed. How critical is this support in the short-term?

In order to have SQL semantics, one has to be able to rollback a failed statement. The statement may be a part of a transaction, in that case the statement must be rolled back, but the transaction must remain open (i.e. neither commit nor rollback).

If we only use Pessimistic Transaction API for locking, we can achieve correct behavior by simply not releasing failed statement's locks util the transaction has either committed or rolled back. This will limit concurrency but is a correct behavior.

If we use Pessimistic Transaction API to also hold transaction's changes, then we need to be able roll back: If a statement inside a transaction changes a few rows and then fails, these changes must not be visible by the rest of the transaction.

@spetrunia
Copy link
Author

I've read the new Pessimistic Transactions patch (posted at https://reviews.facebook.net/D40869 ). It looks like the patch provides everything that MyRocks needs.

@spetrunia
Copy link
Author

Getting close to getting something to work, found one thing that I forgot about.

In the current system, row lock waits are integrated into MySQL. SHOW PROCESSIST shows waits as:

+----+------+-----------+------+---------+------+----------------------+-------------------------------------+---------------+-----------+------+
| Id | User | Host      | db   | Command | Time | State                | Info                                | Rows examined | Rows sent | Tid  |
+----+------+-----------+------+---------+------+----------------------+-------------------------------------+---------------+-----------+------+
|  1 | root | localhost | j8   | Sleep   |   87 |                      | NULL                                |             0 |         0 | 9827 |
|  3 | root | localhost | NULL | Query   |    0 | init                 | show processlist                    |             0 |         0 | 9907 |
|  4 | root | localhost | j8   | Query   |    9 | Waiting for row lock | insert into t1 values (1,'one-one') |             0 |         0 | 9830 |
+----+------+-----------+------+---------+------+----------------------+-------------------------------------+---------------+-----------+------+

and one can use KILL QUERY to immediately terminate the query that is waiting for the lock.

With my new code that uses pessimistic RocksDB trx API, we have state=Updating and the thread is not KILLable:

+----+------+-----------+------+---------+------+-------------+-----------------------------------------+---------------+-----------+------+
| Id | User | Host      | db   | Command | Time | State       | Info                                    | Rows examined | Rows sent | Tid  |
+----+------+-----------+------+---------+------+-------------+-----------------------------------------+---------------+-----------+------+
| 18 | root | localhost | j5   | Query   |    0 | cleaning up | show processlist                        |             0 |         0 | 5583 |
| 19 | root | localhost | j5   | Sleep   |   19 |             | NULL                                    |             1 |         0 | 5584 |
| 20 | root | localhost | j5   | Query   |   15 | updating    | update test set value = 12 where id = 1 |             0 |         0 | 7194 |
+----+------+-----------+------+---------+------+-------------+-----------------------------------------+---------------+-----------+------+

In order to achieve state="Waiting for row lock" and KILLability, MyRocks does the following in storage/rocksdb/rdb_locks.cc:

    thd= current_thd;
    thd_enter_cond(thd, &found_lock->cond, &found_lock->mutex,
                   &stage_waiting_on_row_lock, &old_stage);
    ...
      res= mysql_cond_timedwait(&found_lock->cond, &found_lock->mutex, 
                                &wait_timeout);
    // check if there was SQL  KILL command:
     killed= thd_killed(thd);

      thd_exit_cond(current_thd, &old_stage);

That is, we use MySQL's wrappers over pthread_cond_t and pthread_mutex_t, mysql's wait function, and also we inform MySQL that we start/finish waiting.

Possible ways out:

  • Make "Pessimistic Transaction API" support all of the above (i.e. user-provided wait condition, mutex, and wait function)
  • Forget about all this for now.

@spetrunia
Copy link
Author

Consider two cases:

Case 1.

trx1> txn1= txn_db->BeginTransaction(...);
trx2> txn2= txn_db->BeginTransaction(...); 

trx1> txn1->Put($key, $value1);  // This gets a lock

trx2> txn2->Put($key, $value2); // This waits for the lock

The last call times out and returns:

(gdb) p s
  $52 = {code_ = rocksdb::Status::kBusy, state_ = 0x0}

I have made MyRocks to return ER_LOCK_WAIT_TIMEOUT to SQL layer in this case.

Case 2:

trx2> txn2= txn_db->BeginTransaction(...); 

trx1> txn1= txn_db->BeginTransaction(...);
trx1> txn1->Put($key, $value1);
trx1> txn1->Commit();

trx2> txn2->Put($key, $value2);

The returned value is the same as in Case 1:

(gdb) print s
  $54 = {code_ = rocksdb::Status::kBusy, state_ = 0x0}

But the situation is different. In the Case 1, it makes sense to wait more. In Case 2, increasing wait timeout or retrying won't help.

The return value is the same, though, so MyRocks returns ER_LOCK_TIMEOUT in Case 2, too. For the SQL user, the error message is misleading.

I am not sure how big of a problem this is.

@agiardullo
Copy link

This is good feedback. About distinguishing error cases, I can change the api to return a different status in each case. I believe there are 4 interesting cases:

Case 1 above: Transaction timed out waiting to acquire a lock. In this case, it is possible that having a longer timeout could have succeeded (but we dont know since we never got the lock).

Case 2 above: Detected a write conflict.

Case 3: Transaction has an expiration time set and is expired.

Case 4: We don't have enough memtable history to determine whether there are any conflicts (User could then choose to tune max_write_buffer_number_to_maintain).

Should we have a different Status for each of these 4 cases? How about:
case1: Status::LockWaitTimeout (new status code)
case2: Status::Busy
case3: Status::TimedOut (there is already precedent in rockdsb for using TimedOut to mean 'expired').
case4: Status::TryAgain (new status code)

@agiardullo
Copy link

Re MySql locking: I think we could come up with an api that lets you override what mutex/condvar is used to do locking. But I will need to look into mysql a bit and think about this some more.

@agiardullo
Copy link

Sergei, how about we chat about myrocks locking api? I just sent you a fb msg.

agiardullo added a commit to facebook/rocksdb that referenced this issue Aug 12, 2015
Summary:
Based on feedback from spetrunia, we should better differentiate error statuses for transaction failures.

MySQLOnRocksDB/mysql-5.6#86 (comment)

Test Plan: unit tests

Reviewers: rven, kradhakrishnan, spetrunia, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43323
@spetrunia
Copy link
Author

Got an interesting problem while rebasing the patch over the current tree.

rocksdb.rocksdb test started to fail with an error like this:

MySQL [test]> create table t1 (
    ->   id int not null,
    ->   key1 int,
    ->   PRIMARY KEY (id),
    ->   index (key1) comment '$per_index_cf'
    -> ) engine=rocksdb;
Query OK, 0 rows affected (0.00 sec)

MySQL [test]> insert into t1 values (1,1);
Query OK, 1 row affected (0.00 sec)

MySQL [test]> select * from t1;
+----+------+
| id | key1 |
+----+------+
|  1 |    1 |
+----+------+
1 row in set (0.00 sec)

MySQL [test]> alter table t1 add col2 int;
ERROR 1815 (HY000): Internal error: Operation failed. Try again.: Transaction
could not check for conflicts for opearation at SequenceNumber 9 as the
MemTable only contains changes newer than SequenceNumber 10.  Increasing the
value of the max_write_buffer_number_to_maintain option could reduce the
frequency of this er

while the expected error was:

ERROR 42000: This version of MySQL doesn't yet support 'ALTER TABLE on table with per-index CF'

The new error is caused by this scenario:

  • ALTER TABLE starts execution
  • ha_rocksdb::external_lock() gets a snapshot with current sequence number (it's 9)
  • The code in Sequence_generator::get_and_update_next_number
    makes a write which increases current sequence number to 10 (NEW-WRITE)
  • A new column family is created for $per_index_cf. As part of that, a MemTable
    is created. The MemTable holds changes that start from sequence_no=10.
  • We attempt to write a row into the new column family. The tranaction's
    snapshot has SequenceNumber=9, while MemTable has data starting from
    SequenceNumber=10.(SEQ-CMP) We get an error.

The problem is not observable on the current repository, because old-style
locking doesn't care about Sequence Numbers.

The problem was not observable before the rebase, because the write (NEW-WRITE) didn't
happen, and comparison (SEQ-CMP) compared two equal sequence numbers.

@spetrunia
Copy link
Author

  • Tried to disallow creation of "temporary" tables (temporary here means the name is#sql_nnnn) with $per_index_cf column families. There is no clear way to do this, it seems.
  • The problem only affects ALTER TABLE statements. CREATE TABLE ... SELECT is not affected, because it creates the destination table before it starts the transaction.

@yoshinorim
Copy link

I recently committed https://reviews.facebook.net/D45963 that increments index_id and commits into data dictionary at Sequence_generator::get_and_update_next_number(). The internal begin->commit happens per index creation. I discussed with Herman in the diff --

I think there are two ways -- one is using Transaction API and doing begin -> 
select for update -> update -> commit. the other is what Herman suggested 
-- begin->update->commit at get_next_number(). I think the latter is easier and fine. 
Performance is not a concern since get_next_number() is called at 
index creation (DDL) only.

Maybe it's better to switch to select for update -> update at Sequence_generator::get_and_update_next_number(), within a transaction created at ha_rocksdb::create_key_defs(), then commit altogether?

@hermanlee
Copy link

The select for update on the sequence_number would block other create table requests until the transaction is committed? If we're performing a restore of multiple databases where tables are created in parallel, could one of the table creates fail due to timing out on the select for update?

@agiardullo
Copy link

Re: write_buffer_number_to_maintain:
When using a TransactionDB, this parameter will be set to be equal to max_write_buffer_number (note: need to update the docblock as it currently only mentions Optimistic Transactions.) This means that regardless of flushing, there will always be max_write_buffer_number # of memtable buffer history available to use for write conflict checking. The amount of memory used here can be increased by setting this parameter explicitly. Currently, rocksdb transactions are not ideal for long-running transactions. So the solution is likely to either increase this option or to investigate if there is a transaction that is running for longer than expected.

Although, I suppose if someone explicitly kept flushing when the memtable was almost empty, then we could end up only maintaining N practically empty write buffers. This would not be ideal. But I believe this could only happen if MyRocks ever calls DB:Flush() explicitly.

(In the future, I want to consider an option to fall back to reading SST files if the recent writes are not present in the memtable. If we have problems tuning memtable_write_buffer_history, then this would be motivation to prioritize this. The biggest downside would be code complexity/maintenance.)

https://github.com/facebook/rocksdb/blob/master/include/rocksdb/options.h#L279

@siying
Copy link

siying commented Sep 8, 2015

@agiardullo do you feel like it is easier to maintain if we consider actual size of memtables when we decide how many memtables to maintain?

@agiardullo
Copy link

It would not be easier to maintain, but we could make this change if it turns out it is actually a problem.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants