The issue was detected when testing nbackup during TPCC run with 64 concurrent connections.
Engine could hung immediately after begin\end backup, i.e. after physical state change.
Few threads waits infinitely in RWLock::beginRead() for BackupManager::localStateLock.
Wait can't succeed as there is no owner of localStateLock.
Also, lock value is -1 which should never happens.
All other threads waits for bdb latches already acquired by threads above.
The problem happens because of race condition:
- backup thread acquires localStateLock in Write mode (see BackupManager::StateWriteGuard) and set TDBB_backup_write_locked flag (see BackupManager::lockStateWrite),
then it marks header page and set BDB_nbak_state_lock flag on its BufferDesc
note, this mark does not acquire localStateLock in Read mode because of BDB_nbak_state_lock (see CCH\set_diff_page() and BackupManager::lockStateRead)
then backup thread release header page (it does not release localStateLock)
- another thread commits and flush dirty pages, it writes dirty header page and release localStateLock (see CCH\clear_dirty_flag_and_nbak_state)
as BufferDesc have BDB_nbak_state_lock flag set and tdbb is not marked with TDBB_backup_write_locked flag
- backup thread release localStateLock in Write mode (see ~StateWriteGuard)
I.e. we have excess RWLock::endRead call which broke lock state and leads to the hangup.
To make problem happens there should be very short transactions to fit (from start to finish) into small time window
between release of header page and localStateLock by backup thread.