Skip to content
This repository has been archived by the owner on Apr 6, 2019. It is now read-only.

Hang on sync_commit() #32

Closed
aardvarkk opened this issue Jan 14, 2017 · 7 comments
Closed

Hang on sync_commit() #32

aardvarkk opened this issue Jan 14, 2017 · 7 comments

Comments

@aardvarkk
Copy link
Contributor

We have a sync_commit call that works fine most of the time, but on occasion it seems to hang on the following line:

m_sync_condvar.wait(lock_callback, [=] { return m_callbacks_running == 0 && m_callbacks.empty(); });

m_callbacks_running is 0 and m_callbacks is empty, but it looks like the condition variable is not being notified? It just waits forever and nothing makes progress. I'm not really sure how to debug this!

@aardvarkk
Copy link
Contributor Author

Unfortunately, the issue has yet to be reproduced with logging enabled (but is fairly consistently reproduced without logging enabled). As a result it's tough to see if there's anything different about the calls when the hang occurs.

@Cylix
Copy link
Owner

Cylix commented Jan 15, 2017

Hi,

Thanks for reporting it!
I think I already saw that issue a few time ago but do not remember exactly what was the issue.
From my memories, that was due to callback handling from network part, so it should be fixed when I'll integrate the new network part I'm working on (I nearly finished it, just coding the windows port).

Best

@aardvarkk
Copy link
Contributor Author

aardvarkk commented Jan 15, 2017 via email

@Cylix
Copy link
Owner

Cylix commented Jan 15, 2017

wow, really cool to hear that! Thanks a lot :)

@aardvarkk
Copy link
Contributor Author

aardvarkk commented Jan 15, 2017

Phew. It's a race condition. I'm able to reproduce it consistently with the following changes (assuming a short callback time from a Redis request). Take note of the added thread sleeps (this is what makes it easy to reproduce).

Change redis_client::sync_commit to:

redis_client&
redis_client::sync_commit(void) {
  try_commit();

	std::this_thread::sleep_for(std::chrono::milliseconds(100));
  std::unique_lock<std::mutex> lock_callback(m_callbacks_mutex);
  __CPP_REDIS_LOG(debug, "cpp_redis::redis_client waits for callbacks to complete");
  m_sync_condvar.wait(lock_callback, [=] {
		bool a = m_callbacks_running == 0;
		bool b = m_callbacks.empty();
		std::this_thread::sleep_for(std::chrono::milliseconds(200));
		return a && b; });
  __CPP_REDIS_LOG(debug, "cpp_redis::redis_client finished to wait for callbacks completion");

  return *this;
}

Change redis_client::connection_receive_handler to:

void
redis_client::connection_receive_handler(network::redis_connection&, reply& reply) {
  reply_callback_t callback = nullptr;

  __CPP_REDIS_LOG(info, "cpp_redis::redis_client received reply");
  {
    std::lock_guard<std::mutex> lock(m_callbacks_mutex);

    if (m_callbacks.size()) {
      callback = m_callbacks.front();
      m_callbacks_running += 1;
      m_callbacks.pop();
    }
  }

  if (m_before_callback_handler) {
    __CPP_REDIS_LOG(debug, "cpp_redis::redis_client executes reply callback through custom before callback handler");
    m_before_callback_handler(reply, callback);
  } else if (callback) {
    __CPP_REDIS_LOG(debug, "cpp_redis::redis_client executes reply callback");
    callback(reply);
  }

	std::this_thread::sleep_for(std::chrono::milliseconds(200));
  m_callbacks_running -= 1;
  m_sync_condvar.notify_all();
}

The root problem is that m_callbacks_running is decremented outside of m_callbacks_mutex. A hang will occur in the following scenario (with approximate timestamps based on the provided sleep times):

0ms: application thread calls sync_commit()
~0ms: redis thread acquires mutex lock, increments m_callbacks_running to non-zero, then drops mutex lock
100ms: application thread acquires mutex lock, evaluates m_callbacks_running as non-zero
200ms: redis thread (without lock on mutex!) decrements m_callbacks_running to zero, and calls notify (before application thread has started waiting!)
300ms: application thread begins wait and releases mutex, but notify has already occurred!

@aardvarkk
Copy link
Contributor Author

The problem should be fixed by acquiring the mutex in connection_receive_handler before altering m_callbacks_running. I will submit a pull request.

@Cylix
Copy link
Owner

Cylix commented Jan 15, 2017

Indeed, really good catch!

The mutex lock can be a quick way to fix it for now. Then I plan to reorganize a bit the code to reduce the number of locks in order to improve performance in the next weeks (first need to finish the new network part).

Thanks for your work, greatly appreciate it!

aardvarkk added a commit to aardvarkk/cpp_redis that referenced this issue Jan 15, 2017
aardvarkk added a commit to aardvarkk/cpp_redis that referenced this issue Jan 15, 2017
@Cylix Cylix closed this as completed in #34 Jan 15, 2017
Cylix pushed a commit that referenced this issue Jan 15, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants