Skip to content


Subversion checkout URL

You can clone with
Download ZIP


ideas for improving performance during checkpoints #60

zkasheff opened this Issue · 10 comments

5 participants


Users have mentioned that our performance dips during checkpoints. We need to run experiments to understand why that is, but here are two theories:

  • the amount of work done in the system sharply increases during a checkpoint
  • rebalancing a leaf node is done during the clone, which is on the client thread. This may be an expensive operation. I recall checkpoint variability became an issue (again) when we introduced promotion, and the reason may be that promotion caused many more leaves to be cloned on the client thread.

Assuming experiments verify that these are issues, here are things we can do to address them:

  • change internal nodes to always compress with quicklz (or maybe snappy, if experiments show snappy to be faster). Have the user's compression settings apply only to leaf nodes
  • on rebalancing, have a way to know if the thread locking the node intends to redirty the node. Client and cleaner threads will, checkpoint thread will not. If the intent is to redirty the node, then there is no point to rebalancing the node before a clone. The rebalancing can be done after a clone.
@zkasheff zkasheff was assigned

I cannot devise an experiment that shows rebalancing of a leaf node is an issue. On mork, doing serial insertion tests with perf_insert and doing regular perf_iibench tests show no performance drop during a checkpoint. This is likely because the machine has additional resources available to handle the background work of the checkpoint. So, the checkpoint does not seem to block client threads directly. Therefore, dips in performance are likely due to the cost of resources (memory, CPU, I/O) checkpoints induce. I will focus on changing internal nodes to quicklz and seeing experiment results with that.


I am not sure if that is related to checkpoints, but I observe this problem and it is rather severe


We need to investigate. We do not know yet.


I don't think i want to mess around with the rebalancing. I cannot find a test that shows it is an issue. And clients don't ever write nodes, background threads do. But nevertheless, writing a node to disk requires a lock that is grabbed by the background thread that is writing out the clone. So I don't think there is an issue here.


We have no evidence on what the cause of Vadim's symptoms are. We need to investigate


Here is an experiment: Append an auto increment to Vadim's original primary key and run with unique checks off for the primary key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.