Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regular file flushes whilst writing out scavenged chunk #1918

Merged
merged 1 commit into from Apr 30, 2019

Conversation

4 participants
@lscpike
Copy link
Contributor

commented Apr 25, 2019

After our scavenge performance improvements we've been experiencing intermittent commit timeouts and general spikes in write time during a scavenge. This is caused by the large file write direct from memory at the end of a chunk scavenge or chunks merge.

The file is written quickly and then flushed/closed resulting in up to 256MB being put on the disk write queue. We can see in our monitoring that the disk transfer latency spikes resulting in the slow writes for new events.

This PR adds a regular flush to the scavenge and merge process so that the individual flushes are smaller. Looking at stats when testing locally this results in avg write size of ~60KB instead of 1MB.

I've run a custom build with this change (plus #1914) and run a full scavenge on one of our large databases. There was still occasional impact, but there were no commit timeouts and write times were much more stable.

I played with adding a Thread.Sleep after each flush to throttle the write even more. This gave good results locally, but added many seconds to the chunk write. I left it out as I'm not sure it's necessary.

I'm sure there's a similar issue with the index scavenge and merge processes. I haven't touched them here.

Add regular file flushes whilst writing out scavenged chunk from memo…
…ry and when merging chunks. This avoids a single 256MB flush at the end that results in filling the disk write queue.
@gregoryyoung

This comment has been minimized.

Copy link
Member

commented Apr 25, 2019

This brings up some interesting questions of where this kind of thing should happen. As example its perfectly legitimate that they could end up on different disks. My guess is that overall more knowledge is required to implement this in a reasonable way.

@lscpike

This comment has been minimized.

Copy link
Contributor Author

commented Apr 26, 2019

Are you referring to raid set ups? On windows and hardware raid it appears the disk queues are per file system partition rather than physical disk. It could be misleading me though.

@gregoryyoung

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

No the possibility of delays from a source code perspective. There are a few such operations that could possibly want a delay/priority added. Scavenge, Index Merge, Checkpoints, Version Upgrade etc.

It is also that we don't really know enough such as if these things are actually on shared/remote disks etc and what might possibly be conflicting in terms of iops. In other words we might want to centralize the decision so we know what the index is doing when we are deciding what a chunk should do. Does that make more sense? There has been some work towards this already ... think IO scheduling.

@lscpike

This comment has been minimized.

Copy link
Contributor Author

commented Apr 26, 2019

I see what you mean. There's definitely value in that. I've already seen a commit timeout from an index dump to disk at the same time as a scavenge write.

In the meantime this change is very useful though!

@shaan1337 shaan1337 requested review from shaan1337 and jageall Apr 30, 2019

@jageall jageall merged commit 3496dae into EventStore:master Apr 30, 2019

9 checks passed

EventStore.EventStore Build #20190430.2 succeeded
Details
EventStore.EventStore (Centos 7 x64 Debug) Centos 7 x64 Debug succeeded
Details
EventStore.EventStore (Centos 7 x64 Release) Centos 7 x64 Release succeeded
Details
EventStore.EventStore (Ubuntu 14.04 x64 Debug) Ubuntu 14.04 x64 Debug succeeded
Details
EventStore.EventStore (Ubuntu 14.04 x64 Release) Ubuntu 14.04 x64 Release succeeded
Details
EventStore.EventStore (Windows x64 Debug) Windows x64 Debug succeeded
Details
EventStore.EventStore (Windows x64 Release) Windows x64 Release succeeded
Details
EventStore.EventStore (macOS x64 Debug) macOS x64 Debug succeeded
Details
EventStore.EventStore (macOS x64 Release) macOS x64 Release succeeded
Details

@lscpike lscpike deleted the lscpike:bugfix/scavenge-flushing branch Apr 30, 2019

shaan1337 added a commit that referenced this pull request May 13, 2019

Backport of #1918
Co-Authored-By: Laurence Pike <lscpike@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.