Memory mapped read/address-space implementation #1040

no2chem · 2017-12-05T18:54:49Z

Overview

Description: This PR replaces the implementation of read and readAddressSpace with versions that used MemoryMappedByteBuffers, instead of allocating and reading into a secondary byte buffer, then constructing the protobuf. A new inputstream (MappedByteBufInputStream) which automatically allocates a buffer and maintains a view over a file channel, is introduced, which allows the log unit to read over a file in a memory-mapped fashion. Verification in StreamLogFiles has been merged into StreamLogFiles::scanLogAndPerformAction. This change unifies code which needs to scan over an entire log file. I've tested the PR on a 6GB log file.

Why should this be merged: Issue #986 reported that a Corfu server is unable to startup with a . large log file (>800MB) due to heap space. Usually this shouldn't be a issue since log files are limited to 10k entries each, but the server does not place a limit to how large each entry is, so this could result in an unstartable Corfu server if large entries were written (for example, checkpoints of ~100KB per entry). This was due to StreamLogFiles::readAddressSpace and StreamLogFiles::getGlobalAdddressTail requiring twice as much memory as the file to read it. It should also significantly reduce memory pressure during startup. Unfortunately, generating a memory mapping for each read is also likely to be potentially costly, #940 may help, but transferring a FileRegion (instead of converting into a protobuf) should be preferred.

Related issue(s) (if applicable): Fixes #1048, #986, related to #940

Checklist (Definition of Done):

There are no TODOs left in the code
Coding conventions (e.g. for logging, unit tests) have been followed
Change is covered by automated tests
Public API has Javadoc

coveralls · 2017-12-05T19:19:28Z

Coverage increased (+0.1%) to 70.761% when pulling 7bd4da3 on MemoryMapped into d94839a on master.

corfudb-performance · 2017-12-05T19:38:06Z

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit 7bd4da3.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1040 Graphs

coveralls · 2017-12-06T01:54:24Z

Coverage increased (+0.1%) to 70.763% when pulling 8399417 on MemoryMapped into d94839a on master.

corfudb-performance · 2017-12-06T02:37:09Z

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit 8399417.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1040 Graphs

no2chem · 2017-12-06T18:12:30Z

Unfortunately, this issue is much deeper than previously thought. The CompactedLog class holds a reference to each LogData during the initial scan, which is unnecessary during startup as only the maximum address needs to be collected. Since a reference is held for each LogData, garbage collection fails to reclaim memory. It would be easy enough to just collect data during the scan, but the same code path is used to trim files, so the LogUnit would unexpectedly crash during trim in this case.

Looking at a potential solution which involves generating the compacted file in place.

coveralls · 2017-12-06T20:02:33Z

Coverage increased (+0.1%) to 70.774% when pulling 4256ebf on MemoryMapped into d94839a on master.

zalokhan · 2017-12-06T23:06:10Z

I tested a restart of a server with a 1.8G log file.
Without this patch the server crashes on startup.
Now the restart is successful.
I will now try to test with a file >2GB and post my findings here.

no2chem · 2017-12-06T23:13:05Z

@zalokhan Thanks! I re-implemented the trim function so it completes in a single pass without buffering as well, so that shouldn't crash the server either.

Unfortunately, it seems that we will run into another problem: The JVM only supports mapping 2GB at a time, currently, so I suspect we will crash if we have log files that are >2GB of physical space.

coveralls · 2017-12-06T23:36:03Z

Coverage increased (+0.07%) to 70.735% when pulling b1f60ac on MemoryMapped into d94839a on master.

Maithem · 2017-12-06T23:47:43Z

You can map more than 2GB, but I dont think this problem should be solved this way. File sizes should be smaller, from my tests smaller segments exhibit better performance. The file sizes can be controlled by parameters like (rpc wait time, batch writes). Another way to solve this problem is to allocate an array once and read/process multiple chunks at a time. In addition to the mentioned work around, you can just increase the -Xmx value and you wouldn't see this issue. In other words, I don't think this change is necessary now. Also, @zalokhan only observed this problem when he was deliberately writing large files.

The reason I don't think this PR is necessary right now is because 1) there are easy workarounds 2) we haven't seen this in production 3) we want to introduce more memory mapped buffers in the logging unit

Because of #3 I think we should have another implementation that heavily uses memory mapped files (i.e. a different StreamLog implementation) and this change would make sense in the new implementation.

no2chem · 2017-12-06T23:56:06Z

I disagree. Currently, file sizes cannot be controlled, because log entry sizes are not controlled and just because "you don't run into the issue now", doesn't mean that it won't happen in the future. Changing the segment size doesn't help much, because you can still have some unpredictably large writes.

The JVM doesn't natively support mapping more than 2GB. There are unsafe hacks, or you can map the same file multiple times, but a >2GB mapping is unsupported. This issue causes a Corfu instance to crash in an unrecoverable way. If you think that is a non-issue/non-bug, then perhaps you should comment in the issue (#986).

coveralls · 2017-12-07T00:00:57Z

Coverage increased (+0.06%) to 70.727% when pulling b1f60ac on MemoryMapped into d94839a on master.

Maithem · 2017-12-07T00:16:51Z

My remarks with respect to segment size is regarding the size of the file not number of entries inside a file.
We have ways of mitigating this risk, the underlying problem is we don't control the maximum write size for a particular address, we talked about before but haven't concluded anything, maybe you want to fix that first. Allowing arbitrary parameters and trying to "fix it" with the method you propose is an incomplete solution and currently not necessary. If you are solving this for the general case, then this PR doesn't work (For example it work work for files larger than 2gb as you pointed out). If you are trying to solve the problem for the 2Gb case, then this seems arbitrary and we need to talk more about write sizes.

My argument is not based on the fact that we don't see it today and therefore we shouldn't worry about it. My point is we have controls to mitigate this issue (and we have used these controls in production to fix this very problem before).

no2chem · 2017-12-07T00:23:52Z

Well, this PR is incomplete (if you notice, it is in the In Progress pipeline), and extending the PR to solve the solution for 2GB is not difficult. In fact, I'm almost done with that. Obviously, I didn't realize the Java 2GB issue until I started use MappedMemoryBuffers.

If your statement is that you don't think that this is a bug, then we can discuss that in the issue. This appears to be what you are indicating.

If your statement is that we shouldn't use memory mapped files to solve this bug, then we can discuss this here. I don't believe we should have a "memory mapped" and non-"memory mapped" implementation. How do we tell users which one to pick?

Maithem · 2017-12-07T00:40:47Z

Incomplete in terms of generality of the solution, not the status of this particular PR.

I don't know if a bug is the right classification for this issue, since it is contingent on limits of a single write which are not well defined. I think we should talk about the write size limits, but if you want to ignore that for now, which is fine, then I suggest you do chunking instead of memory mapping, because chunking solves the general problem, whereas mapping solves part of the problem.

Unless we have a complete implementation of StreamLog files that uses mapped files, we can't tell which one is better under different scenarios. I think creating a new StreamLog based on StreamLogFiles and morphing it over time is a better solution, because then you can cleanly compare the performance characteristics of two different implementations. Obviously, there is no silver bullet, so we might end up recommending different implementations for different workloads.

no2chem · 2017-12-07T00:43:28Z

Incomplete in terms of generality of the solution, not the status of this particular PR.

Once the PR supports >2GB, I'm not sure what you mean by "lack of generality"

I don't know if a bug is the right classification for this issue, since it is contingent on limits of a single write which are not well defined.

The server starts up, encounters a log file of 1.8GB, and crashes. How is that not a bug? I'll add that I've had reports of this issue outside of VMware.

Maithem · 2017-12-07T00:50:58Z

If you will support >2gb, then I think this is fine.

"The server starts up, encounters a log file of 1.8GB, and crashes. How is that not a bug?", it can also be seen as misconfiguration (i.e. you need more memory), not necessarily a bug in the code. That's why I am saying its not clear. Ultimately, segments sizes are a function of single address writes, which is undefined. Since its not defined, its hard to claim this as a bug.

no2chem · 2017-12-07T00:56:14Z

it can also be seen as misconfiguration (i.e. you need more memory), not necessarily a bug in the code

Not really. It should tell the user they need more memory (rather than crash with a heap allocation error) so its still a bug.

Besides, needing as much memory as the size of the file just to start up is a bug. Reading a file just to verify it should not requiring reading it all into memory.

Worse yet, there's no way to recover from such a big file even if you accidentally caused the condition, since trim will fail as well. Even if we had write size controls, we still need to be able to recover from large files. And 1.8 GB is not a particularly large file. It's not even 2GB.

Maithem · 2017-12-07T01:14:31Z

Not really. It should tell the user they need more memory (rather than crash with a heap allocation error) so its still a bug.

Its nice to print "You need more memory", many other systems dont, but if you think that's the fix then you can do it.

Besides, needing as much memory as the size of the file just to start up is a bug. Reading a file just to verify it should not requiring reading it all into memory.

That's your opinion, which I think is an unreasonable generalization. Also, the readaddress space is not just "verification", it builds an index.

Worse yet, there's no way to recover from such a big file even if you accidentally caused the condition, since trim will fail as well.

False. as i mentioned its not a simple as you put it, like i mentioned before you can bump up the -xmx parameter. Also, if you fail after restart, trim doesn't require segments to be open for the trim to happen as shown below:

File dir = new File(logDir); FileFilter fileFilter = new FileFilter() { public boolean accept(File file) { try { String segmentStr = file.getName().split("\\.")[0]; return Long.parseLong(segmentStr) < endSegment; } catch (Exception e) { log.warn("trimPrefix: ignoring file {}", file.getName()); return false; } }

corfudb-performance · 2017-12-07T02:16:09Z

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit b1f60ac.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1040 Graphs

no2chem · 2017-12-07T08:14:31Z

That's your opinion, which I think is an unreasonable generalization. Also, the readaddress space is not just "verification", it builds an index.

The same applies to indexes as well. It's not an unreasonable generalization, I think that if SQL server crashed if there was a log file larger than memory, it would be considered a bug.

False. as i mentioned its not a simple as you put it, like i mentioned before you can bump up the -xmx parameter.

Okay, that'll work for a file that fits in memory, perhaps, but what about one which does not? You only can bump Xmx up so far. The comment about trim is irrelevant if the server crashes on startup, right?

This is a issue where there is a reasonable solution. I could understand if there were technical reasons why you don't think this is a good solution, but you seem to be against solving it altogether. Why is that?

Maithem · 2017-12-07T08:53:03Z

The comment about trim is irrelevant if the server crashes on startup

It is irrelevant, I was just correcting a fact that you made in the previous post.

This is a issue where there is a reasonable solution. I could understand if there were technical reasons why you don't think this is a good solution, but you seem to be against solving it altogether. Why is that?

No I think it is a problem that needs to be addressed. My previous posts were to make a point that this problem is a consequence of another pending problem and solving that problem first can simplify the solution to this problem.

Once you complete this PR, then we can objectively evaluate it against other solutions and see if it is the best one.

no2chem · 2017-12-07T09:30:57Z

It is irrelevant, I was just correcting a fact that you made in the previous post.

You mean that trim will fail? The server can't startup, so you can't trim. What are you correcting? Also, even if the server manages to startup, memory allocation failures will start happening if you try to trim, won't they? Catching the exception at the trim doesn't help.

Maithem · 2017-12-07T10:08:25Z

Also, even if the server manages to startup, memory allocation failures will start happening if you try to trim, won't they?

No, trimPrefix doesn't invoke readAddressSpace because it doesn't open segments, so the trim can succeed.

Also, If other requests result in a operation that invokes readAddressSpace, then the allocation will fail on that request's thread, but I don't think that it bricks the server such that a subsequent trims are blocked since they will be serviced on a different thread.

no2chem · 2017-12-07T10:18:50Z

@Maithem the problem is not just readAddressSpace, but also getCompactedEntries , which is called by trimLogFile. If trimLogFile is called on a file which has a large untrimmed portion, then the server will run out of memory because CompactedEntry buffers every entry before writing it out. Even if the server doesn't run out of memory, it seems that keeping the entries in a map generates memory pressure during compaction which can be avoided.

Maithem · 2017-12-07T10:28:05Z

@no2chem Correct, but that's only for sparse trimming (not used), not prefix trim. Also, If you are optimizing for memory pressure, then also consider not reading the whole file to build the index, just put the index at the top of the file. For example, {header | index | records}, that way you can read the index directly without reading all the records. This will address both problems, that is, memory pressure and large segments. Just consider it. Obviously, this requires an extra write (if you are not using mmap), but it might make more sense to do that.

no2chem · 2017-12-07T11:49:37Z

I'm just trying to fix the bug in issue #986. I've reproduced it locally, it seems that you need about 2x the heap space of the file at the last segment. So with a 900MB file (produced using 100KB raw entries), you need 2GB of heap space.

no2chem · 2017-12-07T12:19:08Z

Now updated to support files >2GB. I've tested with a 6GB log file, and though startup takes a few seconds (a potential improvement may be to scan in a multi-threaded fashion), the server no longer crashes and I'm no longer seeing the bufferunderflow errors in #1048 either.

I doubt this PR will have that much of an effect on performance.

coveralls · 2017-12-07T12:27:49Z

Coverage decreased (-0.2%) to 70.444% when pulling 70e9ce8 on MemoryMapped into d94839a on master.

coveralls · 2017-12-07T12:31:35Z

Coverage decreased (-0.05%) to 70.615% when pulling 70e9ce8 on MemoryMapped into d94839a on master.

corfudb-performance · 2017-12-07T13:16:59Z

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit 70e9ce8.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1040 Graphs

no2chem added the bug label Dec 5, 2017

no2chem added this to the 0.2.0 milestone Dec 5, 2017

no2chem requested review from Maithem and medhavidhawan December 5, 2017 18:58

no2chem mentioned this pull request Dec 7, 2017

BufferUnderflowException on large files #1048

Closed

no2chem self-assigned this Dec 7, 2017

no2chem requested review from rogermichoud and annym December 11, 2017 21:44

Memory mapped read/address-space implementation

6de72c0

no2chem force-pushed the MemoryMapped branch from 70e9ce8 to 6de72c0 Compare December 19, 2017 19:05

no2chem added the act: ready label Dec 19, 2017

Rtremblay modified the milestones: 0.2.0, 0.4.0 Dec 20, 2017

no2chem closed this Jun 18, 2018

zalokhan deleted the MemoryMapped branch November 7, 2019 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory mapped read/address-space implementation #1040

Memory mapped read/address-space implementation #1040

no2chem commented Dec 5, 2017 •

edited

coveralls commented Dec 5, 2017

corfudb-performance commented Dec 5, 2017

coveralls commented Dec 6, 2017

corfudb-performance commented Dec 6, 2017

no2chem commented Dec 6, 2017

coveralls commented Dec 6, 2017

zalokhan commented Dec 6, 2017

no2chem commented Dec 6, 2017

coveralls commented Dec 6, 2017

Maithem commented Dec 6, 2017

no2chem commented Dec 6, 2017

coveralls commented Dec 7, 2017

Maithem commented Dec 7, 2017

no2chem commented Dec 7, 2017 •

edited

Maithem commented Dec 7, 2017

no2chem commented Dec 7, 2017 •

edited

Maithem commented Dec 7, 2017

no2chem commented Dec 7, 2017 •

edited

Maithem commented Dec 7, 2017 •

edited

corfudb-performance commented Dec 7, 2017

no2chem commented Dec 7, 2017

Maithem commented Dec 7, 2017 •

edited

no2chem commented Dec 7, 2017

Maithem commented Dec 7, 2017 •

edited

no2chem commented Dec 7, 2017

Maithem commented Dec 7, 2017 •

edited

no2chem commented Dec 7, 2017

no2chem commented Dec 7, 2017

coveralls commented Dec 7, 2017

coveralls commented Dec 7, 2017

corfudb-performance commented Dec 7, 2017

Memory mapped read/address-space implementation #1040

Memory mapped read/address-space implementation #1040

Conversation

no2chem commented Dec 5, 2017 • edited

Overview

Checklist (Definition of Done):

coveralls commented Dec 5, 2017

corfudb-performance commented Dec 5, 2017

coveralls commented Dec 6, 2017

corfudb-performance commented Dec 6, 2017

no2chem commented Dec 6, 2017

coveralls commented Dec 6, 2017

zalokhan commented Dec 6, 2017

no2chem commented Dec 6, 2017

coveralls commented Dec 6, 2017

Maithem commented Dec 6, 2017

no2chem commented Dec 6, 2017

coveralls commented Dec 7, 2017

Maithem commented Dec 7, 2017

no2chem commented Dec 7, 2017 • edited

Maithem commented Dec 7, 2017

no2chem commented Dec 7, 2017 • edited

Maithem commented Dec 7, 2017

no2chem commented Dec 7, 2017 • edited

Maithem commented Dec 7, 2017 • edited

corfudb-performance commented Dec 7, 2017

no2chem commented Dec 7, 2017

Maithem commented Dec 7, 2017 • edited

no2chem commented Dec 7, 2017

Maithem commented Dec 7, 2017 • edited

no2chem commented Dec 7, 2017

Maithem commented Dec 7, 2017 • edited

no2chem commented Dec 7, 2017

no2chem commented Dec 7, 2017

coveralls commented Dec 7, 2017

coveralls commented Dec 7, 2017

corfudb-performance commented Dec 7, 2017

no2chem commented Dec 5, 2017 •

edited

no2chem commented Dec 7, 2017 •

edited

no2chem commented Dec 7, 2017 •

edited

no2chem commented Dec 7, 2017 •

edited

Maithem commented Dec 7, 2017 •

edited

Maithem commented Dec 7, 2017 •

edited

Maithem commented Dec 7, 2017 •

edited

Maithem commented Dec 7, 2017 •

edited