Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory mapped read/address-space implementation #1040

Closed
wants to merge 1 commit into from
Closed

Conversation

no2chem
Copy link
Member

@no2chem no2chem commented Dec 5, 2017

Overview

Description: This PR replaces the implementation of read and readAddressSpace with versions that used MemoryMappedByteBuffers, instead of allocating and reading into a secondary byte buffer, then constructing the protobuf. A new inputstream (MappedByteBufInputStream) which automatically allocates a buffer and maintains a view over a file channel, is introduced, which allows the log unit to read over a file in a memory-mapped fashion. Verification in StreamLogFiles has been merged into StreamLogFiles::scanLogAndPerformAction. This change unifies code which needs to scan over an entire log file. I've tested the PR on a 6GB log file.

Why should this be merged: Issue #986 reported that a Corfu server is unable to startup with a . large log file (>800MB) due to heap space. Usually this shouldn't be a issue since log files are limited to 10k entries each, but the server does not place a limit to how large each entry is, so this could result in an unstartable Corfu server if large entries were written (for example, checkpoints of ~100KB per entry). This was due to StreamLogFiles::readAddressSpace and StreamLogFiles::getGlobalAdddressTail requiring twice as much memory as the file to read it. It should also significantly reduce memory pressure during startup. Unfortunately, generating a memory mapping for each read is also likely to be potentially costly, #940 may help, but transferring a FileRegion (instead of converting into a protobuf) should be preferred.

Related issue(s) (if applicable): Fixes #1048, #986, related to #940

Checklist (Definition of Done):

  • There are no TODOs left in the code
  • Coding conventions (e.g. for logging, unit tests) have been followed
  • Change is covered by automated tests
  • Public API has Javadoc

@no2chem no2chem added the bug label Dec 5, 2017
@no2chem no2chem added this to the 0.2.0 milestone Dec 5, 2017
@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 70.761% when pulling 7bd4da3 on MemoryMapped into d94839a on master.

@corfudb-performance
Copy link
Collaborator

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit 7bd4da3.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1040 Graphs

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 70.763% when pulling 8399417 on MemoryMapped into d94839a on master.

@corfudb-performance
Copy link
Collaborator

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit 8399417.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1040 Graphs

@no2chem
Copy link
Member Author

no2chem commented Dec 6, 2017

Unfortunately, this issue is much deeper than previously thought. The CompactedLog class holds a reference to each LogData during the initial scan, which is unnecessary during startup as only the maximum address needs to be collected. Since a reference is held for each LogData, garbage collection fails to reclaim memory. It would be easy enough to just collect data during the scan, but the same code path is used to trim files, so the LogUnit would unexpectedly crash during trim in this case.

Looking at a potential solution which involves generating the compacted file in place.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 70.774% when pulling 4256ebf on MemoryMapped into d94839a on master.

@zalokhan
Copy link
Member

zalokhan commented Dec 6, 2017

I tested a restart of a server with a 1.8G log file.
Without this patch the server crashes on startup.
Now the restart is successful.
I will now try to test with a file >2GB and post my findings here.

@no2chem
Copy link
Member Author

no2chem commented Dec 6, 2017

@zalokhan Thanks! I re-implemented the trim function so it completes in a single pass without buffering as well, so that shouldn't crash the server either.

Unfortunately, it seems that we will run into another problem: The JVM only supports mapping 2GB at a time, currently, so I suspect we will crash if we have log files that are >2GB of physical space.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) to 70.735% when pulling b1f60ac on MemoryMapped into d94839a on master.

@Maithem
Copy link
Contributor

Maithem commented Dec 6, 2017

You can map more than 2GB, but I dont think this problem should be solved this way. File sizes should be smaller, from my tests smaller segments exhibit better performance. The file sizes can be controlled by parameters like (rpc wait time, batch writes). Another way to solve this problem is to allocate an array once and read/process multiple chunks at a time. In addition to the mentioned work around, you can just increase the -Xmx value and you wouldn't see this issue. In other words, I don't think this change is necessary now. Also, @zalokhan only observed this problem when he was deliberately writing large files.

The reason I don't think this PR is necessary right now is because 1) there are easy workarounds 2) we haven't seen this in production 3) we want to introduce more memory mapped buffers in the logging unit

Because of #3 I think we should have another implementation that heavily uses memory mapped files (i.e. a different StreamLog implementation) and this change would make sense in the new implementation.

@no2chem
Copy link
Member Author

no2chem commented Dec 6, 2017

I disagree. Currently, file sizes cannot be controlled, because log entry sizes are not controlled and just because "you don't run into the issue now", doesn't mean that it won't happen in the future. Changing the segment size doesn't help much, because you can still have some unpredictably large writes.

The JVM doesn't natively support mapping more than 2GB. There are unsafe hacks, or you can map the same file multiple times, but a >2GB mapping is unsupported. This issue causes a Corfu instance to crash in an unrecoverable way. If you think that is a non-issue/non-bug, then perhaps you should comment in the issue (#986).

@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) to 70.727% when pulling b1f60ac on MemoryMapped into d94839a on master.

@Maithem
Copy link
Contributor

Maithem commented Dec 7, 2017

My remarks with respect to segment size is regarding the size of the file not number of entries inside a file.
We have ways of mitigating this risk, the underlying problem is we don't control the maximum write size for a particular address, we talked about before but haven't concluded anything, maybe you want to fix that first. Allowing arbitrary parameters and trying to "fix it" with the method you propose is an incomplete solution and currently not necessary. If you are solving this for the general case, then this PR doesn't work (For example it work work for files larger than 2gb as you pointed out). If you are trying to solve the problem for the 2Gb case, then this seems arbitrary and we need to talk more about write sizes.

My argument is not based on the fact that we don't see it today and therefore we shouldn't worry about it. My point is we have controls to mitigate this issue (and we have used these controls in production to fix this very problem before).

@no2chem
Copy link
Member Author

no2chem commented Dec 7, 2017

Well, this PR is incomplete (if you notice, it is in the In Progress pipeline), and extending the PR to solve the solution for 2GB is not difficult. In fact, I'm almost done with that. Obviously, I didn't realize the Java 2GB issue until I started use MappedMemoryBuffers.

If your statement is that you don't think that this is a bug, then we can discuss that in the issue. This appears to be what you are indicating.

If your statement is that we shouldn't use memory mapped files to solve this bug, then we can discuss this here. I don't believe we should have a "memory mapped" and non-"memory mapped" implementation. How do we tell users which one to pick?

@Maithem
Copy link
Contributor

Maithem commented Dec 7, 2017

Incomplete in terms of generality of the solution, not the status of this particular PR.

I don't know if a bug is the right classification for this issue, since it is contingent on limits of a single write which are not well defined. I think we should talk about the write size limits, but if you want to ignore that for now, which is fine, then I suggest you do chunking instead of memory mapping, because chunking solves the general problem, whereas mapping solves part of the problem.

Unless we have a complete implementation of StreamLog files that uses mapped files, we can't tell which one is better under different scenarios. I think creating a new StreamLog based on StreamLogFiles and morphing it over time is a better solution, because then you can cleanly compare the performance characteristics of two different implementations. Obviously, there is no silver bullet, so we might end up recommending different implementations for different workloads.

@no2chem
Copy link
Member Author

no2chem commented Dec 7, 2017

Incomplete in terms of generality of the solution, not the status of this particular PR.

Once the PR supports >2GB, I'm not sure what you mean by "lack of generality"

I don't know if a bug is the right classification for this issue, since it is contingent on limits of a single write which are not well defined.

The server starts up, encounters a log file of 1.8GB, and crashes. How is that not a bug? I'll add that I've had reports of this issue outside of VMware.

@Maithem
Copy link
Contributor

Maithem commented Dec 7, 2017

If you will support >2gb, then I think this is fine.

"The server starts up, encounters a log file of 1.8GB, and crashes. How is that not a bug?", it can also be seen as misconfiguration (i.e. you need more memory), not necessarily a bug in the code. That's why I am saying its not clear. Ultimately, segments sizes are a function of single address writes, which is undefined. Since its not defined, its hard to claim this as a bug.

@no2chem
Copy link
Member Author

no2chem commented Dec 7, 2017

it can also be seen as misconfiguration (i.e. you need more memory), not necessarily a bug in the code

Not really. It should tell the user they need more memory (rather than crash with a heap allocation error) so its still a bug.

Besides, needing as much memory as the size of the file just to start up is a bug. Reading a file just to verify it should not requiring reading it all into memory.

Worse yet, there's no way to recover from such a big file even if you accidentally caused the condition, since trim will fail as well. Even if we had write size controls, we still need to be able to recover from large files. And 1.8 GB is not a particularly large file. It's not even 2GB.

@Maithem
Copy link
Contributor

Maithem commented Dec 7, 2017

Not really. It should tell the user they need more memory (rather than crash with a heap allocation error) so its still a bug.

Its nice to print "You need more memory", many other systems dont, but if you think that's the fix then you can do it.

Besides, needing as much memory as the size of the file just to start up is a bug. Reading a file just to verify it should not requiring reading it all into memory.

That's your opinion, which I think is an unreasonable generalization. Also, the readaddress space is not just "verification", it builds an index.

Worse yet, there's no way to recover from such a big file even if you accidentally caused the condition, since trim will fail as well.

False. as i mentioned its not a simple as you put it, like i mentioned before you can bump up the -xmx parameter. Also, if you fail after restart, trim doesn't require segments to be open for the trim to happen as shown below:

File dir = new File(logDir); FileFilter fileFilter = new FileFilter() { public boolean accept(File file) { try { String segmentStr = file.getName().split("\\.")[0]; return Long.parseLong(segmentStr) < endSegment; } catch (Exception e) { log.warn("trimPrefix: ignoring file {}", file.getName()); return false; } }

@corfudb-performance
Copy link
Collaborator

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit b1f60ac.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1040 Graphs

@no2chem
Copy link
Member Author

no2chem commented Dec 7, 2017

That's your opinion, which I think is an unreasonable generalization. Also, the readaddress space is not just "verification", it builds an index.

The same applies to indexes as well. It's not an unreasonable generalization, I think that if SQL server crashed if there was a log file larger than memory, it would be considered a bug.

False. as i mentioned its not a simple as you put it, like i mentioned before you can bump up the -xmx parameter.

Okay, that'll work for a file that fits in memory, perhaps, but what about one which does not? You only can bump Xmx up so far. The comment about trim is irrelevant if the server crashes on startup, right?

This is a issue where there is a reasonable solution. I could understand if there were technical reasons why you don't think this is a good solution, but you seem to be against solving it altogether. Why is that?

@Maithem
Copy link
Contributor

Maithem commented Dec 7, 2017

The comment about trim is irrelevant if the server crashes on startup

It is irrelevant, I was just correcting a fact that you made in the previous post.

This is a issue where there is a reasonable solution. I could understand if there were technical reasons why you don't think this is a good solution, but you seem to be against solving it altogether. Why is that?

No I think it is a problem that needs to be addressed. My previous posts were to make a point that this problem is a consequence of another pending problem and solving that problem first can simplify the solution to this problem.

Once you complete this PR, then we can objectively evaluate it against other solutions and see if it is the best one.

@no2chem
Copy link
Member Author

no2chem commented Dec 7, 2017

It is irrelevant, I was just correcting a fact that you made in the previous post.

You mean that trim will fail? The server can't startup, so you can't trim. What are you correcting? Also, even if the server manages to startup, memory allocation failures will start happening if you try to trim, won't they? Catching the exception at the trim doesn't help.

@Maithem
Copy link
Contributor

Maithem commented Dec 7, 2017

Also, even if the server manages to startup, memory allocation failures will start happening if you try to trim, won't they?

No, trimPrefix doesn't invoke readAddressSpace because it doesn't open segments, so the trim can succeed.

Also, If other requests result in a operation that invokes readAddressSpace, then the allocation will fail on that request's thread, but I don't think that it bricks the server such that a subsequent trims are blocked since they will be serviced on a different thread.

@no2chem
Copy link
Member Author

no2chem commented Dec 7, 2017

@Maithem the problem is not just readAddressSpace, but also getCompactedEntries , which is called by trimLogFile. If trimLogFile is called on a file which has a large untrimmed portion, then the server will run out of memory because CompactedEntry buffers every entry before writing it out. Even if the server doesn't run out of memory, it seems that keeping the entries in a map generates memory pressure during compaction which can be avoided.

@Maithem
Copy link
Contributor

Maithem commented Dec 7, 2017

@no2chem Correct, but that's only for sparse trimming (not used), not prefix trim. Also, If you are optimizing for memory pressure, then also consider not reading the whole file to build the index, just put the index at the top of the file. For example, {header | index | records}, that way you can read the index directly without reading all the records. This will address both problems, that is, memory pressure and large segments. Just consider it. Obviously, this requires an extra write (if you are not using mmap), but it might make more sense to do that.

@no2chem
Copy link
Member Author

no2chem commented Dec 7, 2017

I'm just trying to fix the bug in issue #986. I've reproduced it locally, it seems that you need about 2x the heap space of the file at the last segment. So with a 900MB file (produced using 100KB raw entries), you need 2GB of heap space.

@no2chem
Copy link
Member Author

no2chem commented Dec 7, 2017

Now updated to support files >2GB. I've tested with a 6GB log file, and though startup takes a few seconds (a potential improvement may be to scan in a multi-threaded fashion), the server no longer crashes and I'm no longer seeing the bufferunderflow errors in #1048 either.

I doubt this PR will have that much of an effect on performance.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 70.444% when pulling 70e9ce8 on MemoryMapped into d94839a on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.05%) to 70.615% when pulling 70e9ce8 on MemoryMapped into d94839a on master.

@corfudb-performance
Copy link
Collaborator

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit 70e9ce8.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1040 Graphs

@Rtremblay Rtremblay modified the milestones: 0.2.0, 0.4.0 Dec 20, 2017
@no2chem no2chem closed this Jun 18, 2018
@zalokhan zalokhan deleted the MemoryMapped branch November 7, 2019 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants