Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory use when uploading #233

Closed
ncw opened this issue Nov 27, 2020 · 16 comments
Closed

Excessive memory use when uploading #233

ncw opened this issue Nov 27, 2020 · 16 comments

Comments

@ncw
Copy link
Contributor

ncw commented Nov 27, 2020

Which version of the SDK was used?

v0.11.0

Which platform are you using? (ex: Windows, Linux, Debian)

Linux

What problem was encountered?

Excess memory usage compared to v0.8.0

How can we reproduce the problem in the simplest way?

Uploading lots of files with rclone works pretty well.

Have you found a mitigation/solution?

Setting GOGC=20 mitigates the problem. This indicative of too much garbage being created.

Analysis

In c0650f8 #167 by @element-of-surprise the chunked uploader was refactored and a sync.Pool was introduced.

This sync.Pool appears to be causing memory problems according to the memory profiles below.

Testing with the azureblob backend in rclone produces the memory profile attached
memprofile-1.53.3.log

This looks like
profile002

Compared to the profile from the previous version of rclone, which shows no use of sync.Pool.

profile001

This caused memory usage for rclone to go up from 400MB to 1.4 GB between the two versions. The old version of rclone was using v0.8.0

What appears to be happening is that memory is getting stuck in the sync.Pool.

From the sync.Pool docs:

On the other hand, a free list maintained as part of a short-lived object is
not a suitable use for a Pool, since the overhead does not amortize well in
that scenario. It is more efficient to have such objects implement their own
free list.

I think the use of sync.Pool is falling into that case. The Pool isn't shared between connections so it isn't really helping.

Because it is created and destroyed for each upload the old Pools will hang around in memory waiting to be garbage collected this is and the reason why GOGC=20 works around the problem I think.

Since you are recycling buffers of a fixed size maybe a chan []byte might work better instead of the sync.Pool - you could then flush it at the end of the transfer (unlike a sync.Pool which you can't flush).

Or alternatively implement a global sync.Pool. You'll need one for each size of chunk though which is a disadvantage.

This was originally discussed on the rclone forum.

@dcendents
Copy link

Hi, as I said on the forum I'll be happy to test a release candidate for this fix, simply reach me here on this issue.

@element-of-surprise
Copy link
Member

element-of-surprise commented Nov 28, 2020

Forgive me, since its the holidays I'll be responding from a slight bit of ignorance, as I haven't gone back to look at the code in question.

My feeling is that this problem is unfortunately not about memory use or garbage made, I think it's more a problem where various use goals aren't aligning.

The end use of network copy package has the primary goal of moving bytes across a network as fast as possible. As a secondary goal in some situations, you also want to constrict how much memory can be utilized at any moment (however, not always).

The achievement of those goals across multiple scenarios is where everything kinda goes bad. Are we:

  • copying millions of small files,
  • 1 large file
  • thousands of large files
  • is the network lossy
  • is this a short term process or always copying situations
  • ...

What I remember of the existing structure (my memory may be faulty) was that there were three knobs:

  • Buffer size
  • Number of buffers
  • Concurrency restriction

My memory may be faulty on this, but I remember the design being so that the common case was to use a single client to transfer files to a single endpoint (reusing the sync.Pool).

The API provided restrictions through an options struct, if I remember. I didn't want to make changes to the API without thinking about an overhaul of the package, so that lead to some restrictions on what I was going to change.

For use cases I tests (which revolved around kusto and azcopy), using these changes I could increase transfer speeds by up to 10x for many transfer types (though my main goal was to fix a couple of bugs).

So apparently I was sacrificing your use case to increase others.

I remember thinking when I was making the change that a good long term plan would be to have an interface that looked something like:

type TransferManager interface{
  Get() *bytes.Buffer
  Put(b *bytes.Buffer)
  Run(func() error) 
}

This would get passed in whenever someone made a streaming call.

There would be several pre-defined types that could be used and users could define their own if none of those were satisfactory. The code using TransferManager would never use more than the size of an allocated buffer (unless it was 0, which would error). Run() would act as a thread-pool manager (which is more efficient than spinning goroutines off per copy).

So if static buffers were better for a use case, you could do that (the method before the change). Or a sync.Pool for others. Sometimes a combination of both is great (done this a few times). For long term services doing copies in BE qos transfer class, you could write something that responds to network back pressure (dialing up or down your attempted copy speed).

And because you would pass your concrete type to the call, you could choose to share this between all calls or create a new one per call.

I'll dig into this deeper next week, going to try enjoying the holiday, but I am concerned specifically with the OOM mentioned. I'm not familiar with rclone, so if you could link me the code where its using the streaming code and whatever you were running to pull the memory profile, I'd appreciate it.

Thanks for bringing this up and tagging me.

@ncw
Copy link
Contributor Author

ncw commented Nov 30, 2020

The main upload code is here in rclone

https://github.com/rclone/rclone/blob/f7d9b157079bbf6c7145f94784a60f846f19c637/backend/azureblob/azureblob.go#L1441

Rclone calls azblob.UploadStreamToBlockBlob

// UploadStreamToBlockBlob copies the file held in io.Reader to the Blob at blockBlobURL.
// A Context deadline or cancellation will cause this to error.
func UploadStreamToBlockBlob(ctx context.Context, reader io.Reader, blockBlobURL BlockBlobURL,

which turn calls copyFromReader which is the code in question.

func copyFromReader(ctx context.Context, from io.Reader, to blockWriter, o UploadStreamToBlockBlobOptions) (*BlockBlobCommitBlockListResponse, error) {

I don't think there is any way of re-using the copier there which is what we'd need for the use of sync.Pool to be useful and not generate so much garbage.

Since we are making and throwing away a sync.Pool for each object uploaded, the sync.Pool retains its memory until some indeterminate time later, hence the memory leak.

It may well be that there is a better interface for rclone to be calling here - all it wants to do is upload a short blob of known size which doesn't need to be chunked. It has a separate multipart upload routine of its own and doesn't use the transfer manager (which I didn't know about until now!).

Here are docs on how to get memory profiles out of rclone: https://rclone.org/rc/#debugging-memory-use - you'll need the --rc flag and to uploads lots of smallish objects I think (@dcendents how would you describe the files you were uploading in terms of number and size distribution?)

Enjoy your holidays - no rush :-)

@element-of-surprise
Copy link
Member

Thanks for getting back to me. I'm making some changes now that hopefully will give a more robust and customizable solution.

Though, I feel that we should probably consider just re-writing the streaming interface into a client model, I'm going to keep this change API compatible.

Back to you next day or so.

element-of-surprise pushed a commit to element-of-surprise/azure-storage-blob-go that referenced this issue Dec 2, 2020
See: Azure#233

There needs to be a more robust way of controling memory reuse and threadpooling for streaming copies (or maybe all copies, I will not speak to that here).

This introduces a new interface called TransferManager.  This provides 3 main fuctions:

- Control buffer allocation
- Control goroutine allocation
- Allow reuse

This allows a developer to control exactly how they wish to handle this for their use case.  We provide two implementations:
- Static buffer allocation similar to the original implementation
- Threadpool allocation similar to how it currently works since my change

Additionally, both of these support having shared pools between calls which neither did before.

It also allows for custom creation of implementations by users that better support their needs (a hybrid of the two above, one that adjusts buffer or concurrency based on transfer speed, ...)

Note: the behavior by default returns to the static buffer method here.  If azcopy likes the current behavior, it should switch using NewSyncPool()

Like the original change, this does not change the API compatibility.  I think that is probably something that should be done, but not in this change.

There are 3 files here that were not part of my change, but were not go fmt, so they have been updated by the go tool:
- bytes_wrtier.go
- zc_*

All tests without the end to end test system that are expected to pass do (comparing the non-changed version against this one).
@element-of-surprise
Copy link
Member

Added PR: #234

ajankovic added a commit to ajankovic/azure-storage-blob-go that referenced this issue Dec 18, 2020
Add option to UploadStreamToBlockBlob to specify custom BufferPool.

This is needed to for example reuse pool between transfers.

Relates to Azure#233
ajankovic added a commit to ajankovic/azure-storage-blob-go that referenced this issue Dec 21, 2020
Add option to UploadStreamToBlockBlob to specify custom BufferPool.

This is needed to for example reuse pool between transfers.

Relates to Azure#233
@ncw
Copy link
Contributor Author

ncw commented Jan 9, 2021

Any news on this? It came up again in the rclone forum.

@ajankovic are you working on a fix for this? Your idea of passing in a pool looks like a great fix to me.

ncw added a commit to rclone/rclone that referenced this issue Jan 13, 2021
This fixes the excessive memory use in the Azure SDK by using the as
yet unmerged PR to implement a custom memory allocator.

Azure/azure-storage-blob-go#242

This is to fix the underlying bug

Azure/azure-storage-blob-go#233

See: https://forum.rclone.org/t/ask-for-settings-recommendation-for-azureblob/21505/
@ajankovic
Copy link

@ncw yes the PR is one way to solve it but it's not up to me to merge it.

Although I am not sure why the tests are failing because it's not related to my changes in any way, they are probably flaky.

@element-of-surprise
Copy link
Member

You need your environment setup to run some of the tests, I remember this from doing the original patch.

I left comments on this or you patch in your PR and linked in my PR. I believe azblob has some type of release cadence so things get looked at as those approach.

Btw, @ajankovic, thanks for the patch on what was a stupid oversight from my patch. I detangled a bunch of stuff and had to write benchmarks that didn't go into the repo to make sure certain things were at least as fast as before for certain apps. I lost the forrest because of the trees it appears.

@ncw
Copy link
Contributor Author

ncw commented Jan 13, 2021

I've got a user testing your patch @ajankovic on the rclone forum. Hopefully it will fix their problem.

We need a fix for this ASAP though it is causing a lot of problems with rclone users in particular.

@ajankovic
Copy link

@ncw we have our own fork of rclone that uses my patch and it's definitely working in our tests. Having all of this upstream would be even better to cut on maintenance.

You need your environment setup to run some of the tests, I remember this from doing the original patch.

But this failed on CI where the environment is correctly setup. And my PR is not changing default behaviour nor integration tests, if pool is not provided it's not used. I'll check it out.

@element-of-surprise
Copy link
Member

But this failed on CI where the environment is correctly setup. And my PR is not changing default behaviour nor integration tests, if pool is not provided it's not used. I'll check it out.

I am 99% sure that certain tests never works in the CI, that this only can be tested locally if you have the right setup. I'm pretty sure only the devs have that setup.

zezha-msft pushed a commit that referenced this issue Jan 28, 2021
See: #233

There needs to be a more robust way of controling memory reuse and threadpooling for streaming copies (or maybe all copies, I will not speak to that here).

This introduces a new interface called TransferManager.  This provides 3 main fuctions:

- Control buffer allocation
- Control goroutine allocation
- Allow reuse

This allows a developer to control exactly how they wish to handle this for their use case.  We provide two implementations:
- Static buffer allocation similar to the original implementation
- Threadpool allocation similar to how it currently works since my change

Additionally, both of these support having shared pools between calls which neither did before.

It also allows for custom creation of implementations by users that better support their needs (a hybrid of the two above, one that adjusts buffer or concurrency based on transfer speed, ...)

Note: the behavior by default returns to the static buffer method here.  If azcopy likes the current behavior, it should switch using NewSyncPool()

Like the original change, this does not change the API compatibility.  I think that is probably something that should be done, but not in this change.

There are 3 files here that were not part of my change, but were not go fmt, so they have been updated by the go tool:
- bytes_wrtier.go
- zc_*

All tests without the end to end test system that are expected to pass do (comparing the non-changed version against this one).

Co-authored-by: John Doak <jdoak@DANEWMAN-LENOVO.redmond.corp.microsoft.com>
@zezha-msft
Copy link
Contributor

@element-of-surprise @ajankovic Thanks a lot for the contribution! We really appreciate it. I also apologize for the long delay in reviewing the PR.

We really like #242 but chose to merge #234 instead, as it is more future-proofing. Thanks for the detailed analysis! @ncw

@zezha-msft
Copy link
Contributor

We'll make a release soon.

ncw added a commit to rclone/rclone that referenced this issue Jan 28, 2021
This fixes the excessive memory use in the Azure SDK by using the dev
branch which contains a fix for it.

Rclone then implements a custom TransferManager which integrates with
rclone's memory pool.

This is to fix the underlying bug

Azure/azure-storage-blob-go#233

See: https://forum.rclone.org/t/ask-for-settings-recommendation-for-azureblob/21505/
@ncw
Copy link
Contributor Author

ncw commented Jan 28, 2021

Thanks for fixing this :-)

If you could make the release in the next day or two I can fit it into rclone 1.54 which would be very convenient for me :-)

@zezha-msft
Copy link
Contributor

It is done! Sorry again for the long delay.

ncw added a commit to rclone/rclone that referenced this issue Jan 30, 2021
…ransferManager

In the Azure SDK there was a bug which caused excessive memory use
when doing repeated transfers:

Azure/azure-storage-blob-go#233

This patch updates the SDK to v0.13.0 which allowed us to implement a
custom TransferManager which integrates with rclone's memory pool.

This fixes the excessive memory consumption.

See: https://forum.rclone.org/t/ask-for-settings-recommendation-for-azureblob/21505/
@ncw
Copy link
Contributor Author

ncw commented Jan 30, 2021

Thank you very much for making the release @zezha-msft

I've put this fix into rclone now and it will be released with rclone 1.54 and many users will rejoice!

negative0 pushed a commit to negative0/rclone that referenced this issue Aug 13, 2021
…ransferManager

In the Azure SDK there was a bug which caused excessive memory use
when doing repeated transfers:

Azure/azure-storage-blob-go#233

This patch updates the SDK to v0.13.0 which allowed us to implement a
custom TransferManager which integrates with rclone's memory pool.

This fixes the excessive memory consumption.

See: https://forum.rclone.org/t/ask-for-settings-recommendation-for-azureblob/21505/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants