Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SpooledTemporaryFile in MemoryFS? #307

Closed
dargueta opened this issue Jul 5, 2019 · 8 comments
Closed

Use SpooledTemporaryFile in MemoryFS? #307

dargueta opened this issue Jul 5, 2019 · 8 comments

Comments

@dargueta
Copy link
Contributor

dargueta commented Jul 5, 2019

Currently MemoryFS maintains all files as io.BytesIO streams, which works fine until you have a lot of files or some particularly large ones. Eventually your code runs out of memory and crashes.

To alleviate this, I was wondering if we could modify MemoryFS to use tempfile.SpooledTemporaryFile which starts off as a BytesIO, but once it reaches a predefined it size gets flushed out to an actual temporary file on the disk. The change won't have any effect on performance for files under the limit we choose, and it'll allow a MemoryFS file system to scale better.

The maximum file size is configurable so I think we can either hard-code a sane limit (say 128 MiB or whatever) or allow the user to pass one in as an argument to the opener.

Thoughts?

@lurch
Copy link
Contributor

lurch commented Jul 6, 2019

If you know you're going to be creating files larger than available memory, why not use a TempFS in the first place?

IMHO given the way this would change the operation of MemoryFS, if it's added it should default to None (i.e. don't use it), unless the user specifically requests it. (and given it's intended purpose, maybe it'd be worth reading the size parameter as a megabytes value, rather than a bytes value? 🤷‍♂️ (or maybe that'd be too inconsistent with the rest of the API) )

@dargueta
Copy link
Contributor Author

dargueta commented Jul 6, 2019

If you know you're going to be creating files larger than available memory, why not use a TempFS in the first place?

In my case, unit tests. If you've got 1000+ unit tests, holding as much as you can in RAM makes a noticeable difference in performance. If only a few overrun RAM, then you've got to use TempFS for everything and the performance hit adds up.

Theoretically in our case we could use multiple fixtures but that's going to add complexity and more cognitive load on devs.

if it's added it should default to None (i.e. don't use it)

I completely agree. Fortunately, SpooledTemporaryFile doesn't have a default upper limit so we wouldn't need to special-case the default behavior.

or maybe that'd be too inconsistent with the rest of the API

Allow suffixes? Like default unit is bytes but the user can use "128K" instead of "131072" which is far more obvious to someone looking at the code. A little bit more work for us but I think it'd be worth it.

@lurch
Copy link
Contributor

lurch commented Jul 7, 2019

Fortunately, SpooledTemporaryFile doesn't have a default upper limit so we wouldn't need to special-case the default behavior.

Hmmm? The docs say "This function operates exactly as TemporaryFile() does, except that data is spooled in memory until the file size exceeds max_size", and max_size defaults to 0. Which (unless I'm misunderstanding it) suggests that any time the SpooledTemporaryFile (constructed without an explicit max_size) grows larger than 0 bytes, it gets converted (internally) to a TemporaryFile?

....

Hmm, although checking the source, it seems to imply otherwise https://github.com/python/cpython/blob/master/Lib/tempfile.py#L650
Looks like a max_size of 0 gets ignored? I might file this as a documentation bug.
EDIT: Haha, just found https://bugs.python.org/issue34446

Also, https://github.com/python/cpython/blob/master/Lib/tempfile.py#L748 does a comparison directly against max_size, which suggests it always has to be an integer (i.e. it can't be None).

Allow suffixes? Like default unit is bytes but the user can use "128K" instead of "131072"

IMHO adding suffixes to allow specifying filesize might be a bit of a slippery slope - would we need to allow each of "128K", "128KB", "128k", "128kb", "128Kb", etc? I was suggesting a size_megabytes parameter instead of a size parameter (as I can't imagine you'd ever need a finer granularity than this for this MemoryFS suggestion). But of course any decision here is up to @willmcgugan to make, as he knows best what "fits" with the rest of the API.

....

Although on second thoughts, I guess this still wouldn't work if the MemoryFS constructor specified a max_size of 5MB, and then you went ahead and created a million 4MB files? Perhaps the ideal behaviour would be for MemoryFS to track the size of its entire FS, and then do a "similar trick" to SpooledTemporaryFile to start converting its BytesIO objects to TemporaryFile objects when the entire FS's max_size got exceeded. Perhaps a SpooledMemoryFS subclass of MemoryFS makes sense? 😉

@dargueta
Copy link
Contributor Author

dargueta commented Jul 7, 2019

I guess this still wouldn't work if the MemoryFS constructor specified a max_size of 5MB, and then you went ahead and created a million 4MB files?

Rrrgh crap you're right. I'd still like to think this would be helpful, but yeah, good point.

IMHO adding suffixes to allow specifying filesize might be a bit of a slippery slope - would we need to allow each of "128K", "128KB", "128k", "128kb", "128Kb", etc?

I think case-insensitivity would make sense, but we'd only allow K, M, and G, and explicitly state that in the documentation.

Perhaps the ideal behaviour would be for MemoryFS to track the size of its entire FS

That sounds really hard to do accurately, though I think you might be onto something there.

@willmcgugan
Copy link
Member

It's not a bad idea. I probably wouldn't extend MemoryFS. It may be tidier to have another filesystem with this functionality.

That said, I'm not entirely convinced its necessary. If you were to use a TempFS the OS should cache most of it in memory, it may not even touch the physical disk at all. So MemoryFS might not be much faster than TempFS.

It's worth doing a little profiling before implementing this. @dargueta Are you able to switch your MemoryFS objects to TempFS objects and compare performance?

@dargueta
Copy link
Contributor Author

dargueta commented Jul 24, 2019

Are you able to switch your MemoryFS objects to TempFS objects and compare performance?

I'll see what I can do. It's pedal to the metal at work at the moment so probably not for some time.

@dargueta
Copy link
Contributor Author

Just realized #308 might've been an exacerbating factor.

@dargueta
Copy link
Contributor Author

I'm going to close this since it seems to have too many holes in the idea. Might think on it some more, but I also don't want to leave a ticket hanging around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants