Issues with TarFS - slow to open large tar archives, and exception #275

davidparks21 · 2019-03-24T03:35:13Z

I'm trying to open a 500 GB tar archive in PyFilesystem. It takes about 45 minutes to open (it needs to read through the entire file). The time it takes to open is the first issue I'm having. I'm able to open and stream files via the standard tarfile package using tarfile.open and mytar.next() with no delay. It looks like tarfs.py:275 self._directory = OrderedDict((relpath(self._decode(info.name)).rstrip("/"), info) for info in self._tar) is doing the full read. It seems like this could be lazy initialized, or probably even better if the tarfile package was queried directly on demand rather than maintaining a full dictionary of objects in PyFilesystem.

After I get it to open though, when I try to walk the files with mytarfs.walk.files() I am encountering this exception:

Traceback (most recent call last):
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-8-5f315c5de15b>", line 1, in <module>
    next(g)
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/walk.py", line 362, in files
    for _path, info in self._iter_walk(fs, path=path):
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/walk.py", line 433, in _walk_breadth
    for info in _scan(fs, dir_path, namespaces=namespaces):
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/walk.py", line 294, in _scan
    for info in fs.scandir(dir_path, namespaces=namespaces):
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/base.py", line 1256, in scandir
    for name in self.listdir(path)
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/tarfs.py", line 389, in listdir
    return list(OrderedDict.fromkeys(content))
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/tarfs.py", line 388, in <genexpr>
    content = (parts(child)[1] for child in children if relpath(child))
IndexError: list index out of range

It appears the assumption that 2 elements are returned by parts(child) is not always correct. I have tried to reproduce the issue with a small tar file, but I was unable to do so, it only occurs on my large 500 GB tar file. Unfortunately the time it takes to open it is hindering efforts to debug the issue, so for now I can only report it.

The text was updated successfully, but these errors were encountered:

willmcgugan · 2019-04-02T20:35:58Z

I suspect the size of the file is probably a red herring. The only thing I can see that would cause that is if there is a file encoded with a path of empty string or "/".

Could you try something along these lines, and tell me if either of those conditions could be true:

>>> import tarfile
>>> t = tarfile.open("mytar.tar")
>>> [f.path for f in t]

willmcgugan · 2019-04-02T20:59:35Z

As for the slowness, I think you're right about making the directory reading lazy. Although you would incur that delay if you list a directory, and potentially a few other methods.

davidparks21 · 2019-04-03T17:54:11Z

I don't see any empty string or '/' names from the following code:

t = tarfile.open('mybig.tar.gz')
x = [f.path for f in t]
[a for a in x if a == '' or a == '/']
Out[9]: []

I do see '.' in there which seems suspiciously similar to what you're asking about.

[a for a in x if a == '.']
Out[11]: ['.']

The tarfile was generated with this command:

sudo tar \
	--exclude /lost+found \
	--exclude /cdrom \
	--exclude /mnt \
	--exclude /proc \
	--exclude /media \
	-I pigz \
	-c \
	-f ${TARGET} ${SOURCE}

Where source is root, /, and target is mybig.tar.gz.

davidparks21 · 2019-04-03T17:59:28Z

As for the slowness, I think a full file scan is a reasonable expectation for a directory listing operation (the same occurs in tarfile since tar is not indexed). But it seems reasonable for walk to proceed efficiently with a single file scan in the order that files are inserted in the tar file, not 2 file scans as is currently done. That's easy to implement as a generator over the tarfile package. Opening the tarfile should certainly not require a full file scan.

davidparks21 · 2019-04-03T18:06:06Z

Incidentally, [f.path for f in t] consumed 5GB of RAM in the case of my 500GB tarfile. So if there's any way to not hold the full file listing in memory, even when doing a list operation, that might be a reasonable consideration for large files too. If not, maybe it's worth not to holding references to the file listing after returning it from the list operation.

willmcgugan · 2019-04-03T19:15:20Z

I don't think scanning the full directory is avoidable unfortunately. TarFS creates the illusion that tar is structured for random access, when it is anything but. As it is essentially a flat list, there is no way to know if a file exists or not without scanning the entire thing. And you can't list just a single directory.

If TarFS was to discard the directory after an operation, it would have to re-scan it again on the next operation and incur the delay again. So if your big tar takes 45 minutes to scan it would take 45 minutes to do open("foo") and another 45 minutes to do open("bar")

not 2 file scans as is currently done.

Sorry, what 2 file scans? There is only one as far as I can tell. The directory is cached on first use.

To be honest, if your use case is satisfied with a linear scan of the tar, then TarFS may just add overhead you don't need, and you might want to just use the builtin tarfile module.

BTW I am not the original author of TarFS. It was contributed by @althonos I also don't claim to be particularly knowledgable about the tar format. I could well be mistaken about how it works.

I think the IndexError you found is caused by that "." file. I'm guessing that stores information about the directory that was compressed. Odd that we've never seen that. I'll implement a workaround for that when I next have the time. Or would accept a PR if you would like to tackle it yourself.

davidparks21 · 2019-04-04T16:36:08Z

When walking the tar file there are two scans because:

It first reads in the full directory structure, which necessitates a full scan and decompression of the tar file upon opening the filesystem
Then the second scan for reading in the individual files (again decompressing all files) with the walk.

In the case of a tar file where we only open and walk it (arguably a common use case with tar files), # 1 is redundant.

For my present use case I have switched to using tarfile for processing the large tar archive. That leaves me with a walk function for tar files in particular and another for all other filesystems, which seem like what PyFileSystem strives to avoid.

I see your point about abstracting tar files to random access systems, though you might consider a more efficient (key, value) lookup than keeping (long_filename_string, full_tarinfo_object_with_all_its_attributes) to index the tar file.

I don't have bandwidth to tackle any of this myself, but if it remains a feature open to contribution, and I find bandwidth later, I might opt to take it on. It'd be nice to make TarFS() more large-file friendly. The two scans for a walk is the most limiting factor though, 5gb of ram is probably available to those handling massive tar files.

willmcgugan · 2019-04-06T17:03:06Z

In the case of a tar file where we only open and walk it (arguably a common use case with tar files), # 1 is redundant.

I'm afraid it may be unavoidable. Files in a tar may be stored in no particular order, and there are confounding factors such as having a "foo/bar/baz" file without a directory called "foo". PyFilesystem also guarantees that you can walk the directory structure depth first or breadth first. If the walk didn't return files in a well defined order it would break much of the moving and copying code.

I think ultimately the abstraction means that it is never going to be as efficient as simply unpacking the tar.

With regards to storing the directory more efficiently, TarFS uses much of the data that tarfile returns in a TarInfo. But it may not need all of it, so there may be room for storing the directory more efficiently. I'll look in to that...

Contributions are most welcome. You may find technical solutions that have escaped me!

davidparks21 · 2019-04-06T19:02:05Z

The fact that walk defaults to "breadth" in fs.base is a little problematic, and perhaps I could argue a little over specific for a general purpose framework? It seems like a default of None, unspecified, would be a better default, then the filesystem implementation could be free to decide what default is most efficient when the user doesn't specify.

I agree though, that if a user specified "breadth" or "depth", then the first scan is unavoidable. But if the user does not specify, then I'd argue that simply lazy initializing the directory structure would avoid the redundant file scan and the filesystem should be free to choose a walk order that is most efficient.

In fact, an efficient implementation of TarFS might opt to provide a third walk option, "linear", which would make it explicit to the users that there's an "efficient" option available. It seems that filesystem implementations should be free to extend this.

willmcgugan · 2019-04-07T13:55:49Z

In most filesystems directories are a first class concept, and so supporting a "no particular order" directory walk has not been a requirement. But I do see potential in your "linear" walk idea. In addition to archives, there is at least Amazon S3 which doesn't have true directories.

Copying to a true filesystem where the source walk is an an arbitrary order would be a more expensive operation, since Pyfilesystem couldn't make the assumption that directories are created prior to copying files to them. For instance if the source has a file called "foo/bar/baz", PyFilesystem would have to check for the existence of "foo" and "bar" in the destination, before creating baz. For that reason I think breadth is a better default.

Another possible solution to the double scan problem may be to allow an implementation to advertise that it has a more efficient copy available. The fs.copy module could defer to a fscopy method if it is available, or default to the standard copy implementation. That feels like more in the spirit of PyFilesystem as the developer need not think about the nature of the filesystems involved in the copy.

lurch · 2019-04-07T23:36:02Z

https://docs.python.org/3/library/tarfile.html#tarfile-objects says "It is possible to store a file in a tar archive several times." and https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.getmember says "If a member occurs more than once in the archive, its last occurrence is assumed to be the most up-to-date version.".
I've not looked at the code in TarFS but I wonder how the "linear walk" idea would deal with the same member being in the archive multiple times?

And WRT the "double scan" I guess if the tarfile is uncompressed and if it's being opened read-only, then the first scan only needs to (additionally) store the file-offsets and then the walk code would only need to seek and not re-parse all the fileheaders. But I've not looked at the code to see if a) the tarfile or TarFS modules do that already or b) if that'd be possible using the tarfile module.

willmcgugan · 2019-05-05T15:53:24Z

Will have a fix for that exception shortly. And I've made the directory loading lazy.

But the larger issue of the inefficient walk may have to be a wontfix for now. @lurch 's point about duplicate filenames may mean that the linear walk idea is a non-starter.

Happy to revisit this at a later date.

willmcgugan · 2019-05-05T16:52:42Z

Fixed the exception in v2.4.5 @davidparks21 can you test?

@zmej-serow

[2.4.11]: Added Added geturl for TarFS and ZipFS for 'fs' purpose. NoURL for 'download' purpose. Added helpful root path in CreateFailed exception Added Python 3.8 support Fixed Fixed tests leaving tmp files Fixed typing issues Fixed link namespace returning bytes Fixed broken FSURL in windows Fixed hidden exception at fs.close() when opening an absent zip/tar file URL Fixed abstract class import from collections which would break on Python 3.8 Fixed incorrect imports of mock on Python 3 Removed some unused imports and unused requirements.txt file Added mypy checks to Travis. Fixed missing errno.ENOTSUP on PyPy. Fixed bug in a decorator that would trigger an AttributeError when a class was created that implemented a deprecated method and had no docstring of its own. Changed Entire test suite has been migrated to pytest. Style checking is now enforced using flake8; this involved some code cleanup such as removing unused imports. [2.4.10]: Fixed Fixed broken WrapFS.movedir [2.4.9]: Fixed Restored fs.path import Fixed potential race condition in makedirs. Added missing methods to WrapFS. Changed MemFS now immediately releases all memory it holds when close() is called, rather than when it gets garbage collected. FTPFS now translates EOFError into RemoteConnectionError. Added automatic close for filesystems that go out of scope. [2.4.8]: Changed geturl will return URL with user/password if needed @zmej-serow [2.4.7]: Added Flag to OSFS to disable env var expansion [2.4.6]: Added Implemented geturl in FTPFS @zmej-serow Fixed Fixed FTP test suite when time is not UTC-0 @mrg0029 Fixed issues with paths in tarfs PyFilesystem/pyfilesystem2#284 Changed Dropped Python3.3 support [2.4.5]: Fixed Restored deprecated setfile method with deprecation warning to change to writefile Fixed exception when a tarfile contains a path called '.' PyFilesystem/pyfilesystem2#275 Made TarFS directory loading lazy Changed Detect case insensitivity using by writing temp file [2.4.4]: Fixed OSFS fail in nfs mounts [2.4.3]: Fixed Fixed broken "case_insensitive" check Fixed Windows test fails [2.4.2]: Fixed Fixed exception when Python runs with -OO [2.4.1]: Fixed Fixed hash method missing from WrapFS [2.4.0]: Added Added exclude and filter_dirs arguments to walk Micro-optimizations to walk [2.3.1]: Fixed Add encoding check in OSFS.validatepath [2.3.0]: Fixed IllegalBackReference had mangled error message Added FS.hash method

willmcgugan added accepted bug labels Apr 2, 2019

willmcgugan mentioned this issue May 5, 2019

fix for tarfs '.' exception #281

Merged

9 tasks

willmcgugan closed this as completed May 5, 2019

mrg0029 mentioned this issue May 10, 2019

another exception when a tarfile contains a path called '.' #284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with TarFS - slow to open large tar archives, and exception #275

Issues with TarFS - slow to open large tar archives, and exception #275

davidparks21 commented Mar 24, 2019 •

edited

Loading

willmcgugan commented Apr 2, 2019

willmcgugan commented Apr 2, 2019

davidparks21 commented Apr 3, 2019 •

edited

Loading

davidparks21 commented Apr 3, 2019 •

edited

Loading

davidparks21 commented Apr 3, 2019

willmcgugan commented Apr 3, 2019

davidparks21 commented Apr 4, 2019 •

edited

Loading

willmcgugan commented Apr 6, 2019

davidparks21 commented Apr 6, 2019

willmcgugan commented Apr 7, 2019

lurch commented Apr 7, 2019

willmcgugan commented May 5, 2019

willmcgugan commented May 5, 2019

Issues with TarFS - slow to open large tar archives, and exception #275

Issues with TarFS - slow to open large tar archives, and exception #275

Comments

davidparks21 commented Mar 24, 2019 • edited Loading

willmcgugan commented Apr 2, 2019

willmcgugan commented Apr 2, 2019

davidparks21 commented Apr 3, 2019 • edited Loading

davidparks21 commented Apr 3, 2019 • edited Loading

davidparks21 commented Apr 3, 2019

willmcgugan commented Apr 3, 2019

davidparks21 commented Apr 4, 2019 • edited Loading

willmcgugan commented Apr 6, 2019

davidparks21 commented Apr 6, 2019

willmcgugan commented Apr 7, 2019

lurch commented Apr 7, 2019

willmcgugan commented May 5, 2019

willmcgugan commented May 5, 2019

davidparks21 commented Mar 24, 2019 •

edited

Loading

davidparks21 commented Apr 3, 2019 •

edited

Loading

davidparks21 commented Apr 3, 2019 •

edited

Loading

davidparks21 commented Apr 4, 2019 •

edited

Loading