Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with TarFS - slow to open large tar archives, and exception #275

Closed
davidparks21 opened this issue Mar 24, 2019 · 13 comments
Closed

Comments

@davidparks21
Copy link

davidparks21 commented Mar 24, 2019

I'm trying to open a 500 GB tar archive in PyFilesystem. It takes about 45 minutes to open (it needs to read through the entire file). The time it takes to open is the first issue I'm having. I'm able to open and stream files via the standard tarfile package using tarfile.open and mytar.next() with no delay. It looks like tarfs.py:275 self._directory = OrderedDict((relpath(self._decode(info.name)).rstrip("/"), info) for info in self._tar) is doing the full read. It seems like this could be lazy initialized, or probably even better if the tarfile package was queried directly on demand rather than maintaining a full dictionary of objects in PyFilesystem.

After I get it to open though, when I try to walk the files with mytarfs.walk.files() I am encountering this exception:

Traceback (most recent call last):
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-8-5f315c5de15b>", line 1, in <module>
    next(g)
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/walk.py", line 362, in files
    for _path, info in self._iter_walk(fs, path=path):
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/walk.py", line 433, in _walk_breadth
    for info in _scan(fs, dir_path, namespaces=namespaces):
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/walk.py", line 294, in _scan
    for info in fs.scandir(dir_path, namespaces=namespaces):
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/base.py", line 1256, in scandir
    for name in self.listdir(path)
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/tarfs.py", line 389, in listdir
    return list(OrderedDict.fromkeys(content))
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/tarfs.py", line 388, in <genexpr>
    content = (parts(child)[1] for child in children if relpath(child))
IndexError: list index out of range

It appears the assumption that 2 elements are returned by parts(child) is not always correct. I have tried to reproduce the issue with a small tar file, but I was unable to do so, it only occurs on my large 500 GB tar file. Unfortunately the time it takes to open it is hindering efforts to debug the issue, so for now I can only report it.

@willmcgugan
Copy link
Member

I suspect the size of the file is probably a red herring. The only thing I can see that would cause that is if there is a file encoded with a path of empty string or "/".

Could you try something along these lines, and tell me if either of those conditions could be true:

>>> import tarfile
>>> t = tarfile.open("mytar.tar")
>>> [f.path for f in t]

@willmcgugan
Copy link
Member

As for the slowness, I think you're right about making the directory reading lazy. Although you would incur that delay if you list a directory, and potentially a few other methods.

@davidparks21
Copy link
Author

davidparks21 commented Apr 3, 2019

I don't see any empty string or '/' names from the following code:

t = tarfile.open('mybig.tar.gz')
x = [f.path for f in t]
[a for a in x if a == '' or a == '/']
Out[9]: []

I do see '.' in there which seems suspiciously similar to what you're asking about.

[a for a in x if a == '.']
Out[11]: ['.']

The tarfile was generated with this command:

sudo tar \
	--exclude /lost+found \
	--exclude /cdrom \
	--exclude /mnt \
	--exclude /proc \
	--exclude /media \
	-I pigz \
	-c \
	-f ${TARGET} ${SOURCE}

Where source is root, /, and target is mybig.tar.gz.

@davidparks21
Copy link
Author

davidparks21 commented Apr 3, 2019

As for the slowness, I think a full file scan is a reasonable expectation for a directory listing operation (the same occurs in tarfile since tar is not indexed). But it seems reasonable for walk to proceed efficiently with a single file scan in the order that files are inserted in the tar file, not 2 file scans as is currently done. That's easy to implement as a generator over the tarfile package. Opening the tarfile should certainly not require a full file scan.

@davidparks21
Copy link
Author

Incidentally, [f.path for f in t] consumed 5GB of RAM in the case of my 500GB tarfile. So if there's any way to not hold the full file listing in memory, even when doing a list operation, that might be a reasonable consideration for large files too. If not, maybe it's worth not to holding references to the file listing after returning it from the list operation.

@willmcgugan
Copy link
Member

I don't think scanning the full directory is avoidable unfortunately. TarFS creates the illusion that tar is structured for random access, when it is anything but. As it is essentially a flat list, there is no way to know if a file exists or not without scanning the entire thing. And you can't list just a single directory.

If TarFS was to discard the directory after an operation, it would have to re-scan it again on the next operation and incur the delay again. So if your big tar takes 45 minutes to scan it would take 45 minutes to do open("foo") and another 45 minutes to do open("bar")

not 2 file scans as is currently done.

Sorry, what 2 file scans? There is only one as far as I can tell. The directory is cached on first use.

To be honest, if your use case is satisfied with a linear scan of the tar, then TarFS may just add overhead you don't need, and you might want to just use the builtin tarfile module.

BTW I am not the original author of TarFS. It was contributed by @althonos I also don't claim to be particularly knowledgable about the tar format. I could well be mistaken about how it works.

I think the IndexError you found is caused by that "." file. I'm guessing that stores information about the directory that was compressed. Odd that we've never seen that. I'll implement a workaround for that when I next have the time. Or would accept a PR if you would like to tackle it yourself.

@davidparks21
Copy link
Author

davidparks21 commented Apr 4, 2019

When walking the tar file there are two scans because:

  1. It first reads in the full directory structure, which necessitates a full scan and decompression of the tar file upon opening the filesystem
  2. Then the second scan for reading in the individual files (again decompressing all files) with the walk.

In the case of a tar file where we only open and walk it (arguably a common use case with tar files), # 1 is redundant.

For my present use case I have switched to using tarfile for processing the large tar archive. That leaves me with a walk function for tar files in particular and another for all other filesystems, which seem like what PyFileSystem strives to avoid.

I see your point about abstracting tar files to random access systems, though you might consider a more efficient (key, value) lookup than keeping (long_filename_string, full_tarinfo_object_with_all_its_attributes) to index the tar file.

I don't have bandwidth to tackle any of this myself, but if it remains a feature open to contribution, and I find bandwidth later, I might opt to take it on. It'd be nice to make TarFS() more large-file friendly. The two scans for a walk is the most limiting factor though, 5gb of ram is probably available to those handling massive tar files.

@willmcgugan
Copy link
Member

In the case of a tar file where we only open and walk it (arguably a common use case with tar files), # 1 is redundant.

I'm afraid it may be unavoidable. Files in a tar may be stored in no particular order, and there are confounding factors such as having a "foo/bar/baz" file without a directory called "foo". PyFilesystem also guarantees that you can walk the directory structure depth first or breadth first. If the walk didn't return files in a well defined order it would break much of the moving and copying code.

I think ultimately the abstraction means that it is never going to be as efficient as simply unpacking the tar.

With regards to storing the directory more efficiently, TarFS uses much of the data that tarfile returns in a TarInfo. But it may not need all of it, so there may be room for storing the directory more efficiently. I'll look in to that...

Contributions are most welcome. You may find technical solutions that have escaped me!

@davidparks21
Copy link
Author

The fact that walk defaults to "breadth" in fs.base is a little problematic, and perhaps I could argue a little over specific for a general purpose framework? It seems like a default of None, unspecified, would be a better default, then the filesystem implementation could be free to decide what default is most efficient when the user doesn't specify.

I agree though, that if a user specified "breadth" or "depth", then the first scan is unavoidable. But if the user does not specify, then I'd argue that simply lazy initializing the directory structure would avoid the redundant file scan and the filesystem should be free to choose a walk order that is most efficient.

In fact, an efficient implementation of TarFS might opt to provide a third walk option, "linear", which would make it explicit to the users that there's an "efficient" option available. It seems that filesystem implementations should be free to extend this.

@willmcgugan
Copy link
Member

In most filesystems directories are a first class concept, and so supporting a "no particular order" directory walk has not been a requirement. But I do see potential in your "linear" walk idea. In addition to archives, there is at least Amazon S3 which doesn't have true directories.

Copying to a true filesystem where the source walk is an an arbitrary order would be a more expensive operation, since Pyfilesystem couldn't make the assumption that directories are created prior to copying files to them. For instance if the source has a file called "foo/bar/baz", PyFilesystem would have to check for the existence of "foo" and "bar" in the destination, before creating baz. For that reason I think breadth is a better default.

Another possible solution to the double scan problem may be to allow an implementation to advertise that it has a more efficient copy available. The fs.copy module could defer to a fscopy method if it is available, or default to the standard copy implementation. That feels like more in the spirit of PyFilesystem as the developer need not think about the nature of the filesystems involved in the copy.

@lurch
Copy link
Contributor

lurch commented Apr 7, 2019

https://docs.python.org/3/library/tarfile.html#tarfile-objects says "It is possible to store a file in a tar archive several times." and https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.getmember says "If a member occurs more than once in the archive, its last occurrence is assumed to be the most up-to-date version.".
I've not looked at the code in TarFS but I wonder how the "linear walk" idea would deal with the same member being in the archive multiple times?

And WRT the "double scan" I guess if the tarfile is uncompressed and if it's being opened read-only, then the first scan only needs to (additionally) store the file-offsets and then the walk code would only need to seek and not re-parse all the fileheaders. But I've not looked at the code to see if a) the tarfile or TarFS modules do that already or b) if that'd be possible using the tarfile module.

@willmcgugan
Copy link
Member

Will have a fix for that exception shortly. And I've made the directory loading lazy.

But the larger issue of the inefficient walk may have to be a wontfix for now. @lurch 's point about duplicate filenames may mean that the linear walk idea is a non-starter.

Happy to revisit this at a later date.

@willmcgugan
Copy link
Member

Fixed the exception in v2.4.5 @davidparks21 can you test?

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Dec 13, 2019
[2.4.11]:
Added
Added geturl for TarFS and ZipFS for 'fs' purpose. NoURL for 'download' purpose.
Added helpful root path in CreateFailed exception
Added Python 3.8 support

Fixed
Fixed tests leaving tmp files
Fixed typing issues
Fixed link namespace returning bytes
Fixed broken FSURL in windows
Fixed hidden exception at fs.close() when opening an absent zip/tar file URL
Fixed abstract class import from collections which would break on Python 3.8
Fixed incorrect imports of mock on Python 3
Removed some unused imports and unused requirements.txt file
Added mypy checks to Travis.
Fixed missing errno.ENOTSUP on PyPy.
Fixed bug in a decorator that would trigger an AttributeError when a class was created that implemented a deprecated method and had no docstring of its own.

Changed
Entire test suite has been migrated to pytest.
Style checking is now enforced using flake8; this involved some code cleanup such as removing unused imports.

[2.4.10]:
Fixed
Fixed broken WrapFS.movedir

[2.4.9]:
Fixed
Restored fs.path import
Fixed potential race condition in makedirs.
Added missing methods to WrapFS.

Changed
MemFS now immediately releases all memory it holds when close() is called, rather than when it gets garbage collected.
FTPFS now translates EOFError into RemoteConnectionError.
Added automatic close for filesystems that go out of scope.

[2.4.8]:
Changed
geturl will return URL with user/password if needed @zmej-serow

[2.4.7]:
Added
Flag to OSFS to disable env var expansion

[2.4.6]:
Added
Implemented geturl in FTPFS @zmej-serow

Fixed
Fixed FTP test suite when time is not UTC-0 @mrg0029
Fixed issues with paths in tarfs PyFilesystem/pyfilesystem2#284

Changed
Dropped Python3.3 support

[2.4.5]:
Fixed
Restored deprecated setfile method with deprecation warning to change to writefile
Fixed exception when a tarfile contains a path called '.' PyFilesystem/pyfilesystem2#275
Made TarFS directory loading lazy

Changed
Detect case insensitivity using by writing temp file

[2.4.4]:
Fixed
OSFS fail in nfs mounts

[2.4.3]:
Fixed
Fixed broken "case_insensitive" check
Fixed Windows test fails

[2.4.2]:
Fixed
Fixed exception when Python runs with -OO

[2.4.1]:
Fixed
Fixed hash method missing from WrapFS

[2.4.0]:
Added
Added exclude and filter_dirs arguments to walk

Micro-optimizations to walk
[2.3.1]:
Fixed
Add encoding check in OSFS.validatepath

[2.3.0]:
Fixed
IllegalBackReference had mangled error message

Added
FS.hash method
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants