Tar.extract_file: extract content of single file in tarball #95

fredrikekre · 2021-01-29T00:25:41Z

No description provided.

codecov · 2021-01-29T00:27:40Z

Codecov Report

Merging #95 (17797c7) into master (37766a2) will increase coverage by 0.08%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #95      +/-   ##
==========================================
+ Coverage   96.37%   96.46%   +0.08%     
==========================================
  Files           4        4              
  Lines         634      650      +16     
==========================================
+ Hits          611      627      +16     
  Misses         23       23

Impacted Files	Coverage Δ
src/extract.jl	`96.43% <100.00%> (+0.17%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37766a2...17797c7. Read the comment docs.

StefanKarpinski · 2021-01-29T16:05:58Z

Thoughts about API:

Would it be better to extract to an IO handle instead of producing a byte vector? After all, it's easy to convert an IO handle to a byte vector (just call read on it) but if the file is large enough that you don't want to extract it all into memory then with this API you're stuck and you can't do that.
It might be more general to extract the contents of multiple files to an IO handle based on a predicate. There are times when you might just want to concatenate the contents of all the files (e.g. to search them). The pattern could be a function, a regex matching the path name, or just a string, in which case that must be exactly the path name.

StefanKarpinski · 2021-01-29T16:07:06Z

If the path is a directory in the tarball, we could extract the contents of all files under that directory. Or we could error.

fredrikekre · 2021-01-29T20:15:57Z

Would it be better to extract to an IO handle instead of producing a byte vector?

Yea, I thought about that too so I will update it.

It might be more general to extract the contents of multiple files to an IO handle based on a predicate.

Okay. You might risk reading more files than you actually intented if the predicate is not strict enough though.

StefanKarpinski · 2021-01-29T21:12:32Z

True but if it's a string it will only ever match one file, which seems safe enough. We could return the number of files matched. Another thought I had was invoking a callback for each tarball entry with the IO handle of the data stream. But that may be a bit too low-level an API even though it is maximally flexible.

fredrikekre · 2021-01-29T22:05:24Z

You mean that if you give str::String as a predicate it would only be allowed to match one thing? In

$ tree .
.
├── dir
│   ├── elif.txt
│   └── file.txt
└── file.txt

1 directory, 3 files

would predicate = "file.txt" only match /file.txt?

fredrikekre · 2021-01-29T22:23:48Z

Since the result is written to an io instead of beeing returned, perhaps the return value could be something like an array with matched files and number of bytes:

ret = [
    ("file1", nbytes1),
    ("file2", nbytes2),
]

That could be useful if you want to split the returned stream yourself.

Edit: It should just return all the headers that correspond to the file....

that match a predicate, from the tarball archive.

fredrikekre

Rewrote to write to an IO and to return the header(s) of the matching files.

fredrikekre · 2021-01-29T23:48:53Z

README.md

+### `Tar.extract_file`
+
+```jl
+extract_file(predicate::Function, tarball, out) -> Vector{Header}


Maybe this should be

Suggested change

extract_file(predicate::Function, tarball, out) -> Vector{Header}

extract_files(predicate::Function, tarball, out) -> Vector{Header}

?

KristofferC · 2021-02-17T12:34:21Z

To me, giving a single file to extract makes the most sense. If you want to extract multiple files, you just write a loop over the (possibly filtered) files and pass in the same io object?

StefanKarpinski · 2021-02-18T02:13:53Z

Fair enough. Is there an urgent need for this? I'm on parental leave until at least next month.

KristofferC · 2021-03-12T11:33:58Z

No, there is no urgency.

Trying this out a bit in Pkg, it allocates 40GB to read all files in the registry and most of the time is spent on allocating buffer vectors. Should be a few low hanging fruits.

Edit: Lowest hanging fruit removed, next one seems to be allocations for all the headers.

KristofferC · 2021-03-12T13:47:43Z

This is what I came up with for use in Pkg: https://gist.github.com/KristofferC/b354b654e776f4910b03985830418723

Here is it in action in Pkg: JuliaLang/Pkg.jl#2431. Ended up not using anything from this PR.

KristofferC · 2021-03-16T08:24:46Z

After trying to use this for a while, I don't think this API works very well. The problem is that to extract a single file you still need to parse through all headers (or at least until you find the file). So I think there needs to be some intermediate object created that does this for you and caches it (basically https://github.com/JuliaLang/Pkg.jl/pull/2431/files#diff-81abdf925a6c6e19622f1607e7777ae3bd3caabc942b73a7b33cbc0b96a9edff).

StefanKarpinski · 2021-03-22T22:18:26Z

So my reading of that particular Pkg application is that rather than having a function to extract a single file at a time (calling that repeatedly makes the whole thing O(n^2)), you really want to have a way to iterate through the entire tarball and populate an in-memory representation of the tarball contents. I'd also rather avoid exposing special types like InMemoryFileSystem and InMemoryFile, and maybe have an API that converts the tarball into a dict or pair-vector of paths to byte vectors or something like that. Or even return a Vector{Pair{Path, SubArray{UInt8, 1}}: we can extract the entire tarball into a single memory buffer and then create subvectors that refer to parts of that vector. The idea is that extracting the tarball into memory should be fast and then all you're really doing is building a simple index structure that refers to that extracted memory.

KristofferC · 2021-03-22T23:01:09Z

The reason for the way it is done in Pkg is to avoid having to read data that isn't used. So that's why it just reads the headers and remembers position + size and only reads the data (and caches it) on demand.

Regarding the special types, they are not meant to be "public", it was just a way I saw it convenient to structure the internals for the Pkg use case.

StefanKarpinski · 2021-03-23T03:08:18Z

When we decompress the tarball, we've already read all the data, so we've paid the cost of doing that already. We might as well keep it around in memory — the only cost of doing that is the size of a decompressed General tarball, which is currently 25MB, not so big these days. If we use the SubArray{UInt8,1} approach, we only need the original buffer where the tarball was extracted and we never need to copy it. I'll code it up and we can see how it performs.

KristofferC · 2021-03-23T08:29:56Z

When we decompress the tarball, we've already read all the data

The tarball is decompressed once (when the registry is downloaded). Not every time you read the registry.

StefanKarpinski · 2021-03-23T16:52:19Z

The tarball is decompressed once (when the registry is downloaded). Not every time you read the registry.

I was planning to store it compressed to save disk space (yes, we use a ton of disk space, but every byte counts). If we're going to decompress it on disk, then we could use mmap, access it like an in-memory array and let the kernel decide when to page the contents in/out of memory. That seems easier and more efficient than seeking around the tarball I/O handle.

Currently we redo all the work to load the registry on each Pkg operation, which made sense when it was possible for the registry contents to change out from under you with each command. But when the registry is a single, immutable file, I think it makes more sense to cache the registry data structure more persistently. If the registry changes, it will be a different file entirely. So I think it makes sense to:

Store the registry compressed and decompress on first access;
Keep the decompressed data around for longer than a single Pkg operation;
Cache the parsed data TOML data as well since that can't change either and we're likely to parse the same things over and over again.

We could use a WeakRef to store the registry cache so that it gets cleaned up only when there is GC pressure. We could either use the size+mtime+ctime+inode of the compressed registry file as a cache key and reload when that changes or just flush the cache when some Pkg operation in this process has actually updated the registry and not otherwise. Or both.

KristofferC · 2021-03-23T17:00:23Z

Currently we redo all the work to load the registry on each Pkg operation, which made sense when it was possible for the registry contents to change out from under you with each command

After JuliaLang/Pkg.jl#2438 the registry retrieved from PkgServer is considered immutable and it is "cached" through Pkg operations. It caches things based on the git-tree-sha of the registry.

Cache the parsed data TOML data as well

That's been done for a while with

https://github.com/JuliaLang/Pkg.jl/blob/b242fc1ff5de910e600c63db5ba8e022e1e9b83f/src/Registry/registry_instance.jl#L10-L13

It is useful for when the registry is a git repo since we don't cache the registry itself then, but for pkg server registries (with the new cache), it doesn't really do anything since it will never even attempt to read the same TOML file twice.

Regarding decompression / compression, I think I measured it to around 0.1 second to decompress so perhaps it fine to keep it decompressed.

KristofferC · 2021-03-26T16:09:59Z

Here is an example implementation + benchmark of keeping things compressed:

using Downloads, Tar, Pkg

using Pkg: PlatformEngines.exe7z

reg_url = "https://pkg.julialang.org/registry/23338594-aafe-5451-b93e-139f81909106/28c601505a512da4e5abebc75cd448d48b8dac9c"
reg_file = "registry.tar.gz"

Downloads.download(reg_url, reg_file)

function untar_memory(tar_gz::AbstractString)
    data = Dict{String, Vector{UInt8}}()
    buf = Vector{UInt8}(undef, Tar.DEFAULT_BUFFER_SIZE)
    io = IOBuffer()
    open(`$(exe7z()) x $tar_gz -so`) do tar
        Tar.read_tarball(x->true, tar; buf=buf) do hdr, _
            if hdr.type == :file
                Tar.read_data(tar, io; size=hdr.size, buf=buf)
                data[hdr.path] = take!(io)
            end
        end
    end
    return data
end

using BenchmarkTools
@btime untar_memory(reg_file)

Takes about 100ms.

Just doing

@btime read(`$(exe7z()) x $reg_file -so`)

takes about 60 ms.

StefanKarpinski · 2021-03-26T17:09:23Z

That seems like a good approach then, doesn't it? Storing the data as a Dict{String,String} might be a little conceptually simpler and might be just as fast since String(take!(io)) is zero-copy.

Start with Julia 1.7, a registry can also be stored in a depot as a gzipped tarball. This commit unpacks the tarball into a temporary directory. This was the simplest way to support this new feature. This implementation will unpack a fresh copy of the registry on every call to `reachable_registries`. We could instead cache the temporary directory for future calls. There's an open PR for the Tar.jl package that will be of future interest: JuliaIO/Tar.jl#95 After that is merged, we could instead stream the individual files straight out of the tarball, rather than having to unpack it into a temporary location.

Starting with Julia 1.7, a registry can also be stored in a depot as a gzipped tarball. This commit unpacks the tarball into a temporary directory. This was the simplest way to support this new feature. This implementation will unpack a fresh copy of the registry on every call to `reachable_registries`. We could instead cache the temporary directory for future calls. There's an open PR for the Tar.jl package that will be of future interest: JuliaIO/Tar.jl#95 This would allow us to instead stream the individual files straight out of the tarball, rather than having to unpack it into a temporary location.

StefanKarpinski · 2022-12-02T19:09:49Z

I keep coming back to this seemingly simple feature and getting caught up in all the unfortunate corner cases.

What we want this to do is something equivalent to extracting the tarball somewhere temporary and then reading a single file out (or a set of files based on some criterion). Seems simple enough: read through the tarball and skipping over anything that's not the file you're interested in; when you find a header for the file that you want to extract, copy the data for that file to the given IO handle. And that simple strategy is what this PR implements.

So what's the problem? Unfortunately, tar files can have multiple entries for the same output path. If you're extracting the tarball, that's not an issue—you just extract each entry in order, overwriting the previous one with the latest one and the last one is left at the end. But what happens with this implementation? We end up extracting the contents of every version of the file to the IO handle, not just the last one. That's very much not the desired behavior of sending the contents of the file that would have been left after extraction. In order to make sure you extract only the last version of a file to the IO handle, you have to either be able to seek back and forth in the input tarball or you need to buffer the entire contents of the file you're extracting, which could take a lot of memory. If you want to not tax your RAM, it might be preferable to extract a selected set of paths to a temporary location and then read what you're interested in from that temp location ... which is exactly what Tar.extract with a predicate does.

There's also the symlink issue, which is alluded to in the PR as a TODO note. This entails similar problems: if the symlink comes before the entry you want to extract in the tarball, then you can extract by scanning forward only, but if the symlink comes later on in the tarball, then you have to either go back and find the target file or have it buffered. This situation is worse since though since the target file could be any of the files in the tarball, so you'd either have to be able to seek or you have to buffer everything. Again, in order to avoid potentially blowing out memory, you'd want to extract to a temp location and then read the contents afterwards.

So in the case where your tarball is seekable, you can implement this efficiently by reading to the end, keeping track of where file's data can be found in the tarball, and then extracting the correct path contents at the end. However, a seekable tarball is atypical since tarballs are usually compressed so input is being read either from a command doing decompression or from a decompressor stream object—neither of which is seekable. In the non-seekable case, the best you can do is extract the entire tarball to a temporary directory and stream the contents of the paths you're interested in to the output IO stream.

StefanKarpinski · 2022-12-14T14:15:40Z

Given the complications above, a better API might be Tar.extract_dict(tarball) which extracts in-memory to a Dict{String, String} like what Pkg uses. At least we know there's one very concrete use-case for that. Not that there aren't concrete use cases for Tar.extract_file, it's just so challenging to implement efficiently because of the way the TAR format works. It's really not a good format.

JBlaschke · 2022-12-21T22:12:03Z

Any chance that this might be included in Tar.jl soon? Part of my work involves sensitive data on systems with mainly shared file systems [1], so extracting this data to a temporary location on the file system is a no-go. I modified @KristofferC 's gist (https://gist.github.com/KristofferC/b354b654e776f4910b03985830418723) to handle compression: https://gist.github.com/JBlaschke/ec0f4ece417bdfe1428a85c78c0804f7 and it works pretty well.

I ran some benchmarks where I inflated 187 MB to 1.2GB, analyzed a single file (~800MB) and moved onto the next archive. My application's ram footprint varied from ~3GB to ~7GB -- which is not great, but not terrible either.

I think I can safely say that the workflow: i) extract entire archive to Dict{String, String} (overwriting same keys in the order of when the file was archived); ii) take the relevant data from (i); iii) move onto the next archive is fine by me.

[1]: now that I describe it, it sounds way cooler than it actually is.

StefanKarpinski · 2022-12-21T22:14:25Z

Are you asking about the extract_dict functionality or the originally proposed extract_file functionality?

JBlaschke · 2022-12-21T22:58:04Z

Are you asking about the extract_dict functionality or the originally proposed extract_file functionality?

@StefanKarpinski I am asking for any functionality that will let me extract a given file from a tar archive into an io buffer. I see that there are quite a few corner cases that might make extract_file not practical. So I observed, since many tar archives are also compressed, and the tar format itself requiring the entire file to be read even to extract a single file, that the extract_dict is a reasonable compromise -- especially if it makes progress on this issue.

tecosaur · 2023-04-20T08:10:29Z

I've just come across this, and do let me know if I'm misinterpreting the situation, but it looks like the two main challenges are:

Having to scan the whole tarball to definitively find the record for a particular file
Needing to re-scan the tarball for each file when looking at multiple files one-at-a-time is needlessly expensive

I may be missing something, but it seems to me that one potential approach is to have a Tar.scan(tarball) -> ScannedTarball method where ScannedTarball is a struct that records the positions and ranges of every file in the tarball. It would effectively be a wrapper type around a Dict{String, Pair{Int, Header}} (or similar). From this, one could then get an IOStream for a particular file, or many files, correctly and efficiently, after paying the scan-the-whole-tarball cost once.

StefanKarpinski · 2023-04-20T21:35:30Z

I do like the API of extracting the tarball to a dict, mapping paths that are files or symlinks to files to their content, and just not including non-files. That's already a functionality that Pkg is using and it seems generally useful. Not sure about extracting an index; as pointed out, since tarballs are usually compressed, you'd have to decompress everything up to the point of the file you want to extract anyway. One option that @staticfloat and I have been kicking around is to have a footer block that contains some magic and an index of where to find each object in the tarball: if the footer exists, then you can use it, if the footer doesn't exist, we could generate it once and then use that index, which I guess is roughly the proposed API. With regard to compression, it's actually possible to chunk the compression so that you can jump to the start of a specific item in a tarball without having to decode any of the previous chunks. It's a bit tricky, but we might be able to implement this special format, which has the advantage of allowing efficient random access file extraction while still remaining readable by the classic gzcat and tar commands.

tecosaur · 2023-04-21T06:26:48Z

The footer idea is interesting, and a quick search turns up some prior art:

Of course, this only works when you're dealing with a modified tarball. I'd think a progressive enhancement approach may be a good fit here — look for such a footer if it exists, if not fall back on scanning the tarball, with the option to create the footer as you suggest. For speed, I imagine an analogue of --fast-read could be supported for lazy indexing when you know you're dealing with a "well-formed" tarball, by supporting a "partial scan" state where if a file is not found in the current index and the scan is incomplete, the scan continues till either the file is found or the end of the tarball reached.

This seems straightforward to me with an uncompressed tarball, but as you note a compressed tarball is harder. Chunked compression would help, but I wonder if it's possible to create "decompression checkpoints" over the course of the scanning that could be jumped to (for the non-chunked case). A bit like an iterate-ish version of mark (with regards to returning the state instead of modifying the underlying object).

KristofferC · 2023-04-21T06:34:13Z

I think there should be a (very) strong use case to start putting in non-standard stuff in the tar balls. The number of people in this issue over two years suggest to me that there isn't really such a strong case.

In any case, functionality like that in https://github.com/devsnd/tarindexer seems preferable to me where you have an external index since you don't mess with the tarball at all then.

tecosaur · 2023-04-21T06:41:41Z

For reference, the use case that's brought me to this issue is wanting to support being given a tarball in the form of an IO, and then being able to get individual files within it as IO without resorting to temp-files, in a way that's not awfully slow when done repeatedly.

StefanKarpinski · 2023-04-21T21:01:00Z

@KristofferC, the tarball would work completely normally with other tools. You can insert what are effectively comment blocks in tarballs that contains this additional information but don't produce any files. The gzip format also has ways to insert "comments", although they're a little sneaky. Zstd has an official way to do it, which is even better.

fredrikekre force-pushed the fe/single-file branch from 1c033cf to 893c1b9 Compare January 29, 2021 00:27

fredrikekre changed the title ~~Add a function to read the data of a single file in the tarball.~~ Tar.extract_file: extract content of single file in tarball Jan 29, 2021

Tar.extract_file: read the content of individual file(s),

eb0f56b

that match a predicate, from the tarball archive.

fredrikekre force-pushed the fe/single-file branch from 893c1b9 to eb0f56b Compare January 29, 2021 23:47

fredrikekre commented Jan 29, 2021

View reviewed changes

re-use the same buffer in extract_files

17797c7

KristofferC force-pushed the fe/single-file branch from 7255613 to 17797c7 Compare March 12, 2021 11:52

StefanKarpinski mentioned this pull request Aug 3, 2021

extract into IO #122

Open

danielmatz mentioned this pull request May 26, 2022

Support registries in gzipped tarballs JuliaEcosystem/PkgDeps.jl#38

Closed

dgleich mentioned this pull request Apr 27, 2024

Allow Shapefile to read .zip files JuliaGeo/Shapefile.jl#113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tar.extract_file: extract content of single file in tarball #95

Tar.extract_file: extract content of single file in tarball #95

fredrikekre commented Jan 29, 2021

codecov bot commented Jan 29, 2021 •

edited

StefanKarpinski commented Jan 29, 2021

StefanKarpinski commented Jan 29, 2021

fredrikekre commented Jan 29, 2021

StefanKarpinski commented Jan 29, 2021

fredrikekre commented Jan 29, 2021

fredrikekre commented Jan 29, 2021 •

edited

fredrikekre left a comment

fredrikekre Jan 29, 2021

KristofferC commented Feb 17, 2021

StefanKarpinski commented Feb 18, 2021

KristofferC commented Mar 12, 2021 •

edited

KristofferC commented Mar 12, 2021 •

edited

KristofferC commented Mar 16, 2021

StefanKarpinski commented Mar 22, 2021 •

edited

KristofferC commented Mar 22, 2021

StefanKarpinski commented Mar 23, 2021

KristofferC commented Mar 23, 2021

StefanKarpinski commented Mar 23, 2021

KristofferC commented Mar 23, 2021 •

edited

KristofferC commented Mar 26, 2021 •

edited

StefanKarpinski commented Mar 26, 2021

StefanKarpinski commented Dec 2, 2022

StefanKarpinski commented Dec 14, 2022

JBlaschke commented Dec 21, 2022 •

edited

StefanKarpinski commented Dec 21, 2022

JBlaschke commented Dec 21, 2022 •

edited

tecosaur commented Apr 20, 2023

StefanKarpinski commented Apr 20, 2023

tecosaur commented Apr 21, 2023 •

edited

KristofferC commented Apr 21, 2023

tecosaur commented Apr 21, 2023

StefanKarpinski commented Apr 21, 2023 •

edited

	extract_file(predicate::Function, tarball, out) -> Vector{Header}
	extract_files(predicate::Function, tarball, out) -> Vector{Header}

Tar.extract_file: extract content of single file in tarball #95

Are you sure you want to change the base?

Tar.extract_file: extract content of single file in tarball #95

Conversation

fredrikekre commented Jan 29, 2021

codecov bot commented Jan 29, 2021 • edited

Codecov Report

StefanKarpinski commented Jan 29, 2021

StefanKarpinski commented Jan 29, 2021

fredrikekre commented Jan 29, 2021

StefanKarpinski commented Jan 29, 2021

fredrikekre commented Jan 29, 2021

fredrikekre commented Jan 29, 2021 • edited

fredrikekre left a comment

Choose a reason for hiding this comment

fredrikekre Jan 29, 2021

Choose a reason for hiding this comment

KristofferC commented Feb 17, 2021

StefanKarpinski commented Feb 18, 2021

KristofferC commented Mar 12, 2021 • edited

KristofferC commented Mar 12, 2021 • edited

KristofferC commented Mar 16, 2021

StefanKarpinski commented Mar 22, 2021 • edited

KristofferC commented Mar 22, 2021

StefanKarpinski commented Mar 23, 2021

KristofferC commented Mar 23, 2021

StefanKarpinski commented Mar 23, 2021

KristofferC commented Mar 23, 2021 • edited

KristofferC commented Mar 26, 2021 • edited

StefanKarpinski commented Mar 26, 2021

StefanKarpinski commented Dec 2, 2022

StefanKarpinski commented Dec 14, 2022

JBlaschke commented Dec 21, 2022 • edited

StefanKarpinski commented Dec 21, 2022

JBlaschke commented Dec 21, 2022 • edited

tecosaur commented Apr 20, 2023

StefanKarpinski commented Apr 20, 2023

tecosaur commented Apr 21, 2023 • edited

KristofferC commented Apr 21, 2023

tecosaur commented Apr 21, 2023

StefanKarpinski commented Apr 21, 2023 • edited

codecov bot commented Jan 29, 2021 •

edited

fredrikekre commented Jan 29, 2021 •

edited

KristofferC commented Mar 12, 2021 •

edited

KristofferC commented Mar 12, 2021 •

edited

StefanKarpinski commented Mar 22, 2021 •

edited

KristofferC commented Mar 23, 2021 •

edited

KristofferC commented Mar 26, 2021 •

edited

JBlaschke commented Dec 21, 2022 •

edited

JBlaschke commented Dec 21, 2022 •

edited

tecosaur commented Apr 21, 2023 •

edited

StefanKarpinski commented Apr 21, 2023 •

edited