Sparse files gets de-sparsified #409

lukts30 · 2023-11-18T18:08:13Z

After execution when output files are copied from the worker to the filesystem CAS directory, they lose their sparse file properties and get fully de-sparsified. This obviously increases disk usage, and the entire process of copying the file & hashing takes considerably more time than a simple move operation followed by a sha256sum calculation on a sparse file.

(byte-by-byte copy and not moved or copied via reflink/copy_file_range).

https://github.com/TraceMachina/native-link/blob/f989e612715a7fe645e69c4c78a50e9b7262ad17/config/examples/basic_cas.json

genrule(
  name = "create_sparse_file",
  outs = ["re_large_file"],
  cmd = "fallocate -l 5G $(OUTS) "
)

lukas@PC6061B ~/Downloads/turbo-cache $ filefrag -v tmp/turbo_cache/work/30d5d89de2794c8b81c8ceac8c02d0cc96085888988a72075d39e8b939202650/bazel-out/k8-fastbuild/bin/re_large_file
Filesystem type is: 9123683e
File size of tmp/turbo_cache/work/30d5d89de2794c8b81c8ceac8c02d0cc96085888988a72075d39e8b939202650/bazel-out/k8-fastbuild/bin/re_large_file is 5368709120 (1310720 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   65535: 1863472064..1863537599:  65536:             unwritten
   1:    65536..  196607: 1872450496..1872581567: 131072: 1863537600: unwritten
   2:   196608..  327679: 1881560000..1881691071: 131072: 1872581568: unwritten
   3:   327680..  458751: 1882974994..1883106065: 131072: 1881691072: unwritten
   4:   458752..  524287: 1883657152..1883722687:  65536: 1883106066: unwritten
   5:   524288..  589823: 1883741638..1883807173:  65536: 1883722688: unwritten
   6:   589824..  655359: 1890763862..1890829397:  65536: 1883807174: unwritten
   7:   655360..  720895: 2028229568..2028295103:  65536: 1890829398: unwritten
   8:   720896..  917503: 2028884928..2029081535: 196608: 2028295104: unwritten
   9:   917504.. 1114111: 2030457792..2030654399: 196608: 2029081536: unwritten
  10:  1114112.. 1310719: 2030982080..2031178687: 196608: 2030654400: last,unwritten,eof
tmp/turbo_cache/work/30d5d89de2794c8b81c8ceac8c02d0cc96085888988a72075d39e8b939202650/bazel-out/k8-fastbuild/bin/re_large_file: 11 extents found

File size of tmp/turbo_cache/data-worker-test/content_path-cas/7f06c62352aebd8125b2a1841e2b9e1ffcbed602f381c3dcb3200200e383d1d5-5368709120 is 5368709120 (1310720 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   65535: 2035009969..2035075504:  65536:            
   1:    65536..  131071: 2035592031..2035657566:  65536: 2035075505:
   2:   131072..  196607: 2037188178..2037253713:  65536: 2035657567:
   3:   196608..  262143: 2037273536..2037339071:  65536: 2037253714:
   4:   262144..  327679: 2038713195..2038778730:  65536: 2037339072:
   5:   327680..  458751: 2047055695..2047186766: 131072: 2038778731:
   6:   458752..  655359: 2047235008..2047431615: 196608: 2047186767:
   7:   655360..  851967: 2047497152..2047693759: 196608: 2047431616:
   8:   851968..  858862: 2047726528..2047733422:   6895: 2047693760:
   9:   858863..  859572: 2048257283..2048257992:    710: 2047733423:
  10:   859573..  859584: 1548714348..1548714359:     12: 2048257993:
  11:   859585..  859589: 1548714158..1548714162:      5: 1548714360:
  12:   859590..  925125: 2046972864..2047038399:  65536: 1548714163:
  13:   925126..  990661: 2047759296..2047824831:  65536: 2047038400:
  14:   990662.. 1187269: 2053067711..2053264318: 196608: 2047824832:
  15:  1187270.. 1310719: 2055361472..2055484921: 123450: 2053264319: last,eof

The text was updated successfully, but these errors were encountered:

MarcusSorealheis · 2023-11-19T06:44:12Z

@lukts30 We're looking into this one.

allada · 2023-11-19T07:10:46Z

I'm trying to focus on the root-problem and how we can support this kind of feature without breaking other API requirements (like NFS filesystems or other filesystems/stores that are not sparse supported) and I want to pose the problem to you to see if this is the underlying issue:

Problem:
When the Worker & CAS are on the same machine, it requires a full copy of the data.

If it is the problem, possible solution:
What I will propose is to make the Worker aware of when it is uploading to a local filesystem CAS and hardlink the file if it is on the same filesystem. This would make the copy only cost a hardlink (which is extremely cheap).

As a side note to this problem, you may be interested to also use dedup_store and/or compression_store which can be used together to optimize filesystem disk size for files that are often the same between builds.

lukts30 · 2023-11-19T11:42:43Z

Thanks for looking into this issue.

When the Worker & CAS are on the same machine, it requires a full copy of the data.

That is an accurate summary of this issue.

Worker aware of when it is uploading to a local filesystem CAS and hardlink the file if it is on the same filesystem

Indeed hardlinking/moving files or "copying" via reflink/copy_file_range (on supported FS) all would similarly address the expensive copy operation.

I am wondering if the suggested hardlink approach would also work in a scenario where the Workers are distributed, but the storage for both the CAS and worker filesystems is managed through the same shared NFS/SMB file system share?

Adds `inner_store()` function to all stores that enables the resolution of inner stores recursively to get an underlying store. This is mostly for places that can perform optimizations when specific code paths can be optimized with specific stores. towards: #409

allada · 2023-11-20T20:23:42Z

I am wondering if the suggested hardlink approach would also work in a scenario where the Workers are distributed, but the storage for both the CAS and worker filesystems is managed through the same shared NFS/SMB file system share?

NFS/SMB as a shared medium for distributed workers to write too is not currently supported. This is because we evict items and currently don't have the ability to have external sources notify of things changing in the FilesystemStore.

Instead of using NFS, the preferred model is to use a remote store and xfer the data through some kind of pipe (eg: TCP). The complexity of adding network filesystem support sounds like it'd be very difficult to write and possibly add a lot of technical debt.

Given that, we do currently plan on supporting a Fuse filesystem that will materialize the files on demand. This would allow compression of data over the network & deduplication, at the cost of latency. In the case of NFS I would suspect the latency to be about the same, so it "might" be what you are looking for.

Adds `inner_store()` function to all stores that enables the resolution of inner stores recursively to get an underlying store. This is mostly for places that can perform optimizations when specific code paths can be optimized with specific stores. towards: #409

lukts30 · 2024-02-16T18:00:08Z

Commit a0788fa did not change anything in this regard right? Still copies and does not use move/hard link to relocate the file.

I tested again a bit and RE through nativelink is still noticeably slower then what I would hope for.

My testing involved using a buck2 BUILD file first building :random_data and a separate build of :instant_copy, which utilizes the cached file for a reflink copy (takes only 5ms).
But :instant_copy took 70 seconds to complete, with more then 99% time spend "transferring" the file to the CAS. Observed operations involved reading the entire $OUT file at a rate of 300MB/s, followed by simultaneous reading and writing, each at 300MB/s. (monitored through iotop; does not include time/data for RE client to fetch the result artifact)

accumulated read: 20GB
accumulated written: 10GB

Because the file is read twice the speed is in practice only 150MB/s. But even the 300MB/s is rather slow compared to other tools.
For comparison both curl/wget can download a file from a local HTTP Server (caddy) at 800MB/s without any special parameter and the builtin HTTP download rule in buck2 even achieves 1.2GB/s by default.
Additionally as I configured the hash to be BLAKE3 I also ran b3sum with a single thread on the 10GB file and that took 4-6s (avg > 1.6GB/s read).

genrule(
    name = "random_data",
    out = "gen_large_file",
    cmd =  "dd if=/dev/urandom of=$OUT bs=1G count=10 iflag=fullblock",
)

genrule(
    name = "instant_copy",
    out = "gen_large_file",
    cmd =  "cp --reflink=always $(location :random_data) $OUT",
)

allada · 2024-02-16T22:17:07Z

Yes, the commit I made was just lining things up to support the ability to support special filesystem calls like sparse copy.

Can you confirm that you compiled with --release (-o3'ish)? There is a significant performance difference if you don't compile in release mode.

I saw that you are using basic_cas.json, but can you confirm that you did not make any changes to it (that would impact these results)?

lukts30 · 2024-02-16T23:13:49Z

I am using a release build and the tests were done with commit 2a89ce6 because of #665.

I used basic_cas.json only changed the output paths, set the hash to blake3 and tried removing the memory store but that did not change much.

allada · 2024-02-18T20:10:22Z

I did some local testing and yes I do see it taking much longer than it should be.

Here's my local results:

real 24.834s = time dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
real 3.31s   = cat ./dummy_foo | time b3sum --no-mmap --num-threads 1
real 8.040s  = time cp ./dummy_foo ./dummy_foo2

Nativelink results (modified source to capture timing):

24.854s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
47.4s   = computing hash
25.312s = uploading file (copy file)

If I change this line to 1Mib:

24.751s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
7.128s  = computing hash
10.7408 = uploading file (copy file)

I then wanted to see if I put the hashing function onto a different spawn (thread) than the spawn (thread) that reads the file contents how might it improve:

24.7814s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
3.5946s  = computing hash
10.677s  = uploading file (copy file)

I then checked to see if I put the uploading/copying part onto different spawns/threads if it would help:

24.8469s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
3.2696s  = computing hash
7.2525s  = uploading file (copy file)

So the obvious low hanging fruit here is to simply increase the default DEFAULT_READ_BUFF_SIZE to a larger value and possibly make it configurable in global config.

I'm torn on if we should support multi-spawn/thread in this section of code, since we intentionally try to not create any spawns per grpc connection in order to help keep users in a single thread. This keeps extreme parallelism in a fairly cooperative manner, otherwise one user could create millions of small files and use lots of threads computing digests and such, starving everyone else. I'll think about this one a bit more.

lukts30 · 2024-02-19T22:21:41Z

I can confirm that using a 1 MiB buffer helped significantly and brought down build times from 17m45.5s down to 10m50.7s (includes negligible few seconds for RE client to fetch artifacts).
Local builds take 6m52.4s.

Based on these findings I would recommend making DEFAULT_READ_BUFF_SIZE configurable.

Also is there a way to log how long execution, hashing & uploading took?

allada · 2024-02-20T00:12:50Z

Yes. @aaronmondal or @blakehatch, do one of you want to make this DEFAULT_READ_BUFF_SIZE a global config (also need to do some code searching to see if there's other places that could use the same config).

As for optimizing the threading, I think we should optimize it. Right now we are already paying a very high cost because tokio actually uses the synchronous filesystem API in a different threadpool. Since this is already a cost, we should instead stream the data from the synchronous filesystem API. One of the big advantages to doing it this way is that we could easily wire this up to mmap later, which would give even higher throughput.

We do not currently separate out the hashing time vs other time in the worker/running_actions_manager... currently it's lumped into upload_results. We could separate this out I think. @aaronmondal this might an easy one for you :-)

As an FYI, @lukts30, we are currently working on some tooling around using ReClient/Chromium as a benchmark metric that will help us understand where we need to improve.

allada · 2024-02-21T07:39:32Z

I spent a little more time on this. I have a local change I'll push up soon that fixes the hash time completely. It should bring the total hash time to be in parity with b3sum in single threaded mode.

I'm currently looking at optimizing upload time. In doing so, there's a very high chance I'll also implement it as a file move. This should make local execution overhead nearly zero.

allada · 2024-02-23T08:13:45Z

How about this: allada@cf25b31

It still needs some cleanup and I decided to cleanup some code, so it'll be multiple PRs before it's in.

It appears rayon + mmap and just mmap is about the same speed on my computer, so I may disable multi-threading.

24.7308s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
0.56695s = computing hash
0.00005s = uploading file (copy file)

I'm going to bet that we need to optimize localhost data xfer though, so this is likely only one step towards extreme speeds 😄

Computing the digest now happens using mmap when using blake3 and changes default read size to 16k instead of 4k. Towards: TraceMachina#409

Computing the digest now happens using mmap when using blake3 and changes default read size to 16k instead of 4k. Towards: #409

When using nativelink with a local worker/CAS setup, adds optimizations which make it faster to upload files from the worker to the CAS. This is specifically useful for Buck2 for users that want to build hermetically. closes: TraceMachina#409

When using nativelink with a local worker/CAS setup, adds optimizations which make it faster to upload files from the worker to the CAS. This is specifically useful for Buck2 for users that want to build hermetically. closes: #409

allada · 2024-03-05T19:23:54Z

I did not use the rayon hashing, but if it helps your use case a lot, could you open a new ticket and we can possibly make it enabled via a config.

allada mentioned this issue Nov 19, 2023

Add function to Store API to get the inner store when possible #410

Merged

allada added a commit to allada/nativelink-fork that referenced this issue Mar 4, 2024

Optimize hashing files

2774d95

Computing the digest now happens using mmap when using blake3 and changes default read size to 16k instead of 4k. Towards: TraceMachina#409

allada mentioned this issue Mar 4, 2024

Optimize hashing files #720

Merged

allada added a commit to allada/nativelink-fork that referenced this issue Mar 4, 2024

Optimize hashing files

61eeaac

Computing the digest now happens using mmap when using blake3 and changes default read size to 16k instead of 4k. Towards: TraceMachina#409

allada added a commit that referenced this issue Mar 5, 2024

Optimize hashing files (#720)

0fa9a40

Computing the digest now happens using mmap when using blake3 and changes default read size to 16k instead of 4k. Towards: #409

allada mentioned this issue Mar 5, 2024

Optimize file uploads when source is file #723

Merged

allada closed this as completed in #723 Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse files gets de-sparsified #409

Sparse files gets de-sparsified #409

lukts30 commented Nov 18, 2023

MarcusSorealheis commented Nov 19, 2023

allada commented Nov 19, 2023

lukts30 commented Nov 19, 2023

allada commented Nov 20, 2023

lukts30 commented Feb 16, 2024

allada commented Feb 16, 2024

lukts30 commented Feb 16, 2024

allada commented Feb 18, 2024

lukts30 commented Feb 19, 2024

allada commented Feb 20, 2024

allada commented Feb 21, 2024

allada commented Feb 23, 2024 •

edited

Loading

allada commented Mar 5, 2024

Sparse files gets de-sparsified #409

Sparse files gets de-sparsified #409

Comments

lukts30 commented Nov 18, 2023

MarcusSorealheis commented Nov 19, 2023

allada commented Nov 19, 2023

lukts30 commented Nov 19, 2023

allada commented Nov 20, 2023

lukts30 commented Feb 16, 2024

allada commented Feb 16, 2024

lukts30 commented Feb 16, 2024

allada commented Feb 18, 2024

lukts30 commented Feb 19, 2024

allada commented Feb 20, 2024

allada commented Feb 21, 2024

allada commented Feb 23, 2024 • edited Loading

allada commented Mar 5, 2024

allada commented Feb 23, 2024 •

edited

Loading