Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse files gets de-sparsified #409

Closed
lukts30 opened this issue Nov 18, 2023 · 13 comments · Fixed by #723
Closed

Sparse files gets de-sparsified #409

lukts30 opened this issue Nov 18, 2023 · 13 comments · Fixed by #723

Comments

@lukts30
Copy link

lukts30 commented Nov 18, 2023

After execution when output files are copied from the worker to the filesystem CAS directory, they lose their sparse file properties and get fully de-sparsified. This obviously increases disk usage, and the entire process of copying the file & hashing takes considerably more time than a simple move operation followed by a sha256sum calculation on a sparse file.

(byte-by-byte copy and not moved or copied via reflink/copy_file_range).

https://github.com/TraceMachina/native-link/blob/f989e612715a7fe645e69c4c78a50e9b7262ad17/config/examples/basic_cas.json

genrule(
  name = "create_sparse_file",
  outs = ["re_large_file"],
  cmd = "fallocate -l 5G $(OUTS) "
)

lukas@PC6061B ~/Downloads/turbo-cache $ filefrag -v tmp/turbo_cache/work/30d5d89de2794c8b81c8ceac8c02d0cc96085888988a72075d39e8b939202650/bazel-out/k8-fastbuild/bin/re_large_file
Filesystem type is: 9123683e
File size of tmp/turbo_cache/work/30d5d89de2794c8b81c8ceac8c02d0cc96085888988a72075d39e8b939202650/bazel-out/k8-fastbuild/bin/re_large_file is 5368709120 (1310720 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   65535: 1863472064..1863537599:  65536:             unwritten
   1:    65536..  196607: 1872450496..1872581567: 131072: 1863537600: unwritten
   2:   196608..  327679: 1881560000..1881691071: 131072: 1872581568: unwritten
   3:   327680..  458751: 1882974994..1883106065: 131072: 1881691072: unwritten
   4:   458752..  524287: 1883657152..1883722687:  65536: 1883106066: unwritten
   5:   524288..  589823: 1883741638..1883807173:  65536: 1883722688: unwritten
   6:   589824..  655359: 1890763862..1890829397:  65536: 1883807174: unwritten
   7:   655360..  720895: 2028229568..2028295103:  65536: 1890829398: unwritten
   8:   720896..  917503: 2028884928..2029081535: 196608: 2028295104: unwritten
   9:   917504.. 1114111: 2030457792..2030654399: 196608: 2029081536: unwritten
  10:  1114112.. 1310719: 2030982080..2031178687: 196608: 2030654400: last,unwritten,eof
tmp/turbo_cache/work/30d5d89de2794c8b81c8ceac8c02d0cc96085888988a72075d39e8b939202650/bazel-out/k8-fastbuild/bin/re_large_file: 11 extents found

File size of tmp/turbo_cache/data-worker-test/content_path-cas/7f06c62352aebd8125b2a1841e2b9e1ffcbed602f381c3dcb3200200e383d1d5-5368709120 is 5368709120 (1310720 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   65535: 2035009969..2035075504:  65536:            
   1:    65536..  131071: 2035592031..2035657566:  65536: 2035075505:
   2:   131072..  196607: 2037188178..2037253713:  65536: 2035657567:
   3:   196608..  262143: 2037273536..2037339071:  65536: 2037253714:
   4:   262144..  327679: 2038713195..2038778730:  65536: 2037339072:
   5:   327680..  458751: 2047055695..2047186766: 131072: 2038778731:
   6:   458752..  655359: 2047235008..2047431615: 196608: 2047186767:
   7:   655360..  851967: 2047497152..2047693759: 196608: 2047431616:
   8:   851968..  858862: 2047726528..2047733422:   6895: 2047693760:
   9:   858863..  859572: 2048257283..2048257992:    710: 2047733423:
  10:   859573..  859584: 1548714348..1548714359:     12: 2048257993:
  11:   859585..  859589: 1548714158..1548714162:      5: 1548714360:
  12:   859590..  925125: 2046972864..2047038399:  65536: 1548714163:
  13:   925126..  990661: 2047759296..2047824831:  65536: 2047038400:
  14:   990662.. 1187269: 2053067711..2053264318: 196608: 2047824832:
  15:  1187270.. 1310719: 2055361472..2055484921: 123450: 2053264319: last,eof
@MarcusSorealheis
Copy link
Collaborator

@lukts30 We're looking into this one.

@allada
Copy link
Collaborator

allada commented Nov 19, 2023

I'm trying to focus on the root-problem and how we can support this kind of feature without breaking other API requirements (like NFS filesystems or other filesystems/stores that are not sparse supported) and I want to pose the problem to you to see if this is the underlying issue:

Problem:
When the Worker & CAS are on the same machine, it requires a full copy of the data.

If it is the problem, possible solution:
What I will propose is to make the Worker aware of when it is uploading to a local filesystem CAS and hardlink the file if it is on the same filesystem. This would make the copy only cost a hardlink (which is extremely cheap).

As a side note to this problem, you may be interested to also use dedup_store and/or compression_store which can be used together to optimize filesystem disk size for files that are often the same between builds.

@lukts30
Copy link
Author

lukts30 commented Nov 19, 2023

Thanks for looking into this issue.

When the Worker & CAS are on the same machine, it requires a full copy of the data.

That is an accurate summary of this issue.

Worker aware of when it is uploading to a local filesystem CAS and hardlink the file if it is on the same filesystem

Indeed hardlinking/moving files or "copying" via reflink/copy_file_range (on supported FS) all would similarly address the expensive copy operation.

I am wondering if the suggested hardlink approach would also work in a scenario where the Workers are distributed, but the storage for both the CAS and worker filesystems is managed through the same shared NFS/SMB file system share?

allada added a commit that referenced this issue Nov 19, 2023
Adds `inner_store()` function to all stores that enables the
resolution of inner stores recursively to get an underlying store.
This is mostly for places that can perform optimizations when
specific code paths can be optimized with specific stores.

towards: #409
@allada
Copy link
Collaborator

allada commented Nov 20, 2023

I am wondering if the suggested hardlink approach would also work in a scenario where the Workers are distributed, but the storage for both the CAS and worker filesystems is managed through the same shared NFS/SMB file system share?

NFS/SMB as a shared medium for distributed workers to write too is not currently supported. This is because we evict items and currently don't have the ability to have external sources notify of things changing in the FilesystemStore.

Instead of using NFS, the preferred model is to use a remote store and xfer the data through some kind of pipe (eg: TCP). The complexity of adding network filesystem support sounds like it'd be very difficult to write and possibly add a lot of technical debt.

Given that, we do currently plan on supporting a Fuse filesystem that will materialize the files on demand. This would allow compression of data over the network & deduplication, at the cost of latency. In the case of NFS I would suspect the latency to be about the same, so it "might" be what you are looking for.

allada added a commit that referenced this issue Dec 2, 2023
Adds `inner_store()` function to all stores that enables the
resolution of inner stores recursively to get an underlying store.
This is mostly for places that can perform optimizations when
specific code paths can be optimized with specific stores.

towards: #409
allada added a commit that referenced this issue Dec 3, 2023
Adds `inner_store()` function to all stores that enables the
resolution of inner stores recursively to get an underlying store.
This is mostly for places that can perform optimizations when
specific code paths can be optimized with specific stores.

towards: #409
@lukts30
Copy link
Author

lukts30 commented Feb 16, 2024

Commit a0788fa did not change anything in this regard right? Still copies and does not use move/hard link to relocate the file.


I tested again a bit and RE through nativelink is still noticeably slower then what I would hope for.

My testing involved using a buck2 BUILD file first building :random_data and a separate build of :instant_copy, which utilizes the cached file for a reflink copy (takes only 5ms).
But :instant_copy took 70 seconds to complete, with more then 99% time spend "transferring" the file to the CAS. Observed operations involved reading the entire $OUT file at a rate of 300MB/s, followed by simultaneous reading and writing, each at 300MB/s. (monitored through iotop; does not include time/data for RE client to fetch the result artifact)

  • accumulated read: 20GB
  • accumulated written: 10GB

Because the file is read twice the speed is in practice only 150MB/s. But even the 300MB/s is rather slow compared to other tools.
For comparison both curl/wget can download a file from a local HTTP Server (caddy) at 800MB/s without any special parameter and the builtin HTTP download rule in buck2 even achieves 1.2GB/s by default.
Additionally as I configured the hash to be BLAKE3 I also ran b3sum with a single thread on the 10GB file and that took 4-6s (avg > 1.6GB/s read).

genrule(
    name = "random_data",
    out = "gen_large_file",
    cmd =  "dd if=/dev/urandom of=$OUT bs=1G count=10 iflag=fullblock",
)

genrule(
    name = "instant_copy",
    out = "gen_large_file",
    cmd =  "cp --reflink=always $(location :random_data) $OUT",
)

@allada
Copy link
Collaborator

allada commented Feb 16, 2024

Yes, the commit I made was just lining things up to support the ability to support special filesystem calls like sparse copy.

Can you confirm that you compiled with --release (-o3'ish)? There is a significant performance difference if you don't compile in release mode.

I saw that you are using basic_cas.json, but can you confirm that you did not make any changes to it (that would impact these results)?

@lukts30
Copy link
Author

lukts30 commented Feb 16, 2024

I am using a release build and the tests were done with commit 2a89ce6 because of #665.

I used basic_cas.json only changed the output paths, set the hash to blake3 and tried removing the memory store but that did not change much.

@allada
Copy link
Collaborator

allada commented Feb 18, 2024

I did some local testing and yes I do see it taking much longer than it should be.

Here's my local results:

real 24.834s = time dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
real 3.31s   = cat ./dummy_foo | time b3sum --no-mmap --num-threads 1
real 8.040s  = time cp ./dummy_foo ./dummy_foo2

Nativelink results (modified source to capture timing):

24.854s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
47.4s   = computing hash
25.312s = uploading file (copy file)

If I change this line to 1Mib:

24.751s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
7.128s  = computing hash
10.7408 = uploading file (copy file)

I then wanted to see if I put the hashing function onto a different spawn (thread) than the spawn (thread) that reads the file contents how might it improve:

24.7814s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
3.5946s  = computing hash
10.677s  = uploading file (copy file)

I then checked to see if I put the uploading/copying part onto different spawns/threads if it would help:

24.8469s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
3.2696s  = computing hash
7.2525s  = uploading file (copy file)

So the obvious low hanging fruit here is to simply increase the default DEFAULT_READ_BUFF_SIZE to a larger value and possibly make it configurable in global config.

I'm torn on if we should support multi-spawn/thread in this section of code, since we intentionally try to not create any spawns per grpc connection in order to help keep users in a single thread. This keeps extreme parallelism in a fairly cooperative manner, otherwise one user could create millions of small files and use lots of threads computing digests and such, starving everyone else. I'll think about this one a bit more.

@lukts30
Copy link
Author

lukts30 commented Feb 19, 2024

I can confirm that using a 1 MiB buffer helped significantly and brought down build times from 17m45.5s down to 10m50.7s (includes negligible few seconds for RE client to fetch artifacts).
Local builds take 6m52.4s.

Based on these findings I would recommend making DEFAULT_READ_BUFF_SIZE configurable.

Also is there a way to log how long execution, hashing & uploading took?

@allada
Copy link
Collaborator

allada commented Feb 20, 2024

Yes. @aaronmondal or @blakehatch, do one of you want to make this DEFAULT_READ_BUFF_SIZE a global config (also need to do some code searching to see if there's other places that could use the same config).

As for optimizing the threading, I think we should optimize it. Right now we are already paying a very high cost because tokio actually uses the synchronous filesystem API in a different threadpool. Since this is already a cost, we should instead stream the data from the synchronous filesystem API. One of the big advantages to doing it this way is that we could easily wire this up to mmap later, which would give even higher throughput.

We do not currently separate out the hashing time vs other time in the worker/running_actions_manager... currently it's lumped into upload_results. We could separate this out I think. @aaronmondal this might an easy one for you :-)

As an FYI, @lukts30, we are currently working on some tooling around using ReClient/Chromium as a benchmark metric that will help us understand where we need to improve.

@allada
Copy link
Collaborator

allada commented Feb 21, 2024

I spent a little more time on this. I have a local change I'll push up soon that fixes the hash time completely. It should bring the total hash time to be in parity with b3sum in single threaded mode.

I'm currently looking at optimizing upload time. In doing so, there's a very high chance I'll also implement it as a file move. This should make local execution overhead nearly zero.

@allada
Copy link
Collaborator

allada commented Feb 23, 2024

How about this: allada@cf25b31

It still needs some cleanup and I decided to cleanup some code, so it'll be multiple PRs before it's in.

It appears rayon + mmap and just mmap is about the same speed on my computer, so I may disable multi-threading.

24.7308s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
0.56695s = computing hash
0.00005s = uploading file (copy file)

I'm going to bet that we need to optimize localhost data xfer though, so this is likely only one step towards extreme speeds 😄

allada added a commit to allada/nativelink-fork that referenced this issue Mar 4, 2024
Computing the digest now happens using mmap when using blake3 and
changes default read size to 16k instead of 4k.

Towards: TraceMachina#409
allada added a commit to allada/nativelink-fork that referenced this issue Mar 4, 2024
Computing the digest now happens using mmap when using blake3 and
changes default read size to 16k instead of 4k.

Towards: TraceMachina#409
allada added a commit that referenced this issue Mar 5, 2024
Computing the digest now happens using mmap when using blake3 and
changes default read size to 16k instead of 4k.

Towards: #409
allada added a commit to allada/nativelink-fork that referenced this issue Mar 5, 2024
When using nativelink with a local worker/CAS setup, adds
optimizations which make it faster to upload files from the worker
to the CAS.

This is specifically useful for Buck2 for users that want to
build hermetically.

closes: TraceMachina#409
allada added a commit to allada/nativelink-fork that referenced this issue Mar 5, 2024
When using nativelink with a local worker/CAS setup, adds
optimizations which make it faster to upload files from the worker
to the CAS.

This is specifically useful for Buck2 for users that want to
build hermetically.

closes: TraceMachina#409
allada added a commit to allada/nativelink-fork that referenced this issue Mar 5, 2024
When using nativelink with a local worker/CAS setup, adds
optimizations which make it faster to upload files from the worker
to the CAS.

This is specifically useful for Buck2 for users that want to
build hermetically.

closes: TraceMachina#409
allada added a commit that referenced this issue Mar 5, 2024
When using nativelink with a local worker/CAS setup, adds
optimizations which make it faster to upload files from the worker
to the CAS.

This is specifically useful for Buck2 for users that want to
build hermetically.

closes: #409
@allada
Copy link
Collaborator

allada commented Mar 5, 2024

I did not use the rayon hashing, but if it helps your use case a lot, could you open a new ticket and we can possibly make it enabled via a config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants