Skip to content

retryable_download: preserve partial download on network errors#22048

Merged
MikeMcQuaid merged 1 commit intoHomebrew:mainfrom
liukun:fix/preserve-partial-on-download-error
Apr 20, 2026
Merged

retryable_download: preserve partial download on network errors#22048
MikeMcQuaid merged 1 commit intoHomebrew:mainfrom
liukun:fix/preserve-partial-on-download-error

Conversation

@liukun
Copy link
Copy Markdown
Contributor

@liukun liukun commented Apr 20, 2026

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same change?
  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests (excluding integration tests) for your changes? (happy to add one — see "Tests" section below)
  • Have you successfully run brew lgtm (style, typechecking and tests) with your changes locally?

  • AI was used to generate or assist with generating this PR.
    The root-cause analysis and patch were drafted with Claude (Opus 4.7) assistance. I then manually read retryable_download.rb, download_strategy.rb, and utils/curl.rb to verify the trace, reproduced the symptom, ran bin/brew style, bin/brew typecheck, and bin/brew tests --only=utils/curl locally before committing.

Summary

Fix a long-standing bug where bottle downloads on unstable networks always restart from zero instead of resuming from .incomplete, making large bottles effectively undownloadable over flaky links.

Closes #21518.

Observed symptom

On a slow/flaky connection (e.g. behind a proxy that drops TLS streams mid-transfer), the user sees:

  1. brew install <formula> downloads N MB, then the connection drops.
  2. brew prints Retrying download in 2s… and starts again from zero — the partial MB is gone.
  3. After exhausting retries, it errors out (e.g. Failed to download resource "llvm@21").
  4. If the user re-runs brew install …, the first request actually resumes (curl --continue-at - picks up from the current .incomplete size), gets partway further, and drops again.
  5. As soon as brew's own retry kicks in inside the same invocation, the cycle repeats from zero.

Net effect: every fresh brew install invocation downloads a slice, then throws the slice away. If no single streaming session can complete the full bottle, the install can never succeed regardless of retries, even though each attempt does make forward progress.

Root cause

RetryableDownload#fetch (Library/Homebrew/retryable_download.rb) wraps the download strategy with exponential-backoff retries. Before every retry it unconditionally calls:

rescue DownloadError, ChecksumMismatchError, Resource::BottleManifest::Error
  ...
  sleep wait
  downloadable.clear_cache    # ← problem
  retry
end

For CurlDownloadStrategy (Library/Homebrew/download_strategy.rb), clear_cache does:

def clear_cache
  super
  rm_rf(temporary_path)    # temporary_path = "<cached_location>.incomplete"
end

So the .incomplete file — which curl_download would otherwise pass to --continue-at - on the next attempt (Library/Homebrew/utils/curl.rb, line ~278) — is deleted before every retry. The try_partial machinery added in #5421 and #11381 is effectively bypassed for any download that hits a DownloadError more than once. The "kill the process and restart" workaround described in #21518 works precisely because SIGINT bypasses this rescue path and leaves the .incomplete file in place.

The three kinds of exceptions caught have different semantics, but clear_cache treats them identically:

Exception Meaning .incomplete at catch time Correct action
DownloadError connection dropped mid-transfer valid partial, resumable keep, let next attempt --continue-at
ChecksumMismatchError full download finished but hash is wrong already renamed to cached_location; it's bad clear
Resource::BottleManifest::Error manifest mismatch on a finished download same clear

Change

Only clear the cache for the "finished but bad" cases. For DownloadError the partial is kept, so the next iteration's curl_download sees destination.exist?, takes the try_partial branch, and passes --continue-at -:

rescue DownloadError, ChecksumMismatchError, Resource::BottleManifest::Error => e
  ...
  sleep wait

  # Preserve the partial `.incomplete` file on network errors so the next
  # attempt can resume via `--continue-at`. Clear the cache only when the
  # fully-downloaded file is known-bad (checksum or manifest mismatch).
  downloadable.clear_cache unless e.is_a?(DownloadError)
  retry
end

Total diff: +5 / −2 on a single file.

Self-healing against a corrupted partial

If a proxy writes garbage bytes into .incomplete (e.g. a captive-portal HTML response), a resumed download will eventually finish with a wrong sha256. That surfaces as ChecksumMismatchError on the next iteration, which does still call clear_cache, wiping the bad partial. The subsequent attempt downloads fresh. So the loop is still self-terminating under adversarial middleware — it just doesn't give up the first 200 MB of good data to an unrelated TCP hiccup.

Tests

bin/brew tests --only=utils/curl — all 65 specs pass.

No dedicated spec exists for RetryableDownload#fetch's exception-handling branches today. Happy to add one that mocks downloadable and asserts:

  • raising DownloadError during #fetch does not call clear_cache on retry;
  • raising ChecksumMismatchError does call clear_cache on retry.

Let me know if you'd like that folded into this PR or tracked separately.

Prior art / related

Open reports of the same symptom

Merged PRs that built resume support but are silently defeated today

The machinery for resumable downloads already exists and works correctly for the single-attempt case. What's missing is that the outer retry loop in RetryableDownload#fetch deletes the .incomplete file before the next attempt, so none of the following take effect once brew's own retry kicks in:

  • Allow incomplete downloads to be resumed even when server rejects HEAD requests #5421 (2018-12, merged, "Allow incomplete downloads to be resumed even when server rejects HEAD requests", utils/curl.rb) — changed the range-support probe from HEAD to GET so servers that reject HEAD don't defeat resume. Working as intended — but only reaches the point of emitting --continue-at - if an .incomplete file is still on disk on the next attempt.
  • Fix range requests with curl #11381 (2021-05, merged, "Fix range requests with curl", utils/curl.rb) — reworked how curl_download detects range support (previously it sent a range-request and checked Content-Range; now it does a proper HEAD and looks at Accept-Ranges). Same situation: correct probe, but the partial file it would probe against on retry has already been removed.
  • #curl_download: default try_partial to false #13179 (2022-04, merged, "#curl_download: default try_partial to false") — made partial-download mode opt-in to save a HEAD when it isn't useful. CurlDownloadStrategy still opts in (@try_partial = true), so bottle fetches still request resume; this PR is what finally lets that opt-in actually produce savings on the second+ retry.

Recent touches to RetryableDownload that did not look at this path

Both PRs touched retryable_download.rb recently without noticing that clear_cache runs unconditionally between retries.

Longstanding feature requests this helps but does not fully resolve

Out of scope

The utils/curl.rb side could additionally pass --retry-all-errors (curl ≥ 7.71) inside the retries&.positive? block so curl's own retry can survive connection drops, reducing how often the outer RetryableDownload loop runs at all. I deliberately left that out of this PR to keep the change minimal and unambiguous — happy to open a follow-up if reviewers want it.

`RetryableDownload#fetch` cleared the download cache before every retry,
which removes the `.incomplete` file that `CurlDownloadStrategy` uses
for resumable downloads. As a result `--continue-at -` never kicked in:
each retry restarted from zero, making installs on unstable networks
effectively impossible once a bottle was large enough to never finish
within one uninterrupted session.

Only clear the cache when the fully-downloaded file is known-bad
(`ChecksumMismatchError` or `Resource::BottleManifest::Error`). For
network-level `DownloadError`, keep the partial `.incomplete` file so
the next attempt can resume via `--continue-at`. If that partial was
itself corrupted mid-flight, the post-download checksum check will
catch it on a later iteration and clear the cache via the existing
path, so the loop is self-healing.

Closes Homebrew#21518.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@liukun liukun force-pushed the fix/preserve-partial-on-download-error branch from e474a19 to a2d76a8 Compare April 20, 2026 09:07
Copy link
Copy Markdown
Member

@MikeMcQuaid MikeMcQuaid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks!

@MikeMcQuaid MikeMcQuaid added this pull request to the merge queue Apr 20, 2026
Merged via the queue into Homebrew:main with commit cf25ab4 Apr 20, 2026
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Option to save and resume partial download

2 participants