Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries/resume to snapshot fetching #2544

Closed
7 tasks done
LesnyRumcajs opened this issue Feb 14, 2023 · 0 comments · Fixed by #2571
Closed
7 tasks done

Add retries/resume to snapshot fetching #2544

LesnyRumcajs opened this issue Feb 14, 2023 · 0 comments · Fixed by #2571
Assignees

Comments

@LesnyRumcajs
Copy link
Member

LesnyRumcajs commented Feb 14, 2023

Issue summary

The mainnet hosting seems volatile. Forest has been failing numerous times in the CI (so dockerized, with aria2) and locally (without aria2). The error is the same, something failed in the middle of downloading.

Dockerized (aria2)

2023-02-13T20:45:35.637704Z  INFO forest_cli_shared::cli::snapshot_fetch: Snapshot url: https://pub-fd31751bcb69400eb39e694385c19457.r2.dev/minimal/2599920_2023_02_13T16_00_00Z.car    
2023-02-13T20:45:35.637744Z  INFO forest_cli_shared::cli::snapshot_fetch: Snapshot will be downloaded to /volumes/forest_data/snapshots/mainnet/filecoin_snapshot_mainnet_2023-02-13_height_2599920.car (117.14 GiB)    
2023-02-13T20:46:28.219805Z ERROR forest_cli::cli: Error: Failed fetching the snapshot: error reading a body from connection: unexpected end of file    

No logs for the local build but it's roughly the same.

This fails the CI from time to time and is annoying when setting up a server for a night job.

The goal is to implement a retry mechanism (or, even better, retry with a resume - something like this should be available with aria2 almost out of the box; for vanilla, we can be pragmatic and retry from scratch). Only if the download fails a few times should the daemon fail.

Task summary

  • Add retry parameter to config. It doesn't have to be exposed as a flag. Set it to a sane default.
  • Add a retry interval parameter. Set it to a sane default.
  • Use the parameters above to implement retry and retry/resume for snapshot fetch. This should work for both the daemon trying to download the snapshot and the forest-cli snapshot fetch.
  • Implement tests validating the logic

Acceptance Criteria

  • Forest daemon no longer fails after a single failure in the upstream host
  • Feature is properly tested
  • Changelog is updated

Other information and links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants