You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the harvester hits a hard failure and crashes, it takes ~2.5-3 minutes to build a new machine and get the fetch process started. This use to take ~1 second. This means jobs that have a bunch of failures can tie up the harvester for hours...
Processes the ~2K datasets in less than 15 minutes
Actual behavior
Still running after 24 hours
Sketch
We could take the "mitigation" approach, and try running the harvest job as a sub-process that gets restarted on failure (using supervisor as we do currently), or just running a bunch of fetch processes.
The long term fix is to make the harvests more robust and just report failures, don't hard fail. This will require much better error handling than is currently implemented, across upstream code and GSA code.
The text was updated successfully, but these errors were encountered:
When the harvester hits a hard failure and crashes, it takes ~2.5-3 minutes to build a new machine and get the fetch process started. This use to take ~1 second. This means jobs that have a bunch of failures can tie up the harvester for hours...
How to reproduce
Expected behavior
Processes the ~2K datasets in less than 15 minutes
Actual behavior
Still running after 24 hours
Sketch
We could take the "mitigation" approach, and try running the harvest job as a sub-process that gets restarted on failure (using supervisor as we do currently), or just running a bunch of fetch processes.
The long term fix is to make the harvests more robust and just report failures, don't hard fail. This will require much better error handling than is currently implemented, across upstream code and GSA code.
The text was updated successfully, but these errors were encountered: