Purge downloads that failed to index from Netkan cache #2526
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
CKAN and Netkan each have a download cache, which is a folder on disk containing files with names like this:
C43B5474-BasicDeltaV-3.0.zip
A77B71AE-netkan-CraftManager.zip
The 8 hexadecimal digits are a portion of the hash of the origin URL. When we attempt to access any URL, we first calculate that hash and look for a matching file in the cache, and if found, we use that file instead of downloading it again.
Problem
Currently the Netkan bot can get "stuck" if there is a problem with a download. As an example, this just happened with @linuxgurugamer's NIMBY mod; the 1.1.1.1 version of that mod contained one file called
NIMBY.version
and another called just.version
, which caused this error in Netkan and prevented that release from being indexed:However, when the author corrected the download, the problem persisted. This was because the Netkan bot was not re-downloading the fixed file, but instead retrieving the broken file from its cache and re-processing it.
This is a recurring problem that periodically requires us to request @techman83 to delete specific files from the bot server. A more automated solution would be better.
#2337 is a related but different approach to this overall issue.
Cause
The cache object assumes that a successful
Store
action should last forever, and the only clean-up actions that Netkan takes on failure to index is to print an error message. So once a file is downloaded, we'll never re-acquire that URL again, even if there's a fatal problem with the file.Changes
CachingHttpService
keeps track of all the URLs you request from it during the current runThis will ensure that if a module fails to index, its download will be re-acquired on subsequent passes until it finally succeeds. In the NIMBY example, it would have prevented the download from being cached, so the fixed file would have been acquired and indexed.
Known limitations
Note that this does not fully solve the "stuck in bot's cache" problem. Specifically, if a problem occurs that does not prevent a module from being indexed, such as incorrect game version info, such a download will still persist and not be re-downloaded (unless it's from GitHub as per #2337). This pull request only helps in cases where some part of the Netkan process throws an exception.