Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically freeze modules after 3 years without updates #96

Merged
merged 3 commits into from Nov 20, 2019

Conversation

HebaruSan
Copy link
Member

@HebaruSan HebaruSan commented Nov 16, 2019

Motivation

Sorting http://status.ksp-ckan.space/ by Last Indexed, it's apparent that hundreds of indexed mods haven't been updated in 3 years or longer (maybe 40% of the total going by the scrollbar). These probably will never be updated (either because they don't need updates or because they're abandoned), yet we crawl them every day, some several times per day. This incurs extra CPU, network, and possibly RAM consumption.

Changes

Now once a week we run a new auto-freezer command that checks whether any of the currently indexed mods have a ModStatus.last_indexed value 1000 days ago or earlier (which is pretty close to 3 years). (None is treated as infinitely old, since it means the mod has not been indexed since we started tracking its status. This would be true, except that the legacy Perl webhooks never updated the status, so a None value is unreliable. See #97 for help with that.) All such mods are renamed to identifier.frozen on a freeze/auto branch, one commit per mod. This branch is pushed to GitHub and a pull request is created to allow team members to manually review the results and merge to master if appropriate. This will allow us to reduce the workload of the bot over time.

To support pushing to KSP-CKAN/NetKAN, its remote string is updated to SSH format.

In addition, a new ModStatus.frozen property exists. After its normal pass, the auto freezer command will set this property to True for mods that don't have an active .netkan file. This will allow us to create a filter on the status page to hide such mods.

Help wanted

Things I couldn't figure out:

  • The freeze/auto branch needs to have new mods added to it as they're added to master, ideally without losing previous work, something like git rebase master. But I couldn't figure out how to do that in GitPython.
  • If we decide not to freeze a particular idle mod, the script may keep re-submitting PRs to freeze it. Ideally we would create a way to suppress this.

Any help would be very much appreciated.

@HebaruSan HebaruSan added Enhancement New feature or request Help wanted Extra attention is needed labels Nov 16, 2019
@DasSkelett
Copy link
Member

What if it creates a new branch for each mod to freeze? This would solve the issue of the rebase.
And it could be used to solve the second problem too, though not very pretty:
Don't delete the freeze branches in the upstream repo (KSP_CKAN/NetKAN) if the PR is "denied", and let the bot use this as indication to not create another PR.

However this could clutter the repository with unused branches over time :/

@HebaruSan
Copy link
Member Author

What if it creates a new branch for each mod to freeze?

I don't relish the notion of spamming the repo with hundreds of pull requests on the first pass.

@techman83
Copy link
Member

What if it creates a new branch for each mod to freeze?

I don't relish the notion of spamming the repo with hundreds of pull requests on the first pass.

I'll review this properly when I wake up properly, but you do know my opinion already on making sacrifices to operational code relating to historical data 😉

We probably should list them, do 90% in single hand crafted PR and let the new task munch on the rest.

@DasSkelett
Copy link
Member

DasSkelett commented Nov 17, 2019

cat netkan.json | jq 'to_entries | .[] | select(.value.last_indexed == null) | .key' | tr -d '"' > never.list
cat netkan.json | jq 'to_entries | .[] | select(.value.last_indexed and (.value.last_indexed | startswith("2015"))) | .key ' | tr -d '"' > 2015.list
cat netkan.json | jq 'to_entries | .[] | select(.value.last_indexed and (.value.last_indexed | startswith("2016"))) | .key ' | tr -d '"' > 2016.list
cat netkan.json | jq 'to_entries | .[] | select(.value.last_indexed and (.value.last_indexed | startswith("2017"))) | .key ' | tr -d '"' > 2017.list
cat netkan.json | jq 'to_entries | .[] | select(.value.last_indexed and (.value.last_indexed | startswith("2018"))) | .key ' | tr -d '"' > 2018.list
cat netkan.json | jq 'to_entries | .[] | select(.value.last_indexed and (.value.last_indexed | startswith("2019"))) | .key ' | tr -d '"' > 2019.list

wc -l *
> 315  never.list
> 10   2015.list
> 243  2016.list
> 96   2017.list
> 132  2018.list
> 1000 2019.list

(Thanks @HebaruSan for letting me know of that tool, it's great!)

There are indeed quite a few mods to start with.
A PR beforehand makes sense. But we have to leave some for the auto-freezer ;)

Also could we use the auto-freeer to prune all the frozen netkans (including manually frozen ones) from the status db?

@HebaruSan
Copy link
Member Author

Nice! I forgot we could get the ModStatus data from the status page's netkan.json. This lists the modules with a date before a given date (apparently there are 253 before 2017-01-01):

jq '("2017-01-01T00:00:00Z"|fromdate) as $cutoff | with_entries(select(.value.last_indexed and (.value.last_indexed[0:19]+"Z"|fromdate)<$cutoff) | {key, value:.value.last_indexed})' < netkan.json

Also it would be nice to recover the last_indexed values for modules that were only indexed by the old web hooks. We could extract it from CKAN-meta's git log and then set it in ModStatus.

@HebaruSan
Copy link
Member Author

HebaruSan commented Nov 17, 2019

Added a one-off script to backfill last_indexed dates. This should probably run first, since having the timestamps filled in will make the auto-freezer more effective.

EDIT: Moved to #97 instead, since it's independent and should run first.

@HebaruSan
Copy link
Member Author

ModStatus.frozen property added in latest commit (rebased to get the timestamp import script so I could copy pieces of it).

@techman83
Copy link
Member

techman83 commented Nov 20, 2019

The freeze/auto branch needs to have new mods added to it as they're added to master, ideally without losing previous work, something like git rebase master. But I couldn't figure out how to do that in GitPython.

In hindsight, a single branch is probably ok. As long as our workflow is to delete the branch, the next time the Auto Freezer runs it'll pull a new copy of the repo and make a fresh branch from the latest version of master.

I'll give it a proper review after work.

@techman83
Copy link
Member

If we decide not to freeze a particular idle mod, the script may keep re-submitting PRs to freeze it. Ideally we would create a way to suppress this.

Do we really need to worry about this? I say it's not worth adding logic for a rather edge case problem. If someone revives a mod, unfreezing it will inflate it straight away and update last indexed.

@HebaruSan
Copy link
Member Author

Do we really need to worry about this?

Maybe, maybe not. I had a nagging sense at the back of my mind that I'd missed something, and that was an attempt to bring it to the foreground. If we make a hard assumption that we will always freeze these mods, then submitting a pull request is presenting a choice (merge or not) that isn't a choice at all, and we should just push to master instead.

@HebaruSan
Copy link
Member Author

Note to self, how to list the stale mods from the command line:

wget -q -O - http://status.ksp-ckan.space/status/netkan.json | jq -rS '("2017-01-01T00:00:00Z"|fromdate) as $cutoff | to_entries | map(select(.value.last_indexed and (.value.last_indexed[0:19]+"Z"|fromdate)<$cutoff) | .key) | sort_by(ascii_downcase) | .[]'

@techman83
Copy link
Member

Maybe, maybe not. I had a nagging sense at the back of my mind that I'd missed something, and that was an attempt to bring it to the foreground. If we make a hard assumption that we will always freeze these mods, then submitting a pull request is presenting a choice (merge or not) that isn't a choice at all, and we should just push to master instead.

I saw it as more of a sanity check rather than a choice. Ie something went wrong and it tried to freeze all the Netkans and we didn't notice for a few days.

Copy link
Member

@techman83 techman83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see anything that should be a problem and the consequences of it not working are either reversible (revert a commit) or will result in an exception that will get logged. So I'll merge it in and give it a run! We might be able to index even more frequently if we're only tracking active mods. Three years is quite generous as well, but we can always build a formalised policy around this.

Eg: No mod releases for x releases of KSP and older than X days will result in a netkan being frozen (possibly adding a x_netkan_never_freeze property or something if we really must)


def freeze_idle_mods(self, days_limit):
update_cutoff = datetime.datetime.now() - datetime.timedelta(days=days_limit)
self.netkan_repo.remotes.origin.pull('master', strategy_option='ours')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably unnecessary, but won't hurt anything. The container will clone the entire repo every run.

logging.info('Marking frozen: %s', mod.ModIdentifier)
mod.frozen = True
batch.save(mod)
logging.info('Done!')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bit of double handling, but rate limited DynamoDB writes are part of the free tier. We could potentially do this with a webhook, but I think doing this once a week is ok for now and will catch when we mark things as frozen by hand in the interim.

To note batch_write requires some special handling, but we're getting rate limiting on our writes as a consequence of our ModStatus scan.

logging.info('Unable to fetch %s', name)

(getattr(self.netkan_repo.heads, name, None)
or self.netkan_repo.create_head(name)).checkout()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably build a function for branch checkouts, one with tests, that we can use as they can be right finicky. Checking out the branch is one of the first things we do, so if it blows up it'll get logged and the container will bail out.

@techman83 techman83 merged commit ea05fd4 into KSP-CKAN:master Nov 20, 2019
@HebaruSan HebaruSan deleted the feature/auto-freezer branch November 21, 2019 18:22
@HebaruSan HebaruSan mentioned this pull request Nov 21, 2019
@HebaruSan HebaruSan added the Auto Freezer Finds idle mods and submits pull requests to freeze them label Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Auto Freezer Finds idle mods and submits pull requests to freeze them Enhancement New feature or request Help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants