Skip to content
This repository has been archived by the owner on Apr 13, 2021. It is now read-only.

Indexing appears to have come to a full stop #13

Closed
Dazpoet opened this issue Jun 1, 2015 · 11 comments
Closed

Indexing appears to have come to a full stop #13

Dazpoet opened this issue Jun 1, 2015 · 11 comments
Labels

Comments

@Dazpoet
Copy link
Member

Dazpoet commented Jun 1, 2015

As of yesterday (some 30 hours prior to this post) it seems the Indexing robot is no longer inflating .netkans to create new metadata which means some mods are now not updated in CKAN-meta. An example is InterstellarFuelSwitch which is now a version behind in CKAN.

I'm not sure about the distinctions between the bots but it seems that what we on irc see as "netkan-bot" is still inflating newly added .netkan files and pushing to CKAN-meta as can be seen e.g. here and originating from here. The entity known on irc as "NetKAN inflator Robot" though seems to not have done much (if anything) since KSP-CKAN/CKAN-meta@6809cc3

@Dazpoet
Copy link
Member Author

Dazpoet commented Jun 1, 2015

Hoping that KSP-CKAN/CKAN-meta#564 and KSP-CKAN/CKAN-meta#565 won't interefere with anything later, just fixing some mods that where requested

@techman83
Copy link
Member

That's weird, the bot literally just runs netkan.exe on all the netkans and outputs the results to the CKAN-meta repo, then pushes changes.

Unfortunately it doesn't captuer the errors just yet, though there are code improvements in the works that will make this process more robust (and better logging when things don't work for some reason).

@techman83
Copy link
Member

Oh I see what's happening!

Process 23329 timed out! at ./bin/netkan-indexer line 14.
Sending TERM to 23329 at ./bin/netkan-indexer line 14.

I don't know why or what, but it appears something is causing the inflation to take a long time. So it's likely that what you are seeing is that the process runs in alphabetical order and anything past a certain netkan isn't being run.

This is what caused the NetKarmgeddon of Saturday morning. So it's not really a bug in the indexer, rather the thing designed to prevent a re-occurrence of NetKarmgeddon.

After this run I'll ponder a way to figure out how best to diagnose what is going on. I think I'll also add some time stamping warns to let us know when something takes longer than X to inflate. I wonder Time::Limit can be lexical scoped, so we can limit how long an individual inflation takes - though I'm hesitant due to mods with large downloads never getting inflated.

@pjf
Copy link
Member

pjf commented Jun 2, 2015

I wonder Time::Limit can be lexical scoped, so we can limit how long an individual inflation takes - though I'm hesitant due to mods with large downloads never getting inflated.

This sounds like Time::Out, which I've not used myself, but definitely sounds like a great idea as our most likely situation for timeouts is one mod that's causing problems. (Time::Limit may not scale the way we want if we start processing tens of thousands of mods, for example.)

In a super ideal world we'd have worker processes that handle mods in parallel. :)

@techman83
Copy link
Member

In a super ideal world we'd have worker processes that handle mods in parallel. :)

See #12 :D

I've hacked in some debugging and disabled the cron job for now, I'll leave it running and see what is the hold up.

[Tue Jun  2 02:36:55 2015] bin/netkan-indexer:29172 (DEBUG) Downloading metadata for AnimatedDecouplers-x86...
[Tue Jun  2 02:36:59 2015] bin/netkan-indexer:29172 (DEBUG) NetKAN/AnimatedDecouplers-x86.netkan took 4 seconds to inflate

@techman83
Copy link
Member

So we were averaging 1 second per metadata inflation, since friday it looks like were often > 5 seconds. With the number of Netkans and the allowed time (3000 seconds) - we're simply taking too long inflating metadata to complete all updates time for the indexer to finish.

I've no idea why this is suddenly an issue, but I think #2 would go a long way to making this better.

@techman83
Copy link
Member

Out put of the debug run that was chopped at the 50 minute mark, only got up to SETI.

https://gist.github.com/techman83/4fb5d01be9382eb80e2f

@techman83
Copy link
Member

Found the problem, we've used up all our initial CPU burst credits, so our instance is being throttled to baseline performance levels.

CPU Credits

I think #2 will alleviate that significantly. For now we can set it to run 2 hourly and that will at least allow things to start working again.

@pjf
Copy link
Member

pjf commented Jun 2, 2015

We can also redeploy onto an m3 or c4 instance, which would avoid the CPU throttling, and give us more grunt in general. In theory that should just be shutdown instance, change instance type, restart instance. (In practice it may be different.)

I suspect giving netkan.exe superpowers such that it can take a list of files to process would also significantly improve performance, simply because that avoids start-up/tear-down costs of the CLI itself.

@techman83
Copy link
Member

Yeah, throwing grunt at it would also solve the problem. Though I'm working through the improvements to the bot, so should hopefully have something for review that implements what we currently have with a bit more sanity.

Taking a list would be great! I'd suggest there would be some working in getting the exception handling easy to capture though.

techman83 pushed a commit to techman83/NetKAN-bot that referenced this issue Jun 2, 2015
@techman83
Copy link
Member

Changing the scheduling to run every 3 hours has contained the issue.

CPU Credits

#2 will hopefully allow us to index much more frequently than that (I'm hoping every ~15 minutes).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants