Indexing appears to have come to a full stop #13

Dazpoet · 2015-06-01T09:35:59Z

As of yesterday (some 30 hours prior to this post) it seems the Indexing robot is no longer inflating .netkans to create new metadata which means some mods are now not updated in CKAN-meta. An example is InterstellarFuelSwitch which is now a version behind in CKAN.

I'm not sure about the distinctions between the bots but it seems that what we on irc see as "netkan-bot" is still inflating newly added .netkan files and pushing to CKAN-meta as can be seen e.g. here and originating from here. The entity known on irc as "NetKAN inflator Robot" though seems to not have done much (if anything) since KSP-CKAN/CKAN-meta@6809cc3

The text was updated successfully, but these errors were encountered:

Dazpoet · 2015-06-01T17:39:17Z

Hoping that KSP-CKAN/CKAN-meta#564 and KSP-CKAN/CKAN-meta#565 won't interefere with anything later, just fixing some mods that where requested

techman83 · 2015-06-02T01:20:38Z

That's weird, the bot literally just runs netkan.exe on all the netkans and outputs the results to the CKAN-meta repo, then pushes changes.

Unfortunately it doesn't captuer the errors just yet, though there are code improvements in the works that will make this process more robust (and better logging when things don't work for some reason).

techman83 · 2015-06-02T02:19:23Z

Oh I see what's happening!

Process 23329 timed out! at ./bin/netkan-indexer line 14.
Sending TERM to 23329 at ./bin/netkan-indexer line 14.

I don't know why or what, but it appears something is causing the inflation to take a long time. So it's likely that what you are seeing is that the process runs in alphabetical order and anything past a certain netkan isn't being run.

This is what caused the NetKarmgeddon of Saturday morning. So it's not really a bug in the indexer, rather the thing designed to prevent a re-occurrence of NetKarmgeddon.

After this run I'll ponder a way to figure out how best to diagnose what is going on. I think I'll also add some time stamping warns to let us know when something takes longer than X to inflate. I wonder Time::Limit can be lexical scoped, so we can limit how long an individual inflation takes - though I'm hesitant due to mods with large downloads never getting inflated.

pjf · 2015-06-02T02:26:07Z

I wonder Time::Limit can be lexical scoped, so we can limit how long an individual inflation takes - though I'm hesitant due to mods with large downloads never getting inflated.

This sounds like Time::Out, which I've not used myself, but definitely sounds like a great idea as our most likely situation for timeouts is one mod that's causing problems. (Time::Limit may not scale the way we want if we start processing tens of thousands of mods, for example.)

In a super ideal world we'd have worker processes that handle mods in parallel. :)

techman83 · 2015-06-02T02:37:53Z

In a super ideal world we'd have worker processes that handle mods in parallel. :)

See #12 :D

I've hacked in some debugging and disabled the cron job for now, I'll leave it running and see what is the hold up.

[Tue Jun  2 02:36:55 2015] bin/netkan-indexer:29172 (DEBUG) Downloading metadata for AnimatedDecouplers-x86...
[Tue Jun  2 02:36:59 2015] bin/netkan-indexer:29172 (DEBUG) NetKAN/AnimatedDecouplers-x86.netkan took 4 seconds to inflate

techman83 · 2015-06-02T03:27:07Z

So we were averaging 1 second per metadata inflation, since friday it looks like were often > 5 seconds. With the number of Netkans and the allowed time (3000 seconds) - we're simply taking too long inflating metadata to complete all updates time for the indexer to finish.

I've no idea why this is suddenly an issue, but I think #2 would go a long way to making this better.

techman83 · 2015-06-02T03:38:26Z

Out put of the debug run that was chopped at the 50 minute mark, only got up to SETI.

https://gist.github.com/techman83/4fb5d01be9382eb80e2f

techman83 · 2015-06-02T04:01:23Z

Found the problem, we've used up all our initial CPU burst credits, so our instance is being throttled to baseline performance levels.

I think #2 will alleviate that significantly. For now we can set it to run 2 hourly and that will at least allow things to start working again.

pjf · 2015-06-02T04:24:11Z

We can also redeploy onto an m3 or c4 instance, which would avoid the CPU throttling, and give us more grunt in general. In theory that should just be shutdown instance, change instance type, restart instance. (In practice it may be different.)

I suspect giving netkan.exe superpowers such that it can take a list of files to process would also significantly improve performance, simply because that avoids start-up/tear-down costs of the CLI itself.

techman83 · 2015-06-02T04:28:32Z

Yeah, throwing grunt at it would also solve the problem. Though I'm working through the improvements to the bot, so should hopefully have something for review that implements what we currently have with a bit more sanity.

Taking a list would be great! I'd suggest there would be some working in getting the exception handling easy to capture though.

techman83 · 2015-06-08T02:57:45Z

Changing the scheduling to run every 3 hours has contained the issue.

#2 will hopefully allow us to index much more frequently than that (I'm hoping every ~15 minutes).

Dazpoet added the bug label Jun 1, 2015

Dazpoet mentioned this issue Jun 1, 2015

Adding a forced v to SCANSat KSP-CKAN/NetKAN#1469

Merged

techman83 mentioned this issue Jun 2, 2015

Warn if a NetKAN takes longer than X to inflate #14

Closed

iPeer mentioned this issue Jun 2, 2015

Add AGroupOnStage 2.0.9 KSP-CKAN/CKAN-meta#570

Closed

techman83 pushed a commit to techman83/NetKAN-bot that referenced this issue Jun 2, 2015

Allow run for 1 hour 50 minutes, add debugging closes KSP-CKAN#13

7b9fa61

techman83 closed this as completed Jun 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing appears to have come to a full stop #13

Indexing appears to have come to a full stop #13

Dazpoet commented Jun 1, 2015

Dazpoet commented Jun 1, 2015

techman83 commented Jun 2, 2015

techman83 commented Jun 2, 2015

pjf commented Jun 2, 2015

techman83 commented Jun 2, 2015

techman83 commented Jun 2, 2015

techman83 commented Jun 2, 2015

techman83 commented Jun 2, 2015

pjf commented Jun 2, 2015

techman83 commented Jun 2, 2015

techman83 commented Jun 8, 2015

Indexing appears to have come to a full stop #13

Indexing appears to have come to a full stop #13

Comments

Dazpoet commented Jun 1, 2015

Dazpoet commented Jun 1, 2015

techman83 commented Jun 2, 2015

techman83 commented Jun 2, 2015

pjf commented Jun 2, 2015

techman83 commented Jun 2, 2015

techman83 commented Jun 2, 2015

techman83 commented Jun 2, 2015

techman83 commented Jun 2, 2015

pjf commented Jun 2, 2015

techman83 commented Jun 2, 2015

techman83 commented Jun 8, 2015