Sometimes the crawlers in the storage servers stop crawling #686

exarkun · 2018-01-17T16:46:18Z

There are two crawlers. One is the "bucket" crawler. The other is the "accounting crawler". They infinitely loop, inspecting state of the storage system and performing various bookkeeping. Sometimes, however, they don't infinitely loop. They stop looping and stop doing their jobs.

This seems to be accompanied by an error like this (one per crawler):

2018-01-11T10:01:30+0000 [HTTP11ClientProtocol,client] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.web._newclient.ResponseFailed: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>,
 <twisted.python.failure.Failure twisted.web.http._DataLoss: >]

Apparently there's an errback missing somewhere. Once this happens, the crawlers won't crawl until the process is restarted.

The text was updated successfully, but these errors were encountered:

exarkun · 2018-01-17T16:47:30Z

Two reads through of the code that I think is relevant here didn't yield any enlightenment for me.

A mitigation strategy could be to teach Kubernetes to notice that at least one crawler has died so that the affected storageserver can be restarted automatically. This doesn't fix the fault but it does fix the failure.

exarkun mentioned this issue Jan 17, 2018

Accounting and bucket crawler failure recovery #687

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sometimes the crawlers in the storage servers stop crawling #686

Sometimes the crawlers in the storage servers stop crawling #686

exarkun commented Jan 17, 2018

exarkun commented Jan 17, 2018

Sometimes the crawlers in the storage servers stop crawling #686

Sometimes the crawlers in the storage servers stop crawling #686

Comments

exarkun commented Jan 17, 2018

exarkun commented Jan 17, 2018