Skip to content
This repository has been archived by the owner on Apr 8, 2024. It is now read-only.

Sometimes the crawlers in the storage servers stop crawling #686

Open
exarkun opened this issue Jan 17, 2018 · 1 comment
Open

Sometimes the crawlers in the storage servers stop crawling #686

exarkun opened this issue Jan 17, 2018 · 1 comment

Comments

@exarkun
Copy link
Contributor

exarkun commented Jan 17, 2018

There are two crawlers. One is the "bucket" crawler. The other is the "accounting crawler". They infinitely loop, inspecting state of the storage system and performing various bookkeeping. Sometimes, however, they don't infinitely loop. They stop looping and stop doing their jobs.

This seems to be accompanied by an error like this (one per crawler):

2018-01-11T10:01:30+0000 [HTTP11ClientProtocol,client] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.web._newclient.ResponseFailed: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>,
 <twisted.python.failure.Failure twisted.web.http._DataLoss: >]

Apparently there's an errback missing somewhere. Once this happens, the crawlers won't crawl until the process is restarted.

@exarkun
Copy link
Contributor Author

exarkun commented Jan 17, 2018

Two reads through of the code that I think is relevant here didn't yield any enlightenment for me.

A mitigation strategy could be to teach Kubernetes to notice that at least one crawler has died so that the affected storageserver can be restarted automatically. This doesn't fix the fault but it does fix the failure.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant