Fail elegantly when expected database schema is missing #31

hectcastro · 2014-08-14T13:54:37Z

If the database is offline or the schema is not setup properly, the service crashes hard:

panic: pq: relation "treemap_itreecodeoverride" does not exist

goroutine 1 [running]:
runtime.panic(0x6b6160, 0xc2104537e0)
    /usr/local/go/src/pkg/runtime/panic.c:266 +0xb6
main.main()
    /var/jenkins/workspace/OTM2_Build_Release/cloudbuild/go/src/github.com/azavea/ecobenefits/main.go:140 +0x3b2

goroutine 3 [runnable]:
database/sql.(*DB).connectionOpener(0xc210154380)
    /usr/local/go/src/pkg/database/sql/sql.go:574 +0x3e
created by database/sql.Open
    /usr/local/go/src/pkg/database/sql/sql.go:436 +0x24d

goroutine 4 [syscall]:
runtime.goexit()
    /usr/local/go/src/pkg/runtime/proc.c:1394

The text was updated successfully, but these errors were encountered:

jwalgran · 2014-08-14T14:13:41Z

Agreed. All three endpoints should catch exceptions, return a 500, and write the exception details to stderr.

steventlamb · 2014-10-18T15:42:24Z

It makes sense to avoid an unexpected halt. If we catch these exceptions and return 500s instead, the service will stay up, continue trying to connect to the database for no reason, and continue returning 500s. Are we satisifed with that? Is that the best practice in this situation?

steventlamb · 2014-10-18T16:08:34Z

The case of src/github.com/azavea/ecobenefits/main.go:140 looks like it occurs when the app is initializing, before starting the http event loop. So failing abruptly is probably a sane behavior, unless we code around the failures with restart logic.

I think it makes sense to panic and fail when the db schema doesn't match. Better to write ops code to restart the process after a db upgrade than to have it keep trying to discover the correct schema.

As for the database being offline, you could imagine the app trying repeatedly to reconnect and initialize, before passing control over to the http process. Or, since all it's doing is closing over and caching some db state for performance, we could punt that to happen on the next request, or the next request, or the next request, with each one returning 500 if it fails to connect, caching if it succeeds.

hectcastro · 2014-10-20T14:59:34Z

The case of src/github.com/azavea/ecobenefits/main.go:140 looks like it occurs when the app is initializing, before starting the http event loop. So failing abruptly is probably a sane behavior, unless we code around the failures with restart logic.

I think that most web applications with a database dependency make use of database connections lazily. For example, I created a test Rails app locally that is pointed at a nonexistent MySQL database. When I start the app, it starts, but doesn't fail until I attempt to make a request.

In this application, it appears as though some data is loaded once at startup. That data is then reused to deal with subsequent requests. Panicking in this situation makes sense, but usually the argument to panic() is a descriptive string. I think making that change goes a long way in making sense of what is going on in the event of a failure.

I think it makes sense to panic and fail when the db schema doesn't match. Better to write ops code to restart the process after a db upgrade than to have it keep trying to discover the correct schema.

The Upstart script for this service has a respawn and respawn limit stanza.

As for the database being offline, you could imagine the app trying repeatedly to reconnect and initialize, before passing control over to the http process. Or, since all it's doing is closing over and caching some db state for performance, we could punt that to happen on the next request, or the next request, or the next request, with each one returning 500 if it fails to connect, caching if it succeeds.

Are the API requests leading to additional database queries each time, or is it only using data pulled during the startup process?

maurizi · 2014-10-20T15:02:38Z

@hectcastro

Are the API requests leading to additional database queries each time, or is it only using data pulled during the startup process?

(Most) of the API requests lead to additional queries against the DB. The data we get at startup are for things used for every request, which do not change very frequently.

hectcastro · 2014-10-20T18:41:10Z

Got it. Then from my perspective, I think those requests should expose database connectivity failures to HTTP API consumers (via some HTTP status code) and operators (via a log entry), but not attempt any restart logic. Not clear if any of that happens now, but all of the logs I've found around this service don't contain failures.

steventlamb · 2015-01-12T16:26:20Z

Status codes are problematic because the service is built on go-rest which can only return 200 and 500. We'll have to rely structured error text rather than status code to disambiguate types of errors.

azavea-bot added the unverified label Aug 14, 2014

steventlamb self-assigned this Jan 12, 2015

steventlamb removed their assignment Jan 22, 2015

maurizi removed the unverified label Mar 24, 2016

jwalgran added the bug label Aug 10, 2016

jwalgran added the medium label Aug 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail elegantly when expected database schema is missing #31

Fail elegantly when expected database schema is missing #31

hectcastro commented Aug 14, 2014

jwalgran commented Aug 14, 2014

steventlamb commented Oct 18, 2014

steventlamb commented Oct 18, 2014

hectcastro commented Oct 20, 2014

maurizi commented Oct 20, 2014

hectcastro commented Oct 20, 2014

steventlamb commented Jan 12, 2015

Fail elegantly when expected database schema is missing #31

Fail elegantly when expected database schema is missing #31

Comments

hectcastro commented Aug 14, 2014

jwalgran commented Aug 14, 2014

steventlamb commented Oct 18, 2014

steventlamb commented Oct 18, 2014

hectcastro commented Oct 20, 2014

maurizi commented Oct 20, 2014

hectcastro commented Oct 20, 2014

steventlamb commented Jan 12, 2015