Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail elegantly when expected database schema is missing #31

Open
hectcastro opened this issue Aug 14, 2014 · 7 comments
Open

Fail elegantly when expected database schema is missing #31

hectcastro opened this issue Aug 14, 2014 · 7 comments

Comments

@hectcastro
Copy link
Contributor

If the database is offline or the schema is not setup properly, the service crashes hard:

panic: pq: relation "treemap_itreecodeoverride" does not exist

goroutine 1 [running]:
runtime.panic(0x6b6160, 0xc2104537e0)
    /usr/local/go/src/pkg/runtime/panic.c:266 +0xb6
main.main()
    /var/jenkins/workspace/OTM2_Build_Release/cloudbuild/go/src/github.com/azavea/ecobenefits/main.go:140 +0x3b2

goroutine 3 [runnable]:
database/sql.(*DB).connectionOpener(0xc210154380)
    /usr/local/go/src/pkg/database/sql/sql.go:574 +0x3e
created by database/sql.Open
    /usr/local/go/src/pkg/database/sql/sql.go:436 +0x24d

goroutine 4 [syscall]:
runtime.goexit()
    /usr/local/go/src/pkg/runtime/proc.c:1394
@jwalgran
Copy link
Contributor

Agreed. All three endpoints should catch exceptions, return a 500, and write the exception details to stderr.

@steventlamb
Copy link
Contributor

It makes sense to avoid an unexpected halt. If we catch these exceptions and return 500s instead, the service will stay up, continue trying to connect to the database for no reason, and continue returning 500s. Are we satisifed with that? Is that the best practice in this situation?

@steventlamb
Copy link
Contributor

The case of src/github.com/azavea/ecobenefits/main.go:140 looks like it occurs when the app is initializing, before starting the http event loop. So failing abruptly is probably a sane behavior, unless we code around the failures with restart logic.

I think it makes sense to panic and fail when the db schema doesn't match. Better to write ops code to restart the process after a db upgrade than to have it keep trying to discover the correct schema.

As for the database being offline, you could imagine the app trying repeatedly to reconnect and initialize, before passing control over to the http process. Or, since all it's doing is closing over and caching some db state for performance, we could punt that to happen on the next request, or the next request, or the next request, with each one returning 500 if it fails to connect, caching if it succeeds.

@hectcastro
Copy link
Contributor Author

The case of src/github.com/azavea/ecobenefits/main.go:140 looks like it occurs when the app is initializing, before starting the http event loop. So failing abruptly is probably a sane behavior, unless we code around the failures with restart logic.

I think that most web applications with a database dependency make use of database connections lazily. For example, I created a test Rails app locally that is pointed at a nonexistent MySQL database. When I start the app, it starts, but doesn't fail until I attempt to make a request.

In this application, it appears as though some data is loaded once at startup. That data is then reused to deal with subsequent requests. Panicking in this situation makes sense, but usually the argument to panic() is a descriptive string. I think making that change goes a long way in making sense of what is going on in the event of a failure.

I think it makes sense to panic and fail when the db schema doesn't match. Better to write ops code to restart the process after a db upgrade than to have it keep trying to discover the correct schema.

The Upstart script for this service has a respawn and respawn limit stanza.

As for the database being offline, you could imagine the app trying repeatedly to reconnect and initialize, before passing control over to the http process. Or, since all it's doing is closing over and caching some db state for performance, we could punt that to happen on the next request, or the next request, or the next request, with each one returning 500 if it fails to connect, caching if it succeeds.

Are the API requests leading to additional database queries each time, or is it only using data pulled during the startup process?

@maurizi
Copy link
Contributor

maurizi commented Oct 20, 2014

@hectcastro

Are the API requests leading to additional database queries each time, or is it only using data pulled during the startup process?

(Most) of the API requests lead to additional queries against the DB. The data we get at startup are for things used for every request, which do not change very frequently.

@hectcastro
Copy link
Contributor Author

Got it. Then from my perspective, I think those requests should expose database connectivity failures to HTTP API consumers (via some HTTP status code) and operators (via a log entry), but not attempt any restart logic. Not clear if any of that happens now, but all of the logs I've found around this service don't contain failures.

@steventlamb
Copy link
Contributor

Status codes are problematic because the service is built on go-rest which can only return 200 and 500. We'll have to rely structured error text rather than status code to disambiguate types of errors.

@steventlamb steventlamb self-assigned this Jan 12, 2015
@steventlamb steventlamb removed their assignment Jan 22, 2015
@jwalgran jwalgran added the bug label Aug 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants