Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with the EJB Timer (in production environment, specifically) #3672

Closed
landreev opened this issue Mar 7, 2017 · 10 comments
Closed
Assignees
Labels
Component: Code Infrastructure formerly "Feature: Code Infrastructure" Feature: Harvesting

Comments

@landreev
Copy link
Contributor

landreev commented Mar 7, 2017

This must be/probably is related to the issue where the EJB Timer's lock on the database prevents Glassfish to start and/or the application to deploy (for example, #3669)
The issue with the timer, currently observed on our prod. dedicated timer service (dvn-sum1-app-2/picard), is that it just stops working completely. Even the top-level, master timer stops firing - and then none of the scheduled harvests and exports are happening.

We need to finally figure out what is going on with that timer. Part of the difficulty with diagnosing it is that the timer is a standalone EJB app, supplied with Glassfish. But it should still be possible to obtain its source and see what's going on there - if everything else fails.

An interesting observation is that nobody has ever seen these timer issues in their dev. environments. It only happens on "real" servers... but what does that mean exactly? - could be as trivial as Mac OS vs. Linux. Or is it something about running the database on loacalhost vs. over a non-local network, with more ports firewalled?

@landreev landreev self-assigned this Mar 7, 2017
@djbrooke djbrooke added the ready label Mar 8, 2017
@djbrooke djbrooke removed the ready label Mar 8, 2017
@pdurbin pdurbin added Feature: Harvesting Component: Code Infrastructure formerly "Feature: Code Infrastructure" labels Apr 25, 2017
@djbrooke djbrooke added Status: Backlog and removed Component: Code Infrastructure formerly "Feature: Code Infrastructure" Feature: Harvesting labels Sep 22, 2017
@donsizemore
Copy link
Contributor

I noticed this in the notes from yesterday's community call. Odum's production Dataverse seems to be in the same boat (as best my untrained eyes can tell). We'll do some investigating on this end.

@donsizemore
Copy link
Contributor

I found this https://dennis.gesker.com/2014/07/25/glassfish-4-0-1-expunge-timer/ and as a blind stab added this setting to a test server. Will report back.

@djbrooke
Copy link
Contributor

In Sprint Planning 10/11, we decided to estimate the investigation of this as a 3.

@donsizemore any info about this?

@donsizemore
Copy link
Contributor

Hi @djbrooke , I added

<ejb-timer-service> <property name="reschedule-failed-timer" value="true"></property> </ejb-timer-service>

to a test server, and honestly didn't see a timer die, or anything otherwise interesting in the Glassfish logs. This particular server was used for test ingests and actively harvests, but is otherwise pretty quiet. So... nothing to report, yet anyway?

@landreev
Copy link
Contributor Author

Hmm, interesting.
@donsizemore could you please clarify, what are the symptoms you were seeing in your prod server? Was the timer working for a while, and then died? (It had to be working at some point, because you were harvesting from us - correct?).

What we are seeing in our production now, the timer isn't even dying, it doesn't even start anymore. We restart Glassfish, or redeploy the app on the "master" server, and we don't get the "I am the master timer..." in the logs at all.

And yes, it has to be somehow specific to this prod. server of ours. Because it looks like timers are working properly on our test boxes.

@donsizemore
Copy link
Contributor

@landreev I hadn't seen (or at least, noticed) any problems with our EJB timers, in production or on test servers. I just popped the property above in place to see if I did catch anything and, for the past two weeks... nothing.

How much RAM do your production VMs have, what's the JVM heap, etc? (anything I can do on Odum's test machines to help troubleshoot further?)

@landreev
Copy link
Contributor Author

landreev commented Oct 17, 2017

Mystery solved - at least with our prod. server. It was happening simply because the version of the Postgres jdbc driver (on the Glassfish side) got seriously out of sync with the actual version of Postgres.

Based on our experience with the rest of the app, we had assumed this driver version didn't really matter. As it appeared that you could use Postgres 9.3 with say the version 8.4 of the driver, and everything was working ok. The timer app (it's an EJB application of its own) however relies on storing serialized Java "timer info" objects as byte arrays; and the serialization format may differ between versions. So the app could not de-serialize and read the objects back from the timer table.

Upgrading the driver to the same version as the production database has solved this.

I'll update the installer script to match drivers to the running database more strictly. And we'll add a line to the next release notes, advising other installations to check and upgrade their drivers.

landreev added a commit that referenced this issue Oct 19, 2017
@landreev
Copy link
Contributor Author

Moving into review.
This is what was done:

the installer now comes with the specific versions of the JDBC driver for Postgres versions 9.2, 9.3 and 9.4. Plus the driver version 42.1.4 that covers Postgres 9.5 and 9.6. The installer will automatically install the driver version that matches that of PostgresQL running.

Added some extra text to the "Dataverse Application Timers" and "Troubleshooting" of the Admin guide.

@landreev
Copy link
Contributor Author

PR: #4222

@pdurbin
Copy link
Member

pdurbin commented Oct 20, 2017

Looks great. Moving to QA. I made a couple tweaks, including changing PostgresQL to PostgreSQL. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Code Infrastructure formerly "Feature: Code Infrastructure" Feature: Harvesting
Projects
None yet
Development

No branches or pull requests

6 participants