Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeouts in session closing and closed reminders #5999

Closed
damithc opened this issue Aug 18, 2016 · 22 comments
Closed

Timeouts in session closing and closed reminders #5999

damithc opened this issue Aug 18, 2016 · 22 comments
Assignees
Labels
p.Critical Would like to fix ASAP and release as a hot patch
Milestone

Comments

@damithc
Copy link
Contributor

damithc commented Aug 18, 2016

image

@damithc damithc added the p.Critical Would like to fix ASAP and release as a hot patch label Aug 18, 2016
@damithc
Copy link
Contributor Author

damithc commented Aug 18, 2016

One common error I see is this:

A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may be throwing exceptions during the initialization of your application. (Error code 104)

Another is this:

/feedbackSessionClosedReminders
org.datanucleus.store.appengine.query.QueryExceptionWrappers$JDOQueryTimeoutException
    at org.datanucleus.store.appengine.query.QueryExceptionWrappers$2.get(QueryExceptionWrappers.java:88)
    at org.datanucleus.store.appengine.query.RuntimeExceptionWrappingIterator.hasNext(RuntimeExceptionWrappingIterator.java:105)
    at org.datanucleus.store.appengine.query.LazyResult$LazyAbstractListIterator.computeNext(LazyResult.java:210)
    at org.datanucleus.store.appengine.query.AbstractIterator.tryToComputeNext(AbstractIterator.java:132)
    at org.datanucleus.store.appengine.query.AbstractIterator.hasNext(AbstractIterator.java:127)
    at org.datanucleus.store.appengine.query.LazyResult$AbstractListIterator.hasNext(LazyResult.java:169)
    at teammates.storage.api.FeedbackSessionsDb.getNonPrivateFeedbackSessions(FeedbackSessionsDb.java:162)
    at teammates.logic.core.FeedbackSessionsLogic.getFeedbackSessionsClosedWithinThePastHour(FeedbackSessionsLogic.java:1643)
    at teammates.logic.core.FeedbackSessionsLogic.scheduleFeedbackSessionClosedEmails(FeedbackSessionsLogic.java:1664)
    at teammates.logic.automated.FeedbackSessionClosedRemindersServlet.doGet(FeedbackSessionClosedRemindersServlet.java:24)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
    at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)
    at teammates.storage.datastore.DatastoreFilter.doFilter(DatastoreFilter.java:29)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
    at com.google.apphosting.utils.servlet.ParseBlobUploadFilter.doFilter(ParseBlobUploadFilter.java:125)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
    at com.google.apphosting.runtime.jetty.SaveSessionFilter.doFilter(SaveSessionFilter.java:37)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
    at com.google.apphosting.utils.servlet.JdbcMySqlConnectionCleanupFilter.doFilter(JdbcMySqlConnectionCleanupFilter.java:60)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
    at com.google.apphosting.utils.servlet.TransactionCleanupFilter.doFilter(TransactionCleanupFilter.java:50)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
    at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
    at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
    at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
    at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
    at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
    at com.google.apphosting.runtime.jetty.AppVersionHandlerMap.handle(AppVersionHandlerMap.java:260)
    at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
    at org.mortbay.jetty.Server.handle(Server.java:326)
    at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
    at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923)
    at com.google.apphosting.runtime.jetty.RpcRequestParser.parseAvailable(RpcRequestParser.java:78)
    at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
    at com.google.apphosting.runtime.jetty.JettyServletEngineAdapter.serviceRequest(JettyServletEngineAdapter.java:148)
    at com.google.apphosting.runtime.JavaRuntime$RequestRunnable.run(JavaRuntime.java:504)
    at com.google.tracing.TraceContext$TraceContextRunnable.runInContext(TraceContext.java:446)
    at com.google.tracing.TraceContext$TraceContextRunnable$1.run(TraceContext.java:453)
    at com.google.tracing.CurrentContext.runInContext(CurrentContext.java:276)
    at com.google.tracing.TraceContext$AbstractTraceContextCallback.runInInheritedContextNoUnref(TraceContext.java:312)
    at com.google.tracing.TraceContext$AbstractTraceContextCallback.runInInheritedContext(TraceContext.java:304)
    at com.google.tracing.TraceContext$TraceContextRunnable.run(TraceContext.java:450)
    at com.google.apphosting.runtime.ThreadGroupPool$PoolEntry.run(ThreadGroupPool.java:235)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.datanucleus.store.appengine.DatastoreExceptionTranslator$1: The datastore operation timed out, or the data was temporarily unavailable.
    at org.datanucleus.store.appengine.DatastoreExceptionTranslator.wrapDatastoreTimeoutExceptionForQuery(DatastoreExceptionTranslator.java:65)
    at org.datanucleus.store.appengine.query.QueryExceptionWrappers$1.get(QueryExceptionWrappers.java:55)
    at org.datanucleus.store.appengine.query.QueryExceptionWrappers$2.get(QueryExceptionWrappers.java:86)
    ... 45 more
Caused by: com.google.appengine.api.datastore.DatastoreTimeoutException: The datastore operation timed out, or the data was temporarily unavailable.
    at com.google.appengine.api.datastore.DatastoreApiHelper.translateError(DatastoreApiHelper.java:64)
    at com.google.appengine.api.datastore.DatastoreApiHelper$1.convertException(DatastoreApiHelper.java:129)
    at com.google.appengine.api.utils.FutureWrapper.get(FutureWrapper.java:97)
    at com.google.appengine.api.datastore.FutureHelper.getInternal(FutureHelper.java:76)
    at com.google.appengine.api.datastore.FutureHelper.quietGet(FutureHelper.java:36)
    at com.google.appengine.api.datastore.BaseQueryResultsSource.loadMoreEntities(BaseQueryResultsSource.java:197)
    at com.google.appengine.api.datastore.BaseQueryResultsSource.loadMoreEntities(BaseQueryResultsSource.java:171)
    at com.google.appengine.api.datastore.QueryResultIteratorImpl.ensureLoaded(QueryResultIteratorImpl.java:148)
    at com.google.appengine.api.datastore.QueryResultIteratorImpl.hasNext(QueryResultIteratorImpl.java:64)
    at com.google.appengine.api.datastore.PreparedMultiQuery$FilteredMultiQueryIterator.computeNext(PreparedMultiQuery.java:249)
    at com.google.appengine.api.datastore.PreparedMultiQuery$FilteredMultiQueryIterator.computeNext(PreparedMultiQuery.java:185)
    at com.google.appengine.api.datastore.AbstractIterator.tryToComputeNext(AbstractIterator.java:46)
    at com.google.appengine.api.datastore.AbstractIterator.hasNext(AbstractIterator.java:41)
    at com.google.appengine.api.datastore.QueryResultIteratorDelegator.hasNext(QueryResultIteratorDelegator.java:38)
    at org.datanucleus.store.appengine.query.RuntimeExceptionWrappingIterator$1.get(RuntimeExceptionWrappingIterator.java:50)
    at org.datanucleus.store.appengine.query.RuntimeExceptionWrappingIterator$1.get(RuntimeExceptionWrappingIterator.java:46)
    at org.datanucleus.store.appengine.query.QueryExceptionWrappers$1.get(QueryExceptionWrappers.java:51)
    ... 46 more

@damithc
Copy link
Contributor Author

damithc commented Aug 18, 2016

I suspect this is the problem FeedbackSessionsDb.getNonPrivateFeedbackSessions Most likely returns all 13k sessions.

@damithc
Copy link
Contributor Author

damithc commented Aug 18, 2016

As opening reminders have 0% error rate, I'm guessing the db query on that side is done in a more optimal way. We need to quickly replicate the same optimization on the closing reminders side.

@damithc
Copy link
Contributor Author

damithc commented Aug 18, 2016

Both submission edit and save requests are suffering from the first error (104).

@chowyb
Copy link
Contributor

chowyb commented Aug 18, 2016

Currently operating on the assumption that FeedbackSessionsDb.getNonPrivateFeedbackSessions is the source of all errors (including the 104, maybe due to exceeding memory on initialisation?), and I note the following:

3 methods call this method: one for sending session closing reminders, one for sending session closed emails, and a test (which is likely not relevant to this issue). As "session closing" reminders is not a particularly new feature, this may explain why a rollback does not solve the issue.

Currently I have two possible solutions in mind, although none of them are particularly "perfect":

1:
Directly use property filters on the datastore. As closing time is stored as a Date object, it should be possible (but I have not confirmed this, only searched to see if it was possible) to use filters on the closing date, and should be quick to implement. However, this will likely break abstraction, as well as be difficult to test reliably. This could be left in as a temporary solution while working on solution 2 below.

2:
Replicate the optimization for opening reminders by use of a second property (eg. booleans hasSentClosingEmail and hasSentClosedEmail), and querying for whether that property is false. A script will have to be used to change sessions which have already been closed for a sufficiently long period of time (eg. closed at least a week ago) to set the boolean values for these properties to true. This will leave a small set of sessions which may already have emails sent (between now and a week ago) to still be returned as part of the queries, but the script can be run again later (eg. a week, the same time as the window period) to set their properties to true as well. The task queues can then be updated to set the boolean values to true after successfully sending the emails.

Other considerations for this solution: editing the session to change its closing time will have to update these boolean values as well. Therefore, more time will be needed to ensure this does not break any existing features, and this solution will take longer to roll out as a result.

@wkurniawan07
Copy link
Member

@chowyb your observations are all correct; I've been working on it for the past hour (2nd solution since time is not really on our side), but still need to run some tests. Hope you can take a look later when it's done.

@damithc the submission edit page/save and the result page, what are the errors coming out from them? They can't be the non-private feedback sessions error.

@damithc
Copy link
Contributor Author

damithc commented Aug 18, 2016

This looks like a combination of

  1. Our dataset reaching a certain size
  2. Sudden increase in traffic
  3. Possibly some changes on GAE that makes things slower or more constrained (I think they switched to a new GAE version y'day)

I don't think all problems are caused by that query, but it's likely to fix one of the problems. I think retrieving sessions with deadlines in past 24 hours (instead of the entire list) should solve that problem?

@damithc
Copy link
Contributor Author

damithc commented Aug 18, 2016

Most common error for submission edit/save is

A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may be throwing exceptions during the initialization of your application. (Error code 104)

@damithc
Copy link
Contributor Author

damithc commented Aug 18, 2016

Here's one example:
image

@damithc
Copy link
Contributor Author

damithc commented Aug 18, 2016

Another common warning (not an error) I see in submission edit/save as well as results page is this:

com.google.appengine.tools.appstats.MemcacheWriter logMaybe: Stats data was too big, all stack traces were removed.

This is not an error, but it's new and frequent.

@wkurniawan07 wkurniawan07 changed the title Many critical functions are failing on the live server at a high rate Timeouts in session closing and closed reminders Aug 18, 2016
@wkurniawan07 wkurniawan07 self-assigned this Aug 18, 2016
@wkurniawan07
Copy link
Member

Changing the issue title since we need to work on solving things one at a time.

@damithc /feedbackRemindEmailParticularUsersWorker has a high failure rate too, what is the cause for this one? (not going to be solved via this issue/PR, but still we need to know)

@damithc
Copy link
Contributor Author

damithc commented Aug 18, 2016

All the other high error rates are caused by 104.

@damithc
Copy link
Contributor Author

damithc commented Aug 18, 2016

I managed to reduce the severity of the problem by splitting the traffic between 5 versions of the app. 104 errors still happen, but a less than before. So it is related to the traffic level. That's why I suspect 104 error may be a memory issue.
Closing reminders problem is as same as before because it is not related to traffic.

@wkurniawan07 wkurniawan07 added this to the V5.87.01 milestone Aug 18, 2016
@damithc
Copy link
Contributor Author

damithc commented Aug 19, 2016

Looks like it's the results pages (with big result sets) that is causing the 104. Other pages could be collateral damage (i.e. when the server goes down, it takes all other ongoing requests with it). Note that when we load the results page of a big course, we still try to load responses of all the questions at the same time. This could be the thing that sends the server over the memory limit.

@damithc
Copy link
Contributor Author

damithc commented Aug 19, 2016

@chowyb can you look into the results page issue. As an interim measure, we can stop auto-loading the per-question view for any course with more than 30 respondents.

@damithc
Copy link
Contributor Author

damithc commented Aug 19, 2016

BTW, the problem with CourseRemind is that exceeds the deadline. Is it using task queues? If not, it should. Otherwise sending emails for a big course can take more than 60s.

@chowyb
Copy link
Contributor

chowyb commented Aug 19, 2016

@damithc I'll look into the issue, but from memory there is a preset limit on how many responses are present (10k?) before the system defaults to not auto-loading. It might be possible to reduce this limit instead.

@damithc
Copy link
Contributor Author

damithc commented Aug 19, 2016

Yes, let's try to reduce that limit. 10k is definitely too high. 1k?

@kanghj
Copy link
Member

kanghj commented Aug 20, 2016

@chowyb, for the questions view, the limit is based on the number of respondents (there's a constant set for that in InstructorFeedbackResultsPageData: RESPONDENTS_LIMIT_FOR_AUTOLOADING). For the non-questions view, it's based on the number of responses (DEFAULT_SECTION_QUERY_RANGE in InstructorFeedbackResultsPageAction). We could refactor this soon, to prevent further confusion.

@damithc, for the submission page errors, are the errors only from large courses? It is strange that we are getting an increased number of errors on the submit page.

@damithc
Copy link
Contributor Author

damithc commented Aug 20, 2016

The error rate for submission save is not high anymore after I spread the load between 10 servers. I think when many people are trying to submit, we go out of memory and kill all the parallel requests.

@damithc
Copy link
Contributor Author

damithc commented Aug 20, 2016

Closing/closed reminders seems to be fixed now (the fix is very recent and not reflected in the numbers below).
Here are the problem areas now:
image

@wkurniawan07
Copy link
Member

Fixed in #6001

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p.Critical Would like to fix ASAP and release as a hot patch
Projects
None yet
Development

No branches or pull requests

4 participants