Solr indexing sometimes fails #27

jnguyenx · 2017-02-27T14:31:09Z

Not sure why this is happening, especially with a local solr instance. It doesn't seem to index the file at all.

First thing to do is to fail the process, for now this is a silent exception. Then since the file is not indexed at all, a retry can be attempted, or at least keep the json file for a manual retry.

INFO: Posting JSON genotype-phenotype.yaml.json to http://localhost:8983/solr/golr
Feb 27, 2017 6:07:32 AM org.monarch.golr.GolrWorker call
SEVERE: Failed to post JSON genotype-phenotype.yaml.json
java.net.SocketException: Broken pipe (Write failed)
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
	at org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:126)
	at org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:138)
	at org.apache.http.impl.io.SessionOutputBufferImpl.write(SessionOutputBufferImpl.java:169)
	at org.apache.http.impl.io.ContentLengthOutputStream.write(ContentLengthOutputStream.java:115)
	at org.apache.http.client.fluent.InternalFileEntity.writeTo(InternalFileEntity.java:75)
	at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:158)
	at org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:162)
	at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:237)
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:122)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
	at org.apache.http.client.fluent.Request.internalExecute(Request.java:173)
	at org.apache.http.client.fluent.Executor.execute(Executor.java:262)
	at org.monarch.golr.GolrWorker.call(GolrWorker.java:79)
	at org.monarch.golr.GolrWorker.call(GolrWorker.java:25)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

The text was updated successfully, but these errors were encountered:

cmungall · 2017-02-27T14:38:43Z

@kltm have we seen anything like this for the owltools loader?

kltm · 2017-02-27T14:42:31Z

Yes, things like it.
At first blush, I think it looks like a timeout before finish--either slow over wire or too large. Are you still doing the single blob?

jnguyenx · 2017-02-27T14:43:41Z

Yes single blob. But it is not a consistent exception, it comes and goes. I think that it came when we upgraded to solr6, I'll try to update solr first then.

…

On Mon, Feb 27, 2017 at 6:42 AM, kltm ***@***.***> wrote: Yes, things like it. At first blush, I think it looks like a timeout before finish--either slow over wire or too large. Are you still doing the single blob? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#27 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEHMGD0H_GnXyp2gzUQoDGAOgTQ7YIV9ks5rguDYgaJpZM4MNK5H> .

kltm · 2017-02-27T14:46:00Z

You may be right at the limit then. Maybe try cutting out proxies or loading from local? It may be hard to figure it out without getting really good numbers on usage and time at each step.
If you get more fails on a smaller or more "remote" machine, that may also be an indication.

cmungall · 2017-04-17T20:44:00Z

This sounds like a priority ticket - is it still happening?

jnguyenx · 2017-04-17T23:50:22Z

Yes, it still randomly happens. It failed, the generated json can still be indexed afterwards, it's a manual step.
@kshefchek is on it, we talked about it on Friday and this morning.

kshefchek · 2017-07-05T21:02:21Z

The PR here: #33 is merely duct tape to fix this. It looks like we've outgrown the current approach, the run requires 155g of memory, with 64g in cached mem, we end up swapping which likely causes these intermittent time out issues. Some thoughts on a way forward:

Fix at the systems level:

Adjust swappiness
Throw more resources at it - ie higher memory machine

Pros: no additional development needed
Cons: Short sited

Optimize current code

Reduce the need to build huge mapDB objects, while these are stored on disk there appears to still be some memory requirement (either that or there's a memory leak). The approach now is to cache subject,object pairs for a single query, then iterate over all paths per s,o to generate evidence, source, is_defined_by fields. If we simply ordered the results by subject we could generate solr docs, clear out the mapDB object, and move on.

Write a golr updater

Instead of clearing the index and reloading, only update certain fields for each build. This might be complicated given all the moving parts, graph isomorphism, etc. But this is more traditional than the scorched earth approach. Solr also supports concurrent updates.

kshefchek · 2017-07-05T21:26:07Z

On ordering the results - apparently you cannot apply an ORDER BY on a UNION of queries, I remember this was discussed when we initially developed the stack, and likely the reason for the huge mapDB approach:

neo4j/neo4j#2725

I can see if there have been any updates since that could serve as a workaround.

kltm · 2017-07-05T21:27:04Z

@kshefchek Honestly, I think that 1b and 2 are the only real ways forward.

I think 1a is a non-starter; once issues of swap come up, the battle is already lost, no matter the "stickiness"
I think that updating the indexes starts getting a little weird--if anything falls out of phase, you can have dangling links and nonsense creep in very quickly

kltm · 2017-07-05T21:28:25Z

@kshefchek Practically speaking though, it seems like there is probably a fundamental non-scaling issue in the code. While techincally 1b might work, it is likely only deferring the issue a little while. At the very least, I'd vote that 2 should be done until the exact nature of the scaling issue is known.

kshefchek · 2017-07-05T22:36:24Z

Neo4J now has procedures that can be used as a workaround for post union processing
https://github.com/neo4j-contrib/neo4j-apoc-procedures
https://neo4j.com/developer/kb/post-union-processing/

On a test query it seems to work just fine, I can test this out if it sounds like a good way forward.

kshefchek · 2017-07-11T16:03:25Z

Taking a closer look, these timeouts occur when neo4j has a long GC pause, for example:

Golr loader:
Jul 11, 2017 8:22:52 AM org.monarch.golr.GolrLoader addAndCommitToSolr
WARNING: Caught: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8983/solr/golr

Neo4J:
2017-07-11 08:21:30.606+0000 WARN [o.n.k.i.c.MonitorGc] GC Monitor: Application threads blocked for 2048ms.
2017-07-11 08:22:52.318+0000 WARN [o.n.k.i.c.MonitorGc] GC Monitor: Application threads blocked for 60676ms.
2017-07-11 08:23:13.748+0000 WARN [o.n.k.i.c.MonitorGc] GC Monitor: Application threads blocked for 789ms.

I have a test branch that uses post union processing to order the results, which results in less memory usage but has longer GC pauses, so the timeouts are more frequent. Knowing now that neo4j is the issue perhaps we can play around with some tuning of memory and GC settings, but we may need to switch to a server with more memory as our graph grows.

kshefchek · 2017-09-05T17:12:38Z

This is resolved now, the resolution was replacing UNION with UNION ALL and no longer having neo4j uniquify query results, where this is done in the indexing code anyway. The loader now has a much smaller memory footprint.

jnguyenx added the bug label Feb 27, 2017

jnguyenx self-assigned this Feb 27, 2017

jnguyenx assigned kshefchek and unassigned jnguyenx Apr 26, 2017

kshefchek mentioned this issue Jun 6, 2017

Rewrite golr loader to convert results to SolrInputDocument instead of JSON #32

Closed

kshefchek closed this as completed Sep 5, 2017

kshefchek mentioned this issue Sep 13, 2017

Investigate drop in variant-phenotype relationships monarch-initiative/dipper#518

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solr indexing sometimes fails #27

Solr indexing sometimes fails #27

jnguyenx commented Feb 27, 2017

cmungall commented Feb 27, 2017

kltm commented Feb 27, 2017

jnguyenx commented Feb 27, 2017 via email •

edited

Loading

kltm commented Feb 27, 2017

cmungall commented Apr 17, 2017

jnguyenx commented Apr 17, 2017

kshefchek commented Jul 5, 2017 •

edited

Loading

kshefchek commented Jul 5, 2017 •

edited

Loading

kltm commented Jul 5, 2017

kltm commented Jul 5, 2017

kshefchek commented Jul 5, 2017

kshefchek commented Jul 11, 2017

kshefchek commented Sep 5, 2017

Solr indexing sometimes fails #27

Solr indexing sometimes fails #27

Comments

jnguyenx commented Feb 27, 2017

cmungall commented Feb 27, 2017

kltm commented Feb 27, 2017

jnguyenx commented Feb 27, 2017 via email • edited Loading

kltm commented Feb 27, 2017

cmungall commented Apr 17, 2017

jnguyenx commented Apr 17, 2017

kshefchek commented Jul 5, 2017 • edited Loading

kshefchek commented Jul 5, 2017 • edited Loading

kltm commented Jul 5, 2017

kltm commented Jul 5, 2017

kshefchek commented Jul 5, 2017

kshefchek commented Jul 11, 2017

kshefchek commented Sep 5, 2017

jnguyenx commented Feb 27, 2017 via email •

edited

Loading

kshefchek commented Jul 5, 2017 •

edited

Loading

kshefchek commented Jul 5, 2017 •

edited

Loading