Solr Internode Traffic denied by 401 Unauthorized #3770

nickumia-reisys · 2022-04-04T20:43:56Z

Solr v8.11.1

How to reproduce

Unknown

Expected behavior

Stable solrcloud cluster operation

Actual behavior

Background

Solr had been reindexed to service catalog.data.gov traffic and it seemed functional. Upon testing (visiting /organization page), Solr was returning an invalid token error which was never seen before.

Upon inspection of the solrcloud cluster, there were 401 Errors happening between the nodes and no amount of debugging seemed to get the cluster communication under control. We suspected that the persistent volume attached to the nodes returning 401s had been corrupted since they were having issues reading the solr.xml and having authentication issues. Various PVs and Solr Pods were nuked to try to get the cluster in a fresh clean state; however, this made no difference.

Research had illuminated that this issue is known upstream. The upstream fix is not yet released and available. Through a friendly ping, a workaround has been suggested that may be useful in the short term. There is no guarantee how CKAN will respond to the workaround or if it is compatible with our setup.

While there is a similar issue in Solr upstream, the causes are different. We've isolated the problem to the GSA-hardened EKS AMI that we were deploying, since using the default Amazon EKS AMI does not cause Solr to exhibit this issue.

Sketch

~~Attempt to implement workaround suggested in this issue.~~
Use Default EKS AMI for EC2 nodes instead of GSA-Hardened AMI

The text was updated successfully, but these errors were encountered:

nickumia-reisys · 2022-04-05T01:39:04Z

Related issue, although not sure of the direct impact, https://solr-user.lucene.apache.narkive.com/HOJFKqsh/overseer-could-not-get-tags, it does show up in the logs, but it's just a warning, not a real error

mogul · 2022-04-05T03:35:28Z

Are you absolutely sure this has nothing to do with the default-deny NetworkPolicy?

nickumia-reisys · 2022-04-05T15:49:21Z

Considering that the network policy was not enabled when the error showed up, I am very confident that it is not a factor. The network policy was enabled at the initial start of the cluster startup, but I had deleted to get the cluster to become healthy. Incidentally, I had not re-enabled the network policy after that. Thus, the network policy did not have a direct impact on the issue. Nevertheless, if the existence of the network policy at the beginning of the cluster caused some timing issues or communication mishaps that compounded and only became a problem after days of being up, that's a different story.

nickumia-reisys · 2022-04-05T15:52:45Z

What I don't like is that, even with a fresh Solrcloud cluster, there are a plethora of errors and anomalies that seem very destructive in nature.

The ckan collection was successfully created, the collection is indexing and the Solr nodes are reporting healthy, but there are these issues that have already appeared.

nickumia-reisys · 2022-04-05T15:59:38Z

This seems to be a new warning, but I don't know if we would be waiting for CKAN to update this dependency. It seems like the way CKAN is using solr is old, especially relying on old plugins... For example, the synonymFilterFactory.

nickumia-reisys · 2022-04-05T16:02:31Z

This looks like the original issue popping up again, specifically from the HTTP Authentication Challenge,

nickumia-reisys · 2022-04-05T19:24:36Z

Preliminary testing (~1.5 hours of indexing) while using the non-GSA-hardened AMI shows that this issue has not resurfaced.

nickumia-reisys · 2022-04-05T19:48:56Z

It was confirmed that we are not setting any custom shardHandlerFactory on the /select requestHandler so the upstream issue was caused by a different bug than this issue (despite the resulting effects being the same).

hkdctol · 2022-04-14T20:25:11Z

Staying in blocked until we can replicate the issue. @nickumia-reisys will update next steps.

nickumia-reisys · 2022-04-18T13:10:23Z

We have not yet been able to reproduce this error. #3783 is happening more reliably, so we are investigating that one before coming back to this. We can't hope to reproduce this if the Solr Collection is going down before this.

The next steps are unknown as it stands. See #3783 (comment) for my best guess to the next steps in general.

nickumia-reisys · 2022-04-18T13:27:41Z

(This is mainly speculation, but it's a fair point)

Not sure when this happened, but it seems like solr-99d4df8f264da830 is exhibiting an error that seemed to occur right before the 401 Unauthorized issue appeared last time. One of the shards in the collection is stuck in the recovering state and can't seem to catch up. Not sure of the cause, but if it doesn't improve, I won't be surprised if this evolves into the 401 Unauthorized because the shard doesn't get authentication updates with the rest of the cluster and then is technically alienated.

jbrown-xentity · 2022-04-21T20:51:23Z

Waiting on final GSA hardened image and solr classic setup before testing to see if this can be closed.

nickumia-reisys · 2022-05-06T12:53:16Z

This is crazy!! Someone else posted upstream that they saw the same error on k8s!! 😱

nickumia-reisys · 2022-05-16T14:33:10Z

Newest results using the most recent GSA ISE-Hardened AMI (v2022-05-08)

Upon a successful completion of initial Solr re-index, one of the nodes are stuck in recovery_failed due to a No space left on device exception,
It should be noted that there is in fact space on the disk,

2022-05-16 11:00:31.284 WARN  (recoveryExecutor-11-thread-1-processing-n:default-solr-f8da61d49a8cfc70-solrcloud-2.solrcloud1.ssb-dev.data.gov:80_solr x:ckan_shard1_replica_n2 c:ckan s:shard1 r:core_node5) [c:ckan s:shard1 r:core_node5 x:ckan_shard1_replica_n2] o.a.s.h.IndexFetcher Error in fetching file: _eahj_Lucene84_0.doc (downloaded 288358400 of 650272919 bytes) => java.io.IOException: No space left on device
        at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
java.io.IOException: No space left on device
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
        at sun.nio.ch.FileDispatcherImpl.write(Unknown Source) ~[?:?]
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source) ~[?:?]
        at sun.nio.ch.IOUtil.write(Unknown Source) ~[?:?]
        at sun.nio.ch.FileChannelImpl.write(Unknown Source) ~[?:?]
        at java.nio.channels.Channels.writeFullyImpl(Unknown Source) ~[?:?]
        at java.nio.channels.Channels.writeFully(Unknown Source) ~[?:?]
        at java.nio.channels.Channels$1.write(Unknown Source) ~[?:?]
        at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:416) ~[?:?]
        at java.util.zip.CheckedOutputStream.write(Unknown Source) ~[?:?]
        at java.io.BufferedOutputStream.write(Unknown Source) ~[?:?]
        at org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53) ~[?:?]
        at org.apache.solr.handler.IndexFetcher$DirectoryFile.write(IndexFetcher.java:1940) ~[?:?]
        at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1797) ~[?:?]
        at org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1732) ~[?:?]
        at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1713) ~[?:?]
        at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1109) ~[?:?]
        at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:619) ~[?:?]
        at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:384) ~[?:?]
        at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:458) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:252) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:683) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:339) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318) ~[?:?]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180) ~[metrics-core-4.1.5.jar:4.1.5]
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
        at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
        at java.lang.Thread.run(Unknown Source) [?:?]
2022-05-16 11:00:31.701 ERROR (recoveryExecutor-11-thread-1-processing-n:default-solr-f8da61d49a8cfc70-solrcloud-2.solrcloud1.ssb-dev.data.gov:80_solr x:ckan_shard1_replica_n2 c:ckan s:shard1 r:core_node5) [c:ckan s:shard1 r:core_node5 x:ckan_shard1_replica_n2] o.a.s.h.IndexFetcher Error fetching file, doing one retry... => org.apache.solr.common.SolrException: Unable to download _eahj_Lucene84_0.doc completely. Downloaded 288358400!=650272919
        at org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1863)
org.apache.solr.common.SolrException: Unable to download _eahj_Lucene84_0.doc completely. Downloaded 288358400!=650272919
        at org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1863) ~[?:?]
        at org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1743) ~[?:?]
        at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1713) ~[?:?]
        at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1109) ~[?:?]
        at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:619) ~[?:?]
        at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:384) ~[?:?]
        at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:458) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:252) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:683) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:339) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318) ~[?:?]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180) ~[metrics-core-4.1.5.jar:4.1.5]
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
        at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
        at java.lang.Thread.run(Unknown Source) [?:?]

There does not seem to be a way to retry a recovery after it's failed.. but will try to investigate further.

nickumia-reisys · 2023-07-20T13:06:18Z

Similar to the following issue, we are not running SolrCloud (and do not have this issue in our current design)

SOLR cloud leader loss recovery #3784 (comment)

nickumia-reisys added component/catalog Related to catalog component playbooks/roles component/inventory Inventory playbooks/roles bug Software defect or bug component/ssb labels Apr 4, 2022

nickumia-reisys self-assigned this Apr 4, 2022

nickumia-reisys removed their assignment Apr 18, 2022

jbrown-xentity mentioned this issue May 3, 2022

Integrate AMI's into EKS #3808

Open

9 tasks

nickumia-reisys mentioned this issue Sep 15, 2022

Dissect Solr Performance through New Relic #3956

Open

5 tasks

nickumia-reisys closed this as completed Jul 20, 2023

nickumia-reisys self-assigned this Jul 20, 2023

nickumia-reisys added component/solr-service Related to Solr-as-a-Service, a brokered Solr offering Testing labels Oct 7, 2023

nickumia-reisys added this to data.gov team board Oct 7, 2023

nickumia-reisys moved this to 🗄 Closed in data.gov team board Oct 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solr Internode Traffic denied by 401 Unauthorized #3770

Solr Internode Traffic denied by 401 Unauthorized #3770

nickumia-reisys commented Apr 4, 2022 •

edited

Loading

nickumia-reisys commented Apr 5, 2022

mogul commented Apr 5, 2022

nickumia-reisys commented Apr 5, 2022

nickumia-reisys commented Apr 5, 2022 •

edited

Loading

nickumia-reisys commented Apr 5, 2022

nickumia-reisys commented Apr 5, 2022

nickumia-reisys commented Apr 5, 2022

nickumia-reisys commented Apr 5, 2022

hkdctol commented Apr 14, 2022

nickumia-reisys commented Apr 18, 2022

nickumia-reisys commented Apr 18, 2022

jbrown-xentity commented Apr 21, 2022

nickumia-reisys commented May 6, 2022

nickumia-reisys commented May 16, 2022 •

edited

Loading

nickumia-reisys commented Jul 20, 2023

Solr Internode Traffic denied by 401 Unauthorized #3770

Solr Internode Traffic denied by 401 Unauthorized #3770

Comments

nickumia-reisys commented Apr 4, 2022 • edited Loading

How to reproduce

Expected behavior

Actual behavior

Background

Sketch

nickumia-reisys commented Apr 5, 2022

mogul commented Apr 5, 2022

nickumia-reisys commented Apr 5, 2022

nickumia-reisys commented Apr 5, 2022 • edited Loading

nickumia-reisys commented Apr 5, 2022

nickumia-reisys commented Apr 5, 2022

nickumia-reisys commented Apr 5, 2022

nickumia-reisys commented Apr 5, 2022

hkdctol commented Apr 14, 2022

nickumia-reisys commented Apr 18, 2022

nickumia-reisys commented Apr 18, 2022

jbrown-xentity commented Apr 21, 2022

nickumia-reisys commented May 6, 2022

nickumia-reisys commented May 16, 2022 • edited Loading

nickumia-reisys commented Jul 20, 2023

nickumia-reisys commented Apr 4, 2022 •

edited

Loading

nickumia-reisys commented Apr 5, 2022 •

edited

Loading

nickumia-reisys commented May 16, 2022 •

edited

Loading