Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr Internode Traffic denied by 401 Unauthorized #3770

Closed
1 of 2 tasks
nickumia-reisys opened this issue Apr 4, 2022 · 15 comments
Closed
1 of 2 tasks

Solr Internode Traffic denied by 401 Unauthorized #3770

nickumia-reisys opened this issue Apr 4, 2022 · 15 comments
Assignees
Labels
bug Software defect or bug component/catalog Related to catalog component playbooks/roles component/inventory Inventory playbooks/roles component/solr-service Related to Solr-as-a-Service, a brokered Solr offering component/ssb Testing

Comments

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Apr 4, 2022

Solr v8.11.1

How to reproduce

  1. Unknown

Expected behavior

Stable solrcloud cluster operation

Actual behavior

image
image

Background

Solr had been reindexed to service catalog.data.gov traffic and it seemed functional. Upon testing (visiting /organization page), Solr was returning an invalid token error which was never seen before.
image

Upon inspection of the solrcloud cluster, there were 401 Errors happening between the nodes and no amount of debugging seemed to get the cluster communication under control. We suspected that the persistent volume attached to the nodes returning 401s had been corrupted since they were having issues reading the solr.xml and having authentication issues. Various PVs and Solr Pods were nuked to try to get the cluster in a fresh clean state; however, this made no difference.

Research had illuminated that this issue is known upstream. The upstream fix is not yet released and available. Through a friendly ping, a workaround has been suggested that may be useful in the short term. There is no guarantee how CKAN will respond to the workaround or if it is compatible with our setup.

While there is a similar issue in Solr upstream, the causes are different. We've isolated the problem to the GSA-hardened EKS AMI that we were deploying, since using the default Amazon EKS AMI does not cause Solr to exhibit this issue.

Sketch

  • Attempt to implement workaround suggested in this issue.
  • Use Default EKS AMI for EC2 nodes instead of GSA-Hardened AMI
@nickumia-reisys nickumia-reisys added component/catalog Related to catalog component playbooks/roles component/inventory Inventory playbooks/roles bug Software defect or bug component/ssb labels Apr 4, 2022
@nickumia-reisys nickumia-reisys self-assigned this Apr 4, 2022
@nickumia-reisys
Copy link
Contributor Author

Related issue, although not sure of the direct impact, https://solr-user.lucene.apache.narkive.com/HOJFKqsh/overseer-could-not-get-tags, it does show up in the logs, but it's just a warning, not a real error

image

@mogul
Copy link
Contributor

mogul commented Apr 5, 2022

Are you absolutely sure this has nothing to do with the default-deny NetworkPolicy?

@nickumia-reisys
Copy link
Contributor Author

Considering that the network policy was not enabled when the error showed up, I am very confident that it is not a factor. The network policy was enabled at the initial start of the cluster startup, but I had deleted to get the cluster to become healthy. Incidentally, I had not re-enabled the network policy after that. Thus, the network policy did not have a direct impact on the issue. Nevertheless, if the existence of the network policy at the beginning of the cluster caused some timing issues or communication mishaps that compounded and only became a problem after days of being up, that's a different story.

@nickumia-reisys
Copy link
Contributor Author

nickumia-reisys commented Apr 5, 2022

What I don't like is that, even with a fresh Solrcloud cluster, there are a plethora of errors and anomalies that seem very destructive in nature.

The ckan collection was successfully created, the collection is indexing and the Solr nodes are reporting healthy, but there are these issues that have already appeared.

image

image

image

@nickumia-reisys
Copy link
Contributor Author

This seems to be a new warning, but I don't know if we would be waiting for CKAN to update this dependency. It seems like the way CKAN is using solr is old, especially relying on old plugins... For example, the synonymFilterFactory.

image

@nickumia-reisys
Copy link
Contributor Author

This looks like the original issue popping up again, specifically from the HTTP Authentication Challenge,

image

@nickumia-reisys
Copy link
Contributor Author

Preliminary testing (~1.5 hours of indexing) while using the non-GSA-hardened AMI shows that this issue has not resurfaced.

image

@nickumia-reisys
Copy link
Contributor Author

It was confirmed that we are not setting any custom shardHandlerFactory on the /select requestHandler so the upstream issue was caused by a different bug than this issue (despite the resulting effects being the same).

@hkdctol
Copy link
Contributor

hkdctol commented Apr 14, 2022

Staying in blocked until we can replicate the issue. @nickumia-reisys will update next steps.

@nickumia-reisys
Copy link
Contributor Author

We have not yet been able to reproduce this error. #3783 is happening more reliably, so we are investigating that one before coming back to this. We can't hope to reproduce this if the Solr Collection is going down before this.

The next steps are unknown as it stands. See #3783 (comment) for my best guess to the next steps in general.

@nickumia-reisys nickumia-reisys removed their assignment Apr 18, 2022
@nickumia-reisys
Copy link
Contributor Author

(This is mainly speculation, but it's a fair point)

Not sure when this happened, but it seems like solr-99d4df8f264da830 is exhibiting an error that seemed to occur right before the 401 Unauthorized issue appeared last time. One of the shards in the collection is stuck in the recovering state and can't seem to catch up. Not sure of the cause, but if it doesn't improve, I won't be surprised if this evolves into the 401 Unauthorized because the shard doesn't get authentication updates with the rest of the cluster and then is technically alienated.

image

@jbrown-xentity
Copy link
Contributor

Waiting on final GSA hardened image and solr classic setup before testing to see if this can be closed.

@nickumia-reisys
Copy link
Contributor Author

This is crazy!! Someone else posted upstream that they saw the same error on k8s!! 😱

@nickumia-reisys
Copy link
Contributor Author

nickumia-reisys commented May 16, 2022

Newest results using the most recent GSA ISE-Hardened AMI (v2022-05-08)

  • Upon a successful completion of initial Solr re-index, one of the nodes are stuck in recovery_failed due to a No space left on device exception,
    image
  • It should be noted that there is in fact space on the disk,
    image
    image
2022-05-16 11:00:31.284 WARN  (recoveryExecutor-11-thread-1-processing-n:default-solr-f8da61d49a8cfc70-solrcloud-2.solrcloud1.ssb-dev.data.gov:80_solr x:ckan_shard1_replica_n2 c:ckan s:shard1 r:core_node5) [c:ckan s:shard1 r:core_node5 x:ckan_shard1_replica_n2] o.a.s.h.IndexFetcher Error in fetching file: _eahj_Lucene84_0.doc (downloaded 288358400 of 650272919 bytes) => java.io.IOException: No space left on device
        at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
java.io.IOException: No space left on device
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
        at sun.nio.ch.FileDispatcherImpl.write(Unknown Source) ~[?:?]
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source) ~[?:?]
        at sun.nio.ch.IOUtil.write(Unknown Source) ~[?:?]
        at sun.nio.ch.FileChannelImpl.write(Unknown Source) ~[?:?]
        at java.nio.channels.Channels.writeFullyImpl(Unknown Source) ~[?:?]
        at java.nio.channels.Channels.writeFully(Unknown Source) ~[?:?]
        at java.nio.channels.Channels$1.write(Unknown Source) ~[?:?]
        at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:416) ~[?:?]
        at java.util.zip.CheckedOutputStream.write(Unknown Source) ~[?:?]
        at java.io.BufferedOutputStream.write(Unknown Source) ~[?:?]
        at org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53) ~[?:?]
        at org.apache.solr.handler.IndexFetcher$DirectoryFile.write(IndexFetcher.java:1940) ~[?:?]
        at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1797) ~[?:?]
        at org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1732) ~[?:?]
        at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1713) ~[?:?]
        at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1109) ~[?:?]
        at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:619) ~[?:?]
        at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:384) ~[?:?]
        at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:458) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:252) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:683) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:339) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318) ~[?:?]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180) ~[metrics-core-4.1.5.jar:4.1.5]
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
        at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
        at java.lang.Thread.run(Unknown Source) [?:?]
2022-05-16 11:00:31.701 ERROR (recoveryExecutor-11-thread-1-processing-n:default-solr-f8da61d49a8cfc70-solrcloud-2.solrcloud1.ssb-dev.data.gov:80_solr x:ckan_shard1_replica_n2 c:ckan s:shard1 r:core_node5) [c:ckan s:shard1 r:core_node5 x:ckan_shard1_replica_n2] o.a.s.h.IndexFetcher Error fetching file, doing one retry... => org.apache.solr.common.SolrException: Unable to download _eahj_Lucene84_0.doc completely. Downloaded 288358400!=650272919
        at org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1863)
org.apache.solr.common.SolrException: Unable to download _eahj_Lucene84_0.doc completely. Downloaded 288358400!=650272919
        at org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1863) ~[?:?]
        at org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1743) ~[?:?]
        at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1713) ~[?:?]
        at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1109) ~[?:?]
        at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:619) ~[?:?]
        at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:384) ~[?:?]
        at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:458) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:252) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:683) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:339) ~[?:?]
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318) ~[?:?]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180) ~[metrics-core-4.1.5.jar:4.1.5]
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
        at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
        at java.lang.Thread.run(Unknown Source) [?:?]

There does not seem to be a way to retry a recovery after it's failed.. but will try to investigate further.

@nickumia-reisys
Copy link
Contributor Author

Similar to the following issue, we are not running SolrCloud (and do not have this issue in our current design)

@nickumia-reisys nickumia-reisys self-assigned this Jul 20, 2023
@nickumia-reisys nickumia-reisys added component/solr-service Related to Solr-as-a-Service, a brokered Solr offering Testing labels Oct 7, 2023
@nickumia-reisys nickumia-reisys moved this to 🗄 Closed in data.gov team board Oct 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug component/catalog Related to catalog component playbooks/roles component/inventory Inventory playbooks/roles component/solr-service Related to Solr-as-a-Service, a brokered Solr offering component/ssb Testing
Projects
Archived in project
Development

No branches or pull requests

4 participants