-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solr Internode Traffic denied by 401 Unauthorized #3770
Comments
Related issue, although not sure of the direct impact, https://solr-user.lucene.apache.narkive.com/HOJFKqsh/overseer-could-not-get-tags, it does show up in the logs, but it's just a warning, not a real error |
Are you absolutely sure this has nothing to do with the default-deny NetworkPolicy? |
Considering that the network policy was not enabled when the error showed up, I am very confident that it is not a factor. The network policy was enabled at the initial start of the cluster startup, but I had deleted to get the cluster to become healthy. Incidentally, I had not re-enabled the network policy after that. Thus, the network policy did not have a direct impact on the issue. Nevertheless, if the existence of the network policy at the beginning of the cluster caused some timing issues or communication mishaps that compounded and only became a problem after days of being up, that's a different story. |
What I don't like is that, even with a fresh Solrcloud cluster, there are a plethora of errors and anomalies that seem very destructive in nature. The |
This seems to be a new warning, but I don't know if we would be waiting for CKAN to update this dependency. It seems like the way CKAN is using solr is old, especially relying on old plugins... For example, the synonymFilterFactory. |
It was confirmed that we are not setting any custom |
Staying in blocked until we can replicate the issue. @nickumia-reisys will update next steps. |
We have not yet been able to reproduce this error. #3783 is happening more reliably, so we are investigating that one before coming back to this. We can't hope to reproduce this if the Solr Collection is going down before this. The next steps are unknown as it stands. See #3783 (comment) for my best guess to the next steps in general. |
(This is mainly speculation, but it's a fair point) Not sure when this happened, but it seems like solr-99d4df8f264da830 is exhibiting an error that seemed to occur right before the 401 Unauthorized issue appeared last time. One of the shards in the collection is stuck in the recovering state and can't seem to catch up. Not sure of the cause, but if it doesn't improve, I won't be surprised if this evolves into the 401 Unauthorized because the shard doesn't get authentication updates with the rest of the cluster and then is technically alienated. |
Waiting on final GSA hardened image and solr classic setup before testing to see if this can be closed. |
This is crazy!! Someone else posted upstream that they saw the same error on k8s!! 😱 |
Similar to the following issue, we are not running SolrCloud (and do not have this issue in our current design) |
Solr v8.11.1
How to reproduce
Expected behavior
Stable solrcloud cluster operation
Actual behavior
Background
Solr had been reindexed to service catalog.data.gov traffic and it seemed functional. Upon testing (visiting
/organization
page), Solr was returning aninvalid token
error which was never seen before.Upon inspection of the solrcloud cluster, there were 401 Errors happening between the nodes and no amount of debugging seemed to get the cluster communication under control. We suspected that the persistent volume attached to the nodes returning 401s had been corrupted since they were having issues reading the
solr.xml
and having authentication issues. Various PVs and Solr Pods were nuked to try to get the cluster in a fresh clean state; however, this made no difference.Research had illuminated that this issue is known upstream. The upstream fix is not yet released and available. Through a friendly ping, a workaround has been suggested that may be useful in the short term. There is no guarantee how CKAN will respond to the workaround or if it is compatible with our setup.While there is a similar issue in Solr upstream, the causes are different. We've isolated the problem to the GSA-hardened EKS AMI that we were deploying, since using the default Amazon EKS AMI does not cause Solr to exhibit this issue.
Sketch
Attempt to implement workaround suggested in this issue.The text was updated successfully, but these errors were encountered: