Elasticsearch Status: Fault -- no_shard_available_action_exception #12899

jaypfe · 2024-05-01T13:54:16Z

jaypfe
May 1, 2024

Version

2.4.60

Installation Method

Security Onion ISO image

Description

configuration

Installation Type

Distributed

Location

on-prem with Internet access

Hardware Specs

Exceeds minimum requirements

CPU

60

RAM

128

Storage for /

256

Storage for /nsm

256

Network Traffic Collection

span port

Network Traffic Speeds

Less than 1Gbps

Status

No, one or more services are failed (please provide detail below)

Salt Status

No, there are no failures

Logs

Yes, there are additional clues in /opt/so/log/ (please provide detail below)

Detail

Hey guys!

I've been fighting this issue for the past week or so. Starting out everything will work great, but then elasticsearch will show faulted in the grid consol. This is accompanied with a shard fail error in kibana and that same issue in Securityonion.log.

These shards have so far always shown in the unassigned report after running sudo so-elasticsearch-query _cat/shards | grep UN

I have quite a few, and it looks to be ever growing. I will move some out, just for some new ones to take its place. The kicker is these few that I have now, regardless of whether I issue the command to resolve them, they often stay. I have issued the commands with the appropriate users required as well.

What I have found that will work for me very briefly, is issuing that command then restarting my ENTIRE grid. Things will work for around 4ish hours, then this process repeats with a new shard failure on a new index. Which so far has always been unassigned. At least a replica of it anyway.

Currently this shard is:
ds-logs-endpoint.events.network-default-2024.04.30-000086

Sometimes I have also noticed this issue will resolve its self, but only for about that 4ish hour cycle described above.

Something I have also noticed is throughout this, the memory usage on the node will steadily climb much beyond normal operating range.

It might be worth noting my grid has shown pending for quite some time, with most of those indexes being older. I believe only these new indexes that are being dropped in there are the cause of the issue.

Any help is much appreciated as always!

Salt-call fails on the elastic-wait

Guidelines

I have read the discussion guidelines at Read before posting! #1720 and assert that I have followed the guidelines.

cm-ops · 2024-05-02T12:53:09Z

cm-ops
May 2, 2024
Maintainer

Which shards are unassigned? For the above shard, what does this show? so-elasticsearch-query _cluster/allocation/explain -d '{ "index": ".ds-logs-endpoint.events.network-default-2024.04.30-000086", "shard": 0, "primary": true }' | jq <-- if it is a replica shard, change primary to false.

3 replies

jaypfe May 2, 2024
Author

The Unassigned shards as follows:

Running the command you provided might has given some good insight, the response is as follows. (Sorry about the formatting I am not sure how to keep it in a github comment, if you do please share!)

This does not appear to be a replica shard, by reasoning for that is I looked at this index in Kibana and did not see it was assigned any replicas, I am not sure if there would be a more sure fire way to ensure its not a replica. But I would say its probably not.
{
"index": ".ds-logs-endpoint.events.network-default-2024.04.30-000086",
"shard": 0,
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "INDEX_CREATED",
"at": "2024-04-30T23:00:20.455Z",
"last_allocation_status": "no"
},
"can_allocate": "no",
"allocate_explanation": "Elasticsearch isn't allowed to allocate this shard to any of the nodes in the cluster. Choose a node to which you expect this shard to be allocated, find this node in the node-by-node explanation, and address the reasons which prevent Elasticsearch from allocating this shard there.",
"node_allocation_decisions": [
{
"node_id": "_BM2tYQnQWaguUJepHCgfg",
"node_name": "tul-so-manager01",
"transport_address": "x.x.x.x.x:9300",
"node_attributes": {
"transform.config_version": "10.0.0",
"xpack.installed": "true"
},
"node_decision": "no",
"weight_ranking": 1,
"deciders": [
{
"decider": "filter",
"decision": "NO",
"explanation": "node matches cluster setting [cluster.routing.allocation.exclude] filters [_ip:"x.x.x.x.x"]"
}
]
},
{
"node_id": "8-z_OKWQTaCF4a1pnoTsVg",
"node_name": "tul-so-search01",
"transport_address": "x.x.x.x:9300",
"node_attributes": {
"transform.config_version": "10.0.0",
"xpack.installed": "true"
},
"node_decision": "no",
"weight_ranking": 2,
"deciders": [
{
"decider": "disk_threshold",
"decision": "NO",
"explanation": "the node is above the high watermark cluster setting [cluster.routing.allocation.disk.watermark.high=85%], having less than the minimum required [854.1gb] free space, actual free: [740.3gb], actual used: [86.9%]"
}
]
},
{
"node_id": "RZI5pf7TSdO6KWq9hxfvxA",
"node_name": "fb-so-search01",
"transport_address": "x.x.x.x:9300",
"node_attributes": {
"transform.config_version": "10.0.0",
"xpack.installed": "true"
},
"node_decision": "no",
"weight_ranking": 3,
"deciders": [
{
"decider": "disk_threshold",
"decision": "NO",
"explanation": "the node is above the high watermark cluster setting [cluster.routing.allocation.disk.watermark.high=85%], having less than the minimum required [854.1gb] free space, actual free: [754.5gb], actual used: [86.7%]"
}
]
}
]
}

Looks like its hitting watermark, my question then becomes why? I had though the logs would be getting removed after the watermark was hit through the new 'not curator' process. This command is OP btw.

I could extend the watermark by a small margin to get things moving again but would rather remove old indices first if possible.

jaypfe May 2, 2024
Author

The Watermark settings are:
Global
Low: 80%
High 85%
Flood 90%

jaypfe May 2, 2024
Author

I did update the watermark to 90/95/97. Caused everything to come back online and all unassigned indices have moved away.

Question still stands through as to why the old logs were not pruned before the watermark was hit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch Status: Fault -- no_shard_available_action_exception #12899

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Elasticsearch Status: Fault -- no_shard_available_action_exception #12899

jaypfe May 1, 2024

Version

Installation Method

Description

Installation Type

Location

Hardware Specs

CPU

RAM

Storage for /

Storage for /nsm

Network Traffic Collection

Network Traffic Speeds

Status

Salt Status

Logs

Detail

Guidelines

Replies: 1 comment · 3 replies

cm-ops May 2, 2024 Maintainer

jaypfe May 2, 2024 Author

jaypfe May 2, 2024 Author

jaypfe May 2, 2024 Author

jaypfe
May 1, 2024

Replies: 1 comment 3 replies

cm-ops
May 2, 2024
Maintainer

jaypfe May 2, 2024
Author

jaypfe May 2, 2024
Author

jaypfe May 2, 2024
Author