Error on deletion of entities with many relationships #6357

iFrozenPhoenix · 2024-03-14T15:21:48Z

Description

When trying to delete an entity i receive the error "Fail to execute engine pagination - no reason provided".
It only happens on certain entities.

Environment

OS (where OpenCTI server runs): debian
OpenCTI version: 6.0.5
OpenCTI client: web
Other environment details:

Reproducible Steps

Go to entities
Find the entity and select it
Try to delete it
Go to tasks and see the error "Fail to execute engine pagination - no reason provided"

Expected Output

Deletion is successful

Actual Output

Deletion fails with the error message "Fail to execute engine pagination - no reason provided"

Additional information

Screenshots (optional)

nino-filigran · 2024-03-14T16:27:11Z

hey @iFrozenPhoenix do you happen to know fir which entity type it happens or does it seem to happen randomly? Additionally, are you trying to delete a bunch of them at once or only one by one? If it's one by one, do you delete them by entering in the entity and then delete it or do you delete directly from the list?

iFrozenPhoenix · 2024-03-14T16:38:50Z

The entities I have problems are of type Malware. If I enter into the entity and try to delete it nothing happens. If I try to delete the entity via Bulk Deletion I see the mentioned error. It doesn't affect all Malwares only certain.

iFrozenPhoenix · 2024-03-14T17:06:47Z

I was able to isolate the octi error messages on the backend, see the following.

{"category":"APP","errors":[{"attributes":{"genre":"TECHNICAL","http_status":500,"query":{"_source":true,"body":{"query":{"bool":{"must":[{"bool":{"minimum_should_match":1,"should":[{"bool":{"minimum_should_match":1,"should":[{"multi_match":{"fields":["entity_type.keyword","parent_types.keyword"],"query":"RetentionRule"}}]}}]}}],"must_not":[]}},"size":500,"sort":[{"standard_id.keyword":"asc"}]},"index":["opencti_internal_objects*","opencti_stix_meta_objects*","opencti_stix_domain_objects*","opencti_stix_cyber_observables*","opencti_inferred_entities*"],"track_total_hits":true},"root_cause":"rejected_execution_exception"},"message":"Fail to execute engine pagination","name":"DATABASE_ERROR","stack":"DATABASE_ERROR: Fail to execute engine pagination\n    at error (/opt/opencti/build/src/config/errors.js:8:10)\n    at DatabaseError (/opt/opencti/build/src/config/errors.js:58:48)\n    at /opt/opencti/build/src/database/engine.js:2731:15\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at retentionHandler (/opt/opencti/build/src/manager/retentionManager.js:62:28)\n    at /opt/opencti/build/src/manager/retentionManager.js:90:9\n    at lTt.#runHandlerAndScheduleTimeout (/opt/opencti/build/node_modules/set-interval-async/dist/set-interval-async-timer.cjs:36:13)\n    at Timeout._onTimeout (/opt/opencti/build/node_modules/set-interval-async/dist/set-interval-async-timer.cjs:29:13)"},{"message":"search_phase_execution_exception: [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@3e261c74 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 87.9micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1392, completed tasks = 65924960]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@7df037c7 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 108.5micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1390, completed tasks = 65924967]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@2a876b9 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 108.5micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1392, completed tasks = 65924967]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@7c5b247a on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 941.7micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1390, completed tasks = 65924970]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@72905239 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 418.6micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1389, completed tasks = 65924973]]","name":"ResponseError","stack":"ResponseError: search_phase_execution_exception: [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@3e261c74 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 87.9micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1392, completed tasks = 65924960]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@7df037c7 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 108.5micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1390, completed tasks = 65924967]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@2a876b9 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 108.5micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1392, completed tasks = 65924967]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@7c5b247a on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 941.7micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1390, completed tasks = 65924970]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@72905239 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 418.6micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1389, completed tasks = 65924973]]\n    at onBody (/opt/opencti/build/node_modules/@opensearch-project/opensearch/lib/Transport.js:425:23)\n    at IncomingMessage.onEnd (/opt/opencti/build/node_modules/@opensearch-project/opensearch/lib/Transport.js:340:11)\n    at IncomingMessage.emit (node:events:530:35)\n    at endReadableNT (node:internal/streams/readable:1696:12)\n    at processTicksAndRejections (node:internal/process/task_queues:82:21)"}],"level":"error","manager":"RETENTION_MANAGER","message":"Fail to execute engine pagination","timestamp":"2024-03-14T17:03:07.280Z","version":"6.0.5"}
{"category":"APP","errors":[{"attributes":{"genre":"BUSINESS","http_status":500,"participantIds":["relationship--89230b56-befd-4387-936c-5e4b0e266f15","relationship--570df7f7-f2d5-4de9-ba74-4adfa99e980f","de5f2e30-f2fa-4c43-915d-37b86ad981b3","f9357ba9-85ff-49ed-947e-d7b48a6d0b6a"]},"message":"Execution timeout, too many concurrent call on the same entities","name":"LOCK_ERROR","stack":"LOCK_ERROR: Execution timeout, too many concurrent call on the same entities\n    at error (/opt/opencti/build/src/config/errors.js:8:10)\n    at LockTimeoutError (/opt/opencti/build/src/config/errors.js:104:51)\n    at createRelationRaw (/opt/opencti/build/src/database/middleware.js:2884:13)\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at createRelation (/opt/opencti/build/src/database/middleware.js:2892:16)\n    at addStixCoreRelationship (/opt/opencti/build/src/domain/stixCoreRelationship.js:91:19)"}],"inner_relation_creation":1,"level":"warn","message":"Execution timeout, too many concurrent call on the same entities","operation":"StixCoreRelationshipAdd","operation_query":"mutation StixCoreRelationshipAdd($input:StixCoreRelationshipAddInput!){stixCoreRelationshipAdd(input:$input){id standard_id entity_type parent_types}}","size":1172,"time":26056,"timestamp":"2024-03-14T17:03:36.072Z","type":"WRITE_ERROR","user":{"applicant_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","call_retry_number":"1","group_ids":["117df96d-c92f-4e18-8e22-be2f440859ca"],"ip":"::ffff:10.0.0.67","organization_ids":[],"socket":"query","user_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","user_metadata":{}},"version":"6.0.5"}

iFrozenPhoenix · 2024-03-14T20:23:22Z

I could find out a bit more. The problem seems to occur with any entity if it has more then 1000 relations. If the relations are deleted in advance the deletion of the entity also works. If I understand it correctly the mass selection and deletion deletes each object (Relation / Entity) one by one. For me it looks like there is a difference on how the relations are deleted if I delete the entity compared to if I delete the relations by mass selecting them).

With Version 5.12.* it worked without problems.

nino-filigran · 2024-03-15T08:51:35Z

Thanks for the investigation! I'm reaching out internally to seek for advice and help on this one.

iFrozenPhoenix · 2024-03-15T14:36:23Z

I tried adjusting my oopensearch settings, especially the thread_pool.search.size parameter. I significantly increased it and now the error message is a different one ("Execution timeout, too many concurrent call on the same entities - no reason provided") when I try to delete such a entity.

nino-filigran · 2024-03-15T15:44:17Z

cc @richard-julien

richard-julien · 2024-03-15T17:57:27Z

Increase thread_pool.search.size is a good move has the error in your logs is generated when opensearch is under to much load. So basically we have to improve your cluster capacity or opensearch memory/threads figures.

Error "Execution timeout, too many concurrent call on the same entities" is completely different. Its an error generated by opencti when too many operation try to be done on the same element in a short period of time. A quicker opensearch can mitigate this problem but if you have a lot of workers working on the same elements and you try at the same time to delete you can definitely have this kind of error. However after a bit of time you should be able to delete the element if no other process try to work on it.

iFrozenPhoenix · 2024-03-15T20:13:39Z

Currently I'm running OCTI with the following setup.
Machine 16GB Ram, 8 vCPUs (ARM Ampere).
Single Node Opensearch with Xms6G Xmx6G., 5 - 10 (Sometimes I change this; depending on ingestion) OCTI Workers, 1 Platform for UI and 1 Platform for Ingestion / API (If I increase the no of replicas (k8s) performance significantly drops, don't know why, other topic.)

The Opensearch config (only modifications) I use is

thread_pool.search.min_queue_size: 10000
thread_pool.search.max_queue_size: 10000
thread_pool.search.queue_size: 10000
thread_pool.search.size: 32

Befor version 6 I didn't apply the custom Opensearch config and had xms3G and xmx3G and it ran very smooth. Now after applying the above config it runs again, at least nearly smooth, but I have still the problem with the pagination error (Eventually the Execution Timeout, too many concurrent requests was sent in a moment where the ingestion rate was very high.

After I have applied the above config I see again the following ingestion metrics:

and the UI is again relatively smooth, despite of some dashboard widgets with relations filtering.

Could you eventually give me a suggestion what else I could try?.

I have the following metrics: Disk throughput is 2k iops / (50 - 100k), disk bandwidth 100 Mbps / ca. 1000 Mbps, Opensearch CPU Usage ( 105 - 230 % / 500 left; total 800 %), Ram 4 GB Left)

iFrozenPhoenix · 2024-03-15T20:34:04Z

I have to add, currently I have attached a new connector that is currently ingesting 1.2 M new entities, therefore the high no in bundles to be processed.
Under normal circumstances the platform is significantly more performant. A few hundred bundles are processed in no time.

iFrozenPhoenix · 2024-03-15T20:44:11Z

Additionally, opensearch's _cat/thread_pool?v query with (active, queued, rejected) does not show a single rejection.

jborozco · 2024-03-20T08:40:55Z

cc @richard-julien do you have any suggestion ?

richard-julien · 2024-03-20T10:35:17Z

As your platform is growing and maybe the data your ingesting also evolve you start to have some contention problem on your opensearch. A less effective opensearch will introduce some latencies and too many concurrent requests error when a lot of workers works on the same object and the opensearch is slower than expected to process everything.

Today the solution can be to decrease the number of workers and assume that you will be a bit slower to ingest your data. Or you need to move to a more powerful opensearch, moving to a cluster approach. Even if you move to a bigger opensearch you can still have some "too many concurrent" times to times when the platform process very highly connected data. Its something we work on to limit this kind of locking.

iFrozenPhoenix added bug use for describing something not working as expected needs triage use to identify issue needing triage from Filigran Product team labels Mar 14, 2024

nino-filigran added needs more info Intel needed about the use case and removed needs triage use to identify issue needing triage from Filigran Product team labels Mar 14, 2024

iFrozenPhoenix changed the title ~~Bulk deletion error~~ Error on deletion of entities with many relationships Mar 14, 2024

SamuelHassine added this to the Release 6.0.8 milestone Mar 20, 2024

Jipegien modified the milestones: Release 6.0.8, Release 6.0.9 Mar 20, 2024

jborozco closed this as completed Mar 25, 2024

SamuelHassine added solved use to identify issue that has been solved (must be linked to the solving PR) and removed needs more info Intel needed about the use case labels Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on deletion of entities with many relationships #6357

Error on deletion of entities with many relationships #6357

iFrozenPhoenix commented Mar 14, 2024

nino-filigran commented Mar 14, 2024

iFrozenPhoenix commented Mar 14, 2024

iFrozenPhoenix commented Mar 14, 2024

iFrozenPhoenix commented Mar 14, 2024 •

edited

nino-filigran commented Mar 15, 2024

iFrozenPhoenix commented Mar 15, 2024

nino-filigran commented Mar 15, 2024

richard-julien commented Mar 15, 2024

iFrozenPhoenix commented Mar 15, 2024 •

edited

iFrozenPhoenix commented Mar 15, 2024 •

edited

iFrozenPhoenix commented Mar 15, 2024

jborozco commented Mar 20, 2024

richard-julien commented Mar 20, 2024

Error on deletion of entities with many relationships #6357

Error on deletion of entities with many relationships #6357

Comments

iFrozenPhoenix commented Mar 14, 2024

Description

Environment

Reproducible Steps

Expected Output

Actual Output

Additional information

Screenshots (optional)

nino-filigran commented Mar 14, 2024

iFrozenPhoenix commented Mar 14, 2024

iFrozenPhoenix commented Mar 14, 2024

iFrozenPhoenix commented Mar 14, 2024 • edited

nino-filigran commented Mar 15, 2024

iFrozenPhoenix commented Mar 15, 2024

nino-filigran commented Mar 15, 2024

richard-julien commented Mar 15, 2024

iFrozenPhoenix commented Mar 15, 2024 • edited

iFrozenPhoenix commented Mar 15, 2024 • edited

iFrozenPhoenix commented Mar 15, 2024

jborozco commented Mar 20, 2024

richard-julien commented Mar 20, 2024

iFrozenPhoenix commented Mar 14, 2024 •

edited

iFrozenPhoenix commented Mar 15, 2024 •

edited

iFrozenPhoenix commented Mar 15, 2024 •

edited