Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on deletion of entities with many relationships #6357

Closed
iFrozenPhoenix opened this issue Mar 14, 2024 · 13 comments
Closed

Error on deletion of entities with many relationships #6357

iFrozenPhoenix opened this issue Mar 14, 2024 · 13 comments
Labels
bug use for describing something not working as expected solved use to identify issue that has been solved (must be linked to the solving PR)
Milestone

Comments

@iFrozenPhoenix
Copy link

Description

When trying to delete an entity i receive the error "Fail to execute engine pagination - no reason provided".
It only happens on certain entities.

Environment

  1. OS (where OpenCTI server runs): debian
  2. OpenCTI version: 6.0.5
  3. OpenCTI client: web
  4. Other environment details:

Reproducible Steps

  1. Go to entities
  2. Find the entity and select it
  3. Try to delete it
  4. Go to tasks and see the error "Fail to execute engine pagination - no reason provided"

Expected Output

Deletion is successful

Actual Output

Deletion fails with the error message "Fail to execute engine pagination - no reason provided"

Additional information

Screenshots (optional)

@iFrozenPhoenix iFrozenPhoenix added bug use for describing something not working as expected needs triage use to identify issue needing triage from Filigran Product team labels Mar 14, 2024
@nino-filigran
Copy link

hey @iFrozenPhoenix do you happen to know fir which entity type it happens or does it seem to happen randomly? Additionally, are you trying to delete a bunch of them at once or only one by one? If it's one by one, do you delete them by entering in the entity and then delete it or do you delete directly from the list?

@nino-filigran nino-filigran added needs more info Intel needed about the use case and removed needs triage use to identify issue needing triage from Filigran Product team labels Mar 14, 2024
@iFrozenPhoenix
Copy link
Author

The entities I have problems are of type Malware. If I enter into the entity and try to delete it nothing happens. If I try to delete the entity via Bulk Deletion I see the mentioned error. It doesn't affect all Malwares only certain.

@iFrozenPhoenix
Copy link
Author

I was able to isolate the octi error messages on the backend, see the following.

{"category":"APP","errors":[{"attributes":{"genre":"TECHNICAL","http_status":500,"query":{"_source":true,"body":{"query":{"bool":{"must":[{"bool":{"minimum_should_match":1,"should":[{"bool":{"minimum_should_match":1,"should":[{"multi_match":{"fields":["entity_type.keyword","parent_types.keyword"],"query":"RetentionRule"}}]}}]}}],"must_not":[]}},"size":500,"sort":[{"standard_id.keyword":"asc"}]},"index":["opencti_internal_objects*","opencti_stix_meta_objects*","opencti_stix_domain_objects*","opencti_stix_cyber_observables*","opencti_inferred_entities*"],"track_total_hits":true},"root_cause":"rejected_execution_exception"},"message":"Fail to execute engine pagination","name":"DATABASE_ERROR","stack":"DATABASE_ERROR: Fail to execute engine pagination\n    at error (/opt/opencti/build/src/config/errors.js:8:10)\n    at DatabaseError (/opt/opencti/build/src/config/errors.js:58:48)\n    at /opt/opencti/build/src/database/engine.js:2731:15\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at retentionHandler (/opt/opencti/build/src/manager/retentionManager.js:62:28)\n    at /opt/opencti/build/src/manager/retentionManager.js:90:9\n    at lTt.#runHandlerAndScheduleTimeout (/opt/opencti/build/node_modules/set-interval-async/dist/set-interval-async-timer.cjs:36:13)\n    at Timeout._onTimeout (/opt/opencti/build/node_modules/set-interval-async/dist/set-interval-async-timer.cjs:29:13)"},{"message":"search_phase_execution_exception: [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@3e261c74 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 87.9micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1392, completed tasks = 65924960]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@7df037c7 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 108.5micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1390, completed tasks = 65924967]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@2a876b9 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 108.5micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1392, completed tasks = 65924967]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@7c5b247a on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 941.7micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1390, completed tasks = 65924970]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@72905239 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 418.6micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1389, completed tasks = 65924973]]","name":"ResponseError","stack":"ResponseError: search_phase_execution_exception: [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@3e261c74 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 87.9micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1392, completed tasks = 65924960]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@7df037c7 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 108.5micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1390, completed tasks = 65924967]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@2a876b9 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 108.5micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1392, completed tasks = 65924967]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@7c5b247a on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 941.7micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1390, completed tasks = 65924970]]; [rejected_execution_exception] Reason: rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@72905239 on QueueResizingOpenSearchThreadPoolExecutor[name = cti-opensearch-644d7ff846-jh47c/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 10000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 418.6micros, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@4fea578a[Running, pool size = 30, active threads = 30, queued tasks = 1389, completed tasks = 65924973]]\n    at onBody (/opt/opencti/build/node_modules/@opensearch-project/opensearch/lib/Transport.js:425:23)\n    at IncomingMessage.onEnd (/opt/opencti/build/node_modules/@opensearch-project/opensearch/lib/Transport.js:340:11)\n    at IncomingMessage.emit (node:events:530:35)\n    at endReadableNT (node:internal/streams/readable:1696:12)\n    at processTicksAndRejections (node:internal/process/task_queues:82:21)"}],"level":"error","manager":"RETENTION_MANAGER","message":"Fail to execute engine pagination","timestamp":"2024-03-14T17:03:07.280Z","version":"6.0.5"}
{"category":"APP","errors":[{"attributes":{"genre":"BUSINESS","http_status":500,"participantIds":["relationship--89230b56-befd-4387-936c-5e4b0e266f15","relationship--570df7f7-f2d5-4de9-ba74-4adfa99e980f","de5f2e30-f2fa-4c43-915d-37b86ad981b3","f9357ba9-85ff-49ed-947e-d7b48a6d0b6a"]},"message":"Execution timeout, too many concurrent call on the same entities","name":"LOCK_ERROR","stack":"LOCK_ERROR: Execution timeout, too many concurrent call on the same entities\n    at error (/opt/opencti/build/src/config/errors.js:8:10)\n    at LockTimeoutError (/opt/opencti/build/src/config/errors.js:104:51)\n    at createRelationRaw (/opt/opencti/build/src/database/middleware.js:2884:13)\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at createRelation (/opt/opencti/build/src/database/middleware.js:2892:16)\n    at addStixCoreRelationship (/opt/opencti/build/src/domain/stixCoreRelationship.js:91:19)"}],"inner_relation_creation":1,"level":"warn","message":"Execution timeout, too many concurrent call on the same entities","operation":"StixCoreRelationshipAdd","operation_query":"mutation StixCoreRelationshipAdd($input:StixCoreRelationshipAddInput!){stixCoreRelationshipAdd(input:$input){id standard_id entity_type parent_types}}","size":1172,"time":26056,"timestamp":"2024-03-14T17:03:36.072Z","type":"WRITE_ERROR","user":{"applicant_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","call_retry_number":"1","group_ids":["117df96d-c92f-4e18-8e22-be2f440859ca"],"ip":"::ffff:10.0.0.67","organization_ids":[],"socket":"query","user_id":"88ec0c6a-13ce-5e39-b486-354fe4a7084f","user_metadata":{}},"version":"6.0.5"}

@iFrozenPhoenix
Copy link
Author

iFrozenPhoenix commented Mar 14, 2024

I could find out a bit more. The problem seems to occur with any entity if it has more then 1000 relations. If the relations are deleted in advance the deletion of the entity also works. If I understand it correctly the mass selection and deletion deletes each object (Relation / Entity) one by one. For me it looks like there is a difference on how the relations are deleted if I delete the entity compared to if I delete the relations by mass selecting them).

With Version 5.12.* it worked without problems.

@iFrozenPhoenix iFrozenPhoenix changed the title Bulk deletion error Error on deletion of entities with many relationships Mar 14, 2024
@nino-filigran
Copy link

Thanks for the investigation! I'm reaching out internally to seek for advice and help on this one.

@iFrozenPhoenix
Copy link
Author

I tried adjusting my oopensearch settings, especially the thread_pool.search.size parameter. I significantly increased it and now the error message is a different one ("Execution timeout, too many concurrent call on the same entities - no reason provided") when I try to delete such a entity.

@nino-filigran
Copy link

cc @richard-julien

@richard-julien
Copy link
Member

Increase thread_pool.search.size is a good move has the error in your logs is generated when opensearch is under to much load. So basically we have to improve your cluster capacity or opensearch memory/threads figures.

Error "Execution timeout, too many concurrent call on the same entities" is completely different. Its an error generated by opencti when too many operation try to be done on the same element in a short period of time. A quicker opensearch can mitigate this problem but if you have a lot of workers working on the same elements and you try at the same time to delete you can definitely have this kind of error. However after a bit of time you should be able to delete the element if no other process try to work on it.

@iFrozenPhoenix
Copy link
Author

iFrozenPhoenix commented Mar 15, 2024

Currently I'm running OCTI with the following setup.
Machine 16GB Ram, 8 vCPUs (ARM Ampere).
Single Node Opensearch with Xms6G Xmx6G., 5 - 10 (Sometimes I change this; depending on ingestion) OCTI Workers, 1 Platform for UI and 1 Platform for Ingestion / API (If I increase the no of replicas (k8s) performance significantly drops, don't know why, other topic.)

The Opensearch config (only modifications) I use is

thread_pool.search.min_queue_size: 10000
thread_pool.search.max_queue_size: 10000
thread_pool.search.queue_size: 10000
thread_pool.search.size: 32

Befor version 6 I didn't apply the custom Opensearch config and had xms3G and xmx3G and it ran very smooth. Now after applying the above config it runs again, at least nearly smooth, but I have still the problem with the pagination error (Eventually the Execution Timeout, too many concurrent requests was sent in a moment where the ingestion rate was very high.

After I have applied the above config I see again the following ingestion metrics:

Screenshot from 2024-03-15 20-58-21

and the UI is again relatively smooth, despite of some dashboard widgets with relations filtering.

Could you eventually give me a suggestion what else I could try?.

I have the following metrics: Disk throughput is 2k iops / (50 - 100k), disk bandwidth 100 Mbps / ca. 1000 Mbps, Opensearch CPU Usage ( 105 - 230 % / 500 left; total 800 %), Ram 4 GB Left)

@iFrozenPhoenix
Copy link
Author

iFrozenPhoenix commented Mar 15, 2024

I have to add, currently I have attached a new connector that is currently ingesting 1.2 M new entities, therefore the high no in bundles to be processed.
Under normal circumstances the platform is significantly more performant. A few hundred bundles are processed in no time.

@iFrozenPhoenix
Copy link
Author

Additionally, opensearch's _cat/thread_pool?v query with (active, queued, rejected) does not show a single rejection.

@SamuelHassine SamuelHassine added this to the Release 6.0.8 milestone Mar 20, 2024
@Jipegien Jipegien modified the milestones: Release 6.0.8, Release 6.0.9 Mar 20, 2024
@jborozco
Copy link
Member

cc @richard-julien do you have any suggestion ?

@richard-julien
Copy link
Member

As your platform is growing and maybe the data your ingesting also evolve you start to have some contention problem on your opensearch. A less effective opensearch will introduce some latencies and too many concurrent requests error when a lot of workers works on the same object and the opensearch is slower than expected to process everything.

Today the solution can be to decrease the number of workers and assume that you will be a bit slower to ingest your data. Or you need to move to a more powerful opensearch, moving to a cluster approach. Even if you move to a bigger opensearch you can still have some "too many concurrent" times to times when the platform process very highly connected data. Its something we work on to limit this kind of locking.

@SamuelHassine SamuelHassine added solved use to identify issue that has been solved (must be linked to the solving PR) and removed needs more info Intel needed about the use case labels Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug use for describing something not working as expected solved use to identify issue that has been solved (must be linked to the solving PR)
Projects
None yet
Development

No branches or pull requests

6 participants