Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][MISP] Connector creating a runaway task loop - Leading to Platform Stall #958

Closed
MaxwellDPS opened this issue Dec 21, 2022 · 9 comments
Labels
bug use for describing something not working as expected solved use to identify issue that has been solved (must be linked to the solving PR)
Milestone

Comments

@MaxwellDPS
Copy link

Description

The MISP connector seems to be attempting to re-import events it has seen previous. This leads to major queue backlog in a matter of min.

The odd part, is this seems to be new behavior as of ~72 hours ago

Environment

  1. OS (where OpenCTI server runs): CentOS Stream 8
  2. OpenCTI version: 5.4.1
  3. OpenCTI client: Frontend
  4. Other environment details: Running as a Kubernetes Deployment

Reproducible Steps

Not explicitly sure whats causing the events to be reimported, It seems marginally intermittent depending on latest MISP events

Steps to create the smallest reproducible scenario:

  1. Run MISP Connector 5.4.1
  2. Wait for the same event to get reimported continously

Expected Output

Import once MISP Event once

Actual Output

The MISP connector is dropping thousands of tasks to the queue every <10 Min.

Additional information

MISP Connector Logs

The Misp event ID has been replaced with <redacted_misp_event_id(same)> the event ID IS THE SAME for each instance of this

INFO:root:Listing Threat-Actors with filters null.
INFO:root:Listing Threat-Actors with filters null.
INFO:root:Connector registered with ID: <redacted_connector_id>
INFO:root:Starting ping alive thread
WARNING:pymisp:The version of PyMISP recommended by the MISP instance (2.4.166) is newer than the one you're using now (2.4.165.1). Please upgrade PyMISP.
INFO:root:Initiate work for <redacted_connector_id>
INFO:root:Connector last run: 2022-12-21T19:55:32.406301+00:00
INFO:root:Connector latest event: 2022-12-21T19:54:33+00:00
INFO:root:Fetching MISP events with args: {"timestamp": 1671652474, "limit": 50, "page": 1}
INFO:root:MISP returned 1 events.
INFO:root:Processing event <redacted_misp_event_id(same)>
INFO:root:Sending event STIX2 bundle
INFO:root:Update action expectations work_<redacted_connector_id>_2022-12-21T20:02:38.117Z - 29747
INFO:root:Fetching MISP events with args: {"timestamp": 1671652474, "limit": 50, "page": 2}
INFO:root:MISP returned 0 events.
INFO:root:Connector successfully run (1 events have been processed), storing state (last_run=2022-12-21T20:02:38.091939+00:00, last_event=2022-12-21T20:02:06+00:00, last_event_timestamp=1671652926)
INFO:root:Reporting work update_received work_<redacted_connector_id>_2022-12-21T20:02:38.117Z
INFO:root:Initiate work for <redacted_connector_id>
INFO:root:Connector last run: 2022-12-21T20:02:38.091939+00:00
INFO:root:Connector latest event: 2022-12-21T20:02:06+00:00
INFO:root:Fetching MISP events with args: {"timestamp": 1671652927, "limit": 50, "page": 1}
INFO:root:MISP returned 1 events.
INFO:root:Processing event <redacted_misp_event_id(same)>
INFO:root:{"errors":[{"message":"Cannot execute GraphQL operations after the server has stopped.","extensions":{"code":"INTERNAL_SERVER_ERROR"}}]}

ERROR:root:Error pinging the API
INFO:root:Sending event STIX2 bundle
INFO:root:Update action expectations work_<redacted_connector_id>_2022-12-21T20:10:18.351Z - 29786
ERROR:root:API Ping back to normal
INFO:root:Fetching MISP events with args: {"timestamp": 1671652927, "limit": 50, "page": 2}
INFO:root:MISP returned 0 events.
INFO:root:Connector successfully run (1 events have been processed), storing state (last_run=2022-12-21T20:10:18.327820+00:00, last_event=2022-12-21T20:10:07+00:00, last_event_timestamp=1671653407)
INFO:root:Reporting work update_received work_<redacted_connector_id>_2022-12-21T20:10:18.351Z
INFO:root:Initiate work for <redacted_connector_id>
INFO:root:Connector last run: 2022-12-21T20:10:18.327820+00:00
INFO:root:Connector latest event: 2022-12-21T20:10:07+00:00
INFO:root:Fetching MISP events with args: {"timestamp": 1671653408, "limit": 50, "page": 1}
INFO:root:MISP returned 1 events.
INFO:root:Processing event <redacted_misp_event_id(same)>

Screenshots (optional)

Example of the continuous dumps of the same data from the MISP Connector
image

@MaxwellDPS MaxwellDPS changed the title [MISP] MISP-Connector creating a runaway task loop [BUG][MISP] Connector creating a runaway task loop - Leading to Platform Stall Dec 21, 2022
@SamuelHassine SamuelHassine added bug use for describing something not working as expected solved use to identify issue that has been solved (must be linked to the solving PR) labels Jan 9, 2023
@SamuelHassine SamuelHassine added this to the Release 5.5.2 milestone Jan 9, 2023
@MaxwellDPS
Copy link
Author

@SamuelHassine @richard-julien This is still a problem in 5.5.2, and I have confirmed its the source of OpenCTI-Platform/opencti#2603 on my end

@MaxwellDPS
Copy link
Author

MaxwellDPS commented Jan 27, 2023

@SamuelHassine this is now becoming a source of major platform instability, post 5.5.2 Upgrade I am experiencing Redis OOM based platform stalls.

This bug is still very present - Please re-open, if there is anything you need on my end please let me know / Ping me in Slack

side note - The 5.5.2 workers are not processing tasks at nearly the same rate.

  • On 5.4.X we could process ~300k tasks in < 90 Min.
  • On 5.5.2 we cant even process ~25K tasks in <60 Min

Like not tring to nitpick here, but this is a serious performance decrease. On my end resourcing has not changed

Between this issue and OpenCTI-Platform/opencti#2603 - This is kinda a loss in cabin pressure scenario

@llid3nlq
Copy link

I'm keen to get this resolved, too. I'm experiencing the exact same issue. I keep increasing RAM but redis-server is consuming everything I give it. Currently at 54GB that I've allocated to the VM. Let me know if I can provide any information to help resolve this issue. The platform can't survive 24 hours without crash looping.

@Beskletochnii
Copy link

I'm experiencing the exact same issue at the moment...total 32GB memory - redis 20GB

@richard-julien
Copy link
Member

Hi @llid3nlq and @Beskletochnii ,
2 information you need to know. Without specific configuration the Redis will retains all the events of the platform and so take more and more memory. You can adapt that using the trimming option of the platform. Its not limited by default because you need to think about how many events you want to retains. More events you will have, more connectors and internal process that use the streams will be able to recover from late processing. Basically a lot of users take a look on their load and size the trimming to represent 2 months of data average.

If you have the trimming in place and the Redis memory is still not stable, we need to organize a call and deep dive into your connectors/data to identify who could be the culprit.

@Beskletochnii
Copy link

@richard-julien
Thank you very much for your answer. Could you link to the Redis documentation on trimming option?

@richard-julien
Copy link
Member

Yes sure, you can find it here https://filigran.notion.site/Configuration-a568604c46d84f39a8beae141505572a#232c60ef21d9472298baf3c104ccd0c7 on the dependencies section

@llid3nlq
Copy link

Hi @llid3nlq and @Beskletochnii , 2 information you need to know. Without specific configuration the Redis will retains all the events of the platform and so take more and more memory. You can adapt that using the trimming option of the platform. Its not limited by default because you need to think about how many events you want to retains. More events you will have, more connectors and internal process that use the streams will be able to recover from late processing. Basically a lot of users take a look on their load and size the trimming to represent 2 months of data average.

If you have the trimming in place and the Redis memory is still not stable, we need to organize a call and deep dive into your connectors/data to identify who could be the culprit.

Thank you for this information - it is very helpful.
Can you tell me any negative implications to setting this feature? e.g. Some intel won't be searchable, IoCs don'r remain in the database.
I'm just wondering exactly what trimming means in the broader context of OpenCTI.
If there is no negative effect, other than cached stream data, then that sounds ok.

@richard-julien
Copy link
Member

The stream is used in OpenCTI for multiple aspects, for example the "rule engine", the "notification system", the "live stream" etc. The only implication of the trimming is the time period where all features that use the stream will be able to recover if they are late of fail for some reason. So you need to setup a number according to your capacity of monitoring / resolving a platform problem if one of this component fail. I would say 1 month of retention should be sufficiant.

If you want to be specific about the number of the number you want to put in the configuration you can connect to your stream on the /stream uri and check the number of messages your already have and do some maths to know the number that will represent 1 month of data.

If you focus only for now on stabilize the size of your redis, you can for now put 1 million on the configuration, that will represents in average 4 Go of memory and un unknow period of time because its depends on the volume of data you absorb every day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug use for describing something not working as expected solved use to identify issue that has been solved (must be linked to the solving PR)
Projects
None yet
Development

No branches or pull requests

5 participants