[BUG][MISP] Connector creating a runaway task loop - Leading to Platform Stall #958

MaxwellDPS · 2022-12-21T20:32:33Z

Description

The MISP connector seems to be attempting to re-import events it has seen previous. This leads to major queue backlog in a matter of min.

The odd part, is this seems to be new behavior as of ~72 hours ago

Environment

OS (where OpenCTI server runs): CentOS Stream 8
OpenCTI version: 5.4.1
OpenCTI client: Frontend
Other environment details: Running as a Kubernetes Deployment

Reproducible Steps

Not explicitly sure whats causing the events to be reimported, It seems marginally intermittent depending on latest MISP events

Steps to create the smallest reproducible scenario:

Run MISP Connector 5.4.1
Wait for the same event to get reimported continously

Expected Output

Import once MISP Event once

Actual Output

The MISP connector is dropping thousands of tasks to the queue every <10 Min.

Additional information

MISP Connector Logs

The Misp event ID has been replaced with <redacted_misp_event_id(same)> the event ID IS THE SAME for each instance of this

INFO:root:Listing Threat-Actors with filters null.
INFO:root:Listing Threat-Actors with filters null.
INFO:root:Connector registered with ID: <redacted_connector_id>
INFO:root:Starting ping alive thread
WARNING:pymisp:The version of PyMISP recommended by the MISP instance (2.4.166) is newer than the one you're using now (2.4.165.1). Please upgrade PyMISP.
INFO:root:Initiate work for <redacted_connector_id>
INFO:root:Connector last run: 2022-12-21T19:55:32.406301+00:00
INFO:root:Connector latest event: 2022-12-21T19:54:33+00:00
INFO:root:Fetching MISP events with args: {"timestamp": 1671652474, "limit": 50, "page": 1}
INFO:root:MISP returned 1 events.
INFO:root:Processing event <redacted_misp_event_id(same)>
INFO:root:Sending event STIX2 bundle
INFO:root:Update action expectations work_<redacted_connector_id>_2022-12-21T20:02:38.117Z - 29747
INFO:root:Fetching MISP events with args: {"timestamp": 1671652474, "limit": 50, "page": 2}
INFO:root:MISP returned 0 events.
INFO:root:Connector successfully run (1 events have been processed), storing state (last_run=2022-12-21T20:02:38.091939+00:00, last_event=2022-12-21T20:02:06+00:00, last_event_timestamp=1671652926)
INFO:root:Reporting work update_received work_<redacted_connector_id>_2022-12-21T20:02:38.117Z
INFO:root:Initiate work for <redacted_connector_id>
INFO:root:Connector last run: 2022-12-21T20:02:38.091939+00:00
INFO:root:Connector latest event: 2022-12-21T20:02:06+00:00
INFO:root:Fetching MISP events with args: {"timestamp": 1671652927, "limit": 50, "page": 1}
INFO:root:MISP returned 1 events.
INFO:root:Processing event <redacted_misp_event_id(same)>
INFO:root:{"errors":[{"message":"Cannot execute GraphQL operations after the server has stopped.","extensions":{"code":"INTERNAL_SERVER_ERROR"}}]}

ERROR:root:Error pinging the API
INFO:root:Sending event STIX2 bundle
INFO:root:Update action expectations work_<redacted_connector_id>_2022-12-21T20:10:18.351Z - 29786
ERROR:root:API Ping back to normal
INFO:root:Fetching MISP events with args: {"timestamp": 1671652927, "limit": 50, "page": 2}
INFO:root:MISP returned 0 events.
INFO:root:Connector successfully run (1 events have been processed), storing state (last_run=2022-12-21T20:10:18.327820+00:00, last_event=2022-12-21T20:10:07+00:00, last_event_timestamp=1671653407)
INFO:root:Reporting work update_received work_<redacted_connector_id>_2022-12-21T20:10:18.351Z
INFO:root:Initiate work for <redacted_connector_id>
INFO:root:Connector last run: 2022-12-21T20:10:18.327820+00:00
INFO:root:Connector latest event: 2022-12-21T20:10:07+00:00
INFO:root:Fetching MISP events with args: {"timestamp": 1671653408, "limit": 50, "page": 1}
INFO:root:MISP returned 1 events.
INFO:root:Processing event <redacted_misp_event_id(same)>

Screenshots (optional)

Example of the continuous dumps of the same data from the MISP Connector

The text was updated successfully, but these errors were encountered:

MaxwellDPS · 2023-01-24T15:48:47Z

@SamuelHassine @richard-julien This is still a problem in 5.5.2, and I have confirmed its the source of OpenCTI-Platform/opencti#2603 on my end

MaxwellDPS · 2023-01-27T23:29:55Z

@SamuelHassine this is now becoming a source of major platform instability, post 5.5.2 Upgrade I am experiencing Redis OOM based platform stalls.

This bug is still very present - Please re-open, if there is anything you need on my end please let me know / Ping me in Slack

side note - The 5.5.2 workers are not processing tasks at nearly the same rate.

On 5.4.X we could process ~300k tasks in < 90 Min.
On 5.5.2 we cant even process ~25K tasks in <60 Min

Like not tring to nitpick here, but this is a serious performance decrease. On my end resourcing has not changed

Between this issue and OpenCTI-Platform/opencti#2603 - This is kinda a loss in cabin pressure scenario

llid3nlq · 2023-01-30T06:18:00Z

I'm keen to get this resolved, too. I'm experiencing the exact same issue. I keep increasing RAM but redis-server is consuming everything I give it. Currently at 54GB that I've allocated to the VM. Let me know if I can provide any information to help resolve this issue. The platform can't survive 24 hours without crash looping.

Beskletochnii · 2023-02-16T10:13:24Z

I'm experiencing the exact same issue at the moment...total 32GB memory - redis 20GB

richard-julien · 2023-02-16T10:18:48Z

Hi @llid3nlq and @Beskletochnii ,
2 information you need to know. Without specific configuration the Redis will retains all the events of the platform and so take more and more memory. You can adapt that using the trimming option of the platform. Its not limited by default because you need to think about how many events you want to retains. More events you will have, more connectors and internal process that use the streams will be able to recover from late processing. Basically a lot of users take a look on their load and size the trimming to represent 2 months of data average.

If you have the trimming in place and the Redis memory is still not stable, we need to organize a call and deep dive into your connectors/data to identify who could be the culprit.

Beskletochnii · 2023-02-16T10:22:48Z

@richard-julien
Thank you very much for your answer. Could you link to the Redis documentation on trimming option?

richard-julien · 2023-02-16T12:07:08Z

Yes sure, you can find it here https://filigran.notion.site/Configuration-a568604c46d84f39a8beae141505572a#232c60ef21d9472298baf3c104ccd0c7 on the dependencies section

llid3nlq · 2023-02-16T22:11:33Z

Hi @llid3nlq and @Beskletochnii , 2 information you need to know. Without specific configuration the Redis will retains all the events of the platform and so take more and more memory. You can adapt that using the trimming option of the platform. Its not limited by default because you need to think about how many events you want to retains. More events you will have, more connectors and internal process that use the streams will be able to recover from late processing. Basically a lot of users take a look on their load and size the trimming to represent 2 months of data average.

If you have the trimming in place and the Redis memory is still not stable, we need to organize a call and deep dive into your connectors/data to identify who could be the culprit.

Thank you for this information - it is very helpful.
Can you tell me any negative implications to setting this feature? e.g. Some intel won't be searchable, IoCs don'r remain in the database.
I'm just wondering exactly what trimming means in the broader context of OpenCTI.
If there is no negative effect, other than cached stream data, then that sounds ok.

richard-julien · 2023-02-17T11:28:13Z

The stream is used in OpenCTI for multiple aspects, for example the "rule engine", the "notification system", the "live stream" etc. The only implication of the trimming is the time period where all features that use the stream will be able to recover if they are late of fail for some reason. So you need to setup a number according to your capacity of monitoring / resolving a platform problem if one of this component fail. I would say 1 month of retention should be sufficiant.

If you want to be specific about the number of the number you want to put in the configuration you can connect to your stream on the /stream uri and check the number of messages your already have and do some maths to know the number that will represent 1 month of data.

If you focus only for now on stabilize the size of your redis, you can for now put 1 million on the configuration, that will represents in average 4 Go of memory and un unknow period of time because its depends on the volume of data you absorb every day.

MaxwellDPS changed the title ~~[MISP] MISP-Connector creating a runaway task loop~~ [BUG][MISP] Connector creating a runaway task loop - Leading to Platform Stall Dec 21, 2022

SamuelHassine added bug use for describing something not working as expected solved use to identify issue that has been solved (must be linked to the solving PR) labels Jan 9, 2023

SamuelHassine added this to the Release 5.5.2 milestone Jan 9, 2023

SamuelHassine closed this as completed Jan 9, 2023

MaxwellDPS mentioned this issue Jan 24, 2023

[BUG] Setting the maxmemory flag on Redis, causes the platform pods to crash loop once Redis reaches max memory OpenCTI-Platform/opencti#2603

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][MISP] Connector creating a runaway task loop - Leading to Platform Stall #958

[BUG][MISP] Connector creating a runaway task loop - Leading to Platform Stall #958

MaxwellDPS commented Dec 21, 2022

MaxwellDPS commented Jan 24, 2023

MaxwellDPS commented Jan 27, 2023 •

edited

llid3nlq commented Jan 30, 2023

Beskletochnii commented Feb 16, 2023

richard-julien commented Feb 16, 2023

Beskletochnii commented Feb 16, 2023

richard-julien commented Feb 16, 2023

llid3nlq commented Feb 16, 2023

richard-julien commented Feb 17, 2023

[BUG][MISP] Connector creating a runaway task loop - Leading to Platform Stall #958

[BUG][MISP] Connector creating a runaway task loop - Leading to Platform Stall #958

Comments

MaxwellDPS commented Dec 21, 2022

Description

Environment

Reproducible Steps

Expected Output

Actual Output

Additional information

Screenshots (optional)

MaxwellDPS commented Jan 24, 2023

MaxwellDPS commented Jan 27, 2023 • edited

llid3nlq commented Jan 30, 2023

Beskletochnii commented Feb 16, 2023

richard-julien commented Feb 16, 2023

Beskletochnii commented Feb 16, 2023

richard-julien commented Feb 16, 2023

llid3nlq commented Feb 16, 2023

richard-julien commented Feb 17, 2023

MaxwellDPS commented Jan 27, 2023 •

edited