Skip to content

Commit

Permalink
[8.6] [Fleet] refactored bulk update tags retry (elastic#147594) (ela…
Browse files Browse the repository at this point in the history
…stic#147839)

# Backport

This will backport the following commits from `main` to `8.6`:
- [[Fleet] refactored bulk update tags retry
(elastic#147594)](elastic#147594)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Julia
Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2022-12-20T09:36:36Z","message":"[Fleet]
refactored bulk update tags retry (elastic#147594)\n\n## Summary\r\n\r\nFixes
elastic#144161
discussed\r\n[here](elastic#144161 (comment)
existing implementation of update tags doesn't work well with
real\r\nagents, as there are many conflicts with checkin, even when
trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries
more efficient:\r\n- Instead of aborting the whole bulk action on
conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if
an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but
only the 1k conflicts,\r\nthis makes it less likely to conflict on
retry.\r\n- Because of this, on retry we have to know which agents don't
yet have\r\nthe tag added/removed. For this, added an additional filter
to the\r\n`updateByQuery` request. Only adding the filter if there is
exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use
case from the\r\nUI, and handling other cases would complicate the logic
more (each\r\nadditional tag to add/remove would result in another OR
query, which\r\nwould match more agents, making conflicts more
likely).\r\n- Added this additional query on the initial request as well
(not only\r\nretries) to save on unnecessary work e.g. if the user tries
to add a tag\r\non 50k agents, but 48k already have it, it is enough to
update the\r\nremaining 2k agents.\r\n- This improvement has the effect
that 'Agent activity' shows the real\r\nupdated agent count, not the
total selected. I think this is not really\r\na problem for update
tags.\r\n- Cleaned up some of the UI logic, because the conflicts are
fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce
the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try
to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll
50k agents (I used 50k with create_agents script, and 1k with\r\nhorde).
Enroll 50k with horde if possible.\r\n- Select all on UI and try to
add/remove one or more tags\r\n- Expect the changes to propagate quickly
(up to 1m). It might take a\r\nfew refreshes to see the result on agent
list and tags list, because the\r\nUI polls the agents every 30s. It is
expected that the tags list\r\ntemporarily shows incorrect data because
the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag
quickly:\r\n<img width=\"1776\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img
width=\"422\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe
logs show the details of how many `version_conflicts` were there,\r\nand
it decreased with
retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet]
{\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet]
{\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 10857
agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 26245
agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet]
{\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet]
Retry #1 of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
failed: version conflict of 1000
agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO
][plugins.fleet] processed 26245 agents, took
5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet]
{\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO
][plugins.fleet] processed 1000 agents, took
379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"687987aa9ce56ce359f722485330179a4807d79a","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","v8.7.0","v8.6.1"],"number":147594,"url":"elastic#147594
refactored bulk update tags retry (elastic#147594)\n\n## Summary\r\n\r\nFixes
elastic#144161
discussed\r\n[here](elastic#144161 (comment)
existing implementation of update tags doesn't work well with
real\r\nagents, as there are many conflicts with checkin, even when
trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries
more efficient:\r\n- Instead of aborting the whole bulk action on
conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if
an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but
only the 1k conflicts,\r\nthis makes it less likely to conflict on
retry.\r\n- Because of this, on retry we have to know which agents don't
yet have\r\nthe tag added/removed. For this, added an additional filter
to the\r\n`updateByQuery` request. Only adding the filter if there is
exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use
case from the\r\nUI, and handling other cases would complicate the logic
more (each\r\nadditional tag to add/remove would result in another OR
query, which\r\nwould match more agents, making conflicts more
likely).\r\n- Added this additional query on the initial request as well
(not only\r\nretries) to save on unnecessary work e.g. if the user tries
to add a tag\r\non 50k agents, but 48k already have it, it is enough to
update the\r\nremaining 2k agents.\r\n- This improvement has the effect
that 'Agent activity' shows the real\r\nupdated agent count, not the
total selected. I think this is not really\r\na problem for update
tags.\r\n- Cleaned up some of the UI logic, because the conflicts are
fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce
the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try
to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll
50k agents (I used 50k with create_agents script, and 1k with\r\nhorde).
Enroll 50k with horde if possible.\r\n- Select all on UI and try to
add/remove one or more tags\r\n- Expect the changes to propagate quickly
(up to 1m). It might take a\r\nfew refreshes to see the result on agent
list and tags list, because the\r\nUI polls the agents every 30s. It is
expected that the tags list\r\ntemporarily shows incorrect data because
the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag
quickly:\r\n<img width=\"1776\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img
width=\"422\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe
logs show the details of how many `version_conflicts` were there,\r\nand
it decreased with
retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet]
{\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet]
{\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 10857
agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 26245
agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet]
{\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet]
Retry #1 of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
failed: version conflict of 1000
agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO
][plugins.fleet] processed 26245 agents, took
5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet]
{\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO
][plugins.fleet] processed 1000 agents, took
379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"687987aa9ce56ce359f722485330179a4807d79a"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"elastic#147594
refactored bulk update tags retry (elastic#147594)\n\n## Summary\r\n\r\nFixes
elastic#144161
discussed\r\n[here](elastic#144161 (comment)
existing implementation of update tags doesn't work well with
real\r\nagents, as there are many conflicts with checkin, even when
trying to\r\nadd/remove one tag.\r\nRefactored the logic to make retries
more efficient:\r\n- Instead of aborting the whole bulk action on
conflicts, changed the\r\nconflict strategy to 'proceed'. This means, if
an action of 50k agents\r\nhas 1k conflicts, not all 50k is retried, but
only the 1k conflicts,\r\nthis makes it less likely to conflict on
retry.\r\n- Because of this, on retry we have to know which agents don't
yet have\r\nthe tag added/removed. For this, added an additional filter
to the\r\n`updateByQuery` request. Only adding the filter if there is
exactly one\r\n`tagsToAdd` or one `tagsToRemove`. This is the main use
case from the\r\nUI, and handling other cases would complicate the logic
more (each\r\nadditional tag to add/remove would result in another OR
query, which\r\nwould match more agents, making conflicts more
likely).\r\n- Added this additional query on the initial request as well
(not only\r\nretries) to save on unnecessary work e.g. if the user tries
to add a tag\r\non 50k agents, but 48k already have it, it is enough to
update the\r\nremaining 2k agents.\r\n- This improvement has the effect
that 'Agent activity' shows the real\r\nupdated agent count, not the
total selected. I think this is not really\r\na problem for update
tags.\r\n- Cleaned up some of the UI logic, because the conflicts are
fully\r\nhandled now on the backend.\r\n- Locally I couldn't reproduce
the conflict with agent checkins, even\r\nwith 1k horde agents. I'll try
to test in cloud with more real agents.\r\n\r\nTo verify:\r\n- Enroll
50k agents (I used 50k with create_agents script, and 1k with\r\nhorde).
Enroll 50k with horde if possible.\r\n- Select all on UI and try to
add/remove one or more tags\r\n- Expect the changes to propagate quickly
(up to 1m). It might take a\r\nfew refreshes to see the result on agent
list and tags list, because the\r\nUI polls the agents every 30s. It is
expected that the tags list\r\ntemporarily shows incorrect data because
the action is async.\r\n\r\nE.g. removed `test3` tag and added `add` tag
quickly:\r\n<img width=\"1776\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824481-411f0f70-d7e8-42a6-b73f-ed80e77b7700.png\">\r\n<img
width=\"422\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/90178898/207824550-582d43fc-87db-45e1-ba58-15915447fefd.png\">\r\n\r\nThe
logs show the details of how many `version_conflicts` were there,\r\nand
it decreased with
retries.\r\n\r\n```\r\n[2022-12-15T10:32:12.937+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:12.981+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:16.477+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:16.537+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:22.893+01:00][DEBUG][plugins.fleet]
{\"took\":9886,\"timed_out\":false,\"total\":52000,\"updated\":41143,\"deleted\":0,\"batches\":52,\"version_conflicts\":10857,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:26.066+01:00][DEBUG][plugins.fleet]
{\"took\":9518,\"timed_out\":false,\"total\":52000,\"updated\":25755,\"deleted\":0,\"batches\":52,\"version_conflicts\":26245,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:27.401+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 10857
agents\r\n[2022-12-15T10:32:27.461+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:27.462+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:29.274+01:00][ERROR][plugins.fleet]
Action failed: version conflict of 26245
agents\r\n[2022-12-15T10:32:29.353+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:29.353+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:31.480+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:31.481+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:31.481+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:31.481+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:31.485+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:33.841+01:00][DEBUG][plugins.fleet]
{\"took\":2347,\"timed_out\":false,\"total\":10857,\"updated\":9857,\"deleted\":0,\"batches\":11,\"version_conflicts\":1000,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:34.556+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:34.557+01:00][DEBUG][plugins.fleet] Retry #1
of task
fleet:update_agent_tags:retry:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:34.557+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
29e9da7-7194-4e52-8004-2c1b19f6dfd5, total agents:
52000\r\n[2022-12-15T10:32:34.557+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:34.560+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:35.388+01:00][ERROR][plugins.fleet]
Retry #1 of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de
failed: version conflict of 1000
agents\r\n[2022-12-15T10:32:35.468+01:00][INFO ][plugins.fleet]
Scheduling task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:35.468+01:00][INFO
][plugins.fleet] Retrying in task:
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n{\"took\":5509,\"timed_out\":false,\"total\":26245,\"updated\":26245,\"deleted\":0,\"batches\":27,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:42.722+01:00][INFO
][plugins.fleet] processed 26245 agents, took
5509ms\r\n[2022-12-15T10:32:42.723+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:29e9da70-7194-4e52-8004-2c1b19f6dfd5\r\n[2022-12-15T10:32:46.705+01:00][INFO
][plugins.fleet] Running bulk action retry
task\r\n[2022-12-15T10:32:46.706+01:00][DEBUG][plugins.fleet] Retry #2
of task
fleet:update_agent_tags:retry:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:46.707+01:00][INFO
][plugins.fleet] Running action asynchronously, actionId:
90acd54-19ac-4738-b3d3-db32789233de, total agents:
52000\r\n[2022-12-15T10:32:46.707+01:00][INFO ][plugins.fleet] Completed
bulk action retry task\r\n[2022-12-15T10:32:46.711+01:00][INFO
][plugins.fleet] Scheduling task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n[2022-12-15T10:32:47.099+01:00][DEBUG][plugins.fleet]
{\"took\":379,\"timed_out\":false,\"total\":1000,\"updated\":1000,\"deleted\":0,\"batches\":1,\"version_conflicts\":0,\"noops\":0,\"retries\":{\"bulk\":0,\"search\":0},\"throttled_millis\":0,\"requests_per_second\":-1,\"throttled_until_millis\":0,\"failures\":[]}\r\n[2022-12-15T10:32:47.623+01:00][INFO
][plugins.fleet] processed 1000 agents, took
379ms\r\n[2022-12-15T10:32:47.623+01:00][INFO ][plugins.fleet] Removing
task
fleet:update_agent_tags:retry:check:90acd541-19ac-4738-b3d3-db32789233de\r\n```\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"687987aa9ce56ce359f722485330179a4807d79a"}},{"branch":"8.6","label":"v8.6.1","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
  • Loading branch information
kibanamachine and juliaElastic committed Dec 20, 2022
1 parent 32265b7 commit 335b86a
Show file tree
Hide file tree
Showing 9 changed files with 284 additions and 145 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -292,7 +292,16 @@ describe('TagsAddRemove', () => {

expect(mockBulkUpdateTags).toHaveBeenCalledWith(
'query',
['newTag2', 'newTag'],
['newTag'],
[],
expect.anything(),
'Tag created',
'Tag creation failed'
);

expect(mockBulkUpdateTags).toHaveBeenCalledWith(
'query',
['newTag2'],
[],
expect.anything(),
'Tag created',
Expand All @@ -316,7 +325,16 @@ describe('TagsAddRemove', () => {
expect(mockBulkUpdateTags).toHaveBeenCalledWith(
'',
[],
['tag2', 'tag1'],
['tag1'],
expect.anything(),
undefined,
undefined
);

expect(mockBulkUpdateTags).toHaveBeenCalledWith(
'',
[],
['tag2'],
expect.anything(),
undefined,
undefined
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -120,32 +120,10 @@ export const TagsAddRemove: React.FC<Props> = ({
errorMessage
);
} else {
// sending updated tags to add/remove, in case multiple actions are done quickly and the first one is not yet propagated
const updatedTagsToAdd = tagsToAdd.concat(
labels
.filter(
(tag) =>
tag.checked === 'on' &&
!selectedTags.includes(tag.label) &&
!tagsToRemove.includes(tag.label)
)
.map((tag) => tag.label)
);
const updatedTagsToRemove = tagsToRemove.concat(
labels
.filter(
(tag) =>
tag.checked !== 'on' &&
selectedTags.includes(tag.label) &&
!tagsToAdd.includes(tag.label)
)
.map((tag) => tag.label)
);

updateTagsHook.bulkUpdateTags(
agents!,
updatedTagsToAdd,
updatedTagsToRemove,
tagsToAdd,
tagsToRemove,
(hasCompleted) => handleTagsUpdated(tagsToAdd, tagsToRemove, hasCompleted),
successMessage,
errorMessage
Expand Down
6 changes: 4 additions & 2 deletions x-pack/plugins/fleet/server/services/agents/action_runner.ts
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ import { getAgentActions } from './actions';
import { closePointInTime, getAgentsByKuery } from './crud';
import type { BulkActionsResolver } from './bulk_actions_resolver';

export const MAX_RETRY_COUNT = 3;

export interface ActionParams {
kuery: string;
showInactive?: boolean;
Expand Down Expand Up @@ -110,8 +112,8 @@ export abstract class ActionRunner {
`Retry #${this.retryParams.retryCount} of task ${this.retryParams.taskId} failed: ${error.message}`
);

if (this.retryParams.retryCount === 3) {
const errorMessage = 'Stopping after 3rd retry. Error: ' + error.message;
if (this.retryParams.retryCount === MAX_RETRY_COUNT) {
const errorMessage = `Stopping after ${MAX_RETRY_COUNT}rd retry. Error: ${error.message}`;
appContextService.getLogger().warn(errorMessage);

// clean up tasks after 3rd retry reached
Expand Down
15 changes: 9 additions & 6 deletions x-pack/plugins/fleet/server/services/agents/action_status.ts
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,15 @@ export async function getActionStatuses(
const nbAgentsActioned = action.nbAgentsActioned || action.nbAgentsActionCreated;
const cardinalityCount = (matchingBucket?.agent_count as any)?.value ?? 0;
const docCount = matchingBucket?.doc_count ?? 0;
const nbAgentsAck = Math.min(
docCount,
// only using cardinality count when count lower than precision threshold
docCount > PRECISION_THRESHOLD ? docCount : cardinalityCount,
nbAgentsActioned
);
const nbAgentsAck =
action.type === 'UPDATE_TAGS'
? Math.min(docCount, nbAgentsActioned)
: Math.min(
docCount,
// only using cardinality count when count lower than precision threshold
docCount > PRECISION_THRESHOLD ? docCount : cardinalityCount,
nbAgentsActioned
);
const completionTime = (matchingBucket?.max_timestamp as any)?.value_as_string;
const complete = nbAgentsAck >= nbAgentsActioned;
const cancelledAction = cancelledActions.find((a) => a.actionId === action.actionId);
Expand Down
5 changes: 4 additions & 1 deletion x-pack/plugins/fleet/server/services/agents/crud.ts
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,8 @@ export function getElasticsearchQuery(
kuery: string,
showInactive = false,
includeHosted = false,
hostedPolicies: string[] = []
hostedPolicies: string[] = [],
extraFilters: string[] = []
): estypes.QueryDslQueryContainer | undefined {
const filters = [];

Expand All @@ -171,6 +172,8 @@ export function getElasticsearchQuery(
filters.push('NOT (policy_id:{policyIds})'.replace('{policyIds}', hostedPolicies.join(',')));
}

filters.push(...extraFilters);

const kueryNode = _joinFilters(filters);
return kueryNode ? toElasticsearchQuery(kueryNode) : undefined;
}
Expand Down
153 changes: 145 additions & 8 deletions x-pack/plugins/fleet/server/services/agents/update_agent_tags.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import { elasticsearchServiceMock, savedObjectsClientMock } from '@kbn/core/serv

import { createClientMock } from './action.mock';
import { updateAgentTags } from './update_agent_tags';
import { updateTagsBatch } from './update_agent_tags_action_runner';

jest.mock('../app_context', () => {
return {
Expand All @@ -28,6 +29,7 @@ jest.mock('../agent_policy', () => {
return {
agentPolicyService: {
getByIDs: jest.fn().mockResolvedValue([{ id: 'hosted-agent-policy', is_managed: true }]),
list: jest.fn().mockResolvedValue({ items: [] }),
},
};
});
Expand Down Expand Up @@ -73,7 +75,7 @@ describe('update_agent_tags', () => {

expect(esClient.updateByQuery).toHaveBeenCalledWith(
expect.objectContaining({
conflicts: 'abort',
conflicts: 'proceed',
index: '.fleet-agents',
query: { terms: { _id: ['agent1'] } },
script: expect.objectContaining({
Expand All @@ -90,6 +92,9 @@ describe('update_agent_tags', () => {
});

it('should update action results on success', async () => {
esClient.updateByQuery.mockReset();
esClient.updateByQuery.mockResolvedValue({ failures: [], updated: 1, total: 1 } as any);

await updateAgentTags(soClient, esClient, { agentIds: ['agent1'] }, ['one'], []);

const agentAction = esClient.create.mock.calls[0][0] as any;
Expand All @@ -110,11 +115,32 @@ describe('update_agent_tags', () => {
expect(actionResults.body[1].error).not.toBeDefined();
});

it('should write error action results for hosted agent when agentIds are passed', async () => {
it('should update action results on success - kuery', async () => {
await updateTagsBatch(
soClient,
esClient,
[],
{},
{
tagsToAdd: ['new'],
tagsToRemove: [],
kuery: '',
}
);

const actionResults = esClient.bulk.mock.calls[0][0] as any;
const agentIds = actionResults?.body
?.filter((i: any) => i.agent_id)
.map((i: any) => i.agent_id);
expect(agentIds[0]).toHaveLength(36); // uuid
expect(actionResults.body[1].error).not.toBeDefined();
});

it('should skip hosted agent from total when agentIds are passed', async () => {
const { esClient: esClientMock, agentInHostedDoc } = createClientMock();

esClientMock.updateByQuery.mockReset();
esClientMock.updateByQuery.mockResolvedValue({ failures: [], updated: 0, total: '0' } as any);
esClientMock.updateByQuery.mockResolvedValue({ failures: [], updated: 0, total: 0 } as any);

await updateAgentTags(
soClient,
Expand All @@ -130,13 +156,9 @@ describe('update_agent_tags', () => {
action_id: expect.anything(),
agents: [],
type: 'UPDATE_TAGS',
total: 1,
total: 0,
})
);

const errorResults = esClientMock.bulk.mock.calls[0][0] as any;
expect(errorResults.body[1].agent_id).toEqual(agentInHostedDoc._id);
expect(errorResults.body[1].error).toEqual('Cannot modify tags on a hosted agent');
});

it('should write error action results when failures are returned', async () => {
Expand All @@ -152,6 +174,46 @@ describe('update_agent_tags', () => {
expect(errorResults.body[1].error).toEqual('error reason');
});

it('should throw error on version conflicts', async () => {
esClient.updateByQuery.mockReset();
esClient.updateByQuery.mockResolvedValue({
failures: [],
updated: 0,
version_conflicts: 100,
} as any);

await expect(
updateAgentTags(soClient, esClient, { agentIds: ['agent1'] }, ['one'], [])
).rejects.toThrowError('version conflict of 100 agents');
});

it('should write out error results on last retry with version conflicts', async () => {
esClient.updateByQuery.mockReset();
esClient.updateByQuery.mockResolvedValue({
failures: [],
updated: 0,
version_conflicts: 100,
} as any);

await expect(
updateTagsBatch(
soClient,
esClient,
[],
{},
{
tagsToAdd: ['new'],
tagsToRemove: [],
kuery: '',
total: 100,
retryCount: 3,
}
)
).rejects.toThrowError('version conflict of 100 agents');
const errorResults = esClient.bulk.mock.calls[0][0] as any;
expect(errorResults.body[1].error).toEqual('version conflict on 3rd retry');
});

it('should run add tags async when actioning more agents than batch size', async () => {
esClient.search.mockResolvedValue({
hits: {
Expand Down Expand Up @@ -180,4 +242,79 @@ describe('update_agent_tags', () => {

expect(mockRunAsync).toHaveBeenCalled();
});

it('should add tags filter if only one tag to add', async () => {
await updateTagsBatch(
soClient,
esClient,
[],
{},
{
tagsToAdd: ['new'],
tagsToRemove: [],
kuery: '',
}
);

const updateByQuery = esClient.updateByQuery.mock.calls[0][0] as any;
expect(updateByQuery.query).toEqual({
bool: {
filter: [
{ bool: { minimum_should_match: 1, should: [{ match: { active: true } }] } },
{
bool: {
must_not: { bool: { minimum_should_match: 1, should: [{ match: { tags: 'new' } }] } },
},
},
],
},
});
});

it('should add tags filter if only one tag to remove', async () => {
await updateTagsBatch(
soClient,
esClient,
[],
{},
{
tagsToAdd: [],
tagsToRemove: ['remove'],
kuery: '',
}
);

const updateByQuery = esClient.updateByQuery.mock.calls[0][0] as any;
expect(JSON.stringify(updateByQuery.query)).toContain(
'{"bool":{"should":[{"match":{"tags":"remove"}}],"minimum_should_match":1}}'
);
});

it('should write total from updateByQuery result if query returns less results', async () => {
esClient.updateByQuery.mockReset();
esClient.updateByQuery.mockResolvedValue({ failures: [], updated: 0, total: 50 } as any);

await updateTagsBatch(
soClient,
esClient,
[],
{},
{
tagsToAdd: ['new'],
tagsToRemove: [],
kuery: '',
total: 100,
}
);

const agentAction = esClient.create.mock.calls[0][0] as any;
expect(agentAction?.body).toEqual(
expect.objectContaining({
action_id: expect.anything(),
agents: [],
type: 'UPDATE_TAGS',
total: 50,
})
);
});
});
Loading

0 comments on commit 335b86a

Please sign in to comment.