Fix for multi agent job completion #282

msm-code · 2022-08-30T20:25:38Z

Your checklist for this pull request

I've read the contributing guideline.
I've tested my changes by building and running mquery, and testing changed functionality (if applicable)
I've added automated tests for my change (if applicable, optional)
I've updated documentation to reflect my change (if applicable)

What is the current behaviour?
With multiple agents, jobs sometimes don't complete properly.

What is the new behaviour?
Jobs complete properly. I hope, at least.

This is a draft, because I need to set-up a proper mquery testing lab (again). Daemon is notoriously hard to reason about, and I really should test this code before the release.

This is related to PR #271, but I think this is a more correct solution - I'll explain "why" when I test this properly.

Test plan
Create a setup with multiple agent groups and ursadb, and run a single query there. It should complete properly in all cases (indepentent of how many samples are in each agent group)

Closing issues

fixes #270
fixes #301

msm-code · 2022-12-04T18:09:03Z

version: '3'
services:
  web:
    restart: always
    build:
      context: .
      dockerfile: deploy/docker/web.Dockerfile
    ports:
    - "80:5000"
    links:
    - redis
    depends_on:
      - "redis"
      - "daemon1"
      - "daemon2"
      - "ursadb1"
      - "ursadb2"
  daemon1:
    restart: always
    build:
      context: .
      dockerfile: deploy/docker/daemon.Dockerfile
    command: dameon1
    volumes:
    - "/tmp/samples1:/mnt/samples"
    depends_on:
      - "redis"
      - "ursadb1"
    environment:
      - "MQUERY_BACKEND=tcp://ursadb1:9281"
  daemon2:
    restart: always
    build:
      context: .
      dockerfile: deploy/docker/daemon.Dockerfile
    command: dameon2
    volumes:
    - "/tmp/samples2:/mnt/samples"
    depends_on:
      - "redis"
      - "ursadb2"
    environment:
      - "MQUERY_BACKEND=tcp://ursadb2:9281"
  ursadb1:
    restart: always
    image: mqueryci/ursadb:v1.5.0
    ports:
    - "127.0.0.1:5001:9281"
    volumes:
    - "/tmp/samples1:/mnt/samples"
    - "/tmp/indexes1:/var/lib/ursadb"
  ursadb2:
    restart: always
    image: mqueryci/ursadb:v1.5.0
    ports:
    - "127.0.0.1:5002:9281"
    volumes:
    - "/tmp/samples2:/mnt/samples"
    - "/tmp/indexes2:/var/lib/ursadb"
  redis:
    restart: always
    image: redis

old reproducttion config

msm-code · 2022-12-19T23:27:37Z

Confirmed and tested. Sorry for making everyone wait!

I was taking time with this, because the old code used to work with my multi-cluster setup and I didn't want to accidentaly break something. I think it's ready for review.

msm-code · 2022-12-20T02:25:49Z

No, I was wrong, this did not fix the issue at all. I've tested with more agents and reproduced the issue.

Pushed a new version of this PR. There are actually two issues:

initial dataset creation is just wrong. This was the root cause for The same query yara rule results with different number of matches #301. If race happens (almost guarantees for 10+ agents) some agents finish the initial dateset/topology request before others even start. Because of this, job status changes from "new" to "processing" and the agents don't do anything at all
Not only start condition was wrong, end condition too :). But it was hard to debug (because our "database" are JSONs packed into redis):
- The original code checked if the whole job is over to decide if the agent is done (there are multiple agents per job). So not enough agents finished.
- My first PR checked if a single dataset was done in order to decide if the agent is done (there are multiple datasets per agent).
- We should actually check if all datasets for a given agent are done. This is implemented in the current version.

Overall this would be much simpler using a relational DB, but current level of complexity was not expected when the first version of mquery was released 🤔 . There's always #74 but I fear that migration to another database would be a huge burden for everyone involved (especially me). Current data model is bad but workable after all 🤔

src/daemon.py

Co-authored-by: Michał Praszmo <michalpr@cert.pl>

msm-code changed the title ~~Alleaged fix for the multi agent job completion~~ Fix for multi agent job completion Aug 31, 2022

Alleaged fix for the multi agent job completion

e0661c4

msm-code force-pushed the fix/multi-agent-job-completion branch from b6940cf to e0661c4 Compare December 19, 2022 23:03

msm-code marked this pull request as ready for review December 19, 2022 23:27

msm-code mentioned this pull request Dec 19, 2022

Fix #270: Finish job with several parallel daemon groups #271

Closed

4 tasks

msm-code requested a review from nazywam December 19, 2022 23:31

Actually fix multi agent job completoin at the source

3fd804e

nazywam approved these changes Dec 21, 2022

View reviewed changes

src/daemon.py Outdated Show resolved Hide resolved

Update src/daemon.py

3a2ce43

Co-authored-by: Michał Praszmo <michalpr@cert.pl>

msm-code merged commit fd028d2 into master Dec 21, 2022

msm-code deleted the fix/multi-agent-job-completion branch December 21, 2022 16:04

msm-code mentioned this pull request Feb 9, 2023

Release v1.4.0 #354

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for multi agent job completion #282

Fix for multi agent job completion #282

msm-code commented Aug 30, 2022 •

edited

msm-code commented Dec 4, 2022

msm-code commented Dec 19, 2022

msm-code commented Dec 20, 2022

Fix for multi agent job completion #282

Fix for multi agent job completion #282

Conversation

msm-code commented Aug 30, 2022 • edited

msm-code commented Dec 4, 2022

msm-code commented Dec 19, 2022

msm-code commented Dec 20, 2022

msm-code commented Aug 30, 2022 •

edited