Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for multi agent job completion #282

Merged
merged 3 commits into from Dec 21, 2022
Merged

Conversation

msm-code
Copy link
Contributor

@msm-code msm-code commented Aug 30, 2022

Your checklist for this pull request

  • I've read the contributing guideline.
  • I've tested my changes by building and running mquery, and testing changed functionality (if applicable)
  • I've added automated tests for my change (if applicable, optional)
  • I've updated documentation to reflect my change (if applicable)

What is the current behaviour?
With multiple agents, jobs sometimes don't complete properly.

What is the new behaviour?
Jobs complete properly. I hope, at least.

This is a draft, because I need to set-up a proper mquery testing lab (again). Daemon is notoriously hard to reason about, and I really should test this code before the release.

This is related to PR #271, but I think this is a more correct solution - I'll explain "why" when I test this properly.

Test plan
Create a setup with multiple agent groups and ursadb, and run a single query there. It should complete properly in all cases (indepentent of how many samples are in each agent group)

Closing issues

fixes #270
fixes #301

@msm-code msm-code changed the title Alleaged fix for the multi agent job completion Fix for multi agent job completion Aug 31, 2022
@msm-code
Copy link
Contributor Author

msm-code commented Dec 4, 2022

version: '3'
services:
  web:
    restart: always
    build:
      context: .
      dockerfile: deploy/docker/web.Dockerfile
    ports:
    - "80:5000"
    links:
    - redis
    depends_on:
      - "redis"
      - "daemon1"
      - "daemon2"
      - "ursadb1"
      - "ursadb2"
  daemon1:
    restart: always
    build:
      context: .
      dockerfile: deploy/docker/daemon.Dockerfile
    command: dameon1
    volumes:
    - "/tmp/samples1:/mnt/samples"
    depends_on:
      - "redis"
      - "ursadb1"
    environment:
      - "MQUERY_BACKEND=tcp://ursadb1:9281"
  daemon2:
    restart: always
    build:
      context: .
      dockerfile: deploy/docker/daemon.Dockerfile
    command: dameon2
    volumes:
    - "/tmp/samples2:/mnt/samples"
    depends_on:
      - "redis"
      - "ursadb2"
    environment:
      - "MQUERY_BACKEND=tcp://ursadb2:9281"
  ursadb1:
    restart: always
    image: mqueryci/ursadb:v1.5.0
    ports:
    - "127.0.0.1:5001:9281"
    volumes:
    - "/tmp/samples1:/mnt/samples"
    - "/tmp/indexes1:/var/lib/ursadb"
  ursadb2:
    restart: always
    image: mqueryci/ursadb:v1.5.0
    ports:
    - "127.0.0.1:5002:9281"
    volumes:
    - "/tmp/samples2:/mnt/samples"
    - "/tmp/indexes2:/var/lib/ursadb"
  redis:
    restart: always
    image: redis

old reproducttion config

@msm-code msm-code force-pushed the fix/multi-agent-job-completion branch from b6940cf to e0661c4 Compare December 19, 2022 23:03
@msm-code
Copy link
Contributor Author

Confirmed and tested. Sorry for making everyone wait!

I was taking time with this, because the old code used to work with my multi-cluster setup and I didn't want to accidentaly break something. I think it's ready for review.

@msm-code msm-code marked this pull request as ready for review December 19, 2022 23:27
@msm-code
Copy link
Contributor Author

No, I was wrong, this did not fix the issue at all. I've tested with more agents and reproduced the issue.

Pushed a new version of this PR. There are actually two issues:

  • initial dataset creation is just wrong. This was the root cause for The same query yara rule results with different number of matches #301. If race happens (almost guarantees for 10+ agents) some agents finish the initial dateset/topology request before others even start. Because of this, job status changes from "new" to "processing" and the agents don't do anything at all
  • Not only start condition was wrong, end condition too :). But it was hard to debug (because our "database" are JSONs packed into redis):
    • The original code checked if the whole job is over to decide if the agent is done (there are multiple agents per job). So not enough agents finished.
    • My first PR checked if a single dataset was done in order to decide if the agent is done (there are multiple datasets per agent).
    • We should actually check if all datasets for a given agent are done. This is implemented in the current version.

Overall this would be much simpler using a relational DB, but current level of complexity was not expected when the first version of mquery was released 🤔 . There's always #74 but I fear that migration to another database would be a huge burden for everyone involved (especially me). Current data model is bad but workable after all 🤔

src/daemon.py Outdated Show resolved Hide resolved
Co-authored-by: Michał Praszmo <michalpr@cert.pl>
@msm-code msm-code merged commit fd028d2 into master Dec 21, 2022
@msm-code msm-code deleted the fix/multi-agent-job-completion branch December 21, 2022 16:04
@msm-code msm-code mentioned this pull request Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants