Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve query performance during indexing #212

Open
1 of 4 tasks
msm-code opened this issue Feb 18, 2023 · 2 comments · May be fixed by #213
Open
1 of 4 tasks

Improve query performance during indexing #212

msm-code opened this issue Feb 18, 2023 · 2 comments · May be fixed by #213
Assignees

Comments

@msm-code
Copy link
Contributor

msm-code commented Feb 18, 2023

Currently running a query gets very slow if indexing operation is also in progress. This is (probably) because of how disk queues work - indexing is very disk heavy, and saturates the disk with reads of new files to index.

In practice, indexing new files is less important than responding to queries quickly. Ideally, running a query should always have a priority. I think we can solve this with linux's IO priority: https://www.kernel.org/doc/html/latest/block/ioprio.html.

Things to do:

  • Create a benchmark: Measure a query performance without indexing, and during indexing. Doesn't have to be very precise, but must show that performance during indexing is significantly degraded.
  • Investigate if it's possible to use ioprio_set/ioprio_get syscalls to work around this issue per worker.
  • Run the benchmark again, and make sure the query performance is better (and that indexing performance is not hugely impacted, though I don't expect it)
  • Hopefully this solves the issue, but if not, we can consider other measures (for example, pausing all indexing workers during processing a query)
@msm-code msm-code self-assigned this Feb 18, 2023
@msm-code msm-code linked a pull request Feb 18, 2023 that will close this issue
@msm-code
Copy link
Contributor Author

Benchmark: compacting a big dataset collection, and querying DB at the same time. All tests done after dropping VM cache. All tests repeated 3 times.

  1. Performance when not compacting (baseline, best case performnance):
  • 0:39
  • 0:40
  • 0:40
  1. Performance when compacting (master):
  • 1:03
  • 1:07
  • 1:11
  1. Performance when compacting (after Lower iopriority when indexing to IDLE #213)
  • 1:17
  • 1:13
  • 1:09

Yeah, on average the database got slower. But I realised that's because IO priority is per process, not per thread. And ursadb is a single (multi-threaded) process. So I can't actually do what I hoped to do.

But that's not all - I've tried to work around this by running a second ursadb process ("slow" process for compacting), and the results are

  1. Performance when compacting (after Lower iopriority when indexing to IDLE #213, with separate process):
  • 1:11
  • 1:24
  • 1:16

And this is... Even slower? This is surprising to me, I think this time all the priorities were installed as I wanted them to be.

But just to be sure I've ran a second DB by just manually changing the priority with ionice, and:

  1. Performance when compacting (after Lower iopriority when indexing to IDLE #213, with separate process, one started by ionice -c 3):
  • 1:10
  • 1:10
  • 1:15

I have to say this is underwhelming.

Also actually I also suspect that a big part of the slowdown comes from the fact that OS' disk cache is filled by useless (never again used) data. Maybe I should experiment with MADV_DONTNEED instead?

Anyway, looks like this approach may be more challenging than I suspected. Looks like I need to ponder on this topic a bit more 🤔

@msm-code
Copy link
Contributor Author

Initial tests suggest that adding fadvise may not help either:

  • 1:03
  • 1:16
  • 1:15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant