Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Priority doesn't work ? #733

Closed
jbeaudoin11 opened this issue Oct 16, 2017 · 22 comments
Closed

Priority doesn't work ? #733

jbeaudoin11 opened this issue Oct 16, 2017 · 22 comments

Comments

@jbeaudoin11
Copy link

Hi !

I'm having a hard time to understand how priorities work with the queue. I'm making a multimedia processing server, basically compression and resize stuff. So I have 3 queue : image, audio and video, but i'm focusing my test only on the image one for now.

The image queue has 2 types of job (for now), ResizeImage (priority: 1, high priority) and CompressImage (priority: 10, process "after" ResizeImage). I'm doing batch jobs so for each image those 2 jobs are added to the queue. I changed the order of how jobs are added to the queue and now it's by priority so for this case, every ResizeImage jobs are added before all CompressImage jobs.

So what i'm expecting to happen is that most of ResizeImage jobs are done first and after CompressImage jobs should be starting. But it just doesn't work like that at all :

ResizeImage 1629
ResizeImage 1628
ResizeImage 1630
ResizeImage 1631
ResizeImage 1635
ResizeImage 1634
ResizeImage 1633
ResizeImage 1632
CompressImage 1679
CompressImage 1678
CompressImage 1677
CompressImage 1676
CompressImage 1674
CompressImage 1675
CompressImage 1673
CompressImage 1672
CompressImage 1671
CompressImage 1670
CompressImage 1669
CompressImage 1668
CompressImage 1667
CompressImage 1666
CompressImage 1665
CompressImage 1664
CompressImage 1663
CompressImage 1662
CompressImage 1661
CompressImage 1660
CompressImage 1659
CompressImage 1658
CompressImage 1657
CompressImage 1656
ResizeImage 1653
ResizeImage 1652
CompressImage 1655
ResizeImage 1651
ResizeImage 1650
ResizeImage 1649
ResizeImage 1648
ResizeImage 1647
CompressImage 1654
ResizeImage 1646
ResizeImage 1645
ResizeImage 1643
ResizeImage 1644
ResizeImage 1642
ResizeImage 1641
ResizeImage 1640
ResizeImage 1639
ResizeImage 1638
ResizeImage 1637
ResizeImage 1636

This is base on the queue.on("completed", ..) event and with 4 workers. So i don't know how reliable that log is.

I don't think it matters, but that is a small sample size for fast testing, we will get prob 1000+ images by batches.. Also my nodejs server is on a VM with vagrant/virtualbox on ubuntu 16.04.

@manast
Copy link
Member

manast commented Oct 16, 2017

I know it is quite lame but we only have 1 test for priorities: https://github.com/OptimalBits/bull/blob/master/test/test_queue.js#L436
Is it possible that you could write a similar test but that reproduces your issue? if so I could take a look at it ASAP.

@jbeaudoin11
Copy link
Author

Yeah sure, tomorrow i'm gonna work on that !

@jbeaudoin11
Copy link
Author

jbeaudoin11 commented Oct 17, 2017

Mmm it's a pretty good challenge to write test for this kind of stuff.. Since it's not deterministic, i think a "probabilistic" approche is needed.

I didn't search much, but could you help me find a way to test it ? Right now i've done this :

it('should processes jobs by priority (undeterministic)', function(done){
  this.timeout(12000);

  var jobsPrioirtyInCompletedOrder = []
  queue.process(function(job, jobDone){
    expect(job.id).to.be.ok;

    // setTimeout(function() {
      jobsPrioirtyInCompletedOrder.push(job.data.p)

      jobDone();
    // }, 10)
  })

  // Test when jobs are all completed
  var intervalId = setInterval(function() { // not sure about this implementation..
    if(jobsPrioirtyInCompletedOrder.length === numJobsPerPriority * 2) {
      clearInterval(intervalId);

      // TODO <----------

    }
  }, 10)

  // Make sure we remove the interval if there is a problem
  queue.on("failed", function() {
    clearInterval(intervalId);
    assert.isOk(false, "a job failed")
  })

  // Add jobs to the queue ([p1, p2, p1, p2, ...])
  var numJobsPerPriority = 500;
  for(var i=0; i<numJobsPerPriority; i++) {
    queue.add({p: 1}, {priority:1})
    queue.add({p: 2}, {priority:2})
  }
})

So i have an array where each values is the priority of the job, the order can be use to determine if "most" priority 1 was done before priority 2.. I'm just not sure how to do this :/. What do you think ? Maybe a simple correlation could work but i'm not sure.. I'm gonna try and give you feed back.

And right now it doesn't test on more than one worker which might not produce the same result..

@manast
Copy link
Member

manast commented Oct 17, 2017

Why does it need to be undeterministic? It is important the the unit test is deterministic in fact :). Cant you just use the existing test for priorities and modify it so that it reproduces your issue?

@jbeaudoin11
Copy link
Author

jbeaudoin11 commented Oct 17, 2017

I feel like the existing test for priorities doesn't match the reality. You wait for all jobs to be added before starting the "work". Which in reality jobs are added on the fly.

In my exemple it's not deterministic since jobsPrioirtyInCompletedOrder can change from test to test.

@manast
Copy link
Member

manast commented Oct 17, 2017

then you can set the process function first and add the jobs evenly but the time to complete according to an array with delays in the order of magnitude 100ms, since it takes normally less than 10ms to process a single job without delay

@jbeaudoin11
Copy link
Author

jbeaudoin11 commented Oct 17, 2017

You can go and see how i did it, it's pretty simple.
But there is a problem.. tests passes :'(.. The queue doesn't act like i describe in my first post..

So i'm gonna try to use the same technique on my code and i will see if there is a difference.

@manast
Copy link
Member

manast commented Oct 17, 2017

yeah. good luck, in any case these are good tests to complete the priority queue :)

@jbeaudoin11
Copy link
Author

jbeaudoin11 commented Oct 17, 2017

So... I've done some quick tests in my code and this the correlation that I've got (25 jobs per priority, only 2 priorities):

[1, 1, ..., 2, 2] -> -0.36
[2, 2, ..., 1, 1] -> 0.68

I just dont understand.. But at least it's consistant.

@jbeaudoin11
Copy link
Author

jbeaudoin11 commented Oct 18, 2017

I feel like my problem looks a bit like #228 but the implementation changed a lot so idk..

I think it's more related to my env.. I'm not able to reproduce the bug outside of my server. If i try to test it in a new project it works as expected..

@jbeaudoin11
Copy link
Author

jbeaudoin11 commented Oct 18, 2017

So @manast.. We made the choice to drop Bull. We were using Bee-queue and since we needed priorities we changed for Bull (its pretty similar to Bee-queue so it was a simple migration).

That was a little premature, i took sometime to evaluate kue more in depth.. and it doesn't fit well the current architecture, so for now we decided to stay on bull and dont use priorities..

Maybe in my free time i'm going to debug that, but for now i need to go foward.. Sorry

@manast
Copy link
Member

manast commented Oct 18, 2017

yeah bee was heavily based on an early bull version, so it should be easy to go from bee to bull, but not the opposite since the feature set has increased considerably.
Regarding issue #228, although the symptom is the same, the reason is not, since the priorities functionality was vastly improved and simplified since then.
I wish I could do more to help you, but if you cannot create a reproducible case it is pretty much impossible to find the bug...

@jbeaudoin11
Copy link
Author

jbeaudoin11 commented Oct 18, 2017

Yeah like i said, i didn't find the source of the bug, i need to take a lot more time with this, maybe this weekend !

@jbeaudoin11
Copy link
Author

Alright I took my Friday night to look at this. I think i found where the problem is.. For some reason there is a problem with the addJob-6.lua

This is my debugging code :

...
    -- Priority add
    redis.call("ZADD", KEYS[6], priority, jobId)
    local count = redis.call("ZCOUNT", KEYS[6], 0, priority)
    redis.call("RPUSH", "addJob:count", count) -- DEBUG

    local len = redis.call("LLEN", target)
    redis.call("RPUSH", "addJob:len", len) -- DEBUG

    local id = redis.call("LINDEX", target, len - (count-1))
    redis.call("RPUSH", "addJob:id", id) -- DEBUG

    if id then
      local l = redis.call("LINSERT", target, "BEFORE", id, jobId)
      redis.call("RPUSH", "addJob:lenAfter", l) -- DEBUG
    else
      redis.call("RPUSH", target, jobId)
    end
...

When i add 12 jobs ([p1, p1, ..., p2, p2]) i'm having this output :

image

The left one is what is expected, the right one is when i run my app..

If we look closely at addJob:len there is 3 zero at the start, it shouldn't be the case and i can't explain why this is happening... Those zeros prob break the LINDEX because look at the size of addJob:id and addJob:lenAfter, there is only 8 elements instead of 11...

@manast
Copy link
Member

manast commented Oct 22, 2017

thanks for you effort, I am analysing this data to see if I can understand what is going on.

@manast
Copy link
Member

manast commented Oct 22, 2017

I assume the example data above has been generated with a client adding jobs but also there is a processor consuming jobs right? that could explain the zeroes in the addJob:len list. It would mean that the priority zset and the queue list has got out of sync.

@manast
Copy link
Member

manast commented Oct 22, 2017

I have discovered that there is a slight hazard where the zset with the priorities and the queue list gets out of sync. Of course I am can not know if this causes your issue, seems like this small hazard should not be so destructive. It happens because when we use BRPOPLPUSH, which atomically moves a job from the queue to its active state, it needs to make a separate call to remove the item from the priority set. If a all to addJob gets interleaved between the BRPOPLPUSH and the ZREM, then the item will not be placed in its perfect position in the queue. However, this should not happen very often and when it happens it should not completely mess the priorities, just make a small "mistake".
It is not possible to remove this hazard in a nice way because the limitation in redis of not being able to use block commands inside lua scripts, so what I can do is detect that the amount of items in the zset is different than in the queue list and in this case rebuild the priority set.

@jbeaudoin11
Copy link
Author

jbeaudoin11 commented Oct 22, 2017

It's hard to tell if this is the issue..

(sry for the close and open i was on my phone and accidently touched the "comment and close")

@manast
Copy link
Member

manast commented Oct 22, 2017

well since what I discovered is in fact a hazard that need to be fixed I will make a release and after that you can test again, if it still fails I can continue investigating.

@jbeaudoin11
Copy link
Author

jbeaudoin11 commented Oct 22, 2017

I just realize, this is a hazard, so it should happen at random moments no ? In my case, the pattern is always the same, there is no random..

I'm thinking of removing the processing from my micro-service (app) and make a dedicated nodejs server for the processing which could be on its own vm. Maybe the problem is coming from there... I'm interested in the difference between clustering the queue vs having more workers (process(nbOfWorker,...) ?

@manast
Copy link
Member

manast commented Oct 22, 2017

instead of using cluster use the sandbox feature, it is easier to use with similar results. You should use the CPU as much as possible before scaling horizontally to more workers.

@stale
Copy link

stale bot commented Jul 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 12, 2021
@stale stale bot closed this as completed Jul 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants