Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout to index plays lock #912

Merged
merged 2 commits into from Oct 9, 2020
Merged

Conversation

jowlee
Copy link
Contributor

@jowlee jowlee commented Oct 9, 2020

Trello Card Link

na

Description

The job to index plays on discovery provider occasionally gets stuck because the lock cannot be acquired.
This change adds a timeout of 10 minutes to the redis lock for plays so that it will only be stuck for 10 minutes at most.
NOTE: if the job were to be run twice, meaning it for some reason takes over 10 minutes to make the queries to update the plays table, there would be an inconsistency in the table for the number of plays.
Something interesting to note: I queried DP 1 for the message Failed to acquire update_play_count_lock and it seems to show up for about 4 min every 3 hrs. I think this is b/c DP restarts in kuberentes? There are logs for the amount of time the job takes, but there was a bug w/ how it was logged, so I can't get the avg, max time the indexing plays takes.

Screen Shot 2020-10-09 at 2 14 54 PM

Services

Discovery Provider

Does it touch a critical flow like Discovery indexing, Creator Node track upload, Creator Node gateway, or Creator Node file system?

Delete an option.

  • ✅ Nope

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide repro instructions & any configuration.
Include log analysis if applicable.

I ran it locally and added time.sleep to the indexing job to force it to take a while to check that the lock was released.

Copy link
Member

@raymondjacobson raymondjacobson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the lock times out, shouldn't we try and cancel the db writes to the plays table?

@jowlee
Copy link
Contributor Author

jowlee commented Oct 9, 2020

If the lock times out, shouldn't we try and cancel the db writes to the plays table?

Good callout, I passed the lock through and checked if it's owned before trying to write. I tested this by placing wait calls to ensure it wasn't owned to check that the bool was false.
There's also an extend method on the lock to extend the timeout when writing but wasn't sure if it was necessary in case the next job started the moment after the lock was checked for ownership be before the write happened.

@dmanjunath
Copy link
Contributor

dmanjunath commented Oct 9, 2020

@jowlee so blocking_timeout is how long it spends trying to acquire the timeout and timeout is the ttl on the lock itself. so before we could be in a situation where we never releasing the lock if we got into a bad state. now it expires every 10 minutes regardless of if the task finishes or not? also this gets run once a minute right?

Copy link
Member

@raymondjacobson raymondjacobson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. this makes a lot of sense

@dmanjunath you're correct -- we run every minute. On the whole, things should be fast enough, but perhaps during downtime or an outage or something like that, the lock never got released, so this would make us fall at most 10 minutes behind.

I'm going to open a separate PR to add a health check for this indexing

@raymondjacobson
Copy link
Member

Going to merge this & deploy to staging so we can start testing with it!

@raymondjacobson raymondjacobson merged commit d8f5a52 into master Oct 9, 2020
@raymondjacobson raymondjacobson deleted the jowlee-timeout-plays branch October 9, 2020 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants