Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WP-190 Client Token Mutex #882

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

WP-190 Client Token Mutex #882

wants to merge 5 commits into from

Conversation

happycodemonkey
Copy link
Member

Overview

First attempt was not quite right so trying again.

Related

Changes

Moved lock to the class level as suggested

Testing

UI

Notes

@codecov
Copy link

codecov bot commented Oct 10, 2023

Codecov Report

Merging #882 (f235701) into main (e47ea96) will increase coverage by 0.00%.
Report is 1 commits behind head on main.
The diff coverage is 20.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #882   +/-   ##
=======================================
  Coverage   63.44%   63.44%           
=======================================
  Files         427      427           
  Lines       12215    12218    +3     
  Branches     2510     2512    +2     
=======================================
+ Hits         7750     7752    +2     
- Misses       4259     4260    +1     
  Partials      206      206           
Flag Coverage Δ
javascript 69.67% <ø> (ø)
unittests 57.21% <20.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
server/portal/apps/auth/models.py 67.92% <20.00%> (-0.08%) ⬇️

chandra-tacc
chandra-tacc previously approved these changes Oct 10, 2023
Copy link
Collaborator

@chandra-tacc chandra-tacc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Auth worked and did not regress. I did not test the thread locking due to single user scenario in my case.

More testing is definitely needed. Some testing notes:

  • Ensure the lock does not funnel only one request to tapis. Since the lock is defined at the same level where the model is associated to a user, I'm assuming no. But, good to test when pounding on this with multiple users.
  • Make lot of dispatch request at expiry time and see if this works.

@nathanfranklin
Copy link
Member

testing is super tricky. with a single user, you should hit this issue when you come back to using your local portal and the token has expired (so you have to have it up for a long time) and then you have a bunch of requests (for different endpoints that SPA needs) to process which are all realizing that token is expired and refresh_token needs to be called. That said, the tapis v3 refresh_token call took up to like 10 seconds a few months ago but not certain what the current status is in Tapis.

for the block

                        client.refresh_tokens()

do we need to include the logic to determine if we need to refresh tokens in the with lock:? Not certain but It looks like currently if all the requests would each wait for the lock but still end up calling refresh_tokens().

@happycodemonkey
Copy link
Member Author

@nathanfranklin good question actually...i'm still trying to familiarize myself with the code but it seems like it first checks if its expired and if so it will check the lock and then attempt a refresh, should we change the order you think?

thanks for all the feed back on this folks

Copy link
Member

@rstijerina rstijerina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test this, locally you just need to expire your token:

from django.contrib.auth.models import User

u = User.objects.get(username='sal')

u.tapis_oauth.expires_in = 1

u.tapis_oauth.save()

@@ -81,7 +83,8 @@ def client(self):
with transaction.atomic():
if self.expired:
try:
client.refresh_tokens()
with lock:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this throws an error: NameError: name 'lock' is not defined. I believe it might need to be in the context of the transaction, but either way it does work if the lock is instantiated right before with lock is referenced. Can we move it here? i.e.:

                try:
                    lock = Lock()
                    with lock:
                        client.refresh_tokens()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the self is missing, right? with self.lock

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yep that's it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh geeze thank you, let me fix that and push it up. rookie mistake on my part. thanks Sal!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either way it does work if the lock is instantiated right before with lock is referenced.
just fyi for future - we should not do this, it will be not sharing lock between threads. always have it at class level or some shared resource.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point @chandra-tacc , thanks

@rstijerina
Copy link
Member

@nathanfranklin good question actually...i'm still trying to familiarize myself with the code but it seems like it first checks if its expired and if so it will check the lock and then attempt a refresh, should we change the order you think?

thanks for all the feed back on this folks

Good point from Nathan, we should probably lock it when checking expiry

@happycodemonkey
Copy link
Member Author

@nathanfranklin good question actually...i'm still trying to familiarize myself with the code but it seems like it first checks if its expired and if so it will check the lock and then attempt a refresh, should we change the order you think?
thanks for all the feed back on this folks

Good point from Nathan, we should probably lock it when checking expiry

Sounds good, I'll move the expired check and fix the other issue you pointed out.

@happycodemonkey
Copy link
Member Author

I've updated to address the various comments here lmk how it looks

expires_in=client.access_token.expires_in().total_seconds())
try:
with self.lock:
if self.expired:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think select_for_update might need to be utilized here somehow for expired. and the self.update()would need to be moved within the with self.lock block as well. then the behavior is that we lock things and check that if its expired and if expired, we update and then we release the lock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, i'll see if i can figure this out

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will aquire a lock even if the token is not expired. It is aggressive, but guarantees correctness (which is what we want).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true. good point 🤔 . if its too aggressive, could we only use lock/select_for_update after a first check ofself.expired. so basically, we check if we think it is expired via self.expired, then get a lock and re-read expired again via select_for_update to determine if it is really expired (and not that someone else has just updated it) before updating it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.update()would need to be moved within the with self.lock block as well.)

Good catch - that is an bug, this code is updating even if it is not expired, which will cause the token to never expire on our end. A second related point, refresh_token is already doing self.update(), we do not need to do this again here.

on checking self.expired twice (once before and once after locking):

We could do this double checked locking.
if self.expired():
with self.lock:
if self.expired():

A side note, in many other instances we moved away from this double checked locking due to bad compiler and runtime optimizations on this pattern. Hence, aggressive locking is ok sometimes or a quick api call is even ok if it is redundant to avoid correctness issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks all. I've updated this so that it checks expired both before and after the lock, and I removed that extra update call. let me know if you notice anything else that should be modified or if i misunderstood anything.

@chandra-tacc chandra-tacc self-requested a review October 10, 2023 19:07
@chandra-tacc
Copy link
Collaborator

Removing my approval, bad review on my part. Thanks folks for reviews.

try:
if self.expired:
with self.lock:
if self.expired:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good but something (refresh_from_db?) is needed to ensure that the model's expires_in and created have up-to-date values after the lock is acquired.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used refresh_from_db and requests waiting for the lock are still seeing old data.

Apart from that, I tested this by adding log lines. We will need unit tests in this area.

  1. Test with no expiry, nothing is updated in db.
  2. Test with forced expiry of token. Token is refreshed but db is not updated.

Second test failed. Below update is needed to fix that.

if self.expired:
                with self.lock:
                    self.refresh_from_db()
                    if self.expired:
                        try:
                            client.refresh_tokens()
                        except Exception:
                            logger.info('Tapis Token refresh failed User id %d', self.user.id)
                            raise
                        self.update(created=int(time.time()), access_token=client.access_token.access_token, expires_in=client.access_token.expires_in().total_seconds())

There are 3 adjustments needed to make the test pass:

  • try - catch is specific to client.refresh_tokens() (keeps this in par of existing logic)
  • self.update is needed to update db
  • self.refresh_from_db to update model.

Even with above code, the multi thread scenario still requested for token. The failing test case:

  • force expiry
  • Refresh browser (sends multiple actions at once).
  • Each action is refreshing token (see created time in db, the token creation time keeps updating).

The refresh_from_db does not update waiting threads. I'm looking further.

@chandra-tacc chandra-tacc dismissed their stale review October 16, 2023 15:47

reviewing new changes

Copy link
Collaborator

@chandra-tacc chandra-tacc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need more changes based on my comments in: https://github.com/TACC/Core-Portal/pull/882/files#r1362251861

@jarosenb
Copy link
Collaborator

jarosenb commented Oct 27, 2023

Hi all! sorry to barge into this conversation so late, but I think the thread-based locking approach will have issues in production. uWSGI runs multiple processes that are each multithreaded, so it may be possible for the lock to be bypassed if 2 consecutive requests hit 2 different workers. uWSGI actually includes its own cross-process lock: https://uwsgi-docs.readthedocs.io/en/latest/Locks.html. I don't want to rely on it though, since it couples our business logic to our choice of server (and I want us to deploy stuff with gunicorn eventually, lol).

The lock should probably be acquired at the database level since we can use it as a bottleneck no matter how many threads/processes there are. Maybe something like:

with transaction.atomic()
    # SemaphoreModel can just be a standard db model with a couple attributes
    lock = SemaphoreModel.objects.select_for_update().filter(id="token_refresh_lock")
    lock.update(busy=True)
    # stuff we don't want to happen concurrently goes here
    lock.update(busy=False)

(disclaimer: haven't actually tested this, but I don't think the status of the lock matters as long as Django restricts access while it's acquired)

edit: I wonder if this would work if instead of having a global lock we add the username to the ID, so multiple calls can happen concurrently but only one per user.

@chandra-tacc
Copy link
Collaborator

Hi all! sorry to barge into this conversation so late, but I think the thread-based locking approach will have issues in production. uWSGI runs multiple processes that are each multithreaded, so it may be possible for the lock to be bypassed if 2 consecutive requests hit 2 different workers. uWSGI actually includes its own cross-process lock: https://uwsgi-docs.readthedocs.io/en/latest/Locks.html. I don't want to rely on it though, since it couples our business logic to our choice of server (and I want us to deploy stuff with gunicorn eventually, lol).

The lock should probably be acquired at the database level since we can use it as a bottleneck no matter how many threads/processes there are. Maybe something like:

with transaction.atomic()
    # SemaphoreModel can just be a standard db model with a couple attributes
    lock = SemaphoreModel.objects.select_for_update().filter(id="token_refresh_lock")
    lock.update(busy=True)
    # stuff we don't want to happen concurrently goes here
    lock.update(busy=False)

(disclaimer: haven't actually tested this, but I don't think the status of the lock matters as long as Django restricts access while it's acquired)

edit: I wonder if this would work if instead of having a global lock we add the username to the ID, so multiple calls can happen concurrently but only one per user.

Thanks @jarosenb for the context on multiple workers. Really nice solution with db lock, and yes we want to isolate the lock to a specific user instead of global lock (which will become a bottleneck otherwise).

@jarosenb
Copy link
Collaborator

jarosenb commented Oct 27, 2023

I actually really like the approach in django-db-mutex: https://github.com/ambitioninc/django-db-mutex/blob/develop/db_mutex/db_mutex.py#L98

Instead of wrapping the entire context in an atomic transaction, they just use one for writing to a "mutex" table with a uniqueness constraint on the ID. Then if another request tries to access the lock it throws an error due to that constraint. That makes a lot of sense for token refresh since it would be better to catch the exception and no-op if a refresh is in progress, rather than serially process a bunch of redundant calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants