WP-190 Client Token Mutex #882

happycodemonkey · 2023-10-10T15:47:11Z

Overview

First attempt was not quite right so trying again.

Changes

Moved lock to the class level as suggested

Testing

UI

Notes

codecov · 2023-10-10T15:50:02Z

Codecov Report

Merging #882 (f235701) into main (e47ea96) will increase coverage by 0.00%.
Report is 1 commits behind head on main.
The diff coverage is 20.00%.

@@           Coverage Diff           @@
##             main     #882   +/-   ##
=======================================
  Coverage   63.44%   63.44%           
=======================================
  Files         427      427           
  Lines       12215    12218    +3     
  Branches     2510     2512    +2     
=======================================
+ Hits         7750     7752    +2     
- Misses       4259     4260    +1     
  Partials      206      206

Flag	Coverage Δ
javascript	`69.67% <ø> (ø)`
unittests	`57.21% <20.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
server/portal/apps/auth/models.py	`67.92% <20.00%> (-0.08%)`	⬇️

chandra-tacc

LGTM. Auth worked and did not regress. I did not test the thread locking due to single user scenario in my case.

More testing is definitely needed. Some testing notes:

Ensure the lock does not funnel only one request to tapis. Since the lock is defined at the same level where the model is associated to a user, I'm assuming no. But, good to test when pounding on this with multiple users.
Make lot of dispatch request at expiry time and see if this works.

nathanfranklin · 2023-10-10T17:49:11Z

testing is super tricky. with a single user, you should hit this issue when you come back to using your local portal and the token has expired (so you have to have it up for a long time) and then you have a bunch of requests (for different endpoints that SPA needs) to process which are all realizing that token is expired and refresh_token needs to be called. That said, the tapis v3 refresh_token call took up to like 10 seconds a few months ago but not certain what the current status is in Tapis.

for the block

                        client.refresh_tokens()

do we need to include the logic to determine if we need to refresh tokens in the with lock:? Not certain but It looks like currently if all the requests would each wait for the lock but still end up calling refresh_tokens().

happycodemonkey · 2023-10-10T18:13:25Z

@nathanfranklin good question actually...i'm still trying to familiarize myself with the code but it seems like it first checks if its expired and if so it will check the lock and then attempt a refresh, should we change the order you think?

thanks for all the feed back on this folks

rstijerina

To test this, locally you just need to expire your token:

from django.contrib.auth.models import User

u = User.objects.get(username='sal')

u.tapis_oauth.expires_in = 1

u.tapis_oauth.save()

rstijerina · 2023-10-10T18:13:15Z

server/portal/apps/auth/models.py

@@ -81,7 +83,8 @@ def client(self):
        with transaction.atomic():
            if self.expired:
                try:
-                    client.refresh_tokens()
+                    with lock:


Right now this throws an error: NameError: name 'lock' is not defined. I believe it might need to be in the context of the transaction, but either way it does work if the lock is instantiated right before with lock is referenced. Can we move it here? i.e.:

try: lock = Lock() with lock: client.refresh_tokens()

the self is missing, right? with self.lock

Ah yep that's it

oh geeze thank you, let me fix that and push it up. rookie mistake on my part. thanks Sal!

either way it does work if the lock is instantiated right before with lock is referenced.
just fyi for future - we should not do this, it will be not sharing lock between threads. always have it at class level or some shared resource.

Great point @chandra-tacc , thanks

rstijerina · 2023-10-10T18:18:34Z

@nathanfranklin good question actually...i'm still trying to familiarize myself with the code but it seems like it first checks if its expired and if so it will check the lock and then attempt a refresh, should we change the order you think?

thanks for all the feed back on this folks

Good point from Nathan, we should probably lock it when checking expiry

happycodemonkey · 2023-10-10T18:27:24Z

@nathanfranklin good question actually...i'm still trying to familiarize myself with the code but it seems like it first checks if its expired and if so it will check the lock and then attempt a refresh, should we change the order you think?
thanks for all the feed back on this folks

Good point from Nathan, we should probably lock it when checking expiry

Sounds good, I'll move the expired check and fix the other issue you pointed out.

… task/wp-190--mutex

happycodemonkey · 2023-10-10T18:32:30Z

I've updated to address the various comments here lmk how it looks

nathanfranklin · 2023-10-10T18:39:56Z

server/portal/apps/auth/models.py

-                            expires_in=client.access_token.expires_in().total_seconds())
+            try:
+                with self.lock:
+                    if self.expired:


I think select_for_update might need to be utilized here somehow for expired. and the self.update()would need to be moved within the with self.lock block as well. then the behavior is that we lock things and check that if its expired and if expired, we update and then we release the lock.

gotcha, i'll see if i can figure this out

This will aquire a lock even if the token is not expired. It is aggressive, but guarantees correctness (which is what we want).

true. good point 🤔 . if its too aggressive, could we only use lock/select_for_update after a first check ofself.expired. so basically, we check if we think it is expired via self.expired, then get a lock and re-read expired again via select_for_update to determine if it is really expired (and not that someone else has just updated it) before updating it?

self.update()would need to be moved within the with self.lock block as well.)

Good catch - that is an bug, this code is updating even if it is not expired, which will cause the token to never expire on our end. A second related point, refresh_token is already doing self.update(), we do not need to do this again here.

on checking self.expired twice (once before and once after locking):

We could do this double checked locking.
if self.expired():
with self.lock:
if self.expired():

A side note, in many other instances we moved away from this double checked locking due to bad compiler and runtime optimizations on this pattern. Hence, aggressive locking is ok sometimes or a quick api call is even ok if it is redundant to avoid correctness issues.

thanks all. I've updated this so that it checks expired both before and after the lock, and I removed that extra update call. let me know if you notice anything else that should be modified or if i misunderstood anything.

chandra-tacc · 2023-10-10T19:08:09Z

Removing my approval, bad review on my part. Thanks folks for reviews.

nathanfranklin · 2023-10-12T16:40:08Z

server/portal/apps/auth/models.py

+            try:
+                if self.expired:
+                    with self.lock:
+                        if self.expired:


looks good but something (refresh_from_db?) is needed to ensure that the model's expires_in and created have up-to-date values after the lock is acquired.

I used refresh_from_db and requests waiting for the lock are still seeing old data.

Apart from that, I tested this by adding log lines. We will need unit tests in this area.

Test with no expiry, nothing is updated in db.

Test with forced expiry of token. Token is refreshed but db is not updated.

Second test failed. Below update is needed to fix that.

if self.expired: with self.lock: self.refresh_from_db() if self.expired: try: client.refresh_tokens() except Exception: logger.info('Tapis Token refresh failed User id %d', self.user.id) raise self.update(created=int(time.time()), access_token=client.access_token.access_token, expires_in=client.access_token.expires_in().total_seconds())

There are 3 adjustments needed to make the test pass:

try - catch is specific to client.refresh_tokens() (keeps this in par of existing logic)

self.update is needed to update db

self.refresh_from_db to update model.

Even with above code, the multi thread scenario still requested for token. The failing test case:

force expiry

Refresh browser (sends multiple actions at once).

Each action is refreshing token (see created time in db, the token creation time keeps updating).

The refresh_from_db does not update waiting threads. I'm looking further.

reviewing new changes

chandra-tacc

Need more changes based on my comments in: https://github.com/TACC/Core-Portal/pull/882/files#r1362251861

jarosenb · 2023-10-27T16:42:00Z

Hi all! sorry to barge into this conversation so late, but I think the thread-based locking approach will have issues in production. uWSGI runs multiple processes that are each multithreaded, so it may be possible for the lock to be bypassed if 2 consecutive requests hit 2 different workers. uWSGI actually includes its own cross-process lock: https://uwsgi-docs.readthedocs.io/en/latest/Locks.html. I don't want to rely on it though, since it couples our business logic to our choice of server (and I want us to deploy stuff with gunicorn eventually, lol).

The lock should probably be acquired at the database level since we can use it as a bottleneck no matter how many threads/processes there are. Maybe something like:

with transaction.atomic()
    # SemaphoreModel can just be a standard db model with a couple attributes
    lock = SemaphoreModel.objects.select_for_update().filter(id="token_refresh_lock")
    lock.update(busy=True)
    # stuff we don't want to happen concurrently goes here
    lock.update(busy=False)

(disclaimer: haven't actually tested this, but I don't think the status of the lock matters as long as Django restricts access while it's acquired)

edit: I wonder if this would work if instead of having a global lock we add the username to the ID, so multiple calls can happen concurrently but only one per user.

chandra-tacc · 2023-10-27T17:14:04Z

Hi all! sorry to barge into this conversation so late, but I think the thread-based locking approach will have issues in production. uWSGI runs multiple processes that are each multithreaded, so it may be possible for the lock to be bypassed if 2 consecutive requests hit 2 different workers. uWSGI actually includes its own cross-process lock: https://uwsgi-docs.readthedocs.io/en/latest/Locks.html. I don't want to rely on it though, since it couples our business logic to our choice of server (and I want us to deploy stuff with gunicorn eventually, lol).

The lock should probably be acquired at the database level since we can use it as a bottleneck no matter how many threads/processes there are. Maybe something like:
with transaction.atomic()
    # SemaphoreModel can just be a standard db model with a couple attributes
    lock = SemaphoreModel.objects.select_for_update().filter(id="token_refresh_lock")
    lock.update(busy=True)
    # stuff we don't want to happen concurrently goes here
    lock.update(busy=False)
(disclaimer: haven't actually tested this, but I don't think the status of the lock matters as long as Django restricts access while it's acquired)

edit: I wonder if this would work if instead of having a global lock we add the username to the ID, so multiple calls can happen concurrently but only one per user.

Thanks @jarosenb for the context on multiple workers. Really nice solution with db lock, and yes we want to isolate the lock to a specific user instead of global lock (which will become a bottleneck otherwise).

jarosenb · 2023-10-27T18:33:16Z

I actually really like the approach in django-db-mutex: https://github.com/ambitioninc/django-db-mutex/blob/develop/db_mutex/db_mutex.py#L98

Instead of wrapping the entire context in an atomic transaction, they just use one for writing to a "mutex" table with a uniqueness constraint on the ID. Then if another request tries to access the lock it throws an error due to that constraint. That makes a lot of sense for token refresh since it would be better to catch the exception and no-op if a refresh is in progress, rather than serially process a bunch of redundant calls.

moved lock to the class level

af1a920

chandra-tacc previously approved these changes Oct 10, 2023

View reviewed changes

Merge branch 'main' into task/wp-190--mutex

7e2149f

rstijerina requested changes Oct 10, 2023

View reviewed changes

happycodemonkey added 2 commits October 10, 2023 13:28

fixed silly error and moved the expired check into the lock

2a4c399

Merge branch 'task/wp-190--mutex' of github.com:TACC/Core-Portal into…

5ae797c

… task/wp-190--mutex

nathanfranklin reviewed Oct 10, 2023

View reviewed changes

chandra-tacc self-requested a review October 10, 2023 19:07

double checking expired and removing the extra update call

f235701

nathanfranklin reviewed Oct 12, 2023

View reviewed changes

rstijerina mentioned this pull request Oct 18, 2023

WP-190: Client Token Mutex #856

Closed

chandra-tacc requested changes Oct 25, 2023

View reviewed changes

chandra-tacc mentioned this pull request Jan 25, 2024

Tasks/WP-190: Handle concurrency with Tapis OAuth Token Refresh #932

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WP-190 Client Token Mutex #882

WP-190 Client Token Mutex #882

happycodemonkey commented Oct 10, 2023

codecov bot commented Oct 10, 2023 •

edited

chandra-tacc left a comment

nathanfranklin commented Oct 10, 2023

happycodemonkey commented Oct 10, 2023

rstijerina left a comment

rstijerina Oct 10, 2023

nathanfranklin Oct 10, 2023

rstijerina Oct 10, 2023

happycodemonkey Oct 10, 2023

chandra-tacc Oct 10, 2023 •

edited

rstijerina Oct 10, 2023

rstijerina commented Oct 10, 2023

happycodemonkey commented Oct 10, 2023

happycodemonkey commented Oct 10, 2023

nathanfranklin Oct 10, 2023

happycodemonkey Oct 10, 2023

chandra-tacc Oct 10, 2023

nathanfranklin Oct 10, 2023

chandra-tacc Oct 10, 2023

happycodemonkey Oct 11, 2023

chandra-tacc commented Oct 10, 2023

nathanfranklin Oct 12, 2023 •

edited

chandra-tacc Oct 17, 2023 •

edited

chandra-tacc left a comment

jarosenb commented Oct 27, 2023 •

edited

chandra-tacc commented Oct 27, 2023

jarosenb commented Oct 27, 2023 •

edited

WP-190 Client Token Mutex #882

Are you sure you want to change the base?

WP-190 Client Token Mutex #882

Conversation

happycodemonkey commented Oct 10, 2023

Overview

Related

Changes

Testing

UI

Notes

codecov bot commented Oct 10, 2023 • edited

Codecov Report

chandra-tacc left a comment

Choose a reason for hiding this comment

nathanfranklin commented Oct 10, 2023

happycodemonkey commented Oct 10, 2023

rstijerina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chandra-tacc Oct 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rstijerina commented Oct 10, 2023

happycodemonkey commented Oct 10, 2023

happycodemonkey commented Oct 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chandra-tacc commented Oct 10, 2023

nathanfranklin Oct 12, 2023 • edited

Choose a reason for hiding this comment

chandra-tacc Oct 17, 2023 • edited

Choose a reason for hiding this comment

chandra-tacc left a comment

Choose a reason for hiding this comment

jarosenb commented Oct 27, 2023 • edited

chandra-tacc commented Oct 27, 2023

jarosenb commented Oct 27, 2023 • edited

codecov bot commented Oct 10, 2023 •

edited

chandra-tacc Oct 10, 2023 •

edited

nathanfranklin Oct 12, 2023 •

edited

chandra-tacc Oct 17, 2023 •

edited

jarosenb commented Oct 27, 2023 •

edited

jarosenb commented Oct 27, 2023 •

edited