-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WP-190 Client Token Mutex #882
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## main #882 +/- ##
=======================================
Coverage 63.44% 63.44%
=======================================
Files 427 427
Lines 12215 12218 +3
Branches 2510 2512 +2
=======================================
+ Hits 7750 7752 +2
- Misses 4259 4260 +1
Partials 206 206
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Auth worked and did not regress. I did not test the thread locking due to single user scenario in my case.
More testing is definitely needed. Some testing notes:
- Ensure the lock does not funnel only one request to tapis. Since the lock is defined at the same level where the model is associated to a user, I'm assuming no. But, good to test when pounding on this with multiple users.
- Make lot of dispatch request at expiry time and see if this works.
testing is super tricky. with a single user, you should hit this issue when you come back to using your local portal and the token has expired (so you have to have it up for a long time) and then you have a bunch of requests (for different endpoints that SPA needs) to process which are all realizing that token is expired and refresh_token needs to be called. That said, the tapis v3 refresh_token call took up to like 10 seconds a few months ago but not certain what the current status is in Tapis. for the block
do we need to include the logic to determine if we need to refresh tokens in the |
@nathanfranklin good question actually...i'm still trying to familiarize myself with the code but it seems like it first checks if its expired and if so it will check the lock and then attempt a refresh, should we change the order you think? thanks for all the feed back on this folks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test this, locally you just need to expire your token:
from django.contrib.auth.models import User
u = User.objects.get(username='sal')
u.tapis_oauth.expires_in = 1
u.tapis_oauth.save()
server/portal/apps/auth/models.py
Outdated
@@ -81,7 +83,8 @@ def client(self): | |||
with transaction.atomic(): | |||
if self.expired: | |||
try: | |||
client.refresh_tokens() | |||
with lock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now this throws an error: NameError: name 'lock' is not defined
. I believe it might need to be in the context of the transaction, but either way it does work if the lock
is instantiated right before with lock
is referenced. Can we move it here? i.e.:
try:
lock = Lock()
with lock:
client.refresh_tokens()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the self is missing, right? with self.lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yep that's it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh geeze thank you, let me fix that and push it up. rookie mistake on my part. thanks Sal!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either way it does work if the lock is instantiated right before with lock is referenced.
just fyi for future - we should not do this, it will be not sharing lock between threads. always have it at class level or some shared resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point @chandra-tacc , thanks
Good point from Nathan, we should probably lock it when checking expiry |
Sounds good, I'll move the expired check and fix the other issue you pointed out. |
I've updated to address the various comments here lmk how it looks |
server/portal/apps/auth/models.py
Outdated
expires_in=client.access_token.expires_in().total_seconds()) | ||
try: | ||
with self.lock: | ||
if self.expired: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think select_for_update
might need to be utilized here somehow for expired
. and the self.update()
would need to be moved within the with self.lock
block as well. then the behavior is that we lock things and check that if its expired and if expired, we update and then we release the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, i'll see if i can figure this out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will aquire a lock even if the token is not expired. It is aggressive, but guarantees correctness (which is what we want).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true. good point 🤔 . if its too aggressive, could we only use lock/select_for_update after a first check ofself.expired
. so basically, we check if we think it is expired via self.expired
, then get a lock and re-read expired
again via select_for_update
to determine if it is really expired (and not that someone else has just updated it) before updating it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.update()would need to be moved within the with self.lock block as well.)
Good catch - that is an bug, this code is updating even if it is not expired, which will cause the token to never expire on our end. A second related point, refresh_token is already doing self.update(), we do not need to do this again here.
on checking self.expired twice (once before and once after locking):
We could do this double checked locking.
if self.expired():
with self.lock:
if self.expired():
A side note, in many other instances we moved away from this double checked locking due to bad compiler and runtime optimizations on this pattern. Hence, aggressive locking is ok sometimes or a quick api call is even ok if it is redundant to avoid correctness issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks all. I've updated this so that it checks expired both before and after the lock, and I removed that extra update call. let me know if you notice anything else that should be modified or if i misunderstood anything.
Removing my approval, bad review on my part. Thanks folks for reviews. |
try: | ||
if self.expired: | ||
with self.lock: | ||
if self.expired: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good but something (refresh_from_db
?) is needed to ensure that the model's expires_in
and created
have up-to-date values after the lock is acquired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used refresh_from_db and requests waiting for the lock are still seeing old data.
Apart from that, I tested this by adding log lines. We will need unit tests in this area.
- Test with no expiry, nothing is updated in db.
- Test with forced expiry of token. Token is refreshed but db is not updated.
Second test failed. Below update is needed to fix that.
if self.expired:
with self.lock:
self.refresh_from_db()
if self.expired:
try:
client.refresh_tokens()
except Exception:
logger.info('Tapis Token refresh failed User id %d', self.user.id)
raise
self.update(created=int(time.time()), access_token=client.access_token.access_token, expires_in=client.access_token.expires_in().total_seconds())
There are 3 adjustments needed to make the test pass:
- try - catch is specific to client.refresh_tokens() (keeps this in par of existing logic)
- self.update is needed to update db
- self.refresh_from_db to update model.
Even with above code, the multi thread scenario still requested for token. The failing test case:
- force expiry
- Refresh browser (sends multiple actions at once).
- Each action is refreshing token (see created time in db, the token creation time keeps updating).
The refresh_from_db does not update waiting threads. I'm looking further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need more changes based on my comments in: https://github.com/TACC/Core-Portal/pull/882/files#r1362251861
Hi all! sorry to barge into this conversation so late, but I think the thread-based locking approach will have issues in production. uWSGI runs multiple processes that are each multithreaded, so it may be possible for the lock to be bypassed if 2 consecutive requests hit 2 different workers. uWSGI actually includes its own cross-process lock: https://uwsgi-docs.readthedocs.io/en/latest/Locks.html. I don't want to rely on it though, since it couples our business logic to our choice of server (and I want us to deploy stuff with gunicorn eventually, lol). The lock should probably be acquired at the database level since we can use it as a bottleneck no matter how many threads/processes there are. Maybe something like:
(disclaimer: haven't actually tested this, but I don't think the status of the lock matters as long as Django restricts access while it's acquired) edit: I wonder if this would work if instead of having a global lock we add the username to the ID, so multiple calls can happen concurrently but only one per user. |
Thanks @jarosenb for the context on multiple workers. Really nice solution with db lock, and yes we want to isolate the lock to a specific user instead of global lock (which will become a bottleneck otherwise). |
I actually really like the approach in django-db-mutex: https://github.com/ambitioninc/django-db-mutex/blob/develop/db_mutex/db_mutex.py#L98 Instead of wrapping the entire context in an atomic transaction, they just use one for writing to a "mutex" table with a uniqueness constraint on the ID. Then if another request tries to access the lock it throws an error due to that constraint. That makes a lot of sense for token refresh since it would be better to catch the exception and no-op if a refresh is in progress, rather than serially process a bunch of redundant calls. |
Overview
First attempt was not quite right so trying again.
Related
Changes
Moved lock to the class level as suggested
Testing
UI
Notes