Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Possible Deadlock caused by Thread Explosion #1175

Merged
merged 2 commits into from
Dec 18, 2020

Conversation

jasoncoolmax
Copy link
Member

@jasoncoolmax jasoncoolmax commented Dec 10, 2020

Proposed changes

(Main changes are in common core: AzureAD/microsoft-authentication-library-common-for-objc#911)

There are reports that deadlock could happen in MSAL if too many threads are calling MSAL APIs at the same time.

After some investigation, we found that deadlock could happen if there are more than 64 threads calling MSAL APIs.
Root cause:
There is a limit regarding how many GCD threads we can create for an app (limit is 64 according to my testing, also found reference here https://www.oreilly.com/library/view/high-performance-ios/9781491910993/ch04.html). Such a limitation could cause deadlock in our code. E.g., if a dispatch_barrier_async block is at the front of a queue and waiting to be executed, it will block all other dispatch_sync blocks. Image there are 64 threads (limit is hit) waiting at dispatch_sync, then no more thread can be created to execute the dispatch_barrier_async block. In such a case, both dispatch_barrier_async and dispatch_sync cannot proceed further. Deadlock occurs.

Solutions:

  1. According to Apple's suggestion (https://developer.apple.com/videos/play/wwdc2015/718/?time=2185), using serial queue instead of concurrent queue. But it will sacrifice some performance.
  2. Change dispatch_barrier_async to dispatch_barrier_sync. The former requires a new thread to be created to execute the block, while the later tries to use the current thread to execute the block. In such a way we could avoid the "no more thread can be created" issue.
  • I adopted Num. 1 for those classes who don't strictly require data consistency.
  • I adopted Num. 2 for those classes who requires data consistency.

Type of change

  • Feature work
  • Bug fix
  • Documentation
  • Engineering change
  • Test
  • Logging/Telemetry

Risk

  • High – Errors could cause MAJOR regression of many scenarios. (Example: new large features or high level infrastructure changes)
  • Medium – Errors could cause regression of 1 or more scenarios. (Example: somewhat complex bug fixes, small new features)
  • Small – No issues are expected. (Example: Very small bug fixes, string changes, or configuration settings changes)

Additional information

Copy link
Contributor

@hieunguyenmsft hieunguyenmsft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if the limit GCD threads is 64, when Teams reporting the issue, how many threads are holding by us and how many threads holding by Teams?
With your new code, how can you verify the issue fixed?

@jasoncoolmax
Copy link
Member Author

jasoncoolmax commented Dec 18, 2020

Just curious if the limit GCD threads is 64, when Teams reporting the issue, how many threads are holding by us and how many threads holding by Teams?
With your new code, how can you verify the issue fixed?

From the crash logs, it showed that all 64 GCD threads are waiting on dispatch_sync in our code, although Teams initiated all those MSAL calls.

Teams doesn't have a repro on their side. I have a repro by calling [MSALPublicClientApplication accountsForParameters:error:] in 100 concurrent background threads. So I can only verify the that it can fix my repro :P Teams will keep monitoring the crash to see if the fix addresses their issue.

@jasoncoolmax jasoncoolmax merged commit de43cc4 into dev Dec 18, 2020
@jasoncoolmax jasoncoolmax deleted the jason/fixThreadExplosionDeadlock branch December 18, 2020 23:48
@oldalton oldalton mentioned this pull request Jan 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants