New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic: SPINLOCK is not safe in iOS #2619

Closed
NachoSoto opened this Issue Dec 15, 2015 · 20 comments

Comments

Projects
None yet
10 participants
@NachoSoto
Member

NachoSoto commented Dec 15, 2015

https://twitter.com/steipete/status/676851647042203648

I'll refrain from making any comments (I'll just say... if it's illegal on iOS, why is it a public API???)

Apparently pthread mutexes are faster now. We should create a benchmark to compare, and try to convert Atomic to this (also RACCompoundDisposable and RACSerialDisposable?)

@NachoSoto NachoSoto added the bug label Dec 15, 2015

@kastiglione

This comment has been minimized.

Show comment
Hide comment
@kastiglione

kastiglione Dec 15, 2015

Member

I'm not even sure what "illegal" means.

Member

kastiglione commented Dec 15, 2015

I'm not even sure what "illegal" means.

@joshaber

This comment has been minimized.

Show comment
Hide comment
@joshaber
Member

joshaber commented Dec 15, 2015

@NachoSoto

This comment has been minimized.

Show comment
Hide comment
@NachoSoto
Member

NachoSoto commented Dec 15, 2015

nope

@liscio

This comment has been minimized.

Show comment
Hide comment
@liscio

liscio Dec 21, 2015

Contributor

I'm just curious, as I wasn't around from the beginning, but why was OSSpinLock used for the Atomic type to begin with?

Was NSLock used and then shown to be too slow, then pthread_mutex_t was tested and found to be too slow as well? Unfortunately git doesn't have history before "move sources…"

Spinlocks are not something I'd think to use in a non-kernel, non-device-driver context, which is why I ask. If anything, I'd say that the bug here is that OSSpinLock is used at all. 😛

Contributor

liscio commented Dec 21, 2015

I'm just curious, as I wasn't around from the beginning, but why was OSSpinLock used for the Atomic type to begin with?

Was NSLock used and then shown to be too slow, then pthread_mutex_t was tested and found to be too slow as well? Unfortunately git doesn't have history before "move sources…"

Spinlocks are not something I'd think to use in a non-kernel, non-device-driver context, which is why I ask. If anything, I'd say that the bug here is that OSSpinLock is used at all. 😛

@liscio

This comment has been minimized.

Show comment
Hide comment
@liscio

liscio Dec 21, 2015

Contributor

To add a counter-point to my comment above, this isn't as big a deal as the title might make it out to be: https://twitter.com/Catfish_Man/status/676854531615883265

Even under CPU starvation that critical section is so tiny it’s likely ok in real situations.
—@Catfish_Man on Twitter

As for the comment above from @kastiglione:

I'm not even sure what "illegal" means.

I don't have the answer, but I would guess that something in the ARM CPU doesn't allow the spinlock primitive to enjoy many of the same benefits/guarantees/assumptions as it does on Intel CPUs. (But likely not all of them, according to https://twitter.com/Catfish_Man/status/676851988596809728)

Anyway, there's likely "nothing to see here" in practice, and instead some investigation should be done to replace with the safer primitive and measure for any performance regressions, as @NachoSoto suggests.

Contributor

liscio commented Dec 21, 2015

To add a counter-point to my comment above, this isn't as big a deal as the title might make it out to be: https://twitter.com/Catfish_Man/status/676854531615883265

Even under CPU starvation that critical section is so tiny it’s likely ok in real situations.
—@Catfish_Man on Twitter

As for the comment above from @kastiglione:

I'm not even sure what "illegal" means.

I don't have the answer, but I would guess that something in the ARM CPU doesn't allow the spinlock primitive to enjoy many of the same benefits/guarantees/assumptions as it does on Intel CPUs. (But likely not all of them, according to https://twitter.com/Catfish_Man/status/676851988596809728)

Anyway, there's likely "nothing to see here" in practice, and instead some investigation should be done to replace with the safer primitive and measure for any performance regressions, as @NachoSoto suggests.

@jspahrsummers

This comment has been minimized.

Show comment
Hide comment
Member

jspahrsummers commented Dec 22, 2015

@jspahrsummers

This comment has been minimized.

Show comment
Hide comment
@jspahrsummers

jspahrsummers Dec 22, 2015

Member

I do think we should replace our uses of OSSpinLock, but I think our/my reasoning for using them was valid given the documentation available to us.

Member

jspahrsummers commented Dec 22, 2015

I do think we should replace our uses of OSSpinLock, but I think our/my reasoning for using them was valid given the documentation available to us.

@liscio

This comment has been minimized.

Show comment
Hide comment
@liscio

liscio Dec 22, 2015

Contributor

Yeah, and thanks for the relevant history commits. Not sure why GitHub doesn't do a git log --follow when history is requested for a given file.

At any rate, I will reiterate that "it's probably fine." Using a "now faster" mutex is probably a straightforward task, however it is less so without knowing how you profiled / tested them to begin with. Got any suggestions/pointers there?

As for a way forward if/when performance becomes a concern, lock-free queues are worth investigating. (Here's a pretty good article for readers unfamiliar with the concept: http://www.linuxjournal.com/content/lock-free-multi-producer-multi-consumer-queue-ring-buffer?page=0,0). I built something similar for an internal audio library a while ago, and I may be able to adapt it for general use like this.

Contributor

liscio commented Dec 22, 2015

Yeah, and thanks for the relevant history commits. Not sure why GitHub doesn't do a git log --follow when history is requested for a given file.

At any rate, I will reiterate that "it's probably fine." Using a "now faster" mutex is probably a straightforward task, however it is less so without knowing how you profiled / tested them to begin with. Got any suggestions/pointers there?

As for a way forward if/when performance becomes a concern, lock-free queues are worth investigating. (Here's a pretty good article for readers unfamiliar with the concept: http://www.linuxjournal.com/content/lock-free-multi-producer-multi-consumer-queue-ring-buffer?page=0,0). I built something similar for an internal audio library a while ago, and I may be able to adapt it for general use like this.

@jspahrsummers

This comment has been minimized.

Show comment
Hide comment
@jspahrsummers

jspahrsummers Dec 22, 2015

Member

Using a "now faster" mutex is probably a straightforward task, however it is less so without knowing how you profiled / tested them to begin with. Got any suggestions/pointers there?

I was running GitHub Desktop in Instruments' Time Profiler and Allocations, and focusing on RAC-heavy stacks. You'll have to talk to @joshaber or @mdiep for that now. 😉

Member

jspahrsummers commented Dec 22, 2015

Using a "now faster" mutex is probably a straightforward task, however it is less so without knowing how you profiled / tested them to begin with. Got any suggestions/pointers there?

I was running GitHub Desktop in Instruments' Time Profiler and Allocations, and focusing on RAC-heavy stacks. You'll have to talk to @joshaber or @mdiep for that now. 😉

@liscio

This comment has been minimized.

Show comment
Hide comment
@liscio

liscio Dec 22, 2015

Contributor

OK good to know. I imagine the Swift-side stuff will require a similarly large/deep code base to test with. I have grown some of my own over the last few months1 but I don't know whether I push RAC "hard enough" to get enough samples out of it. Plenty of other things tend to bubble up in the Instruments charts with apps like mine. 😄

1 Plug! http://capoapp.com, with http://capoapp.com/neptune being the most recent, totally-designed-using-RAC4 component.

Contributor

liscio commented Dec 22, 2015

OK good to know. I imagine the Swift-side stuff will require a similarly large/deep code base to test with. I have grown some of my own over the last few months1 but I don't know whether I push RAC "hard enough" to get enough samples out of it. Plenty of other things tend to bubble up in the Instruments charts with apps like mine. 😄

1 Plug! http://capoapp.com, with http://capoapp.com/neptune being the most recent, totally-designed-using-RAC4 component.

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Dec 23, 2015

I wrote a blog post today that goes into more detail about why spinlocks are "illegal" on iOS: http://engineering.postmates.com/Spinlocks-Considered-Harmful-On-iOS/

kballard commented Dec 23, 2015

I wrote a blog post today that goes into more detail about why spinlocks are "illegal" on iOS: http://engineering.postmates.com/Spinlocks-Considered-Harmful-On-iOS/

@Adlai-Holler

This comment has been minimized.

Show comment
Hide comment
@Adlai-Holler

Adlai-Holler Jan 5, 2016

Contributor

I broke off Atomic a little while ago and migrated it to pthread_mutex_lock. By all accounts it's the best lock there is – no dynamic dispatch like NSLock, instant lock if not contended like spinlock, doesn't waste energy unlike spin lock.

Feel free to copy-paste the source from my repo or do whatever you like with it.

https://github.com/Adlai-Holler/Atomic
https://github.com/Adlai-Holler/Atomic/blob/master/Atomic/Atomic.swift

Contributor

Adlai-Holler commented Jan 5, 2016

I broke off Atomic a little while ago and migrated it to pthread_mutex_lock. By all accounts it's the best lock there is – no dynamic dispatch like NSLock, instant lock if not contended like spinlock, doesn't waste energy unlike spin lock.

Feel free to copy-paste the source from my repo or do whatever you like with it.

https://github.com/Adlai-Holler/Atomic
https://github.com/Adlai-Holler/Atomic/blob/master/Atomic/Atomic.swift

@NachoSoto

This comment has been minimized.

Show comment
Hide comment
@NachoSoto

NachoSoto Jan 7, 2016

Member

@kballard that's a great explanation, thanks!

I don't have strong opinions on this as I'm not an expert on the matter. I'll defer to others on the decision to change (or not change) what type of lock we use, though I'd be happy to make the change myself!

Member

NachoSoto commented Jan 7, 2016

@kballard that's a great explanation, thanks!

I don't have strong opinions on this as I'm not an expert on the matter. I'll defer to others on the decision to change (or not change) what type of lock we use, though I'd be happy to make the change myself!

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Jan 19, 2016

A benchmark like that is completely useless without source. What's more, you're only showing a single time for each construct, without documenting what that time actually represents. The various constructs will behave differently depending on whether they're under contention, the priorities of the threads that are contending on the lock, etc. @synchronized is also a curious beast, its performance will depend heavily on the access patterns, such as how many @synchronized blocks are being entered/exited at the same time, whether you're locking the same object repeatedly or locking new objects, etc.

kballard commented Jan 19, 2016

A benchmark like that is completely useless without source. What's more, you're only showing a single time for each construct, without documenting what that time actually represents. The various constructs will behave differently depending on whether they're under contention, the priorities of the threads that are contending on the lock, etc. @synchronized is also a curious beast, its performance will depend heavily on the access patterns, such as how many @synchronized blocks are being entered/exited at the same time, whether you're locking the same object repeatedly or locking new objects, etc.

@jspahrsummers

This comment has been minimized.

Show comment
Hide comment
@jspahrsummers

jspahrsummers Jan 19, 2016

Member

NB: dispatch_semaphore does not donate priority, which is another potential pitfall.

Member

jspahrsummers commented Jan 19, 2016

NB: dispatch_semaphore does not donate priority, which is another potential pitfall.

@russbishop

This comment has been minimized.

Show comment
Hide comment
@russbishop

russbishop Jan 27, 2016

@jspahrsummers @kballard I have an implementation using atomic CAS. If there is any interest I'm happy to submit a PR.

edit: Looking at the use of Atomic I don't think my implementation will be useful. If you can do idempotent updates CAS lets you do a kind of optimistic transactions in memory which can be great but that would require quite a bit of refactoring (though it works really well if everything is a CoW immutable, then you really can treat mutation as an in-memory transaction).

russbishop commented Jan 27, 2016

@jspahrsummers @kballard I have an implementation using atomic CAS. If there is any interest I'm happy to submit a PR.

edit: Looking at the use of Atomic I don't think my implementation will be useful. If you can do idempotent updates CAS lets you do a kind of optimistic transactions in memory which can be great but that would require quite a bit of refactoring (though it works really well if everything is a CoW immutable, then you really can treat mutation as an in-memory transaction).

@RomanTruba

This comment has been minimized.

Show comment
Hide comment
@RomanTruba

RomanTruba Jun 29, 2016

I found out this thread today, and made an investigation of different types of synchronization and did execute a benchmark on iOS simulator and real devices.
The results are very interesting for me. In iOS 10 we have visible performance degradation of dispatch_semaphore, which is probably changed his behaviour with thread priority respect.
Here is the summary diagram of basic synchronization mechanisms available in iOS. All tests run with release configuration (Swift optimization -O)

2016-06-29 17 58 39

Benchmark code for SDK9
Benchmark code for SDK10

RomanTruba commented Jun 29, 2016

I found out this thread today, and made an investigation of different types of synchronization and did execute a benchmark on iOS simulator and real devices.
The results are very interesting for me. In iOS 10 we have visible performance degradation of dispatch_semaphore, which is probably changed his behaviour with thread priority respect.
Here is the summary diagram of basic synchronization mechanisms available in iOS. All tests run with release configuration (Swift optimization -O)

2016-06-29 17 58 39

Benchmark code for SDK9
Benchmark code for SDK10

@kballard

This comment has been minimized.

Show comment
Hide comment
@kballard

kballard Jun 30, 2016

Your numbers look very suspect to me. How many times did you run the tests?

For comparison, I ran my own benchmark (source) on OS X 10.11.5 using a hacky benchmark harness I wrote a while back. I ran the entire benchmark 10 times, dropped the bottom and top 2 numbers from each test (to get rid of outliers), and the rest of the numbers produced the following ranges:

no sync: 22-41ns
spinlock: 24ns
semaphore: 29ns
NSLock: 45-71ns
mutex: 40-64ns
synchronized: 75-122ns
queue: 505-554ns

From this you can see that spinlock is by far the cheapest, followed by semaphore, then mutex, then NSLock just slightly behind mutex (since it's basically mutex + objc_msgSend), then synchronized is almost twice as expensive, and finally queues end up being extremely expensive, much more-so than I expected.

kballard commented Jun 30, 2016

Your numbers look very suspect to me. How many times did you run the tests?

For comparison, I ran my own benchmark (source) on OS X 10.11.5 using a hacky benchmark harness I wrote a while back. I ran the entire benchmark 10 times, dropped the bottom and top 2 numbers from each test (to get rid of outliers), and the rest of the numbers produced the following ranges:

no sync: 22-41ns
spinlock: 24ns
semaphore: 29ns
NSLock: 45-71ns
mutex: 40-64ns
synchronized: 75-122ns
queue: 505-554ns

From this you can see that spinlock is by far the cheapest, followed by semaphore, then mutex, then NSLock just slightly behind mutex (since it's basically mutex + objc_msgSend), then synchronized is almost twice as expensive, and finally queues end up being extremely expensive, much more-so than I expected.

@RomanTruba

This comment has been minimized.

Show comment
Hide comment
@RomanTruba

RomanTruba Jun 30, 2016

@kballard as you can see, my benchmark is quite simple, but it uses default xctest mechanism for measuring execution time. Numbers are average run time for each test, not for each sync. In general, your and my results are correlating excepting queues, which are not so bad, as we used to think

RomanTruba commented Jun 30, 2016

@kballard as you can see, my benchmark is quite simple, but it uses default xctest mechanism for measuring execution time. Numbers are average run time for each test, not for each sync. In general, your and my results are correlating excepting queues, which are not so bad, as we used to think

@kiddhmh

This comment has been minimized.

Show comment
Hide comment
@kiddhmh

kiddhmh commented Sep 7, 2016

okok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment