Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove data race caused by doing sample on rum thread #1177

Merged
merged 3 commits into from Mar 1, 2023

Conversation

cltnschlosser
Copy link
Contributor

What and why?

Seeing this crash after upgrading from 1.12.1 to 1.15.0:

Crashed: com.datadoghq.rum-monitor
0  MyApp                          0x24da04 VitalCPUReader.readVitalData() + 29 (VitalCPUReader.swift:29)
1  MyApp                          0x24db04 protocol witness for SamplingBasedVitalReader.readVitalData() in conformance VitalCPUReader + 4366867204 (<compiler-generated>:4366867204)
2  MyApp                          0x24df7c VitalInfoSampler.takeSample() + 86 (VitalInfoSampler.swift:86)
3  MyApp                          0x20982c specialized VitalInfoSampler.init(cpuReader:memoryReader:refreshRateReader:frequency:maximumRefreshRate:) + 4366587948 (<compiler-generated>:4366587948)
4  MyApp                          0x224808 specialized RUMViewScope.init(isInitialView:parent:dependencies:identity:path:name:attributes:customTimings:startTime:serverTimeOffset:) + 4366698504 (<compiler-generated>:4366698504)
5  MyApp                          0x2083dc RUMSessionScope.startView(on:context:) + 4366582748 (RUMViewScope.swift:4366582748)
6  MyApp                          0x207a10 RUMSessionScope.process(command:context:writer:) + 144 (RUMSessionScope.swift:144)
7  MyApp                          0x18e758 RUMApplicationScope.process(command:context:writer:) + 4366083928 (<compiler-generated>:4366083928)
8  MyApp                          0x1fa8e8 closure #1 in closure #1 in RUMMonitor.process(command:) + 597 (RUMMonitor.swift:597)
9  MyApp                          0x134de0 thunk for @callee_guaranteed () -> () + 4365716960 (<compiler-generated>:4365716960)
10 MyApp                          0x134e00 thunk for @escaping @callee_guaranteed () -> () + 4365716992 (<compiler-generated>:4365716992)
11 libdispatch.dylib              0x647c8 _dispatch_client_callout + 16
12 libdispatch.dylib              0x46b54 _dispatch_lane_barrier_sync_invoke_and_complete + 52
13 MyApp                          0x1fa7bc closure #1 in RUMMonitor.process(command:) + 4366526396 (<compiler-generated>:4366526396)
14 MyApp                          0x1fdc7c partial apply for closure #1 in RUMMonitor.process(command:) + 4366539900 (<compiler-generated>:4366539900)
15 MyApp                          0x1fe540 closure #1 in RUMMonitor.process(command:)partial apply + 4366542144
16 MyApp                          0x12f904 closure #1 in DatadogCoreFeatureScope.eventWriteContext(bypassConsent:forceNewBatch:_:) + 486 (DatadogCore.swift:486)
17 MyApp                          0x12a030 closure #1 in DatadogContextProvider.read(block:) + 105 (DatadogContextProvider.swift:105)
18 MyApp                          0x392b8 thunk for @escaping @callee_guaranteed () -> () + 4364686008 (<compiler-generated>:4364686008)
19 libdispatch.dylib              0x63850 _dispatch_call_block_and_release + 24
20 libdispatch.dylib              0x647c8 _dispatch_client_callout + 16
21 libdispatch.dylib              0x3f854 _dispatch_lane_serial_drain$VARIANT$armv81 + 604
22 libdispatch.dylib              0x402e4 _dispatch_lane_invoke$VARIANT$armv81 + 380
23 libdispatch.dylib              0x4a000 _dispatch_workloop_worker_thread + 612
24 libsystem_pthread.dylib        0x1b50 _pthread_wqthread + 284
25 libsystem_pthread.dylib        0x167c start_wqthread + 8

My initial suspicion was integer overflow with the UInt32s being used here (natural_t), but I tried to reproduce that and I think it would result in a slightly different stacktrace. So then I noticed this comment in VitalCPUReader:

    // TODO: RUMM-1276 appWillResignActive&appDidBecomeActive are called in main thread
    // IF readVitalData() is called from non-main threads, they must be synchronized

And I realized that it's crashing on the rum thread.

How?

Remove initial sample call that was happening on rum thread.
This passes current unit tests.

Alternatively this call can be moved to Runloop.main.perform {}, but that caused a test failure (could just be a test issue), so I went with this approach for now.

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration) - Existing tests pass
  • Make sure each commit and the PR mention the Issue number or JIRA reference
  • Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

  • Run unit tests
  • Run integration tests
  • Run smoke tests

@cltnschlosser cltnschlosser requested a review from a team as a code owner February 24, 2023 19:58
@cltnschlosser
Copy link
Contributor Author

CC @ncreated and @maxep you were both helpful before :)

@maciejburda
Copy link
Contributor

@cltnschlosser Thanks for contributing and explaining the issue so well!

We decided to follow the other approach you suggested and update the unit tests afterwards. We'll release this as a hotfix release as soon as possible. I'll keep the opened issue posted with the updates.

@maciejburda
Copy link
Contributor

@cltnschlosser Thanks for contributing and explaining the issue so well!

We decided to follow the other approach you suggested, and update the unit tests afterwards. This will ensure we start gathering the vitals as soon as we instantiate the object. We'll release this as a hotfix as soon as possible.

@cltnschlosser
Copy link
Contributor Author

Looks like #1181 is actually crashing due to the overflow, so you'll want to fix both issues (Maybe memory issue is causing the overflow, not sure)

@cltnschlosser
Copy link
Contributor Author

cltnschlosser commented Feb 28, 2023

@maciejburda I updated this diff to take the initial sample, fixed the broken test, added a new test as well. Also I changed the internals of VitalCPUReader to use UInt64 instead of natural_t(UInt32). Although it's a bit unclear if that will even help. According to the stacktrace in #1181 (which is more detailed than anything I have) it looks like this happens on line 24. If that's true let ongoingInactiveTicks = ticks - (utilizedTicksWhenResigningActive ?? ticks) it would be this underflowing, and not an overflow. IE utilizedTicksWhenResigningActive > ticks. Can't find create documentation on the cpu ticks stuff in Darwin, but maybe that value just overflows and starts over at 0?

EDIT: It could also just be the memory corruption / data race issue causing utilizedTicksWhenResigningActive to have a bad value. And the other changes here would fix the overflow(underflow) issue as well.

Copy link
Contributor

@maciejburda maciejburda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to jump on this issue, but I can see you've got it all right!
Thanks a ton for the great contribution (again!).

I feel like it's more likely that we are dealing with memory corruption / data race.
I'll make sure it's merged and released asap.

@cltnschlosser
Copy link
Contributor Author

Unable to reproduce this locally:

/Users/vagrant/git/Tests/DatadogTests/Datadog/RUM/RUMVitals/VitalInfoSamplerTests.swift:38 - XCTAssertGreaterThan failed: ("1") is not greater than ("1")
/Users/vagrant/git/Tests/DatadogTests/Datadog/RUM/RUMVitals/VitalInfoSamplerTests.swift:40 - XCTAssertGreaterThan failed: ("1") is not greater than ("1")

When I run it locally the value is 6. Not sure what's happening, if this is 1 then other tests should be failing too.

@maciejburda
Copy link
Contributor

maciejburda commented Mar 1, 2023

Same, passes both locally and when using local CI CLI. Maybe we can try increasing the wait time? 🤔

I can take a closer look later today.

@maciejburda
Copy link
Contributor

Made some experiments and this change seems to do the trick:
8948351

Copy link
Collaborator

@ncreated ncreated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍, but please make sure the test is not flaky before merging it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants