Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLite Crash on deleteEvents #183

Open
josephwegner opened this issue Feb 6, 2017 · 13 comments
Open

SQLite Crash on deleteEvents #183

josephwegner opened this issue Feb 6, 2017 · 13 comments

Comments

@josephwegner
Copy link

A customer reported an issue with Keen's interaction with SQLite causing their app to crash often. Given a crash log, it appears that the crash originates in [KIODBStore deleteEvent], or more specifically in the keen_io_sqlite3_bind_int64 statement.

They also reported an issue with duplicate events, similar to #170. This makes sense, because if we fail to delete an event from SQLite after uploading, it will be uploaded again after the app resumes.

@josephwegner
Copy link
Author

Anonymized crash log from customer:

CrashReporter Key:   602EC7B8-403B-446B-9070-69CDFEE531AA
Hardware Model:      iPhone7,2
Process:         APPNAME [279]
Path:            /var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/APPNAME
Identifier:      BUNDLEIDENTIFER
Version:         1.7.2 (10117)
Code Type:       ARM-64
Parent Process:  ??? [1]
Date/Time:       2017-02-03T11:08:30Z
Launch Time:     2017-02-03T09:15:37Z
OS Version:      iPhone OS 10.2.1 (14D27)
Report Version:  104
Exception Type:  SIGSEGV
Exception Codes: SEGV_ACCERR at 0x80000000101c3fae
Crashed Thread:  10
Thread 0:
0   libsystem_kernel.dylib               0x00000001876f5188 mach_msg_trap + 8
1   CoreFoundation                       0x00000001886f25d0 __CFRunLoopServiceMachPort + 188
2   CoreFoundation                       0x00000001886f01ec __CFRunLoopRun + 1128
3   CoreFoundation                       0x000000018861e2b8 CFRunLoopRunSpecific + 440
4   GraphicsServices                     0x000000018a0d2198 GSEventRunModal + 176
5   UIKit                                0x000000018e6657fc -[UIApplication _run] + 680
6   UIKit                                0x000000018e660534 UIApplicationMain + 204
7   APPNAME                            0x000000010005c06c main (main.m:19)
8   libdyld.dylib                        0x00000001876015b8 start + 0
Thread 1:
0   libsystem_kernel.dylib               0x0000000187712e1c __psynch_cvwait + 8
1   libc++.1.dylib                       0x00000001871013ec std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 52
2   JavaScriptCore                       0x000000018cf225d0 void std::__1::condition_variable_any::wait<std::__1::unique_lock<bmalloc::Mutex> >(std::__1::unique_lock<bmalloc::Mutex>&) + 108
3   JavaScriptCore                       0x000000018cf22544 bmalloc::AsyncTask<bmalloc::Heap, void (bmalloc::Heap::*)()>::threadRunLoop() + 164
4   JavaScriptCore                       0x000000018cf22424 bmalloc::AsyncTask<bmalloc::Heap, void (bmalloc::Heap::*)()>::threadEntryPoint(bmalloc::AsyncTask<bmalloc::Heap, void (bmalloc::Heap::*)()>*) + 8
5   JavaScriptCore                       0x000000018cf226d4 void* std::__1::__thread_proxy<std::__1::tuple<void (*)(bmalloc::AsyncTask<bmalloc::Heap, void (bmalloc::Heap::*)()>*), bmalloc::AsyncTask<bmalloc::Heap, void (bmalloc::Heap::*)()>*> >(void*) + 88
6   libsystem_pthread.dylib              0x00000001877d9850 _pthread_body + 236
7   libsystem_pthread.dylib              0x00000001877d9760 _pthread_start + 280
8   libsystem_pthread.dylib              0x00000001877d6d94 thread_start + 0
Thread 2:
0   libsystem_kernel.dylib               0x00000001876f5188 mach_msg_trap + 8
1   CoreFoundation                       0x00000001886f25d0 __CFRunLoopServiceMachPort + 188
2   CoreFoundation                       0x00000001886f01ec __CFRunLoopRun + 1128
3   CoreFoundation                       0x000000018861e2b8 CFRunLoopRunSpecific + 440
4   WebCore                              0x000000018d224490 RunWebThread(void*) + 452
5   libsystem_pthread.dylib              0x00000001877d9850 _pthread_body + 236
6   libsystem_pthread.dylib              0x00000001877d9760 _pthread_start + 280
7   libsystem_pthread.dylib              0x00000001877d6d94 thread_start + 0
Thread 3:
0   libsystem_kernel.dylib               0x00000001876f5188 mach_msg_trap + 8
1   CoreFoundation                       0x00000001886f25d0 __CFRunLoopServiceMachPort + 188
2   CoreFoundation                       0x00000001886f01ec __CFRunLoopRun + 1128
3   CoreFoundation                       0x000000018861e2b8 CFRunLoopRunSpecific + 440
4   Foundation                           0x000000018915b26c -[NSRunLoop(NSRunLoop) runMode:beforeDate:] + 300
5   Foundation                           0x000000018917bdd0 -[NSRunLoop(NSRunLoop) runUntilDate:] + 92
6   UIKit                                0x000000018efd9c38 -[UIEventFetcher threadMain] + 132
7   Foundation                           0x0000000189258e68 __NSThread__start__ + 1020
8   libsystem_pthread.dylib              0x00000001877d9850 _pthread_body + 236
9   libsystem_pthread.dylib              0x00000001877d9760 _pthread_start + 280
10  libsystem_pthread.dylib              0x00000001877d6d94 thread_start + 0
Thread 4:
0   libsystem_kernel.dylib               0x00000001876f5188 mach_msg_trap + 8
1   CoreFoundation                       0x00000001886f25d0 __CFRunLoopServiceMachPort + 188
2   CoreFoundation                       0x00000001886f01ec __CFRunLoopRun + 1128
3   CoreFoundation                       0x000000018861e2b8 CFRunLoopRunSpecific + 440
4   CFNetwork                            0x0000000188e23a70 +[NSURLConnection(Loader) _resourceLoadLoop:] + 332
5   Foundation                           0x0000000189258e68 __NSThread__start__ + 1020
6   libsystem_pthread.dylib              0x00000001877d9850 _pthread_body + 236
7   libsystem_pthread.dylib              0x00000001877d9760 _pthread_start + 280
8   libsystem_pthread.dylib              0x00000001877d6d94 thread_start + 0
Thread 5:
0   libsystem_kernel.dylib               0x00000001876f5188 mach_msg_trap + 8
1   CoreFoundation                       0x00000001886f25d0 __CFRunLoopServiceMachPort + 188
2   CoreFoundation                       0x00000001886f01ec __CFRunLoopRun + 1128
3   CoreFoundation                       0x000000018861e2b8 CFRunLoopRunSpecific + 440
4   Foundation                           0x000000018915b26c -[NSRunLoop(NSRunLoop) runMode:beforeDate:] + 300
5   Foundation                           0x00000001891afaa0 -[NSRunLoop(NSRunLoop) run] + 84
6   AFNetworking                         0x00000001009873b0 +[AFURLConnectionOperation networkRequestThreadEntryPoint:] (AFURLConnectionOperation.m:163)
7   Foundation                           0x0000000189258e68 __NSThread__start__ + 1020
8   libsystem_pthread.dylib              0x00000001877d9850 _pthread_body + 236
9   libsystem_pthread.dylib              0x00000001877d9760 _pthread_start + 280
10  libsystem_pthread.dylib              0x00000001877d6d94 thread_start + 0
Thread 6:
0   libsystem_kernel.dylib               0x00000001876f5188 mach_msg_trap + 8
1   CoreFoundation                       0x00000001886f25d0 __CFRunLoopServiceMachPort + 188
2   CoreFoundation                       0x00000001886f01ec __CFRunLoopRun + 1128
3   CoreFoundation                       0x000000018861e2b8 CFRunLoopRunSpecific + 440
4   CFNetwork                            0x0000000188f7da40 _privateRunloopEmulationSet(void*) + 260
5   libsystem_pthread.dylib              0x00000001877d9850 _pthread_body + 236
6   libsystem_pthread.dylib              0x00000001877d9760 _pthread_start + 280
7   libsystem_pthread.dylib              0x00000001877d6d94 thread_start + 0
Thread 7:
0   libsystem_kernel.dylib               0x00000001876f5188 mach_msg_trap + 8
1   CoreFoundation                       0x00000001886f25d0 __CFRunLoopServiceMachPort + 188
2   CoreFoundation                       0x00000001886f01ec __CFRunLoopRun + 1128
3   CoreFoundation                       0x000000018861e2b8 CFRunLoopRunSpecific + 440
4   CoreFoundation                       0x000000018866bb44 CFRunLoopRun + 108
5   CoreMotion                           0x000000018f4e1120 CLStartStopAdvertisingBeacon + 203192
6   libsystem_pthread.dylib              0x00000001877d9850 _pthread_body + 236
7   libsystem_pthread.dylib              0x00000001877d9760 _pthread_start + 280
8   libsystem_pthread.dylib              0x00000001877d6d94 thread_start + 0
Thread 8:
0   libsystem_kernel.dylib               0x0000000187713a88 __workq_kernreturn + 8
1   libsystem_pthread.dylib              0x00000001877d6d8c start_wqthread + 0
Thread 9:
0   libsystem_kernel.dylib               0x0000000187713a88 __workq_kernreturn + 8
1   libsystem_pthread.dylib              0x00000001877d6d8c start_wqthread + 0
Thread 10 Crashed:
0   KeenClient                           0x00000001010f0a5c vdbeUnbind (keen_io_sqlite3.c:66777)
1   KeenClient                           0x00000001010f0c10 keen_io_sqlite3_bind_int64 (keen_io_sqlite3.c:66875)
2   KeenClient                           0x000000010115af18 __26-[KIODBStore deleteEvent:]_block_invoke (KIODBStore.m:566)
3   libdispatch.dylib                    0x00000001875ce1fc _dispatch_call_block_and_release + 20
4   libdispatch.dylib                    0x00000001875ce1bc _dispatch_client_callout + 12
5   libdispatch.dylib                    0x00000001875dc3dc _dispatch_queue_serial_drain + 924
6   libdispatch.dylib                    0x00000001875d19a4 _dispatch_queue_invoke + 648
7   libdispatch.dylib                    0x00000001875de34c _dispatch_root_queue_drain + 568
8   libdispatch.dylib                    0x00000001875de0ac _dispatch_worker_thread3 + 120
9   libsystem_pthread.dylib              0x00000001877d72a0 _pthread_wqthread + 1284
10  libsystem_pthread.dylib              0x00000001877d6d8c start_wqthread + 0
Thread 11:
0   libsystem_kernel.dylib               0x0000000187713a88 __workq_kernreturn + 8
1   libsystem_pthread.dylib              0x00000001877d6d8c start_wqthread + 0
Thread 12:
0   libsystem_kernel.dylib               0x0000000187713a88 __workq_kernreturn + 8
1   libsystem_pthread.dylib              0x00000001877d6d8c start_wqthread + 0
Thread 13:
0   libsystem_kernel.dylib               0x0000000187713a88 __workq_kernreturn + 8
1   libsystem_pthread.dylib              0x00000001877d6d8c start_wqthread + 0
Thread 14:
0   libsystem_kernel.dylib               0x0000000187713a88 __workq_kernreturn + 8
1   libsystem_pthread.dylib              0x00000001877d6d8c start_wqthread + 0
Thread 10 crashed with ARM-64 Thread State:
    pc: 0x00000001010f0a5c     fp: 0x000000016df8ec70     sp: 0x000000016df8ec40     x0: 0x0000000000000000 
    x1: 0x0000000000000001     x2: 0x000000000000028a     x3: 0x000000018775922c     x4: 0x0000000000000032 
    x5: 0x0000000000000000     x6: 0x0000000000000000     x7: 0x0000000000000000     x8: 0x80000000101c3f96 
    x9: 0x0000000000000000    x10: 0x0000000000000184    x11: 0x0000000000000004    x12: 0x00000001886e068c 
   x13: 0x00000000000028a3    x14: 0x00000000000028a3    x15: 0x7000000000000000    x16: 0x00000001ae02d7f0 
   x17: 0x000000018866fee4    x18: 0x0000000000000000    x19: 0x0000000101c3b420    x20: 0x0000000000000001 
   x21: 0x000000000000028a    x22: 0x00000000000104d7    x23: 0x00000001700f9680    x24: 0x0000000000000000 
   x25: 0x0000000000000000    x26: 0x0000000000000014    x27: 0x000000016df8f0e0    x28: 0xffffffffffffffff 
    lr: 0x00000001010f0a34   cpsr: 0x0000000060000000 
Binary Images:
       0x10003c000 -        0x10075ffff +APPNAME arm64  <aa438f9c30b1326db7a44cf742389f66> /var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/APPNAME
       0x100978000 -        0x1009affff +AFNetworking arm64  <2a3607079b1c3b6b8bb39a7f876c267b> /private/var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/Frameworks/AFNetworking.framework/AFNetworking
       0x100a0c000 -        0x100a13fff +ColorUtils arm64  <161da403f1033bb18aed8d7eaad154f0> /private/var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/Frameworks/ColorUtils.framework/ColorUtils
       0x100aa8000 -        0x100ac7fff +BlocksKit arm64  <745684cb132935afb05441c03d61832d> /private/var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/Frameworks/BlocksKit.framework/BlocksKit
       0x100afc000 -        0x100c2bfff +SOMELIBRARYSDK arm64  <94cd1afbe4213d8480baadd645e43314> /private/var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/Frameworks/SOMELIBRARYSDK.framework/SOMELIBRARYSDK
       0x100e48000 -        0x100edbfff +SOMELIBRARYUI arm64  <5b13fd3f7b0d3ba8b252b820aa5428e5> /private/var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/Frameworks/SOMELIBRARYUI.framework/SOMELIBRARYUI
       0x10100c000 -        0x10102bfff +CocoaLumberjack arm64  <7b4d4f0e175031198f558e8ce50a7ed1> /private/var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/Frameworks/CocoaLumberjack.framework/CocoaLumberjack
       0x101058000 -        0x10105ffff +DTTJailbreakDetection arm64  <55ba159d98c9314f8074353cd8203eb7> /private/var/containers/Bundle/Application/18DA561D-6913-469A-AA58-30E7AE67ABBB/APPNAME.app/Frameworks/DTTJailbreakDetection.framework/DTTJailbreakDetection
       0x10106c000 -        0x101073fff +FXNotifications arm64  <4d6f3443e17137f2b005d7ca8247cc7c> /private/var/containe...```

@baumatron
Copy link
Contributor

baumatron commented Feb 8, 2017

Seems that the closeDB has been called before the dispatch worker thread processes the deleteEvent enqueued block has been executed, meaning the delete statement has been finalized, and would cause the segmentation fault seen in the callstack.

Previously I was thinking this could be due to a race condition due to client calls to closeDB, but that isn't actually exposed to clients. Seems that closeDB could be called if a sqlite call fails for whatever reason. Not sure if the keen client has logging enabled by default, but if it could be enabled that might shed some light on if/what is failing and causing the db to be closed.

@rist
Copy link

rist commented Feb 8, 2017

We never experienced the crashes in development and don't have logging enabled in production (and wouldn't be able to access those logs anyway).
As we see in the crash log the crash happens in keen_io_sqlite3_bind_int64 of deleteEvent - could you perhaps be more defensive while parsing eventIds in handleEventAPIResponse

Or even better - could you check your server logs? Perhaps we see there your servers response and then can understand what happens in handleEventAPIResponse. I can supply you with the exact time of lots of crashes

@josephwegner
Copy link
Author

I checked our server logs for your account a few days ago. There were no errors on writes in the time period we're looking at, so the response should have been a standard {"success": true} message.

@rist
Copy link

rist commented Feb 9, 2017

but the responsedata would also contain collectionNames - and just wondering if anything could go wrong here: NSNumber *eid = [[eventIds objectForKey:collectionName] objectAtIndex:count];

if a collectionName returned by your API, wouldn't be found in the passed eventIds dictionary then eid would be nil which could perhaps lead to any problems

@josephwegner
Copy link
Author

:( Our logs are unfortunately not that verbose as far as what is in the responses. We'd have to fill the every data center everywhere with {"success": true} if we did that. All I get to see if # of failures and # of successes (for writes).

@baumatron
Copy link
Contributor

@rist not sure if that could potentially lead to the state we're seeing that causes the crash, but the crash itself seems to be because the SQLite database used internally has been closed by the time the internal delete operation is processed through the KIODBStore queue. We might investigate what previous SQLite operation failed which is the only way the database gets closed. @josephwegner would it be possible that a failure is generated on that level due to a nil eventId passed to the SDK? I imagine there are checks in place for that, or otherwise wouldn't lead to a SQLite failure?

@rist
Copy link

rist commented Feb 9, 2017

Ok, if the theory ist that the DB was closed before we execute some delete statement - why do you check if the DB is open at the beginning of - (void)deleteEvent:(NSNumber *)eventId but not later in the async thread where the delete actually happens.

The DB could have been closed in the meantime

@masojus
Copy link
Collaborator

masojus commented Feb 9, 2017

Since the offending line is keen_io_sqlite3_mutex_enter(p->db->mutex); either the statement has been deleted or the db is trashed, it would seem. I don't see how that line would fail due to a nil eventId.

Almost all the code paths that lead to closeDB being called happen inside dispatch_*() blocks, maybe all but the openAndInit... code. Since KIODBStore uses a sequential queue, it seems to have to happen when something in the queue gets dispatched and causes a failure, which hits closeDB after the checkOpenDB but before the dispatch_*() block in deleteEvent() is executed. Scattering checkOpenDB messages all over to every dispatch_*() block could potentially recover or fail gracefully, but we'd want to figure out what caused the closeDB() call in the first place.

Also, if we fail to delete events from the local store, I don't know if there's code in KeenClient designed to handle that, so it might be the case that such events, which succeeded in being pushed, get pushed again later due to a failure upon attempting to delete them (@josephwegner mentioned this, and it's mentioned in Issue #170 and Issue #156). One would need a callback set on KIODBStore so that it could notifyCannotDelete(eventId) or something along those lines, and KeenClient could set a handler to then place those event IDs in a list of events to try to purge or at the least not re-push the next time we try to flush the local store. I'm not super familiar with this particular SDK, so I'm just brainstorming.

Either way, it's important to not just try to mask the original failure which caused the closeDB code to execute. If the DB is fundamentally in a bad state, closing and reopening all the time isn't going to necessarily mean that delete events will succeed in a future if doing so failed just now.

@rist
Copy link

rist commented Feb 14, 2017

yes #170 is definitely an issue for us as well - we see lots of duplicated event

@masojus
Copy link
Collaborator

masojus commented Feb 15, 2017

@baumatron has worked on creating some potential repro scenarios and identified a few issues that could be addressed around how DB corruption is handled. We could try to address those as well as how local delete failures are handled and how DB closed is handled by dispatched blocks as described above. Some combination of these patches could ameliorate the situation. Since none of that identifies the root cause, we'd still love to introduce some sort of better logging. @baumatron and I were tossing around the idea of adding some callback logging interface the client code could opt into in order to use a third-party logging solution that actually posts logs to somewhere useful. That could be turned on by client code for certain clients in order to help track down the root cause. Does that sound interesting @rist ?

@rist
Copy link

rist commented Feb 16, 2017

I was just clicking around in our crash reporting tool and found another overview screen about our Keen related crashes - do you see anything helpful/intersting there?
screen shot 2017-02-16 at 22 37 41

This was referenced Mar 21, 2017
@baumatron baumatron mentioned this issue Apr 4, 2017
@baumatron
Copy link
Contributor

Here's a bit of an update. The general issue here is that the db gets closed all over the place when encountering anything considered to be an error from the db, which is a bit dubious to begin with, but causes issues because this closing of the db isn't synchronized with code elsewhere that is responsible for opening the db again. So you can get into some call in KIODBStore that checks to be sure the db is open, then drops a block in the queue that will do some db operation, in the meantime some other operation "fails" and closes the db, and then the queued block executes and crashes since the db is now closed.

Since I last updated the bug, a couple of issues have been discovered and fixed where this "failure" condition is triggered erroneously, meaning the db is closed more often than it should be, making this issue observed more frequently. Those fixes should help mitigate this in the short term (#196, #200).

A long term fix would be to do an audit of how failures are handled in KIODBStore. I believe that in general these failures should not be closing the db, but instead surfacing a meaningful error to the SDK client when a call can't be handled. In cases where an error case should be explicitly handled because KIODBStore knows how to deal with it, such as is the case for existing corruption handling code-paths, that should all be done immediately where the error is encountered instead of partially where the error is encountered and then handled by checkOpenDB in random places salted throughout the code. I would suggest checkOpenDB gets removed and a single explicit initialization takes it's place, and all methods should be able to assume the db is open if called and should throw an exception if that isn't the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants