Skip to content
This repository

Success and Failure blocks sometimes not called #307

Closed
FirstSOW opened this Issue April 18, 2012 · 146 comments
FirstSOW

I'm using AFJSONRequestOperation, and most of the time everything works fine. But sometimes neither the success block nor the failure blocked gets called. What happens when the HTTP call times out? Does the failure block get called? Or does the operation just silently fail?

Mattt Thompson
Owner
mattt commented April 19, 2012

All of the request operations are guaranteed to trigger a callback once the connection finishes, even on a timeout. If you can reproduce a case where this does not happen, please send that in. Thanks!

Fiel Guhit

I've experienced the same issue. This consistently happens when opening an app that has been hanging around in the background, and then immediately sending an AFHTTPRequestOperation.

John Z Wu

I've experienced something where no requests were returning shortly after reopening the app. I determined it was Testflight deadlocking when their log sending fails and that cripples all AFURLConnection stuff.

Mattt Thompson
Owner
mattt commented April 26, 2012

Dammit TestFlight. Now I remember: TestFlight's and Urban Airship's SDKs mess with the URL loading system and interfere with AFNetworking. I'll add a note about that in the documentation somewhere.

John Z Wu

@mattt Can you elaborate on that? Is there any way around it other than removing them?

Mattt Thompson
Owner
mattt commented April 26, 2012

@bONchON All I know is that I've had reports in the past about the TestFlight and UrbanAirship SDKs causing weird networking behavior. I have absolutely no idea what's going on with either of the frameworks, so my advice isn't anything useful.

If anyone has any actionable way to improve AFN's relationship to any of these or any other interfering frameworks, I'd love to have that documented.

Mattt Thompson mattt closed this April 26, 2012
FirstSOW

Well, this is unfortunate. I do use TestFlight and find it quite beneficial. But if I'm forced to choose between TestFlight and AFNetworking, I'll have to choose AFNetworking. The project is completed, and the app has been deployed to the field. Replacing AFNetworking would be a significant amount of work and would require us to test everything again. In short, what a bummer.

Mattt Thompson
Owner
mattt commented April 28, 2012

@FirstSOW I don't think the takeaway from all of this is that you need to choose between AFN and TestFlight. TestFlight isn't something that you should be shipping with your App Store build of your app, so any issues you encounter would be an inconvenience to developers and beta testers, which isn't great, but is far better than affecting end-users, for what it's worth.

I would love nothing more than for everything to work well together, so I encourage people who have and can reproduce this issue to help make things better by submitting a patch to TestFlight and/or AFN. All I'm saying is that I'm not in a position to do that.

Fiel Guhit

@bONchON I am using TestingFlight as well. Could you provide some detail on how you were able to determine that TestFlight was the issue? Thanks.

Fiel Guhit

Update
The url connection is still not completing after removing the TestFlight SDK.

Mattt Thompson
Owner
mattt commented April 29, 2012

@fguhit I suspect then that the request is failing perhaps because of network connectivity taking a second to kick in after the app comes to the foreground. Or maybe not. Without code or anything else to go on, this is complete speculation. If you want this issue resolved, please provide some context as to what your code is doing.

John Z Wu

Hey guys. Here's a log from hockey app for my deadlock. http://www.2shared.com/file/r4XgbEfG/af_tf_deadlock.html

So the scenario is this: I returned to foreground, noticed that no network requests were returning, despite having internet. I force crashed my app to obtain the logs so I may see what's going on in other threads. I noticed that thread 6 seemed to be waiting to acquire a lock, and since it references TestFlight code, I could only infer...

Empirically, I haven't seen this problem since I took out TestFlight in my code...

I have submitted a bug report to testflight but still awaiting responses.

Fiel Guhit
fguhit commented May 01, 2012

@mattt I just discovered we're using a very old version of AFNetworking. I'll try to keep this issue updated.

Michael Frederick

Ah. I finally found some other people with my issue. It took me a while to figure out the issue was Test Flight and I have been trying to work with one of their developers to fix the issue.

TestFlight does indeed have a bug in it that sometimes causes a deadlock. When it occurs, no requests from AFNetworking ever finish.

Every time the deadlock occurs, TestFlight seems to be waiting on a mutex and [TFRunLoopOperation finishWithError:] has just been called (as seen in the log provided by @bONchON).

Here is a screenshot of one such stack trace from my app: http://i.imgur.com/Yy9Fg.png

Mattt Thompson
Owner
mattt commented May 24, 2012

For anyone who comes across this thread in the future, I wanted to share something I learned from a developer at TestFlight about a potential fix:

Apparently, there is a non-documented option in the SDK, completeNetworkOperationsOnMainThread, which may address deadlocking. So maybe give that a try.

Michael Frederick

@mattt I already went down that path, completeNetworkOperationsOnMainThread does not fix the issue.

Theo Ephraim

Wow! I can't tell you how happy I am to find this. We've been banging our heads against the wall for a week!
It's incredibly hard to reproduce reliably, but since disabling TestFlight, things seem to be working smoothly.

@mattt please do update us here again if you hear back from the TestFlight engineers!

Trystan Kosmynka

Hey guys,

We (TestFlight) don't actually mess with the url loading system. But regardless there is an issue here. We have had a heck of a time trying to get the feedback necessary to reproduce this issue. If anyone can reproduce it and provide a sample app that does, we'll fix it immediately. We've devoted quite a bit of time to this, came up with a few solutions, none of which have seemed to fly. So if you guys have anything that is able to replicate, please let us know.

Trystan

Michael Frederick

@tkosmynka It's extremely hard for us to debug without access to the TestFlight HTTP classes. I spent about a week straight on this issue and all I know is that the method [TFRunLoopOperation finishWithError:] is called every time that I have seen the error occur. Also, the thread that [TFRunLoopOperation finishWithError:] is called on ends up getting locked by a mutex and this seems pretty suspicious. I ended up determining that it wasn't possible for me to debug the issue any further without access to the TF classes.

Trystan Kosmynka

@mikefrederick do you have an application in which it occurs on a regular basis? Any chance you could send us a stripped down version where the app still has the issue? We have no problem debugging but to this point we have had a problem replicating it at all, making it difficult to debug.

Mattt Thompson
Owner
mattt commented June 12, 2012

@tkosmynka I heard from @jasongregori that either you or someone from TestFlight was here at WWDC. Send me a message if you'd be interested in talking this out in person.

Trystan Kosmynka

@mattt our Mac OSX developer @hjaltij is there, he is rock solid and I'm sure would be up for a meetup regardless of whether or you guys are able to work through the issue together.

Michael Frederick

@tkosmynka no i don't have an application in which it occurs regularly, it is very hard to replicate

Theo Ephraim

We have an app that it is occurring in A TON. But its a big app and getting it actually cut down to something I can hand to you will take us some time which we dont have at the moment. BUT I wanted to share some more info:

The slower the device, the more it occurs. It was barely happening on our ipad 3, occasionally on an ipad 2, and 3/4 calls on our ipad 1. We also only noticed it occurring only in one specific spot in our app (it might have happened a few times in other places, but mostly at this one spot). The difference of that part of the app from others is we were making 4 concurrent requests which were all GET requests. In most other places we're only making 1 request at a time and fewer gets, more posts and puts. We also tried messing with keep-alive and caching headers but it made no difference.

Trystan Kosmynka

@theoephraim Thanks! This is a step in the right direction. We'll bump up some concurrent get requests, and I'll dust off my iPhone 3 and maybe we'll get lucky with replicating it.

Did you notice this on a particular version of iOS?

Theo Ephraim
Steve Madsen

I believe I may also be seeing this in the field for our betas, but I've never been able to reproduce it. This app is iPad-only, iOS 5 only. I have never seen it on my iPad 3. In the field it is occurring on iPad 2s.

The few recent times it's occurred, users have indicated that the network was slow. In addition to trying it on slower devices, maybe try running in the simulator with a bad network imposed by the Network Link Conditioner?

Lastly, please consider open sourcing libTestflight.a, or at least a stripped down version that only contains the core functionality. The value of TestFlight is the website. With more eyes on the problem, my bet is that this would be fixed very quickly.

Nagon
Nagon commented June 13, 2012

While I don't have any more insight regarding the problem than the other guys, I can echo that it seems more frequent on older devices (ie iPad 1) and that slow/lossy network connection also seems to increase the likelihood of the problem appearing.

Unfortunately I'm also having a hard time finding steps to reproduce the problem, but if you have any more questions I will try to help as much as I can.

Theo Ephraim

@tkosmynka things are wrapping up here and I should be able to get you an demo app by the end of the week. Let me know before then if you end up being able to recreate it consistently and won't need it.

Trystan Kosmynka

@theoephraim that is fantastic, we're trying to replicate here, hopefully we'll have something before you get to it, if we do I'll let you know.

Dale Buckley

We had this problem a few months ago and one of my colleagues reported it directly to testflight. We eventually had to remove testflight from our app because we had to release it and couldn't release it with a potentially app breaking bug inside.

To replicate the issue we created a very simple app that used testflight, all it did was hit a selection of URL's in a loop over a period of time with a testflight checkpoint to say that it had been hit, (Just a simple NSURLRequest), eventually the app would lock up and not be able to hit the URL's any more.

The time it took to lock up would vary drastically from a few minutes to over an hour, but eventually it would lock up.

I don't know if that helps any further at all.

Trystan Kosmynka

@dlbuckley thanks! We have tried a few things, but nothing that we let run for over an hour. We'll try this again.

mackinra

@tkosmynka We had this happening regularly on our iPhone app as well, and we were not even using AFNetworking, but rather our own implementation involving NSURLConnection. Our app uses background location services, fwiw. The problem would usually rear its head after the app had been in the background for some amount of time, particularly if using 3G (no WiFi) -- perhaps related to flaky connectivity? I had documented our issue here (before figuring out it was TestFlight SDK): http://stackoverflow.com/questions/10841641/nsurlconnection-timing-out

Brandon Fosdick
bfoz commented June 19, 2012

I'm working on an app that's experiencing the same problem. For us it happens about 1 in 15 attempts to load a set of concurrent requests. Running the app for extended periods seems to exacerbate the problem, but we've also had problems immediately after a cold start. WiFi vs 3G doesn't seem to make much of a difference, other than that the requests take longer over 3G which makes it easier to load up a number of concurrent requests.

FWIW we're considering switching to Crittercism to work around this (we're trying to submit the app this week). Any idea when you'll have a fix?

Trystan Kosmynka
Dale Buckley

The issue with this problem is that it's very very random. I've just been running a the test app that I mentioned previously to see if I can replicate the issue again but with no sign of it, but a few months ago it was throwing up the errors quite consistently. Just to note; I have not changed the code at all in that time frame, it's gone from having the problem to no longer having the problem.

I can strip it down (remove the stuff that signs into our servers etc) and send it over to you, but now that it's not showing the problem I don't know if it will be of any use to you guys.

Morten Perriard

Any news on this issue?

Zakay
Zakay commented July 09, 2012

Having the same issues, except I am not using TestFlight SDK.
Can't reproduce it all the time, happens sporadically.

Brandon Fosdick
bfoz commented July 12, 2012

FWIW, we're seeing the problem with Crittercism as well. It's still very random overall, but some of our test devices appear to be less random than others.

John Z Wu

@mattt @tksmynka may we get an update to your discussions at WWDC12?

Dale Buckley

I'm starting to think that this is possibly something not localized to Testflight, but it does involve library's accessing the network, but just to note, after removing Testflight we had no problem at all with this issue at all.

We did think that it could be some kind of multi-threading issue, but it seems very unlikely that this wide selection of library's are all having the same multi-threading issue which puts this theory under question. We did try to set up some kind of meeting with the developers at testflight to try and give them a hand with this issue, but nothing has happened just yet.

Morten Perriard

In our app we were using AFNetworking and TestFlight Live SDK, and in normal testing the bug would happen maybe once a day, apparently at random. Taking TestFlight out solved it, except now we need to find a replacement for that functionality. (was hoping for a quick fix to be able to keep using TF).

Ben Scheirman

I have this problem on multiple apps. I don't think it's just TestFlight that causes this issue. I have an app that uses TestFlight, Flurry, and Crashlytics. I think perhaps including more of these launch-at-startup frameworks exacerbates the issue and causes it to happy more frequently.

Ideally TF Live would replace Flurry and Crashlytics, but it sounds like for the time-being we may have to remove all of them. :sob:

mcohnen

I removed Testflight and the bug seemed to disappear, but it is back again. I have crashlytics framework, so maybe that could be the issue.

Trystan Kosmynka

Our team has been trying to replicate the issue, so far no dice. Encouraging to see that the issue is occurring outside the TestFlight library as well.

Rob Fahnri recently posted this article http://blog.applecorelabs.com/2012/05/29/bug-llvm-optimizer-asihttprequest-arc/. Is it possible that everyone is suffering from the same issues?

Jonathan Lundell

I didn't get the impression that Rob's problem was intermittent...

Brandon Fosdick
bfoz commented July 13, 2012

@tkosmynka Thanks for the article link. Unfortunately we're not using ARC (yet) and we're still having problems.

mcohnen

No ARC here as well. Reports I get from Apple are timeouts in didFinishLaunching because I use Reachability Apple's class. I can see some other threads waiting on the same mutex.

Michael Frederick

I have yet to experience this issue with Crittercism (as @bfoz reported) -- has anyone else experienced it? If so I would like to take Crittercism out of my app as well.

Also, something I never tested but am curious about -- has anyone tried submitting a normal NSURLRequest (i.e. a non-AFNetworking request) once the deadlock has occurred? I am not sure if the bug is just preventing AFNetworking from completing requests or if it is preventing ALL requests from being completed.

mcohnen

@mikefrederick Code in Reachability classes also gets stuck in the mutex, and there are no NSURLConnections involved there.

Jason Gregori

@mikefrederick @mcohnen i had the same bug and i didn't even use AFNetworking. Just regular NSURLConnections.

mcohnen

@jasongregori thanks for the info. Were you using Testflight, Crashlytics or any of these libraries?

Ben Scheirman
Trystan Kosmynka

@jasongregori there goes that theory then, it was too good to be true. What are the odds you can replicate it consistently and happen to have a sample app?

mackinra

I like @jasongregori had the same lockup problem with just regular NSURLConnections (no AFNetworking), and the problems disappeared when I removed TestFlight. As I reported earlier, I believe the problem is evoked by flaky network coverage. This may explain the difficulty in reproducing it, if one has solid connectivity.

Jason Gregori

@mcohnen Testflight.

@tkosmynka this was a while back (maybe 2 months) at zabbi. It was completely random but happened fairly often. Maybe 1/4 of the time. There was nothing I could find out at all from testing. It just seemed that URL connections would be frozen somehow. We were in a hurry so once we realized removing testflight fixed it we didn't investigate further. I don't work there anymore so I can't look into it much further.

I want to say poor Internet connections played a role but I think the problem also happened over good wifi. I don't totally remember but I want to say it even happened in the simulator a few times.

Herman Olsson

Back in april, when I first ran into this problem, the app I was working on did not use AFNetworking, but a set of NSURLConnections first fetching some JSON and then some thumbnails, so the number of connections fired was about 30-50 all in all right when the application started.

This might already be discussed, but the every time the lockup occurred was when the TF lib did not log "Team token recognized". Once this output was shown in the console, everything seemed to work fine again. At one point I waited for about 20 minutes, and once the log message was shown, it all went back to normal.

An observation I made that has not been mentioned here is that every time the lockup occurred it blocked connections that where initially called from my UIApplicationDidBecomeActiveNotification method. Since I figured the TF library also listens for this notification I guessed that this might be part of the problem.

Finally, when I figured out that this lockup did not happen without the TF lib, I kind of let it be for the time being since the app did not use TF for the App Store release, and I didn't have the time to look into it more at that time.

Hopefully this can be of any help to someone working to solve this issue.

Tony Million

To chime in on this issue, I used both AFNetworking and a "regular" NSURLConnection (including the new iOS5 Async stuff) and saw problems with all of them.

I tracked it as far as being an issue with backgrounding/foregrounding of my app - EITHER while a network operation was in progress OR when the app came to the foreground and it kicked off a bunch of network operations. I believe TestFlight does some network stuff during these back/foreground transitions too, I was using "beginBackgroundTask etc to make sure all the network operations finished when moving to the background.

I was curious if it was because TestFlight creates its own threaded runloop on which to perform the network IO as that could cause the mutex problem we see (if the thread already owns the mutex it shouldn't block itself), but never got a chance to look into it fully as (like everyone else) I had an app to release, so I just pulled the testflight SDK.

Dale Buckley

@tonymillion we came to pretty much the same conclusion.

@tkosmynka it might be useful to note that this problem only started for us once around the time of the v1.0 release (possibly v0.8 but I'm pretty sure 0.8 was fine), before that we had no problems, plus we only use NSURLConnection's, we don't use any other external library for our networking.

Barthelemy Menayas

Like many other people here, I only use NSURLConnection (through RestKit), and have been having this problem quite often. when it happens, it generally happens at start-up which is when the application makes a larger number of simultaneous requests.

Mattt Thompson
Owner
mattt commented July 19, 2012

@tkosmynka This is a long shot, and perhaps previously mentioned, but do you think that TestFlight could be running into a bug with setDelegateQueue: described here: http://www.ddeville.me/2011/12/broken-NSURLConnection-on-ios/ ?

Emil
emilof commented July 23, 2012

We had this problem in an earlier project. Our testers started to report that the network connection suddenly stopped working. Unfortunately we couldn't reproduce this in a controlled manner. What we did was to switch AFNetworking to ASIHTTPRequest, then this issue disappeared.

Hjalti Jakobsson

I've been investigating this and I was able to reproduce an issue where all network connections failed and only started working again after restarting the app. The failure handlers were still called so I'm wondering if that's another issue or if some of you guys ran into that?

Jason Gregori
dnstevenson

Anyone try the TestFlight Beta 1.1 and see the problem still?

Morten Perriard

Good question. We have removed TF SDK from our project and have no plan to put it back in before someone from TF confirms that it is working. Does anyone know if there has been any progress in finding/fixing this issue?

Tony Million

FYI: I implemented 1.1Beta in my app and everything seems to be working fine! Not noticed any network hangups at all.

(though it would be nice for someone from TF to confirm this)

Jeppe Vesterbæk

We also removed the TestFlight SDK from our apps and have not experienced the problem since doing that. We will not use it again before TestFlight explains what has caused this and confirmes that it has been resolved.

PJ Gray

+1 on removing the Testflight SDK too. I used the 1.1 for a while and still had problems so finally just gave up on it.

ecerney

Finally I found a relevant thread! I have this same issue, and it is really causing problems freezing the user while data is synced with our server. Does anyone know if another crash report sdk works with afnetworking?

Jonathan Lundell

I was seeing hangs. Switched from TestFlight to Crashlytics and Flurry, and haven't seen them since. (At least I don't think so; hard to tell what goes on in the field, but my testers have not reported it.)

ecerney

Looks like you need to be invited to Crashytics, sooo any other options out there? Client is looking for a free option, which I know is hard to come by, but if there is an option that would be great.

Tony Million

I've been using the 1.1Beta in a production application and its been working perfectly (this app uses AFNetworking too)

Jeppe Vesterbæk

Problem is that if the TF team does not identify the core of the problem -- and explain this, it's hard to trust future TF releases, since this bug falls into the happens-rarely category (=> hard to identify during development/test). I especially find it strange that they in no way have warned existing users of this potentially critical problem on their website.

Catarino

@tonymillion 1.1b2 ? I wasn't able to use b1 would not build. anyway, can you confirm it's ok? I just had a build rejected on Apple and I suspect it's this problem.

I've managed to test a new build with our beta team and all is ok. Had a crash that once again I suspect it's from what you guys are talking here (hence my presence here too)

would like that you confirm if all is going ok with TF SDK 1.1b2

thank you

Steven Fisher

Does anyone know if including TestFlight but not calling takeOff also causes the problem?

My plan is to include TestFlight, but only have it running if the user turns it on.

Dave Lee

@tewha I'm wondering the same thing, if anyone has the answer please do share.

Brandon Fosdick
Dave Lee

thanks for the help @bfoz

Ben Scheirman
Steven Fisher

This is still a problem, right?

Much props to @mattt and @AFNetworking; it's insane that this is TestFlight, with a closed source framework, is directing people here. The only people in the world with the power to diagnose and fix this are at TestFlight.

Jason Gregori

Hi Guys,

So I'm one of the people posting about this bug earlier in this thread. Last week, I started a new job at TestFlight. Of course my first order of business was fixing this bug. It's turned out to be a lot harder than I thought. Here is everything I know about the bug and have learned in the last week:

  1. The bug is incredibly difficult to reproduce. No one at TestFlight has ever been able to reproduce it. I have not been able to reproduce it since last spring.

  2. Developers have reported this problem less and less to TestFlight since it started back in Spring. No one has reported the problem in the last 2 months.

  3. Something to do with TestFlight caused this bug. I know for my case that the bug immediately disappeared when the SDK was removed (I wasn't even using AFNetworking).

  4. About the SDK's networking code:

    • The SDK uses C networking code to report crash reports, but I'm pretty certain it has nothing to do with the bug because it is called only in the case of a crash or on startup right after a crash it did not have time to report. The bug happened many many times to me when there were no crashes in sight.
    • The Obj-C networking code is NSOperation based. It is entirely based off of Apple's MVCNetworking sample project. It doesn't do anything crazy. It uses regular NSURLConnections to send data.
  5. The bug occurred primarily around last spring.

    • At the time, TestFlight's servers were having a hard time keeping up with demand. Calls to the server would often time out. Since then, TestFlight's servers have been tremendously beefed up.
    • 5.1 and 5.1.1 were the latest versions of iOS. I can confirm I experienced the bug on iOS 5.1.
  6. @bONchON and @mikefrederick's crash logs. These are the only stack traces we have from anywhere on this bug. (Thanks guys!)

    • @bONchON's crash report log file: http://www.2shared.com/file/r4XgbEfG/af_tf_deadlock.html
    • @mikefrederick's stack trace pic: http://i.imgur.com/Yy9Fg.png
    • I thought I had discovered the bug in these crash logs. Follow along in @bONchON's crash log if you want (and since the SDK's networking code closely mirrors MVCNetworking you can follow along in there as well)
      • On thread 3 (NSURLConnection's thread), line 32 there is a time out.
      • On thread 6 (TestFlight's networking thread), line 7 NSURLConnection is calling -connection:didFailWithError on it's delegate. I'm not sure but I think this could be the time out from thread 3.
      • On thread 6, line 4 the SDK calls -cancel on the NSURLConnection that has just told it that it failed earlier up the stack.
      • A bit deeper down there is a deadlock.
    • My theory is that this is a deadlock because TestFlight is calling back into the NSURLConnection that is calling it to because it failed.
    • Try as I might I was not able to reproduce a deadlock like this.
  7. While trying to reproduce the bug in item 6, I stumbled on to another bug that looks very similar to our bug. I'm not sure if this could be our bug, but it's very similar and could be the real culprit or at least related.

    • We're not able to start any new HTTP network connections if we have 5 running connections to a very slow HTTP URL. In code, the new NSURLConnection's appear to have started. On Charles, I can't see any network connection actually go out until the slow ones finish. We're using Charles to see the network connections and artificially slow
      some of them down. I run into this limit on the simulator in iOS 5.1 and 6.0. Steps to reproduce:

      • Create and start 5 NSURLConnections to a very slow HTTP URL (something that will take minutes to return). These connections may have different paths but must have the same domain.
      • Create and start an NSURLConnection to any other HTTP URL.

      I expect the new connection to go out right away but it appears to be waiting on the slow ones. It can take a long time. If I change the slow URL to HTTPS, other connections can start fine while the slow ones are running. I have sample code if you'd like to see it.

    • I have a sample project that reproduces this bug
    • I was not able to reproduce this bug with the TestFlight SDK for two reasons:
      • The SDK uses HTTPS
      • The SDK limits itself to 4 connections at a time.

So, we know that at the time this bug was mostly being reported, TestFlight's servers were having a lot of trouble keeping up with demand. The potential bugs in items 6 and 7 look like they only happen when connections are very slow. I think this explains why the bug affected a lot of people last spring but doesn't seem to be affecting anyone now.

I haven't been able to reproduce the bug in item 6 at all. Either way the potential offending code will be removed from the next major build of the SDK.

I haven't been able to reproduce the bug in item 7 inside the SDK. We are filing a radar to see if Apple can hopefully shed any light on the situation.

Sorry for how long this post is. I just want to let you guys in on the situation over here and keep you in the loop.

This bug doesn't seem to be affecting anyone anymore, but I would like to get to the bottom of it. If you are running into this issue or have any more information you could provide to us please let me know here or on TestFlight's support site: http://help.testflightapp.com. Mention the "afnetworking bug" and it will be forwarded to me.

Eric Patey

Jason,
Wow. Thanks for the detailed and frank information. It's nice to see you're on the case. Based on this, I'll turn it back on in our dev deployment and see what happens.

Steven Fisher

Thanks for the detailed info.

For what it's worth, despite TestFlight's beefed-up servers I'm still seeing more than an order of magnitude performance difference when downloading a small app between it (79s) and HockeyApp (6s). So if your bug requires slow web servers, it's probably still there.

Steven Fisher

Wait, there's a limit of five simultaneous connections? Is that to a single server only?

mackinra

Jason,

Thanks for the info on how TF is doing things. It helps to possibly explain the behavior I was seeing. When you mentioned TF servers being busy and those connections timing out, it made me wonder a few things:
1) have you tried switching networking from 3G to WiFi or vice-versa, or moving around (on 3G) in spotty coverage areas, while experiencing a slow connection? Perhaps the combination of the two would leaves connections in a wedged state.
2) does the TF code implement it's own timeout method (via NSTimer)? If not, that might be something to do, to hopefully cleanup/prevent such wedged connections.
3) like @tewha, I'm wondering about this 5 simulataneous connection thing... could TF be opening multiple connections, particularly when the connections slow down?

Jason Gregori

@tewha

re: slow servers - we use cloud front for app downloads. its unrelated to the servers used for the sdk.

re: 5 connections - i have no idea what is going on there. it seems like a bug in iOS. we are trying to contact apple and see if they can help on this front. the sdk never creates more than 4 connections at a time so should never run into this. but if this is a bug, maybe it's related to ours.

Steven Fisher

If TestFlight uses 4 connections and AFNetworking uses 2 or more the app has collectively exceeded 5 connections. Isn't that what you're describing?

mackinra

Exactly my thought... my app has multiple other connections open at any one time.

Jason Gregori

@mackinra

1) we have. still haven't been able to reproduce.
2) no. it uses NSURLRequest's -setTimeoutInterval: method and the timeout is set to 30 seconds. there doesn't seem to be any issues with the requests not timing out. like i said, i haven't been able to show that these bugs are the same bug you were all experiencing. and in my testing i used very large timeouts.
3) TF can have a maximum of 4 connections open at a time.

mackinra

Is this limit on connections or active requests? I'm thinking it's the latter, as the former would be a much bigger problem (not just for TF).

mackinra

@jasongregori - I take it your requests are all GETs? If not, I assume you know about the super-long timeouts on POSTs (which ignore your setTimeoutInterval)?

Jason Gregori

there doesn't seem to be a limit on the number NSURLConnections, but after 5 really slow connections show up on Charles, all other connections seem to be waiting on the slow ones

Jason Gregori
If TestFlight uses 4 connections and AFNetworking uses more than two the app has collectively exceeded 5 connections. Isn't that what you're describing?

good point

Jason Gregori

@mackinra they are POSTs. I didn't know about super long timeouts on POSTs. does that happen on HTTP and HTTPS?

Jason Gregori

the other thing to note about the 5 connection limit. it doesn't appear to affect HTTPS requests and all of testflight's connections are HTTPS.

Jason Gregori

@epatey thanks eric, let me know if you have any problems.

mackinra

@jasongregori - the slow timeouts on POSTs happen on either, to my knowledge. The 'forced' timeout value is something like 240 seconds!

Piotr Tomasik

@jasongregori On bullet 7. Did you try to run your 5 slow HTTP connections. (thus hitting the limit) and then calling the testflight SDK which is over HTTPS?

Jason Gregori

@tomasikp yes that would block testflight but in that case whatever the slow connection is would be blocking all other connections so it wouldn't really be testflight's fault.

Jason Gregori

@mackinra hmmmm. that is valuable information. thank you. if there is a way to get this bug to happen on HTTPS requests than that might have something to do with it.

Stan Chang Khin Boon
Brandon Fosdick
Dale Buckley

There is a lot of talk about the 5 concurrent network connections, but this limit only applies to connections over wifi. If the connection comes through over 3G the limit is reduced to 2 (it's controlled by the cellular networks so it can be different but it's typically 2) and 1 connection for EDGE.

TestFlight should really be limiting itself to no more than a single connection (or at the very least scaled concurrent connections depending on connection type) and queuing it's requests, its supposed to be a background service which stays out of the way of the main app, not taking all available resources (in this case network connection slots) available to it which could hinder functionality in the main app.

The POST timeout issue has been around for a long time, it effects both HTTP and HTTPS connections according to our experiments. We had to add in our own timeout system to sort this problem for our own app, so it's definitely something you should be considering looking at @jasongregori if all of TestFlights requests are POST.

It's great to hear that this bug has started to get some serious investigation into it's roots again, I really want to use TestFlight but I can't get the approval from the guys above until I can prove this this issue has been eradicated for good. Keep up the good work @jasongregori (and congrats on the job!), if you find anything else about this issue then let us know on here and I'm sure there will be someone around here to help.

Dale Buckley

Just to clarify, could the issues be directly linked to TestFlight taking up all of the available network connection slots with POST requests that aren't timing out in a reasonable amount of time?

This would fit in with the load problems the servers were having a while ago and why the issue seemed to alleviate after the server beefed up. Once the servers could handle more traffic less POST requests would timeout at peak times so the issue would seem to vanish.

It's just a theory and it seems a bit too simple, but it seems to add up.

If this turns out to be the case, then adding in the proposed solutions in my other post could help solve the issue once and for all, having a play can't hurt.

Stan Chang Khin Boon

Post accidentally got deleted, reposting:

Hi @dlbuckley,

I have heard about the argument of 3G Network have a limit of 2 concurrent connections, but I haven't gotten anyone to provide the citations for the argument. So I'm skeptical about it for now.

The closest I have seen is pointing to HTTP 1.1 specs, but it only apply to per host (or "server") based.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8.1.4

Clients that use persistent connections SHOULD limit the number of simultaneous connections that they maintain to a given server. A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy. A proxy SHOULD use up to 2*N connections to another server or proxy, where N is the number of simultaneously active users. These guidelines are intended to improve HTTP response times and avoid congestion.

In fact, in my test, Mobile Safari have a limit of 5 concurrent connections per host (or "server"), 20 max concurrent connection. http://cl.ly/image/0O1B1h232a25

While on desktop, Firefox have set this value to 15 concurrent connections per host (or "server"), 256 max concurrent connection. http://cl.ly/image/260G3N1Z3f0P

I'm really against implementing the limit in anywhere higher than NSURLConnection until we can verify that whether NSURLConnection or below (CFNetwork) have such limitation in place. We might be overwriting any optimisation already in place.

Also, there's HTTP Pipelining, putting a limit will affect the effectiveness of HTTP Pipelining.

That been said, I agree that TestFlight shouldn't be competing for resources in the app.

Steve Madsen

@dlbuckley To echo @lxcid, any concurrent connection limitations are per-host, not system-wide.

As to detecting what kind of connection is in use, this is generally a bad idea. As Apple points out every year in the networking session at WWDC, just because the device says you're on Wi-Fi doesn't mean you can expect Wi-Fi performance characteristics. You could be on a bus or tethered, where the Wi-Fi connection is backed by cellular.

Now that cellular can mean EDGE, 3G or LTE, even knowing you are on cellular doesn't tell you nearly enough to impose an arbitrary concurrency limit.

Multiple NSURLConnection instances making GET requests to the same server will be pipelined if the server supports it and your app won't have any knowledge of it. POST requests should not be pipelined and I'll guess that NSURLConnection enforces it.

Jason Gregori

Let me reiterate that the bug I found doesn't seem to affect HTTPS which means it wouldn't directly affect the TestFlight SDK. At least in the form I have discovered. But perhaps it is related.

@lxcid

  1. We are only using 1 queue and its max concurrent operation count is set to 4.

  2. The number of concurrent connections per host isn't a problem. The problem is that there seems to be a maximum number of connections to anywhere.

@bfoz

Thats true. The thing I need to figure out is what do all the apps we on this thread have in common that is making this bug happen. I know that the app I was working on last spring had a lot of network connections all the time. Downloading lots of images and data basically every time a user went to a new view controller.

Everyone and @bfoz do the apps affected have a ton of network connections?

Brandon Fosdick
Jason Gregori

@dlbuckley that's one of my theories as well. do you have any more information about these limits? Did you find them by testing? Do you know how they affect HTTPS?

Jason Gregori

@dlbuckley @lxcid one thing I've noticed is that we seem to be limited to 5 HTTP connections per app. So even though in my test app the connections are all waiting, when I open safari there are no problems.

Every request in the SDK is a POST but it is over HTTPS. I don't think its the same as HTTP Pipelining, but requests do share the SSL connection. This might be something that also changed when the server was beefed up. Sharing the SSL connection makes things a lot faster. I think @lxcid is right that putting a smaller limit might actually slow it down more.

I definitely agree that TestFlight should not be competing for resources in your app. What I don't know is how limited are these resources and how should TestFlight limit itself? Should we take it down to 2 max connections?

@bfoz I think Flurry can make a lot of network calls. Question: does the bug show up right away/often? If I sent you a special build with 1 or 2 max connections to TestFlight do you think you would be able to see a difference?

Steve Madsen

@jasongregori What you're seeing is keep-alive. The existing connection to the server is re-used to avoid the TCP (and SSL/TLS for HTTPS) handshaking, which takes a couple of hundred milliseconds in the best case.

Keep-alive is different than pipelining. In keep-alive, the client sends a request and waits for a response. Then it can send another request over the same TCP connection and wait for the next response.

Pipelining allows the client to send multiple requests without waiting for responses. It improves latency because while the client is receiving one response, the server can begin processing the next request.

Jason Gregori

thanks for the explanation @sjmadsen

Brandon Fosdick
Jason Gregori

@bfoz same here. that would be great. trial and error may be the only way to fix this. does your app need armv6 support or is armv7 support and up ok?

Brandon Fosdick
Jason Gregori

@bfoz ive sent a special build to the email on your profile page. another question: are you using testflight checkpoints and if you are how many do you have in your app?

Brandon Fosdick
Jason Gregori

@bfoz that is really helpful to know because most of the connections going out of the SDK are checkpoints. are you using remote logging?

Jason Gregori

@bfoz thanks a lot. let me know if you need anything.

Brandon Fosdick
Jason Gregori

Hey everyone, I created a special build of the TestFlight SDK to see if it would make @bfoz stop experiencing the bug. He hasn't had a chance to try it out yet. I'm very eager to see if this helps solves problems for people. If any one else would like to give it a shot and let me know how it works out, please do.

The only difference in this build and the current beta build is that only 2 connections are allowed at once instead of 4. NB: It also drops support for armv6. Here is the build:

http://cl.ly/1Z3d2R3R081K

Brandon Fosdick
Jason Gregori
Dale Buckley

I would have jumped on the opportunity to test this out but it took so long to find a solution we had no choice but to switch to another solution (hockeyapp in our case), so to switch everything back (all of our CI setup) just for a test isn't something we can do.

I will say that testflight should never use more than 2 simultaneous connections, so even if it's a test it should be deployed with that limit regardless.

Jason Gregori

@dlbuckley I totally understand. That's basically what happened at the last company I worked at. Hopefully we can fix this bug now and make it a viable alternative for you if you ever decide to try it again in the future.

Question: Do you think having 2 simultaneous connections is OK? Is that too much for you?

Dale Buckley

I also hope that Testflight is made a viable option again in the future as I prefer the administrative web UI, but at the moment HockeyApp provides us with a stable SDK for crash reporting a few more options in general.

As I said previously, I don't think a background service should be taking over the majority of resources available to an app, but at the same time this can be more critical to some apps than others.

Take our app for example; it is completely network dependent for every action performed in the app, so when the Testflight SDK took up all available network connections it completely locked down the app and made it unusable. In our case I would be reluctant to allow any background service to make any more than a single connection, but if an app only downloads some information at launch, or if its a game and doesn't download any information at all then I would allow it to take more resources.

Due to the different app requirements I would personally make it a user definable option, this way a developer can give as little or as many resources to testflight as they feel is sufficient depending on their solution.

Just a quick question, have you looked at trying to implement a shorter timeout for your POST requests yet?

Jason Gregori

Thanks, that's good information to have. I'm looking into rewriting our entire networking code, so shorter POST timeouts will definitely be a part of that.

Dale Buckley

Sounds good!

I'm looking forward to using testflight again in the future :)

Taylor Halliday

As someone who is about to jump on the user testing bandwagon (never used testflight or HockeyApp), what is the status with its compatibly with AFNetworking? My project is heavily dependent on multiple 3rd party services (FB, our own rails API) for its content - a good level comparison of network pull would probably be FB or Instagram. I'm using AFNetworking for all calls except for the ones that FB makes on its own.

I've scanned this thread and it looked like the network failure instances sparse - TestFlight can't replicate it?

Taking a step out of the technical weeds, does the compatibility just come down to whether or not I'm willing to accept a 1 in 1000 network failure instance while testing the app on testflight? All else is good? Trying to get a sense how big of a deal this really is prior to jumping into Testflight.

Jason Gregori

Hi @tayhalla, you're basically correct. This issue affects a tiny percentage of apps but when it does, it's pretty terrible. When it does affect an app it seems to affect it fairly regularly. We don't think the issue is directly related to AFNetworking. We think it has to do with an app that has a lot of slow network connections. I also think it had a lot to do with how slow TestFlight's servers were until the end of Spring. We have not been able to reproduce it. The current solution we are trying to test is limiting the SDK to 2 simultaneous network connections.

So to answer your question: It's more comes down to you are unlikely to run into this issue but on the chance that you do, it will probably affect more than 1 tester and you will want to turn the SDK off.

Also, it might help to know that this bug is a huge priority for me, and I won't let it fade away until it isn't affecting the people here on this thread.

Taylor Halliday

Hi @jasongregori, thanks for the response. Good to hear you guys are on it. I'll be sure to launch you anything that might be useful as I start to crack open Testflight for our purposes - assuming I run into connectivity issues / timeouts with AFN.

In the meantime, if people have suggestions for best practices in order to avoid these issues on the ios side, those would be great to have on here as well.

Brandon Fosdick
Jason Gregori

That's great to hear. Just so everybody knows where we are: this fix is not in 1.2 but as soon as 1.2 is released, I will put this fix into the 1.2.1 beta and it will be in all subsequent releases.

Hopefully this puts this dreadful bug behind us once and for all. I will keep you all updated here when these fixes go into place. If you'd like to use the SDK, hopefully, when 1.2.1 hits, you will be able to start without having to worry about networking issues. Thank you all for all the helpful information you've given me and thank you @bfoz especially for testing this build of the SDK.

Dale Buckley

Good to hear that the simultaneous network connections limit appears to have fixed the issues.

What the news on testflights handling of the POST timeouts @jasongregori? Is that part of the bigger change you have planned for the reworked networking code you talked about previously.

Jason Gregori
Chitradeep Dutta Roy

Yes, the bug is not related to AFNetworking Library only. In my project I do open a NSURLConnection on a secondary thread from inside a NSOperation subclass when the app enters foreground lot of times it fails with a "Request Timed Out" error. When I was not letting the concurrent operation run on a secondary thread it seemed to work fine. But as soon as I changed the code to use another thread for downloading the problem started occuring. And my NSMutableRequest used HTTP Pipelining, for some reason disabling HTTP Pipelining solved the issue.

Now even on a secondary thread with http pipelining disabled the timeout is not occurring.

Joseph Heenan

@jasongregori I've read the whole thread but am unclear on the current status - is there a version of testflight available that doesn't have this issue? I can only see 1.2 beta 2 on the testflight site, and your 17th December post seems to indicate that doesn't have the fix. Thanks!

Jason Gregori

@jogu you are correct, that version does not have the fix. But I have just released a new version. See my next post for more information.

Jason Gregori

Everyone: I've just released a new version of the SDK that has our fix in it. It's in beta but please try it out and give me some feedback. It's not quite so easy to get to because we are in the middle of transitioning from team tokens to app tokens and the transition is not quite complete. So to use this beta you will have to turn on app tokens for your testflight account if you haven't already (this will be switched for everyone soon). Steps:

Now when you go to download the SDK (https://testflightapp.com/sdk/download/) you will see 1.2 as the release version and 1.2.3 BETA 1 as the beta. 1.2.3 BETA 1 is the one with our fix in it. Please try it out!

Important: If you haven't switched to app tokens yet you will have to start using them instead of team tokens with the SDK. They only work with apps that have already been uploaded and not new apps right now (which is why it hasn't been rolled out to everyone yet).

Joseph Heenan

Thanks Jason, that's great - I'll take a look at this over the next week or so and let you know how I get on!

Joseph Heenan

@jasongregori We're using the new sdk etc in 2 apps, everything seems to be working fine.

Jason Gregori

@jogu glad to hear it!

Mingming Wang

I experienced this bug without using TF. The senario is when the JSON response is empty ( only "[]" in response) and I use the dispatch_group similar to this SO answer. It happened 100% of the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.