Error on finalization - FineUploader S3 Concurrent Chunk Uploads #1519

Closed
jfpaulin opened this Issue Jan 21, 2016 · 25 comments

Projects

None yet

4 participants

@jfpaulin

We are using FineUploader to upload big files directly to our Amazon S3 buckets. Here is my configuration :

fineUploaderReference = $("#fine-uploader").fineUploaderS3({
    debug:true,
    uploaderType:'basic',
    button:$('#attach-button'),
    maxConnections:5,
    validation:{sizeLimit:31457280000},
    retry:{enableAuto: true, autoAttemptDelay:5, maxAutoAttempts:50},
    chunking:{enabled: true, partSize:5*1024*1024,concurrent:{enabled:true}}, // 5mb
    resume:{enabled: true, recordsExpireIn:1},
    messages:{
        'onLeave':'Files are still uploading, are you sure you want to leave this page and cancel the upload?',
        'sizeError':'Free account are limited to: 29.297 GB, please login for more options',
        'emptyError':'File is empty'
    },
    signature:{endpoint:'******'},
    iframeSupport:{
        localBlankPagePath:'/empty.html'
    },
    request:{
        endpoint:'https://*****.s3.amazonaws.com',
        accessKey:'*******'
    }
});

Sometimes we got an error when trying to finalize the file on S3. Here is the error we receiving :

[Fine Uploader 5.5.0] Submitting S3 complete multipart upload request for 0 s3.jque...s?r=3.6 (line 16)
[Fine Uploader 5.5.0] Sending POST request for 0 s3.jque...s?r=3.6 (line 16)
POST https://**********.s3.amazonaws.com/14533..._8j14EDjuVHXXfg6wNHuIai84IBn58Ig_GUhZg5hOLT3tvns    200 OK        78ms    s3.jque...s?r=3.6 (line 17)
[Fine Uploader 5.5.0] Complete response status 200, body = 
InvalidPartOrderThe list of parts was not in ascending order. Parts must be ordered by part number.F0UOQTefZmRcg0L_P9tEe9SLxza7mInpsiXPPiR58ARb6Zh4UOI_wkh_t_Cc8eL.ZTe1NGxQiNeRFNkD_8j14EDjuVHXXfg6wNHuIai84IBn58Ig_GUhZg5hOLT3tvns02F41891E40C60F2+OSli4SW3A+ZKlJTk3EWSEcQHu4r6lGcbCfcLWKGcjOoaWC6h5hUTHCoLHEQap1VUPqUlXzVHFg= s3.jque...s?r=3.6 (line 16)
[Fine Uploader 5.5.0] Missing bucket and/or key in response to Complete Multipart Upload request for 0.

This seems to happen only if one of the chunk got an error during the upload even if it seems to resume/retry correctly. And I was able to reproduce the problem only when trying to upload big files (i.e. 5Go+).

Everything is working fine if I turn off the concurrent chunk upload. Even if some chunks got an error during the upload, the resume/retry is working correctly and the file can be finalized correctly at the end.

Let me know if you need more details.

@rnicholus
Member

Sounds like it could be a bug. Could be that when a chunk fails with concurrent chunking enabled, there is a bug that results in an unordered list of parts in the digest request sent to S3 after all chunks have been uploaded. After all parts are uploaded, Fine Uploader sends a "complete multipart upload" request. The body of that request contains IDs (ETags, to be specific) for each chunk, and they are supposed to be in order. I'll look into this in the near future.

@rnicholus
Member

@jfpaulin Can you provide me with Fine Uploader JS/browser logs for the point when the chunk upload fails and then successfully retries? I'm also interested in the request body for the Complete Multipart Upload POST request that Fine Uploader sends after all parts have been uploaded.

I don't see how the parts could be out of order, as they are explicitly ordered just before the Complete POST is sent. Perhaps something is going wrong with the numbering of the parts. I suspect that, after an out-of-order chunk fails, when its index is added back to the "remaining chunk indexes" array, something unexpected is happening before that chunk is retried. The logs I have requested from you may shed some light on this.

@rushimusmaximus

We've been getting reports recently of +5GB files failing too (also with s3/concurrent chunk setup). Unfortunately I haven't been able to personally recreate the issue, after several attempts on multiple browsers. I did get some logs from someone that had the problem that I hope might be helpful. (This is in version 5.1.3.)

[Error] Failed to load resource: The network connection was lost. (...S3 URL...)
[Error] TypeError?: undefined is not an object (evaluating 'e.upload')
_registerProgressHandler (anonymous function) s3.jquery.fine-uploader.min.js 17:5363
(anonymous function) s3.jquery.fine-uploader.min.js 18:24101
each s3.jquery.fine-uploader.min.js 15:8058
success s3.jquery.fine-uploader.min.js 15:12332

I'm going to keep testing, seeing if I can get it to happen for me.

@rnicholus
Member

5.1.3 is quite old. I'll either need to recreate myself or get access to the logs mentioned in my last post in order to get to the bottom of this. I'll take a closer look sometime next week.

@rushimusmaximus

@rnicholus Yeah, I figured you'd say that ;) We're releasing an update with to 5.5 "real soon now". The line numbers might not be helpful to yo, but I was hoping the general error might provide some insight. Basically it seems like this may happen when there are network errors. I wasn't able to reproduce this today, but am in contact with someone who thinks they can do it reliably, so I'm going to work on getting you some better information.

@rnicholus
Member

@rushimusmaximus The error message in your comment above seems to be related to the network connection issue on your end, and is otherwise benign as far as I can tell. I suspect that the original issue reported here is related to assigning part indexes to retried out-of-order chunks.

@jfpaulin
jfpaulin commented Feb 1, 2016

@rnicholus sorry for the delay, I was out of the office last week. I will try to get those logs for you this week.

@jfpaulin
jfpaulin commented Feb 1, 2016

@rnicholus I attached two files. The first one is 'all.logs.txt', this contains all the console log for my failed uploads. I attached all the log to be sure that you have all you need since this is hard to reproduce. You can see the first chunk error around the line number 9430.

The second file, 'complete.post.txt' contains the request headers for the complete multipart post and the body for this post.

What was weird, is that that time, I had to force a resume on the file. Because after uploading the last chunk, everything stopped ... No more progress, no more post/put and no more logs in the console. So I force a resume of the file by refreshing the page and selecting the same file then I was able to get the error for the complete post..

I'll try to reproduce it again without interfering, but I'm not sure if that change anything.

all logs.txt
complete post.txt

@jfpaulin
jfpaulin commented Feb 1, 2016

@rnicholus I also noticed that when resuming the file, FineUploader uploaded the chunk number that failed.. Then do the complete multipart post which failed.

Let me know if you need more details.

@rnicholus
Member

Chunks 1344-1347 initially failed due to issues on S3 (these are surprisingly common), but then recovered. Looking at the complete POST request, part number 1349 is duplicated:

<Part>
   <PartNumber>1349</PartNumber>
   <ETag>"ea1968a50c26c099597724727c36abc8"</ETag>
</Part>
<Part>
   <PartNumber>1349</PartNumber>
   <ETag>"ea1968a50c26c099597724727c36abc8"</ETag>
</Part>

Seems like this is causing S3 to reject the complete POST.

It's not clear why that specific part number was repeated. That chunk did not fail, but it is close in range to the chunks that did and were retried. I'll have to look closer.

@jfpaulin
jfpaulin commented Feb 2, 2016

@rnicholus Do you think that the fact the file upload stopped after uploading the last chunk is also related to that problem ?

@rnicholus
Member

Not sure. I don't see anything unusual in the logs. May have just been a pending request to S3 that fine uploader was waiting for.

@jfpaulin
jfpaulin commented Feb 2, 2016

Ok. Because I'm still making tests on this, and everytime I get that problem, the upload stop and I have to force a resume on it. I can try to do it again and send you the post/put logs for the last chunks ?

@rnicholus
Member

@jfpaulin and @rushimusmaximus Is this something that can be easily reproduced by you or one of your customers? If yes, I'd like to consider giving you an updated version with some more logging so I can have an easier time determining why a part/etag is duplicated in a failure scenario. Otherwise I'll just push out a pre-release with some code that removes the any duplicates before multipart e "multipart complete" REST call to S3.

@rnicholus
Member

You can disregard the last message, after looking at the logs even closer, I think I see what is causing a duplicate part number in the manifest. When an error on S3 forced a batch of concurrent upload requests to fail, one of those requests actually succeeded.

The batch consisted of part 1344 - 1348 as far as I can tell. This makes sense @jfpaulin since you have the maxConnections option set to 5. Parts 1344-1347 failed but 1348 somehow succeeded (or at least this is what the logs indicate). I mentioned that the manifest contained a repeated part number of 1349. Amazon, for inexplicable reasons, decided that part numbers must start at 1 instead of 0. So while Fine Uploader's core logging code reports this part as 1348, AWS sees it as 1349. Anyway, due to some apparent error in Fine Uploader's concurrent chunking code, 1348/1349 was uploaded again when the batch of files was retried. I'm not certain why this happened, but this is what resulted in a duplicate entry, since the same part was uploaded twice.

Uploading the same part twice in itself isn't a problem, but reporting the same part twice appears to be. I should probably determine why a seemingly successfully uploaded part was re-uploaded, and that will result in a solution to this problem I imagine.

@rnicholus
Member

This is looking to be caused by a very unusual race condition. I'm not entirely certain about the chain of events, and reproducing looks to be difficult. I have not had any luck yet. The concurrent chunking feature is complex, as is the code. I'm hesitant to make any changes to the logic in this code unless I am 100% certain of the outcome and am certain that the change needs to be made. I also thought about simply checking the internal array that maintains the part numbers yet to be uploaded for a match before adding a new part number, but this check would occur often, and for very large files, this array could be quite large and examining it for a duplicate could be costly. Unless someone has some thoughts or insight regarding reliable reproduction, I may just opt for an easy "fix" that involves removing a duplicate entry in the "complete multipart" manifest just before this request is sent.

I really don't think this has any relation to the size of the file (6GB vs 500 MB), but I can see how the issue would be frustrating when a very large file fails to upload at the last step and is not retryable. Just out of curiosity, how often are your customers seeing this issue?

@rupert1073

Hi Ray, thanks for all your work, our clients were having daily issue
concurrent chunks was on, the bigger the files the more often the issue
happen, i can reproduce it most of the time with file larger than 15G but
once in awhile the file goes thru. we'll be happy to help you with the
tests.

On Fri, Feb 5, 2016 at 3:47 AM, Ray Nicholus notifications@github.com
wrote:

This is looking to be caused by a very unusual race condition. I'm not
entirely certain about the chain of events, and reproducing looks to be
difficult. I have not had any luck yet. The concurrent chunking feature is
complex, as is the code. I'm hesitant to make any changes to the logic in
this code unless I am 100% certain of the outcome and am certain that the
change needs to be made. I also thought about simply checking the internal
array that maintains the part numbers yet to be uploaded for a match before
adding a new part number, but this check would occur often, and for very
large files, this array could be quite large and examining it for a
duplicate could be costly. Unless someone has some thoughts or insight
regarding reliable reproduction, I may just opt for an easy "fix" that
involves removing a duplicate entry in the "complete multipart" manifest.

I really don't think this has any relation to the size of the file (6GB vs
500 MB), but I can see how the issue would be frustrating when a very large
file fails to upload at the last step and is not retryable. Just out of
curiosity, how often are your customers seeing this issue?


Reply to this email directly or view it on GitHub
#1519 (comment)
.

@rnicholus
Member

If this truly does take that large of a file to reproduce, then I will have difficulty reproducing myself due to limited bandwidth at my current location. I'm still not convinced that file size is really a factor. A request error is required to reproduce, and this is more likely to occur with much larger files somewhere along the way. That is probably why you are seeing this with larger files. Did you say you are able to reproduce yourself very easily? If so I'd like to be able to send you updates for testing/verification in order to get to the bottom of this as quickly as possible. Will that work for you?

@rupert1073

Absolutely, it will be our pleasure to help you on this,

Le vendredi 5 février 2016, Ray Nicholus notifications@github.com a écrit
:

If this truly does take that large of a file to reproduce, then I will
have difficulty reproducing myself due to limited bandwidth at my current
location. I'm still not convinced that file size is really a factor. A
request error is required to reproduce, and this is more likely to occur
with much larger files somewhere along the way. That is probably why you
are seeing this with larger files. Did you say you are able to reproduce
yourself very easily? If so I'd like to be able to send you updates for
testing/verification in order to get to the bottom of this as quickly as
possible. Will that work for you?


Reply to this email directly or view it on GitHub
#1519 (comment)
.

@rnicholus
Member

It will probably be easier to continue via email then. Can you contact me at [redacted]? I'll have you a build with some more logging and a first naive attempt at a fix sometime on Monday.

@rupert1073

Hi @rnicholus , did you received my email?

@rnicholus
Member

Yes, I did. I'm currently GMT+7, so I will be sure to respond with details in the morning. Thank you for your patience.

@rnicholus rnicholus changed the title from FineUploader S3 Concurrent Chunk Uploads to Error on finalization - FineUploader S3 Concurrent Chunk Uploads Feb 13, 2016
@rnicholus rnicholus added this to the 5.5.1 milestone Feb 13, 2016
@rnicholus
Member

This issue now has my full attention, and as soon as I fix it, I'll push out a 5.5.1 release. This is a tricky one, so bear with me.

@rnicholus rnicholus added a commit that referenced this issue Feb 13, 2016
@rnicholus rnicholus fix(concurrent chunking): first attempt to fix S3 finalization issue
Also added some more logging in case this doesn't work.
#1519
8109d67
@rnicholus rnicholus added a commit that referenced this issue Feb 17, 2016
@rnicholus rnicholus fix(concurrent chunking): account for cancelled chunk uploads that ar…
…e still waiting for signature

This should prevent a chunk upload that has not yet called xhr.send() from starting if it has been cancelled by error handling code.
#1519
bc22159
@rnicholus
Member

5.5.1-3 may have the fix for this issue. I have one person already testing, but was wondering if anyone else is interested in verifying as well? Since this change touches the most complex code in the library, more testers is a good thing.

@rnicholus rnicholus added a commit that closed this issue Feb 23, 2016
@rnicholus rnicholus docs(concurrent chunking): prepare for 5.5.1 release
Removed temporary logs.
fixes #1519
16f1476
@rnicholus rnicholus closed this in 16f1476 Feb 23, 2016
@rnicholus
Member

This has been released as 5.5.1, now on npm and also available via the website.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment