-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IQSS/6829 - account for file failures #6857
IQSS/6829 - account for file failures #6857
Conversation
and the dataset pid being set multiple times.
for directUpload case
used in DatasetPage and editFilesFragment.xhtml
@qqmyers This is still failing for me on create. Server log: [2020-04-30T18:55:37.467+0000] [Payara 5.201] [SEVERE] [] [javax.enterprise.resource.webcontainer.jsf.context] [tid: _ThreadID=87 _ThreadName=http-thread-pool::jk-connector(1)] [timeMillis: 1588272937467] [levelValue: 1000] [[ [2020-04-30T18:55:37.680+0000] [Payara 5.201] [SEVERE] [] [javax.enterprise.resource.webcontainer.jsf.context] [tid: _ThreadID=91 _ThreadName=http-thread-pool::jk-connector(5)] [timeMillis: 1588272937680] [levelValue: 1000] [[ [2020-04-30T18:55:37.822+0000] [Payara 5.201] [SEVERE] [] [javax.enterprise.resource.webcontainer.jsf.context] [tid: _ThreadID=89 _ThreadName=http-thread-pool::jk-connector(3)] [timeMillis: 1588272937822] [levelValue: 1000] [[ Browser console: |
Hmm - the last console line with 3:3 in the middle shows that all three files are uploading successfully, which was not the case in the original bug. One other thing to check is to see in the browser dev pane what calls are being made in the 'Network' panel. After the calls to upload the files to S3, there should be three calls to EditFilesPage.handleExternalUpload(). If they are failing, the error code and response might help debug. FWIW: I did not have trouble with an EC2 instance spun up with the ansible script, which I think is/was still glassfish. If you're on payara, the NPEs in the log could be related to that - just did a quick check on the web and found some mentions. If that's the problem, we might want a new bug and some help from people more familiar with the latest payara/J2EE/JSF updates. |
@qqmyers I will recheck the browser, network log. |
That error means your server can't find the Amazon profile file. That could be because you don't have one or the Dataverse server is running as a different Unix user than you normally use, etc. |
@qqmyers I am seeing some weird behavior. Things are not working at this point. I uploaded 3 small files, then on save, got an exception: [2020-04-30T20:17:11.704+0000] [Payara 5.201] [WARNING] [] [edu.harvard.iq.dataverse.ingest.IngestServiceBean] [tid: _ThreadID=97 _ThreadName=http-thread-pool::jk-connector(3)] [timeMillis: 1588277831704] [levelValue: 900] [[ [2020-04-30T20:17:11.805+0000] [Payara 5.201] [WARNING] [] [edu.harvard.iq.dataverse.ingest.IngestServiceBean] [tid: _ThreadID=97 _ThreadName=http-thread-pool::jk-connector(3)] [timeMillis: 1588277831805] [levelValue: 900] [[ [2020-04-30T20:17:11.848+0000] [Payara 5.201] [WARNING] [] [edu.harvard.iq.dataverse.ingest.IngestServiceBean] [tid: _ThreadID=97 _ThreadName=http-thread-pool::jk-connector(3)] [timeMillis: 1588277831848] [levelValue: 900] [[ [2020-04-30T20:17:12.645+0000] [Payara 5.201] [SEVERE] [] [javax.enterprise.resource.webcontainer.jsf.application] [tid: _ThreadID=96 _ThreadName=http-thread-pool::jk-connector(2)] [timeMillis: 1588277832645] [levelValue: 1000] [[ |
The stack trace about file size is caused by the failure to get the file metadata from S3. It will break displaying that dataset unless you add non-null file size in the db. I think you can still delete the dataset though. It's less clear why you can't see the files on S3. The New host dataverse id line suggest that you changed the host dataverse at some point which will change the DOI and file paths. Can you check S3 and see if the files are just at the wrong path? If that's the case, and you changed the dataverse after uploading files, the fix might be to clear out uploaded files if you change dataverses (which could change the storage driver used too). Or we could keep the DOI when recreating the Dataset. |
@qqmyers In the case of that one failed dataset, the directory does not exists on s3. I have been able to upload on create in a root directory. Just tried again in a sub directory, newly created, all 3 files appeared to complete upload but only appeared at bottom with a stack trace in logs: [2020-04-30T21:27:58.800+0000] [Payara 5.201] [SEVERE] [] [javax.enterprise.resource.webcontainer.jsf.context] [tid: _ThreadID=97 _ThreadName=http-thread-pool::jk-connector(3)] [timeMillis: 1588282078800] [levelValue: 1000] [[ I need to step out for a couple hours but it seems that I'm getting wildly different results. I'll try to retest, being very methodical but aside from that one time starting payara as root rather than dataverse user, everything else was straightforward uploads. |
It's hard to tell but I'd suspect that you've identified two new issues: changing the dataverse when creating a dataset breaks direct file uploads (only if they are started prior to the dataverse change?), and payara breaks direct uploads. If this PR solves the problem on glassfish for cases when the dataverse doesn't change, it fixes an issue (or at least an issue since I was able to see a problem on glassfish when I didn't change the dataverse). |
@qqmyers I'm not sure about changing the host dv. I did not do that. What has happened and is a UX issue is the focus is on the host dv text box on create dataset page load and often, when trying to enter the title, it alters the host dv name, which you then need to edit and reselect to be the correct name. This happens before any data or files are uploaded and I believe I simply reselected the original dv so not sure why this would be triggered. |
@qqmyers OK, have tried testing more carefully, recording results. Here is the list of tests and results. I'll send you the stack traces outside of this ticket since they're large:
|
From looking in the logs: I'd suggest we break the Payara issue out into a new issue, and probably create one to track changing between dataverses with different stores. If we want to fix/work-around #6371, I can submit a PR that reverts to using older code that doesn't randomly fail - we use that on QDR. W.r.t. this PR, I think it does fix the race condition I saw, and now catches network errors that don't even return a status=0 on their own, and handles ~accidental toggling between dataverses. I'm not sure how you test the network error unless you pull the plug, but testing on glassfish and trying the toggle to another dataverse and back should be possible. If we want to wait on this one until the Payara issue is fixed, that's OK with me - I guess it depends on whether the things I've fixed are being seen/are a priority that would result in a pre-Payara patch. |
This fixes some failing cases but will break out payara direct s3 upload on ds create as a separate ticket. |
What this PR does / why we need it: This PR tracks failures in direct uploads and adds them to the total count of processed files so that direct upload correctly completes with the remaining (successfully uploaded) files
Which issue(s) this PR closes:
Closes #6829
Special notes for your reviewer: Not sure if this closes the issue or not - it does not address the underlying possibility for Amazon AWS's 'eventual consistency' to take longer than 1 minute to recognized creation of new files (so Dataverse can find it, get its filesize, etc.). Assuming that's not the new norm, this PR just makes sure that when errors occur, the overall upload, which counts uploads to see when it's finished the total number of files, counts failures so that it successfully completes the upload. That happening should make the successful files appear in the new panel of uploaded files (as with normal uploads), after which one should be able to save the dataset. The failure to finish an upload at all was reported for dataset create mode, but I think it was possible in both create and edit modes.
Suggestions on how to test this: Could be hard to test if AWS isn't slow. I edited the fileuploads.js script and added a random failure when directUploadFinished() was called (some fraction of the time, it called uploadFailure() instead of continuing. That caused the processing to hang with some files not processed, which is now fixed by the PR since it calls directUploadFinished at the end of its processing now. In my fake test, I didn't send values for the params required by uploadFailure so I bypassed the steps that find the error code and display them.) I can provide edits to that file if someone wants to replicate this.
Does this PR introduce a user interface change?: no
Is there a release notes update needed for this change?: no
Additional documentation: FWIW: I also removed the unused datasetId variable - that was originally used but is only available for edit mode, so I ended up working around it to handle create mode and it just hadn't been deleted.