-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENG-330] [OATHPIT] Throw figshare into the Oath Pit #391
[ENG-330] [OATHPIT] Throw figshare into the Oath Pit #391
Conversation
# When a file has been published, the figshare download request returns a redirection URL | ||
# which downloads the file directly from its backend storage (currently Amazon S3). The new | ||
# request does not need any auth at all. More importantly, the default figshare auth header | ||
# breaks the download since S3 API does not understand figshare API. Thus, the provider | ||
# must use `no_auth_header=True` to inform `super().make_request()` to drop the header. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize this problem until I started testing the view link for published / public file. The content shown by MFR is an XML error response. I went to test the file download and got same result.
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>InvalidArgument</Code>
<Message>Unsupported Authorization Type</Message>
<ArgumentName>Authorization</ArgumentName>
<ArgumentValue>token ****</ArgumentValue>
<RequestId>****</RequestId>
<HostId>****</HostId>
</Error>
When I checked the WB log, it looked OK since the error response had a 200 status !!!
[waterbutler.core.provider]: >>> request headers = {'Authorization': 'token ****'}
[waterbutler.core.provider]: >>> request URL = https://api.figshare.com/v2/account/projects/64916/articles/8263811
[waterbutler.core.provider]: >>> request headers = {'Authorization': 'token ****'}'}
[waterbutler.core.provider]: >>> request URL = https://ndownloader.figshare.com/files/15460988
[waterbutler.providers.figshare.provider]: >>> resp status: 302
[waterbutler.core.provider]: >>> request headers = {'Authorization': 'token ****'}'}
[waterbutler.core.provider]: >>> request URL = https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/15460988/Dataset01.txt
[tornado.access]: 200 GET /v1/resources/nwstv/providers/figshare/8263811/15460988? (172.19.0.1) 11957.88ms
[waterbutler.core.remote_logging]: Callback for download_file request succeeded with {"status": "success"}
no_auth_header = kwargs.pop('no_auth_header', False) | ||
if no_auth_header: | ||
kwargs['headers'].pop('Authorization') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the auth header in make_request()
instead of build_headers()
. I don't have any preference on either approach. and choose the former simply because it is easier.
range_header = {'Range': self._build_range_header(range)} | ||
resp = await super().make_request('GET', resp.headers['location'], | ||
headers=range_header, no_auth_header=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use an extra local var to keep line-breaking easy.
e96da64
to
7f8422d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
figshare works as expected except when copying / moving private files from figshare.
) | ||
|
||
if settings['container_type'] in pd_settings.ARTICLE_CONTAINER_TYPES: | ||
return FigshareArticleProvider( | ||
auth, credentials, dict(settings, container_id=settings['container_id']) | ||
auth, credentials, dict(settings, container_id=settings['container_id']), **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only fixed the unsupported argument issue so that celery tasks can init the figshare provider.
However, I discovered a new figshare copy / move issue, which is not related to oathpit
but can't be reproduced on staging
or prod
since neither has this fix :( Below are some details.
- Copy / Move from OSFStorage (or any working provider other than figshare) to figshare works
- Copy / Move published files / datasets from figshare to OSFStorage (or any working provider including figshare) works
- Copy / Move private files / datasets from figshare to OSFStorage (or any working provider including figshare) fails
After debugging, I learned that the problem is that the downloaded stream from the src-provider (i.e. figshare in this case) has a None
size during copy / move. The breaks the next upload-to-dest-provider step which requires stream.size
.
More digging reveals the root cause: the download response for private figshare files (which uses figshare API) does not provide the Content-Length
header which is used to build the response stream and set the stream size. In contrary, the download response for published / public files (which uses figshare's backend storage S3 API) provides the Content-Length
header.
- [x] cc @felliott Not sure if we should fix this here in oathpit
or we should fix the develop
/ master
first.
|
||
return streams.ResponseStreamReader(resp) | ||
return streams.ResponseStreamReader(resp, size=file_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally found a solution to the download missing stream size issue: retrieve and save the file size in advance during the metadata request and then create the response stream using this size if the Content-Length
header is not provided. See https://github.com/CenterForOpenScience/waterbutler/pull/391/files#diff-1d2551564e2f17d27a564332e782cfeeR196-R203
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Local regression tests done; QA notes and figshare quirks added. PR ready for CR!
🖼 🤝 🖼 🤝
figshare directly calls `aiohttp.request()` with aiohttp 0.18 which no longer works in aiohttp 3.5. Instead, rewrote the request making process into using the updated `super().make_request()`.
- As for the figshare provider: When a file has been published, the figshare download request returns a redirection URL which downloads the file directly from its backend storage S3. The new request does not need any auth at all. More importantly, the default figshare auth header breaks the download since S3 API does not understand figshare API. Thus, the provider must use `no_auth_header=True` to inform `super().make_request()` to drop the header. - As for the core provider: Instead of modifying how headers are built with `build_headers()`, simply update `make_request()` to drop the `Authorization` header built into the headers.
Move and copy fail for the figshare provider since `__new__()` in the `FigshareProvider` class does not accept extra arguments such as `is_celery_task` which was recently added for celery tasks. The fix is simply adding the missing `**kwargs`. Here is the related commit: CenterForOpenScience@345b2d8
* Copy / Move private files / datasets from figshare to OSFStorage or any working provider including figshare fails. The problem is that the downloaded stream from the src-provider has a None size during copy / move, which breaks the upload-to-dest step which requires the size. * The root problem: the download response for private files using the figshare API does not provide the Content-Length header that is used to build the response stream and to set the stream size. On the contrary, the download response for published files using figshare's backend S3 API provides the Content-Length header. * The fix is to retrieve and save the file size in advance during the metadata request and then create the response stream using this size if the Content-Length header is not provided.
* Upgrading the pytest version revealed a couple of non-async tests that were being declared as async. Fix these and correct the tests so they pass.
* FigShare now returns a field in the response to upload requests that indicates whether hashing has completed. If this is `false`, skip the checksum verification step. Recent testing has shown that this completes quickly even for very large files.
* Update `path_from_metadata` to correctly handle files in projects. WB had been treating files as just a file entity under the project root, but they are actually stored in an implied article (which WB treats as a folder).
4f38829
to
f671937
Compare
Ticket
https://openscience.atlassian.net/browse/ENG-330
Purpose
Enable and update the figshare provider for
aiohttp3
.Changes
aiohttp
0.18 -> 3The figshare provider directly calls
aiohttp.request()
inaiohttp
0.18 for some downloads, which no longer works withaiohttp-3.5
. Instead, use the updatedsuper().make_request()
. However, this only partially fix the download issue.Authorization
HeaderTo fully fix the download issue, the figshare auth header are dropped for published / public files.
As for the figshare provider: When a file has been published, the figshare download request returns a redirection URL which downloads the file directly from its backend storage S3. The new request does not need any auth at all. More importantly, the default figshare auth header breaks the download since S3 API does not understand figshare API. Thus, the provider must use
no_auth_header=True
to informsuper().make_request()
to drop the header.As for the core provider: Instead of modifying how headers are built with
build_headers()
, simply updatemake_request()
to drop theAuthorization
header built into the headers.Broken Celery Task (Existing Issue on
staging
)Move / copy fails for the figshare provider since the
__new__()
in theFigshareProvider
class does not accept extra arguments such asis_celery_task
which was recently added for celery tasks. The fix is simply adding the missing**kwargs
. Here is the related commit:add flag to BaseProvider to mark if it's running in celery
.Download Stream Size (Existing Issue on
staging
)Fixed missing stream size when copy / move private files for figshare.
Copy / Move private files / datasets from figshare to OSFStorage or any working provider including figshare fails. The problem is that the downloaded stream from the src-provider has a None size during copy / move, which breaks the upload-to-dest step which requires the size.
The root problem: the download response for private files using the figshare API does not provide the
Content-Length
header that is used to build the response stream and to set the stream size. On the contrary, the download response for published / public files using figshare's backend S3 API provides the Content-Length header.The fix is to retrieve and save the file size in advance during the metadata request and then create the response stream using his size if the
Content-Length
header is not provided.Side effects
Fixed move / copy which is broken on both
staging
andprod
.QA Notes
About figshare and WB-figshare
figshare is very different from other providers due to the "article type" concept.
WB supports figshare project and dataset as OSF project root but not figshare collection. Both cases have been tested during local regression tests. Please note that a dataset can belong to a project or be of its own.
Public / published figshare articles (folder and file) can only be read but not modified.
Uploading a file (no type) or creating a folder (type 3 Dataset) ends up creating a private article of respective type, even if the file is uploaded to a published folder.
Dev Tests
As usual, no comments indicates a PASS. Please test both figshare as root and dataset as root when eligible.
Getting metadata for a file / folder: tested along folder listing and file rendering
Downloading
Uploading
DAZ
Deleting (not available for published / publish folders and files)
Folder
Rename files and folders: N / A, disabled by the front-end
Verifying non-root folder access for id-based folders: not necessary but tested anyway
Intra move / copy: N / A
Inter move / copy (light testing only)
Comments persist with moves (light testing only)
If enabled, test revisions: only seeing the latest, which is as expected.
Project root is storage root vs. a subfolder: not necessary but tested
Updating a file
Extra Notes for QA Testing
At the time of writing the QA notes,
prod
OSF does not have the figshare article type fix whilestaging1
andstaging2
has just been fixed. In short:prod
figshare may be more broken thanstaging1
andstaging2
feature/oathpit
(i.e.staging3
after merge) figshare works almost perfectlyDeployment Notes
No