[ENG-330] [OATHPIT] Throw figshare into the Oath Pit #391

cslzchen · 2019-12-11T14:15:06Z

Ticket

https://openscience.atlassian.net/browse/ENG-330

Purpose

Enable and update the figshare provider for aiohttp3.

Changes

`aiohttp` 0.18 -> 3

The figshare provider directly calls aiohttp.request() in aiohttp 0.18 for some downloads, which no longer works with aiohttp-3.5. Instead, use the updated super().make_request(). However, this only partially fix the download issue.

`Authorization` Header

To fully fix the download issue, the figshare auth header are dropped for published / public files.

As for the figshare provider: When a file has been published, the figshare download request returns a redirection URL which downloads the file directly from its backend storage S3. The new request does not need any auth at all. More importantly, the default figshare auth header breaks the download since S3 API does not understand figshare API. Thus, the provider must use no_auth_header=True to inform super().make_request() to drop the header.
As for the core provider: Instead of modifying how headers are built with build_headers(), simply update make_request() to drop the Authorization header built into the headers.

Broken Celery Task (Existing Issue on `staging`)

Move / copy fails for the figshare provider since the __new__() in the FigshareProvider class does not accept extra arguments such as is_celery_task which was recently added for celery tasks. The fix is simply adding the missing **kwargs. Here is the related commit: add flag to BaseProvider to mark if it's running in celery.

Download Stream Size (Existing Issue on `staging`)

Fixed missing stream size when copy / move private files for figshare.

Copy / Move private files / datasets from figshare to OSFStorage or any working provider including figshare fails. The problem is that the downloaded stream from the src-provider has a None size during copy / move, which breaks the upload-to-dest step which requires the size.
The root problem: the download response for private files using the figshare API does not provide the Content-Length header that is used to build the response stream and to set the stream size. On the contrary, the download response for published / public files using figshare's backend S3 API provides the Content-Length header.
The fix is to retrieve and save the file size in advance during the metadata request and then create the response stream using his size if the Content-Length header is not provided.

Side effects

Fixed move / copy which is broken on both staging and prod.

QA Notes

About figshare and WB-figshare

figshare is very different from other providers due to the "article type" concept.
- From current WB's perspective, a figshare article can be either a folder (type 3 Dataset and deprecated type 3 fileset) or a file (all other types).
- From figshare's perspective, any article type is a set of files.
- Articles created directly on figshare must have a type.
- Files created via WB becomes a figshare article without a type.
- Folders created via WB becomes a figshare article of type 3 (Dataset).
- For a folder article, WB shows all its contents.
- For a file article, WB shows only the first file of the article.
WB supports figshare project and dataset as OSF project root but not figshare collection. Both cases have been tested during local regression tests. Please note that a dataset can belong to a project or be of its own.
- As for figshare project as OSF project root, only one level of folder is allowed, which is the article of file type 3 Dataset.
- As for figshare dataset as OSF project root, only files are allowed. A WB-figshare folder / figshare dataset cannot contain another folder / dataset.
Public / published figshare articles (folder and file) can only be read but not modified.
Uploading a file (no type) or creating a folder (type 3 Dataset) ends up creating a private article of respective type, even if the file is uploaded to a published folder.

Dev Tests

As usual, no comments indicates a PASS. Please test both figshare as root and dataset as root when eligible.

Getting metadata for a file / folder: tested along folder listing and file rendering
Downloading
- Public files are downloaded using the figshare's S3 API without auth
- Private files are downloaded using the figshare API with auth
Uploading
- One file
- Multiple files
- Contiguous < 10 MB
- Chunked >= 10 MB
DAZ
- Private dataset
- Public dataset
Deleting (not available for published / publish folders and files)
- One file
- Multiple files
- One folder
- Multiple folders
Folder
- Creation (only when figshare project as root)
- Upload: tested along with Uploading
- Deletion: tested along with Deleting
Rename files and folders: N / A, disabled by the front-end
Verifying non-root folder access for id-based folders: not necessary but tested anyway
Intra move / copy: N / A
Inter move / copy (light testing only)
- One and multiple files
- One and multiple folders
- From OSFStorage to figshare
- From figshare to OSFStorage: must test both private and published
- Within figshare (intra disabled, thus test with inter): must test both private and published
- Trying to move a published file or folder ends up being a copy
Comments persist with moves (light testing only)
If enabled, test revisions: only seeing the latest, which is as expected.
Project root is storage root vs. a subfolder: not necessary but tested
Updating a file

Extra Notes for QA Testing

At the time of writing the QA notes, prod OSF does not have the figshare article type fix while staging1 and staging2 has just been fixed. In short:

prod figshare may be more broken than staging1 and staging2
feature/oathpit (i.e. staging3 after merge) figshare works almost perfectly

Deployment Notes

No

cslzchen · 2019-12-11T14:33:43Z

waterbutler/providers/figshare/provider.py

+        # When a file has been published, the figshare download request returns a redirection URL
+        # which downloads the file directly from its backend storage (currently Amazon S3). The new
+        # request does not need any auth at all. More importantly, the default figshare auth header
+        # breaks the download since S3 API does not understand figshare API. Thus, the provider
+        # must use `no_auth_header=True` to inform `super().make_request()` to drop the header.


I didn't realize this problem until I started testing the view link for published / public file. The content shown by MFR is an XML error response. I went to test the file download and got same result.

<?xml version="1.0" encoding="UTF-8"?> <Error> <Code>InvalidArgument</Code> <Message>Unsupported Authorization Type</Message> <ArgumentName>Authorization</ArgumentName> <ArgumentValue>token ****</ArgumentValue> <RequestId>****</RequestId> <HostId>****</HostId> </Error>

When I checked the WB log, it looked OK since the error response had a 200 status !!!

[waterbutler.core.provider]: >>> request headers = {'Authorization': 'token ****'} [waterbutler.core.provider]: >>> request URL = https://api.figshare.com/v2/account/projects/64916/articles/8263811 [waterbutler.core.provider]: >>> request headers = {'Authorization': 'token ****'}'} [waterbutler.core.provider]: >>> request URL = https://ndownloader.figshare.com/files/15460988 [waterbutler.providers.figshare.provider]: >>> resp status: 302 [waterbutler.core.provider]: >>> request headers = {'Authorization': 'token ****'}'} [waterbutler.core.provider]: >>> request URL = https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/15460988/Dataset01.txt [tornado.access]: 200 GET /v1/resources/nwstv/providers/figshare/8263811/15460988? (172.19.0.1) 11957.88ms [waterbutler.core.remote_logging]: Callback for download_file request succeeded with {"status": "success"}

cslzchen · 2019-12-11T14:40:51Z

waterbutler/core/provider.py

+        no_auth_header = kwargs.pop('no_auth_header', False)
+        if no_auth_header:
+            kwargs['headers'].pop('Authorization')


Drop the auth header in make_request() instead of build_headers(). I don't have any preference on either approach. and choose the former simply because it is easier.

cslzchen · 2019-12-11T14:41:07Z

waterbutler/providers/figshare/provider.py

+                range_header = {'Range': self._build_range_header(range)}
+                resp = await super().make_request('GET', resp.headers['location'],
+                                                  headers=range_header, no_auth_header=True)


Use an extra local var to keep line-breaking easy.

coveralls · 2019-12-11T14:49:12Z

Coverage increased (+6.8%) to 76.117% when pulling 4f38829 on cslzchen:feature/oathpit-figshare into be3b81f on CenterForOpenScience:feature/oathpit.

cslzchen

figshare works as expected except when copying / moving private files from figshare.

cslzchen · 2019-12-12T03:27:26Z

waterbutler/providers/figshare/provider.py

            )

        if settings['container_type'] in pd_settings.ARTICLE_CONTAINER_TYPES:
            return FigshareArticleProvider(
-                auth, credentials, dict(settings, container_id=settings['container_id'])
+                auth, credentials, dict(settings, container_id=settings['container_id']), **kwargs


This only fixed the unsupported argument issue so that celery tasks can init the figshare provider.

However, I discovered a new figshare copy / move issue, which is not related to oathpit but can't be reproduced on staging or prod since neither has this fix :( Below are some details.

Copy / Move from OSFStorage (or any working provider other than figshare) to figshare works

Copy / Move published files / datasets from figshare to OSFStorage (or any working provider including figshare) works

Copy / Move private files / datasets from figshare to OSFStorage (or any working provider including figshare) fails

After debugging, I learned that the problem is that the downloaded stream from the src-provider (i.e. figshare in this case) has a None size during copy / move. The breaks the next upload-to-dest-provider step which requires stream.size.

More digging reveals the root cause: the download response for private figshare files (which uses figshare API) does not provide the Content-Length header which is used to build the response stream and set the stream size. In contrary, the download response for published / public files (which uses figshare's backend storage S3 API) provides the Content-Length header.

~~- [x] cc @felliott Not sure if we should fix this here in oathpit or we should fix the develop / master first.~~

cslzchen · 2019-12-12T11:11:11Z

waterbutler/providers/figshare/provider.py


-        return streams.ResponseStreamReader(resp)
+        return streams.ResponseStreamReader(resp, size=file_size)


Finally found a solution to the download missing stream size issue: retrieve and save the file size in advance during the metadata request and then create the response stream using this size if the Content-Length header is not provided. See https://github.com/CenterForOpenScience/waterbutler/pull/391/files#diff-1d2551564e2f17d27a564332e782cfeeR196-R203

cslzchen

Local regression tests done; QA notes and figshare quirks added. PR ready for CR!

🖼 🤝 🖼 🤝

figshare directly calls `aiohttp.request()` with aiohttp 0.18 which no longer works in aiohttp 3.5. Instead, rewrote the request making process into using the updated `super().make_request()`.

- As for the figshare provider: When a file has been published, the figshare download request returns a redirection URL which downloads the file directly from its backend storage S3. The new request does not need any auth at all. More importantly, the default figshare auth header breaks the download since S3 API does not understand figshare API. Thus, the provider must use `no_auth_header=True` to inform `super().make_request()` to drop the header. - As for the core provider: Instead of modifying how headers are built with `build_headers()`, simply update `make_request()` to drop the `Authorization` header built into the headers.

Move and copy fail for the figshare provider since `__new__()` in the `FigshareProvider` class does not accept extra arguments such as `is_celery_task` which was recently added for celery tasks. The fix is simply adding the missing `**kwargs`. Here is the related commit: CenterForOpenScience@345b2d8

* Copy / Move private files / datasets from figshare to OSFStorage or any working provider including figshare fails. The problem is that the downloaded stream from the src-provider has a None size during copy / move, which breaks the upload-to-dest step which requires the size. * The root problem: the download response for private files using the figshare API does not provide the Content-Length header that is used to build the response stream and to set the stream size. On the contrary, the download response for published files using figshare's backend S3 API provides the Content-Length header. * The fix is to retrieve and save the file size in advance during the metadata request and then create the response stream using this size if the Content-Length header is not provided.

* Upgrading the pytest version revealed a couple of non-async tests that were being declared as async. Fix these and correct the tests so they pass.

* FigShare now returns a field in the response to upload requests that indicates whether hashing has completed. If this is `false`, skip the checksum verification step. Recent testing has shown that this completes quickly even for very large files.

* Update `path_from_metadata` to correctly handle files in projects. WB had been treating files as just a file entity under the project root, but they are actually stored in an implied article (which WB treats as a folder).

[ENG-330] Closes: #391

cslzchen commented Dec 11, 2019

View reviewed changes

cslzchen force-pushed the feature/oathpit-figshare branch from e96da64 to 7f8422d Compare December 11, 2019 14:46

cslzchen commented Dec 12, 2019

View reviewed changes

cslzchen and others added 8 commits December 26, 2019 19:52

Enable the figshare provider for aiohttp3

6ac760b

Partially fix the download for figshare

1336777

figshare directly calls `aiohttp.request()` with aiohttp 0.18 which no longer works in aiohttp 3.5. Instead, rewrote the request making process into using the updated `super().make_request()`.

fix accidentally-passing tests

a1d1de3

* Upgrading the pytest version revealed a couple of non-async tests that were being declared as async. Fix these and correct the tests so they pass.

fix DAZ for files-in-projects

f671937

* Update `path_from_metadata` to correctly handle files in projects. WB had been treating files as just a file entity under the project root, but they are actually stored in an implied article (which WB treats as a folder).

felliott force-pushed the feature/oathpit-figshare branch from 4f38829 to f671937 Compare January 5, 2020 01:04

felliott added a commit that referenced this pull request Jan 5, 2020

Merge branch 'feature/oathpit-figshare' into feature/oathpit

f4c1b45

[ENG-330] Closes: #391

felliott merged commit f671937 into CenterForOpenScience:feature/oathpit Jan 5, 2020

cslzchen deleted the feature/oathpit-figshare branch January 13, 2020 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENG-330] [OATHPIT] Throw figshare into the Oath Pit #391

[ENG-330] [OATHPIT] Throw figshare into the Oath Pit #391

cslzchen commented Dec 11, 2019 •

edited

cslzchen Dec 11, 2019

cslzchen Dec 11, 2019

cslzchen Dec 11, 2019

coveralls commented Dec 11, 2019 •

edited

cslzchen left a comment •

edited

cslzchen Dec 12, 2019 •

edited

cslzchen Dec 12, 2019

cslzchen left a comment


		return streams.ResponseStreamReader(resp)
		return streams.ResponseStreamReader(resp, size=file_size)

[ENG-330] [OATHPIT] Throw figshare into the Oath Pit #391

[ENG-330] [OATHPIT] Throw figshare into the Oath Pit #391

Conversation

cslzchen commented Dec 11, 2019 • edited

Ticket

Purpose

Changes

aiohttp 0.18 -> 3

Authorization Header

Broken Celery Task (Existing Issue on staging)

Download Stream Size (Existing Issue on staging)

Side effects

QA Notes

About figshare and WB-figshare

Dev Tests

Extra Notes for QA Testing

Deployment Notes

cslzchen Dec 11, 2019

Choose a reason for hiding this comment

cslzchen Dec 11, 2019

Choose a reason for hiding this comment

cslzchen Dec 11, 2019

Choose a reason for hiding this comment

coveralls commented Dec 11, 2019 • edited

cslzchen left a comment • edited

Choose a reason for hiding this comment

cslzchen Dec 12, 2019 • edited

Choose a reason for hiding this comment

cslzchen Dec 12, 2019

Choose a reason for hiding this comment

cslzchen left a comment

Choose a reason for hiding this comment

cslzchen commented Dec 11, 2019 •

edited

`aiohttp` 0.18 -> 3

`Authorization` Header

Broken Celery Task (Existing Issue on `staging`)

Download Stream Size (Existing Issue on `staging`)

coveralls commented Dec 11, 2019 •

edited

cslzchen left a comment •

edited

cslzchen Dec 12, 2019 •

edited