Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENG-330] [OATHPIT] Throw figshare into the Oath Pit #391

Conversation

cslzchen
Copy link
Contributor

@cslzchen cslzchen commented Dec 11, 2019

Ticket

https://openscience.atlassian.net/browse/ENG-330

Purpose

Enable and update the figshare provider for aiohttp3.

Changes

aiohttp 0.18 -> 3

The figshare provider directly calls aiohttp.request() in aiohttp 0.18 for some downloads, which no longer works with aiohttp-3.5. Instead, use the updated super().make_request(). However, this only partially fix the download issue.

Authorization Header

To fully fix the download issue, the figshare auth header are dropped for published / public files.

  • As for the figshare provider: When a file has been published, the figshare download request returns a redirection URL which downloads the file directly from its backend storage S3. The new request does not need any auth at all. More importantly, the default figshare auth header breaks the download since S3 API does not understand figshare API. Thus, the provider must use no_auth_header=True to inform super().make_request() to drop the header.

  • As for the core provider: Instead of modifying how headers are built with build_headers(), simply update make_request() to drop the Authorization header built into the headers.

Broken Celery Task (Existing Issue on staging)

Move / copy fails for the figshare provider since the __new__() in the FigshareProvider class does not accept extra arguments such as is_celery_task which was recently added for celery tasks. The fix is simply adding the missing **kwargs. Here is the related commit: add flag to BaseProvider to mark if it's running in celery.

Download Stream Size (Existing Issue on staging)

Fixed missing stream size when copy / move private files for figshare.

  • Copy / Move private files / datasets from figshare to OSFStorage or any working provider including figshare fails. The problem is that the downloaded stream from the src-provider has a None size during copy / move, which breaks the upload-to-dest step which requires the size.

  • The root problem: the download response for private files using the figshare API does not provide the Content-Length header that is used to build the response stream and to set the stream size. On the contrary, the download response for published / public files using figshare's backend S3 API provides the Content-Length header.

  • The fix is to retrieve and save the file size in advance during the metadata request and then create the response stream using his size if the Content-Length header is not provided.

Side effects

Fixed move / copy which is broken on both staging and prod.

QA Notes

About figshare and WB-figshare

  • figshare is very different from other providers due to the "article type" concept.

    • From current WB's perspective, a figshare article can be either a folder (type 3 Dataset and deprecated type 3 fileset) or a file (all other types).
    • From figshare's perspective, any article type is a set of files.
    • Articles created directly on figshare must have a type.
    • Files created via WB becomes a figshare article without a type.
    • Folders created via WB becomes a figshare article of type 3 (Dataset).
    • For a folder article, WB shows all its contents.
    • For a file article, WB shows only the first file of the article.
  • WB supports figshare project and dataset as OSF project root but not figshare collection. Both cases have been tested during local regression tests. Please note that a dataset can belong to a project or be of its own.

    • As for figshare project as OSF project root, only one level of folder is allowed, which is the article of file type 3 Dataset.
    • As for figshare dataset as OSF project root, only files are allowed. A WB-figshare folder / figshare dataset cannot contain another folder / dataset.
  • Public / published figshare articles (folder and file) can only be read but not modified.

  • Uploading a file (no type) or creating a folder (type 3 Dataset) ends up creating a private article of respective type, even if the file is uploaded to a published folder.

Dev Tests

As usual, no comments indicates a PASS. Please test both figshare as root and dataset as root when eligible.

  • Getting metadata for a file / folder: tested along folder listing and file rendering

  • Downloading

    • Public files are downloaded using the figshare's S3 API without auth
    • Private files are downloaded using the figshare API with auth
  • Uploading

    • One file
    • Multiple files
    • Contiguous < 10 MB
    • Chunked >= 10 MB
  • DAZ

    • Private dataset
    • Public dataset
  • Deleting (not available for published / publish folders and files)

    • One file
    • Multiple files
    • One folder
    • Multiple folders
  • Folder

    • Creation (only when figshare project as root)
    • Upload: tested along with Uploading
    • Deletion: tested along with Deleting
  • Rename files and folders: N / A, disabled by the front-end

  • Verifying non-root folder access for id-based folders: not necessary but tested anyway

  • Intra move / copy: N / A

  • Inter move / copy (light testing only)

    • One and multiple files
    • One and multiple folders
    • From OSFStorage to figshare
    • From figshare to OSFStorage: must test both private and published
    • Within figshare (intra disabled, thus test with inter): must test both private and published
    • Trying to move a published file or folder ends up being a copy
  • Comments persist with moves (light testing only)

  • If enabled, test revisions: only seeing the latest, which is as expected.

  • Project root is storage root vs. a subfolder: not necessary but tested

  • Updating a file

Extra Notes for QA Testing

At the time of writing the QA notes, prod OSF does not have the figshare article type fix while staging1 and staging2 has just been fixed. In short:

  • prod figshare may be more broken than staging1 and staging2
  • feature/oathpit (i.e. staging3 after merge) figshare works almost perfectly

Deployment Notes

No

Comment on lines +214 to +226
# When a file has been published, the figshare download request returns a redirection URL
# which downloads the file directly from its backend storage (currently Amazon S3). The new
# request does not need any auth at all. More importantly, the default figshare auth header
# breaks the download since S3 API does not understand figshare API. Thus, the provider
# must use `no_auth_header=True` to inform `super().make_request()` to drop the header.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize this problem until I started testing the view link for published / public file. The content shown by MFR is an XML error response. I went to test the file download and got same result.

<?xml version="1.0" encoding="UTF-8"?>
<Error>
    <Code>InvalidArgument</Code>
    <Message>Unsupported Authorization Type</Message>
    <ArgumentName>Authorization</ArgumentName>
    <ArgumentValue>token ****</ArgumentValue>
    <RequestId>****</RequestId>
    <HostId>****</HostId>
</Error>

When I checked the WB log, it looked OK since the error response had a 200 status !!!

[waterbutler.core.provider]: >>> request headers = {'Authorization': 'token ****'}
[waterbutler.core.provider]: >>> request URL = https://api.figshare.com/v2/account/projects/64916/articles/8263811
[waterbutler.core.provider]: >>> request headers = {'Authorization': 'token ****'}'}
[waterbutler.core.provider]: >>> request URL = https://ndownloader.figshare.com/files/15460988
[waterbutler.providers.figshare.provider]: >>> resp status: 302
[waterbutler.core.provider]: >>> request headers = {'Authorization': 'token ****'}'}
[waterbutler.core.provider]: >>> request URL = https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/15460988/Dataset01.txt
[tornado.access]: 200 GET /v1/resources/nwstv/providers/figshare/8263811/15460988? (172.19.0.1) 11957.88ms
[waterbutler.core.remote_logging]: Callback for download_file request succeeded with {"status": "success"}

Comment on lines +273 to +275
no_auth_header = kwargs.pop('no_auth_header', False)
if no_auth_header:
kwargs['headers'].pop('Authorization')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the auth header in make_request() instead of build_headers(). I don't have any preference on either approach. and choose the former simply because it is easier.

Comment on lines +222 to +232
range_header = {'Range': self._build_range_header(range)}
resp = await super().make_request('GET', resp.headers['location'],
headers=range_header, no_auth_header=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use an extra local var to keep line-breaking easy.

@coveralls
Copy link

coveralls commented Dec 11, 2019

Coverage Status

Coverage increased (+6.8%) to 76.117% when pulling 4f38829 on cslzchen:feature/oathpit-figshare into be3b81f on CenterForOpenScience:feature/oathpit.

Copy link
Contributor Author

@cslzchen cslzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

figshare works as expected except when copying / moving private files from figshare.

)

if settings['container_type'] in pd_settings.ARTICLE_CONTAINER_TYPES:
return FigshareArticleProvider(
auth, credentials, dict(settings, container_id=settings['container_id'])
auth, credentials, dict(settings, container_id=settings['container_id']), **kwargs
Copy link
Contributor Author

@cslzchen cslzchen Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only fixed the unsupported argument issue so that celery tasks can init the figshare provider.

However, I discovered a new figshare copy / move issue, which is not related to oathpit but can't be reproduced on staging or prod since neither has this fix :( Below are some details.

  • Copy / Move from OSFStorage (or any working provider other than figshare) to figshare works
  • Copy / Move published files / datasets from figshare to OSFStorage (or any working provider including figshare) works
  • Copy / Move private files / datasets from figshare to OSFStorage (or any working provider including figshare) fails

After debugging, I learned that the problem is that the downloaded stream from the src-provider (i.e. figshare in this case) has a None size during copy / move. The breaks the next upload-to-dest-provider step which requires stream.size.

More digging reveals the root cause: the download response for private figshare files (which uses figshare API) does not provide the Content-Length header which is used to build the response stream and set the stream size. In contrary, the download response for published / public files (which uses figshare's backend storage S3 API) provides the Content-Length header.

- [x] cc @felliott Not sure if we should fix this here in oathpit or we should fix the develop / master first.


return streams.ResponseStreamReader(resp)
return streams.ResponseStreamReader(resp, size=file_size)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally found a solution to the download missing stream size issue: retrieve and save the file size in advance during the metadata request and then create the response stream using this size if the Content-Length header is not provided. See https://github.com/CenterForOpenScience/waterbutler/pull/391/files#diff-1d2551564e2f17d27a564332e782cfeeR196-R203

Copy link
Contributor Author

@cslzchen cslzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local regression tests done; QA notes and figshare quirks added. PR ready for CR!

🖼 🤝 🖼 🤝

cslzchen and others added 8 commits December 26, 2019 19:52
figshare directly calls `aiohttp.request()` with aiohttp 0.18 which
no longer works in aiohttp 3.5. Instead, rewrote the request making
process into using the updated `super().make_request()`.
- As for the figshare provider: When a file has been published,
  the figshare download request returns a redirection URL which
  downloads the file directly from its backend storage S3. The
  new request does not need any auth at all. More importantly,
  the default figshare auth header breaks the download since S3
  API does not understand figshare API. Thus, the provider must
  use `no_auth_header=True` to inform `super().make_request()`
  to drop the header.
- As for the core provider: Instead of modifying how headers are
  built with `build_headers()`, simply update `make_request()` to
  drop the `Authorization` header built into the headers.
Move and copy fail for the figshare provider since `__new__()` in
the `FigshareProvider` class does not accept extra arguments such
as `is_celery_task` which was recently added for celery tasks. The
fix is simply adding the missing `**kwargs`.

Here is the related commit: CenterForOpenScience@345b2d8
* Copy / Move private files / datasets from figshare to OSFStorage
  or any working provider including figshare fails. The problem is
  that the downloaded stream from the src-provider  has a None size
  during copy / move, which  breaks the upload-to-dest step which
  requires the size.

* The root problem: the download response for private files using
  the figshare API does not provide the Content-Length header that
  is used to build the response stream and to set the stream size.
  On the contrary, the download response for published files using
  figshare's backend S3 API provides the Content-Length header.

* The fix is to retrieve and save the file size in advance during
  the metadata request and then create the response stream using
  this size if the Content-Length header is not provided.
 * Upgrading the pytest version revealed a couple of non-async tests
   that were being declared as async.  Fix these and correct the tests
   so they pass.
 * FigShare now returns a field in the response to upload requests
   that indicates whether hashing has completed.  If this is `false`,
   skip the checksum verification step.  Recent testing has shown that
   this completes quickly even for very large files.
 * Update `path_from_metadata` to correctly handle files in projects.
   WB had been treating files as just a file entity under the project
   root, but they are actually stored in an implied article (which WB
   treats as a folder).
felliott added a commit that referenced this pull request Jan 5, 2020
@felliott felliott merged commit f671937 into CenterForOpenScience:feature/oathpit Jan 5, 2020
@cslzchen cslzchen deleted the feature/oathpit-figshare branch January 13, 2020 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants