Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DMPTool Waterbutler provider #204

Closed
wants to merge 13 commits into from

Conversation

rdhyee
Copy link

@rdhyee rdhyee commented Mar 28, 2017

Copy link
Member

@felliott felliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rdhyee!

It's coming along nicely! I'm going to pause for now. Let me know when you've addressed these comments and the other cleanups you were planning, and I'll take another pass.

Cheers,
@felliott

class DmptoolFileMetadata(metadata.BaseFileMetadata):

def __init__(self, raw):
metadata.BaseFileMetadata.__init__(self, raw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use super().__init__()?


@property
def extra(self):
return super(DmptoolFileMetadata, self).extra.update({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to pass arguments to super() for 3.5 classes.


return result

async def plans_full(self, id_=None, format_='json'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for the underscores at the end of the kwarg names? Just curious, I've haven't seen that before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method, like plans(), should probably be split into two methods

try:
plan_id = path.parts[1].raw
pdf = await self._dmptool_plan_pdf(plan_id)
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried downloading a "Test or Practice"-visible plan, but got the following message: {"response": "You are not authorized to look at this content."}. Are those not available via the API?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""
raise exceptions.ReadOnlyProviderError(self)

async def _do_intra_move_or_copy(self, dest_provider, src_path, dest_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this method, it's not called by anything else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you will want to add move() and copy() methods that respect the readonly-ness of DMPTool. See the btibucket provider for examples: https://github.com/CenterForOpenScience/waterbutler/pull/198/files#diff-af4a691a976fbf59e9aabd632386d206R191

else:
resp = await self.get_url_async('plans/{}'.format(id_))
r = await resp.json()
result = r.get('plan')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: since the only commonality between these two paths is return result, how about splitting this into two methods, _get_plans() and _get_plan(id)?

def plans_templates(self):
return self._unroll(self.get_url('plans_templates').json())

def institutions_plans_count(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the previous two methods don't seem to be used anywhere. Are these stubs for future development, or just code that hasn't been trimmed yet?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About using id_ and format_ as parameter names, I've gotten in the habit of trying to avoid names that conflict with functions in the builtins. Some discussion at https://stackoverflow.com/questions/16523789/naming-conflict-with-built-in-function. I'm ok with taking the _ out since the code should work without the underscore (I think!).

else:
return None

async def plans_owned(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably already on your todo list, but can you clarify the difference between plans(), plans_full(), and plans_owned()?

@rdhyee
Copy link
Author

rdhyee commented Apr 12, 2017

I thought the Download-as-a-zip function was working but although I got a valid zip file with several files with the right names and .pdf suffix, the pdf files themselves are invalid. I can download each pdf file separately. How should I go about debugging this problem? (I was under the impression that the zip-downloading should come "for free" if I get the downloading for each file working. Maybe I am missing some necessary metadata. For example, I'm setting the length of the file to 0 and not the actual size of the pdf. Might that be the problem?

@rdhyee
Copy link
Author

rdhyee commented Apr 12, 2017

Another bug for which I'd appreciate help debugging: on a page for loading a specific DMPTool pdf (e.g., http://localhost:5000/kjayw/), the files sidebar doesn't finish loading:

osf___copy_of_a_unified_approach_to_preserving_cultural_software_objects_and_their_development_histories__pdf

The error in the console:

index.js:478 Uncaught TypeError: Cannot read property 'fullName' of undefined
    at Object.view (index.js:478)
    at Object.view (mithril.js:558)
    at redraw (mithril.js:638)
    at Function.m.redraw (mithril.js:624)
    at Function.m.endComputation (mithril.js:655)
    at endFirstComputation (mithril.js:662)
    at Function.m.mount.m.module (mithril.js:605)
    at Object.<anonymous> (file-page.js:9)
    at __webpack_require__ (bootstrap 08e8d500beb7e96abbbc:50)
    at webpackJsonpCallback (bootstrap 08e8d500beb7e96abbbc:21)
    at file-page.js:1
view @ index.js:478
view @ mithril.js:558
redraw @ mithril.js:638
m.redraw @ mithril.js:624
m.endComputation @ mithril.js:655
endFirstComputation @ mithril.js:662
m.mount.m.module @ mithril.js:605
(anonymous) @ file-page.js:9
__webpack_require__ @ bootstrap 08e8d500beb7e96abbbc:50
webpackJsonpCallback @ bootstrap 08e8d500beb7e96abbbc:21
(anonymous) @ file-page.js:1
l10n.js:108 Synchronous XMLHttpRequest on the main thread is deprecated because of its detrimental effects to the end user's experience. For more help, check https://xhr.spec.whatwg.org/.
xhrLoadText @ l10n.js:108
loadImport @ l10n.js:229
parseRawLines @ l10n.js:215
parseProperties @ l10n.js:235
(anonymous) @ l10n.js:244
xhr.onreadystatechange @ l10n.js:115
pdf.worker.min.js:39 The provided value 'moz-chunked-arraybuffer' is not a valid enum value of type XMLHttpRequestResponseType.
request @ pdf.worker.min.js:39
requestFull @ pdf.worker.min.js:39
c @ pdf.worker.min.js:872
(anonymous) @ pdf.worker.min.js:877
a.onmessage @ pdf.worker.min.js:8

When I dug into the problem, the problem boils down to osf.io/index.js at 732b9fff76492ea22867d41780f7cb3eb31c9505 · rdhyee/osf.io in which storageAddons (for reasons I don't understand) does not have a dmptool key, though there are keys for box, dataverse, etc.)

Any suggestion about how to debug this problem?

@felliott
Copy link
Member

Sorry, for replying out-of-band, but for some reason GH won't let me comment on #204 (comment) directly (the issue with DAZ not working). Yes, I suspect setting the size to 0 is your problem. When we build a DAZ zipfile, we use the streaming zipfile spec, which requires that we pass correct file size information in the zip central directory data structure.

If setting the correct file size doesn't work, PLMK, and I'll dig deeper.

@felliott
Copy link
Member

Re: #204 (comment) (storageAddon) -- those are defined in https://github.com/CenterForOpenScience/osf.io/blob/develop-backup/website/static/storageAddons.json. You probably just need to add a key for dmptool, and it should start working.

@rdhyee
Copy link
Author

rdhyee commented Apr 12, 2017

Re need to calculate the length of the files: I will likely need to download a file to calculate to size of the file. Is there a smart way to keep my addon from downloading a file multiple times in the construction of the zip file? That is, is there any caching mechanism available for me to use?

On further reflection: I'm going to look at whether doing a HTTP HEAD on the right API method will give me the length of the file.

rdhyee added a commit to rdhyee/osf.io that referenced this pull request Apr 12, 2017
@felliott
Copy link
Member

I may have spoke to soon about the file sizes. DAZ does download the file, and keeps a tally of its size as it goes, which I think is what is fed to the zipfile builder. For some reason, I was thinking you were overriding that and passing 0 as the size. I'm not sure where I got that idea from. I'll have another look.

@rdhyee
Copy link
Author

rdhyee commented Apr 12, 2017

About file size: I think I have been setting the length to 0 somewhere (I can chase down the reference). It is possible to not set the length at all?

@felliott
Copy link
Member

You can override the Content-Length header when downloading a file, but we actually keep a separate tally when building the zip, since providers are pretty unreliable. Manually setting the length to zero sounds weird to me, but it shouldn't affect the building of the zipfile. I wonder if the files are getting truncated, or if maybe the deflater is corrupting them?

@rdhyee
Copy link
Author

rdhyee commented Apr 14, 2017

Re: adding copy and `move', I used the BitBucket addon as a model and committed DMPTool Waterbutler provider by rdhyee · Pull Request #204 · CenterForOpenScience/waterbutler. Unfortunately, when I tried to test the functionality by selecting a file in the DMPTool section and then dragging it to OSF Storage:

osf___embryo_assessment_continental_navigation_hemoglobin_potential_spectrum_illumination__files

A few things I don't understand:

  1. As I've highlighted on screenshot, I got a "Moving" instead of a "Copying" message (as I would expect).

  2. The file doesn't actually get copied. The exception I get:

[2017-04-13 20:34:24,336][ERROR][tornado.application]: Uncaught exception POST /v1/resources/8c6js/providers/dmptool/26459? (127.0.0.1)
HTTPServerRequest(protocol='http', host='localhost:7777', method='POST', uri='/v1/resources/8c6js/providers/dmptool/26459?', version='HTTP/1.1', remote_ip='127.0.0.1', headers={'Dnt': '1', 'Content-Length': '92', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36', 'Connection': 'keep-alive', 'Referer': 'http://localhost:5000/8c6js/files/', 'Accept-Encoding': 'gzip, deflate, br', 'Content-Type': 'Application/json', 'Origin': 'http://localhost:5000', 'Cookie': 'csrftoken=mal4CtqoeFsomEL7ah6pqeYfJfpKZtQ8; _xsrf=2|0cf0ceea|f08f38f4abc0c83a2a848fddbcef1860|1489700718; username-localhost-8889="2|1:0|10:1491363387|23:username-localhost-8889|44:YjRhZDcwMmZmZGQ2NDU5M2FjNDY0MTAwYWI1N2IyYWU=|66998c0f03b2617b8961d6f7c3f51678c9bc518194029137bf8002a9591daeeb"; osf=58f042b61193c74ded5fd52b.s_E7xxoNfFznnt3WK_PObRoO0_c', 'Host': 'localhost:7777', 'Accept-Language': 'en-US,en;q=0.8', 'Accept': '*/*'})
Traceback (most recent call last):
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/site-packages/tornado/web.py", line 1445, in _execute
    result = yield result
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/site-packages/tornado/gen.py", line 1014, in run
    yielded = self.gen.throw(*exc_info)
  File "<string>", line 6, in _wrap_awaitable
  File "/Users/raymondyee/C/src/waterbutler/waterbutler/server/api/v1/provider/__init__.py", line 104, in post
    return (await self.move_or_copy())
  File "/Users/raymondyee/C/src/waterbutler/waterbutler/server/api/v1/provider/movecopy.py", line 102, in move_or_copy
    metadata, created = await tasks.wait_on_celery(result)
  File "/Users/raymondyee/C/src/waterbutler/waterbutler/tasks/core.py", line 55, in wrapped
    return (await backgrounded(func, *args, **kwargs))
  File "/Users/raymondyee/C/src/waterbutler/waterbutler/tasks/core.py", line 48, in backgrounded
    functools.partial(func, *args, **kwargs)
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/asyncio/futures.py", line 358, in __iter__
    yield self  # This tells Task to wait for completion.
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/raymondyee/C/src/waterbutler/waterbutler/tasks/core.py", line 33, in wrapped
    return ensure_event_loop().run_until_complete(func(*args, **kwargs))
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/asyncio/base_events.py", line 337, in run_until_complete
    return future.result()
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/asyncio/futures.py", line 274, in result
    raise self._exception
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/asyncio/tasks.py", line 239, in _step
    result = coro.send(None)
  File "/Users/raymondyee/C/src/waterbutler/waterbutler/tasks/core.py", line 126, in wait_on_celery
    raise exceptions.WaitTimeOutError
waterbutler.tasks.exceptions.WaitTimeOutError
[2017-04-13 20:34:24,352][INFO][tornado.access]: 202 POST /v1/resources/8c6js/providers/dmptool/26459? (127.0.0.1) 15872.53ms

  1. I don't actually see any print messages that I embedded in the move and copy methods, suggesting that neither was called.

  2. I have a celery worker running but could it be a celery configuration problem?

  3. I've not implemented the v2 api methods -- are they needed?

Any hints as to how to debug the problem would be appreciated

@felliott
Copy link
Member

Re: move() / copy():

1.) Good catch! I'm not sure where the wording for that is set, so I'll need to dig into the code to figure it out. I suspect it's something that needs to be changed in the osf PR.

2.) That exception is okay. When a process takes too long to complete it gets shunted off into celery, and we throw the WaitTimeoutError. It's an expected error. We throw it internally to abort the normal request handling process and indicate that we should return a 202.

3+4.) You definitely need the waterbutler celery worker running for this. Once the process is sent to WB's celery, any debugging output will appear in its terminal window.

5.) Sorry, what v2 api methods are you referring to?

@rdhyee
Copy link
Author

rdhyee commented Apr 14, 2017

  1. I just looked at the window for the WB celery worker. There's no error message but I don't see any signs of the task ever being completed. (I reload the page and still don't see any copied file in the OSFStorage section. Any tips on debugging celery tasks?

  2. About v2 api methods, I think I was confused by seeing _build_v2_repo_url in the bitbucket provider and thinking that it had something to do with validate_v1_path (which I do have implemented for dmptool) and how I might need a validate_v2_path -- which is not right. So strike that question.

  3. I'm surprised to not see output from the print statements I have in the move and copy methods when I drag a file from dmptool to OSFStorage? Is that expected behavior?

@felliott
Copy link
Member

1.) debugging celery: does the request get logged in the celery worker window? Or does it just remain blank? I just tried it locally using your branch and was able to successfully copy a plan to osfstorage.

2.) Hah! ETOOMANYAPIS: that's referring to the OSF API, for which v2 is the latest version.

3.) They should be appearing. Did you restart the celery task after you added those?

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling b6e8254 on rdhyee:feature/dmptool into ** on CenterForOpenScience:develop**.

@felliott
Copy link
Member

felliott commented Jun 1, 2017

Upstream DMPTool will be releasing a new version of their API soon. This PR has been archived as the feature/dmptool-integration branch in this repo and in https://github.com/felliott/waterbutler. Closing.

@felliott felliott closed this Jun 1, 2017
@felliott felliott mentioned this pull request Aug 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants