New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

DMPTool Waterbutler provider #204

Closed

rdhyee wants to merge 13 commits into CenterForOpenScience:develop from rdhyee:feature/dmptool

rdhyee commented Mar 28, 2017 •

edited

Paired with CenterForOpenScience/osf.io#7030


          extract dmptool parts from dmptool_evernote

This was referenced Mar 28, 2017

DMPTool Add-on CenterForOpenScience/osf.io#7030

Closed

[WIP] DMPTool add-on: first pass to get feedback CenterForOpenScience/osf.io#5681

Closed

rdhyee added 2 commits

April 2, 2017 16:21


          trying to filter out plans with 'test' visibility -- but this part of…

534f95b

… the code might not

be called by dmptool


          The dmptool client.py code was extraneous -- the API code is in provi…

1cc4d7c

…der.py and is fully async.

felliott requested changes

View reviewed changes

Member

felliott left a comment

It's coming along nicely! I'm going to pause for now. Let me know when you've addressed these comments and the other cleanups you were planning, and I'll take another pass.

Cheers,
@felliott

waterbutler/providers/dmptool/metadata.py Outdated

+              class DmptoolFileMetadata(metadata.BaseFileMetadata):
+                  def __init__(self, raw):
+                      metadata.BaseFileMetadata.__init__(self, raw)

Member

felliott Apr 4, 2017

Why not use super().__init__()?

waterbutler/providers/dmptool/metadata.py Outdated

+                  @property
+                  def extra(self):
+                      return super(DmptoolFileMetadata, self).extra.update({

Member

felliott Apr 4, 2017

I don't think it's necessary to pass arguments to super() for 3.5 classes.

waterbutler/providers/dmptool/provider.py Outdated


		return result

		async def plans_full(self, id_=None, format_='json'):

Member

felliott Apr 4, 2017

Any particular reason for the underscores at the end of the kwarg names? Just curious, I've haven't seen that before.

Member

felliott Apr 4, 2017

This method, like plans(), should probably be split into two methods

waterbutler/providers/dmptool/provider.py Outdated

+                      try:
+                          plan_id = path.parts[1].raw
+                          pdf = await self._dmptool_plan_pdf(plan_id)
+                      except Exception as e:

Member

felliott Apr 4, 2017

I tried downloading a "Test or Practice"-visible plan, but got the following message: {"response": "You are not authorized to look at this content."}. Are those not available via the API?

Author

rdhyee Apr 5, 2017

A possible bug that won't be corrected: The DMPTool API does not return the PDF for plans with test visibility.

waterbutler/providers/dmptool/provider.py

+                      """
+                      raise exceptions.ReadOnlyProviderError(self)
+                  async def _do_intra_move_or_copy(self, dest_provider, src_path, dest_path):

Member

felliott Apr 4, 2017

I don't think you need this method, it's not called by anything else.

Member

felliott Apr 4, 2017

But you will want to add move() and copy() methods that respect the readonly-ness of DMPTool. See the btibucket provider for examples: https://github.com/CenterForOpenScience/waterbutler/pull/198/files#diff-af4a691a976fbf59e9aabd632386d206R191

waterbutler/providers/dmptool/provider.py Outdated

+                      else:
+                          resp = await self.get_url_async('plans/{}'.format(id_))
+                          r = await resp.json()
+                          result = r.get('plan')

Member

felliott Apr 4, 2017

Minor: since the only commonality between these two paths is return result, how about splitting this into two methods, _get_plans() and _get_plan(id)?

waterbutler/providers/dmptool/provider.py Outdated

+                  def plans_templates(self):
+                      return self._unroll(self.get_url('plans_templates').json())
+                  def institutions_plans_count(self):

Member

felliott Apr 4, 2017

This and the previous two methods don't seem to be used anywhere. Are these stubs for future development, or just code that hasn't been trimmed yet?

Author

rdhyee Apr 9, 2017

About using id_ and format_ as parameter names, I've gotten in the habit of trying to avoid names that conflict with functions in the builtins. Some discussion at https://stackoverflow.com/questions/16523789/naming-conflict-with-built-in-function. I'm ok with taking the _ out since the code should work without the underscore (I think!).

waterbutler/providers/dmptool/provider.py Outdated

+                          else:
+                              return None
+                  async def plans_owned(self):

Member

felliott Apr 4, 2017

This is probably already on your todo list, but can you clarify the difference between plans(), plans_full(), and plans_owned()?

rdhyee added 4 commits

April 6, 2017 17:26


          view plan in DMPTool enabled

5e448ec


          now we filter out plans that have test visibility from the DMPTool addon

673a17d


          document plans, plans_full, plans_owned_full

61bb0e2

removed unused method: plans_templates, institutions_plans_count


          change id_, format_ to id, format

951c0ba

Author

rdhyee commented Apr 12, 2017

I thought the Download-as-a-zip function was working but although I got a valid zip file with several files with the right names and .pdf suffix, the pdf files themselves are invalid. I can download each pdf file separately. How should I go about debugging this problem? (I was under the impression that the zip-downloading should come "for free" if I get the downloading for each file working. Maybe I am missing some necessary metadata. For example, I'm setting the length of the file to 0 and not the actual size of the pdf. Might that be the problem?

Author

rdhyee commented Apr 12, 2017

Another bug for which I'd appreciate help debugging: on a page for loading a specific DMPTool pdf (e.g., http://localhost:5000/kjayw/), the files sidebar doesn't finish loading:

The error in the console:

index.js:478 Uncaught TypeError: Cannot read property 'fullName' of undefined
    at Object.view (index.js:478)
    at Object.view (mithril.js:558)
    at redraw (mithril.js:638)
    at Function.m.redraw (mithril.js:624)
    at Function.m.endComputation (mithril.js:655)
    at endFirstComputation (mithril.js:662)
    at Function.m.mount.m.module (mithril.js:605)
    at Object.<anonymous> (file-page.js:9)
    at __webpack_require__ (bootstrap 08e8d500beb7e96abbbc:50)
    at webpackJsonpCallback (bootstrap 08e8d500beb7e96abbbc:21)
    at file-page.js:1
view @ index.js:478
view @ mithril.js:558
redraw @ mithril.js:638
m.redraw @ mithril.js:624
m.endComputation @ mithril.js:655
endFirstComputation @ mithril.js:662
m.mount.m.module @ mithril.js:605
(anonymous) @ file-page.js:9
__webpack_require__ @ bootstrap 08e8d500beb7e96abbbc:50
webpackJsonpCallback @ bootstrap 08e8d500beb7e96abbbc:21
(anonymous) @ file-page.js:1
l10n.js:108 Synchronous XMLHttpRequest on the main thread is deprecated because of its detrimental effects to the end user's experience. For more help, check https://xhr.spec.whatwg.org/.
xhrLoadText @ l10n.js:108
loadImport @ l10n.js:229
parseRawLines @ l10n.js:215
parseProperties @ l10n.js:235
(anonymous) @ l10n.js:244
xhr.onreadystatechange @ l10n.js:115
pdf.worker.min.js:39 The provided value 'moz-chunked-arraybuffer' is not a valid enum value of type XMLHttpRequestResponseType.
request @ pdf.worker.min.js:39
requestFull @ pdf.worker.min.js:39
c @ pdf.worker.min.js:872
(anonymous) @ pdf.worker.min.js:877
a.onmessage @ pdf.worker.min.js:8

When I dug into the problem, the problem boils down to osf.io/index.js at 732b9fff76492ea22867d41780f7cb3eb31c9505 · rdhyee/osf.io in which storageAddons (for reasons I don't understand) does not have a dmptool key, though there are keys for box, dataverse, etc.)

Any suggestion about how to debug this problem?

Member

felliott commented Apr 12, 2017

Sorry, for replying out-of-band, but for some reason GH won't let me comment on #204 (comment) directly (the issue with DAZ not working). Yes, I suspect setting the size to 0 is your problem. When we build a DAZ zipfile, we use the streaming zipfile spec, which requires that we pass correct file size information in the zip central directory data structure.

If setting the correct file size doesn't work, PLMK, and I'll dig deeper.

Member

felliott commented Apr 12, 2017

Re: #204 (comment) (storageAddon) -- those are defined in https://github.com/CenterForOpenScience/osf.io/blob/develop-backup/website/static/storageAddons.json. You probably just need to add a key for dmptool, and it should start working.

Author

rdhyee commented Apr 12, 2017 •

edited

Re need to calculate the length of the files: I will likely need to download a file to calculate to size of the file. Is there a smart way to keep my addon from downloading a file multiple times in the construction of the zip file? That is, is there any caching mechanism available for me to use?

On further reflection: I'm going to look at whether doing a HTTP HEAD on the right API method will give me the length of the file.

rdhyee added a commit to rdhyee/osf.io that referenced this pull request


          addresses CenterForOpenScience/waterbutler#204 (comment)

d3944e6

Member

felliott commented Apr 12, 2017

I may have spoke to soon about the file sizes. DAZ does download the file, and keeps a tally of its size as it goes, which I think is what is fed to the zipfile builder. For some reason, I was thinking you were overriding that and passing 0 as the size. I'm not sure where I got that idea from. I'll have another look.

Author

rdhyee commented Apr 12, 2017

About file size: I think I have been setting the length to 0 somewhere (I can chase down the reference). It is possible to not set the length at all?

Member

felliott commented Apr 12, 2017

You can override the Content-Length header when downloading a file, but we actually keep a separate tally when building the zip, since providers are pretty unreliable. Manually setting the length to zero sounds weird to me, but it shouldn't affect the building of the zipfile. I wonder if the files are getting truncated, or if maybe the deflater is corrupting them?

rdhyee added 2 commits

April 13, 2017 17:01


          progress in terms of creating a zip file: a valid zip file is downloa…

131b461

…ded but all the files have file names with the id (with no .pdf suffix)


          add move and copy methods to dmptool.provider --> doesn't work yet

e950279

Author

rdhyee commented Apr 14, 2017

Re: adding copy and `move', I used the BitBucket addon as a model and committed DMPTool Waterbutler provider by rdhyee · Pull Request #204 · CenterForOpenScience/waterbutler. Unfortunately, when I tried to test the functionality by selecting a file in the DMPTool section and then dragging it to OSF Storage:

A few things I don't understand:

As I've highlighted on screenshot, I got a "Moving" instead of a "Copying" message (as I would expect).
The file doesn't actually get copied. The exception I get:

[2017-04-13 20:34:24,336][ERROR][tornado.application]: Uncaught exception POST /v1/resources/8c6js/providers/dmptool/26459? (127.0.0.1)
HTTPServerRequest(protocol='http', host='localhost:7777', method='POST', uri='/v1/resources/8c6js/providers/dmptool/26459?', version='HTTP/1.1', remote_ip='127.0.0.1', headers={'Dnt': '1', 'Content-Length': '92', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36', 'Connection': 'keep-alive', 'Referer': 'http://localhost:5000/8c6js/files/', 'Accept-Encoding': 'gzip, deflate, br', 'Content-Type': 'Application/json', 'Origin': 'http://localhost:5000', 'Cookie': 'csrftoken=mal4CtqoeFsomEL7ah6pqeYfJfpKZtQ8; _xsrf=2|0cf0ceea|f08f38f4abc0c83a2a848fddbcef1860|1489700718; username-localhost-8889="2|1:0|10:1491363387|23:username-localhost-8889|44:YjRhZDcwMmZmZGQ2NDU5M2FjNDY0MTAwYWI1N2IyYWU=|66998c0f03b2617b8961d6f7c3f51678c9bc518194029137bf8002a9591daeeb"; osf=58f042b61193c74ded5fd52b.s_E7xxoNfFznnt3WK_PObRoO0_c', 'Host': 'localhost:7777', 'Accept-Language': 'en-US,en;q=0.8', 'Accept': '*/*'})
Traceback (most recent call last):
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/site-packages/tornado/web.py", line 1445, in _execute
    result = yield result
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/site-packages/tornado/gen.py", line 1014, in run
    yielded = self.gen.throw(*exc_info)
  File "<string>", line 6, in _wrap_awaitable
  File "/Users/raymondyee/C/src/waterbutler/waterbutler/server/api/v1/provider/__init__.py", line 104, in post
    return (await self.move_or_copy())
  File "/Users/raymondyee/C/src/waterbutler/waterbutler/server/api/v1/provider/movecopy.py", line 102, in move_or_copy
    metadata, created = await tasks.wait_on_celery(result)
  File "/Users/raymondyee/C/src/waterbutler/waterbutler/tasks/core.py", line 55, in wrapped
    return (await backgrounded(func, *args, **kwargs))
  File "/Users/raymondyee/C/src/waterbutler/waterbutler/tasks/core.py", line 48, in backgrounded
    functools.partial(func, *args, **kwargs)
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/asyncio/futures.py", line 358, in __iter__
    yield self  # This tells Task to wait for completion.
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/raymondyee/C/src/waterbutler/waterbutler/tasks/core.py", line 33, in wrapped
    return ensure_event_loop().run_until_complete(func(*args, **kwargs))
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/asyncio/base_events.py", line 337, in run_until_complete
    return future.result()
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/asyncio/futures.py", line 274, in result
    raise self._exception
  File "/Users/raymondyee/anaconda/envs/waterbutler/lib/python3.5/asyncio/tasks.py", line 239, in _step
    result = coro.send(None)
  File "/Users/raymondyee/C/src/waterbutler/waterbutler/tasks/core.py", line 126, in wait_on_celery
    raise exceptions.WaitTimeOutError
waterbutler.tasks.exceptions.WaitTimeOutError
[2017-04-13 20:34:24,352][INFO][tornado.access]: 202 POST /v1/resources/8c6js/providers/dmptool/26459? (127.0.0.1) 15872.53ms

I don't actually see any print messages that I embedded in the move and copy methods, suggesting that neither was called.
I have a celery worker running but could it be a celery configuration problem?
I've not implemented the v2 api methods -- are they needed?

Any hints as to how to debug the problem would be appreciated

Member

felliott commented Apr 14, 2017

Re: move() / copy():

1.) Good catch! I'm not sure where the wording for that is set, so I'll need to dig into the code to figure it out. I suspect it's something that needs to be changed in the osf PR.

2.) That exception is okay. When a process takes too long to complete it gets shunted off into celery, and we throw the WaitTimeoutError. It's an expected error. We throw it internally to abort the normal request handling process and indicate that we should return a 202.

3+4.) You definitely need the waterbutler celery worker running for this. Once the process is sent to WB's celery, any debugging output will appear in its terminal window.

5.) Sorry, what v2 api methods are you referring to?

Author

rdhyee commented Apr 14, 2017

I just looked at the window for the WB celery worker. There's no error message but I don't see any signs of the task ever being completed. (I reload the page and still don't see any copied file in the OSFStorage section. Any tips on debugging celery tasks?
About v2 api methods, I think I was confused by seeing _build_v2_repo_url in the bitbucket provider and thinking that it had something to do with validate_v1_path (which I do have implemented for dmptool) and how I might need a validate_v2_path -- which is not right. So strike that question.
I'm surprised to not see output from the print statements I have in the move and copy methods when I drag a file from dmptool to OSFStorage? Is that expected behavior?

Member

felliott commented Apr 14, 2017

1.) debugging celery: does the request get logged in the celery worker window? Or does it just remain blank? I just tried it locally using your branch and was able to successfully copy a plan to osfstorage.

2.) Hah! ETOOMANYAPIS: that's referring to the OSF API, for which v2 is the latest version.

3.) They should be appearing. Did you restart the celery task after you added those?

rdhyee added 4 commits

April 17, 2017 23:24


          This commit is now based on a more solid understanding of WaterButler…

9f3c653

…Path. The zip file now has working pdf files with the right names.


          handle revisions

589db1e


          added a license for DMPTool waterbutler provider

3824d6d


          I think I've now fixed the file naming issues in the download of indi…

b6e8254

…vidual files from DMPTool, the zip file, and in the copy to OSF Storage

coveralls commented May 27, 2017

Changes Unknown when pulling b6e8254 on rdhyee:feature/dmptool into ** on CenterForOpenScience:develop**.

Member

felliott commented Jun 1, 2017

Upstream DMPTool will be releasing a new version of their API soon. This PR has been archived as the feature/dmptool-integration branch in this repo and in https://github.com/felliott/waterbutler. Closing.

felliott closed this

felliott mentioned this pull request

[WIP] Dmptool evernote #171

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment