Skip to content

Commit

Permalink
rsync upload docs - rough draft [#3348]
Browse files Browse the repository at this point in the history
A rough draft of detailed rsync upload instructions. Needs more loving once we know more details about this feature.
  • Loading branch information
dlmurphy committed Aug 10, 2017
1 parent 2b607a1 commit d60bc51
Showing 1 changed file with 16 additions and 14 deletions.
30 changes: 16 additions & 14 deletions doc/sphinx-guides/source/user/dataset-management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,38 +125,40 @@ There are several advanced options available for certain file types.

.. _rsync_upload:

Rsync Upload
rsync Upload
------------

Rsync is typically used for synchronizing files and directories between two different systems, using SSH to connect rather than HTTP, to better facilitate large file transfers.
rsync is typically used for synchronizing files and directories between two different systems, using SSH to connect rather than HTTP. Some Dataverse installations allow uploads using rsync, to facilitate extremely large file transfers in a reliable and secure manner.

This comment has been minimized.

Copy link
@pdurbin

pdurbin Aug 11, 2017

Member

The word "extremely" strikes me as slightly over the top. I believe we're only talking about 100 GB or so.

This comment has been minimized.

Copy link
@pdurbin

pdurbin Aug 11, 2017

Member

This might be a good place to mention that rsync also allows you to resume uploads, a feature request we're tracking in #2960.

This comment has been minimized.

Copy link
@pameyer

pameyer Aug 11, 2017

Contributor

"rsync" may be lower-level than would need to worry about (other than dependencies). They'd care about: non-browser upload, preservation of file naming/directory structure and (maybe) client-side checksums and resumability.

This comment has been minimized.

Copy link
@dlmurphy

dlmurphy Aug 11, 2017

Author Contributor

@pameyer I generally agree with this, but I'm inclined to use the term "rsync" only in the docs (not the UI) for these reasons:

  • In the UI, "rsync" is too technical,, but if the user clicks the link in the UI to find out more, then I'm not as worried about throwing too much information at them

  • Using the term "rsync" gives the user a solid word to latch onto for this feature, rather than more generic phrases like "non-browser upload" or "upload via script"

  • This way, users can google "rsync" if they want more technical info than what we provide

  • This way, we can more easily explain why the feature doesn't work out of the box for Windows users

That being said, I am definitely trying not to get too 'in the weeds' in this docs section, the user isn't going to need to know exactly how this stuff works.


File Upload Script
~~~~~~~~~~~~~~~~~~

Download the file upload script in order to upload files via a terminal window, to run the rsync script.
An rsync-enabled Dataverse installation has a file upload process that differs from the traditional browser-based upload process you may be used to. In order to transfer your data to Dataverse's storage, you will need to complete the following steps:

Features
1. Create your dataset. In rsync-enabled Dataverse installations, you cannot upload files until the dataset creation process is complete. After you hit "Save Dataset" on the Dataset Creation page, you will be taken to the page for your dataset.

- Upload files is disabled on dataset create because in order to produce the upload script, and have a container to store the files, the dataset needs to exist.
- Instead of an upload pg, you have an upload popup, with instructions to follow, as well as a script to download.
- There are requirements for preparing your data before upload, like making sure all your files are in one directory. Anything else?
2. On the dataset page, click the "+ Upload Files" button. This will open a box with instructions and a link to the file upload script.

Upload In Progress
~~~~~~~~~~~~~~~~~~
3. Make sure your files are ready for upload. You will need to have one directory that you can point the upload script to. All files in this directory and in any subdirectories will be uploaded. The directory structure will be preserved, and will be represented when your dataset is downloaded from Dataverse. Note that your data will be uploaded in the form of an rsync package, and each dataset can only host one such package. Be sure that all files you want to include are present before you upload.

This comment has been minimized.

Copy link
@pdurbin

pdurbin Aug 11, 2017

Member

I don't believe we want to call this an "rsync package". Rather, we just want to call it a "package" and we can imagine in the future having packages that were not created via rsync uploads.

This comment has been minimized.

Copy link
@pameyer

pameyer Aug 11, 2017

Contributor

typo - "represented" -> "reproduced"

This comment has been minimized.

Copy link
@dlmurphy

dlmurphy Aug 11, 2017

Author Contributor

Both good suggestions, implementing them in my new draft.


Features
4. Download the rsync file upload script using the link in the Upload Files instruction box. There are no requirements for where you save the script; put it somewhere you can find it.

5. To begin the upload process, you will need to run the script you downloaded. For this, you will have to go outside your browser and open a terminal (AKA command line) window on your computer. Use the terminal to navigate to the directory where you saved the upload script, and run the command that the Upload Files instruction box provides. This will begin the upload script. Please note that this upload script will expire 7 days after you downloaded it. If it expires and you still need to use it, simply download the script script from Dataverse again.

This comment has been minimized.

Copy link
@pdurbin

pdurbin Aug 11, 2017

Member

Typo: "script script".


**Note:** Unlike other operating systems, Windows does not come with rsync installed by default. If you are using Windows, you may need to install rsync before the upload script will work. The developers of rsync recommend `cwRsync <https://www.itefix.net/cwrsync>`_ for Windows users.

6. Follow the instructions provided by the upload script running in your terminal. If you need to cancel the upload, you can do so by canceling the script running in your terminal window.

This comment has been minimized.

Copy link
@pdurbin

pdurbin Aug 11, 2017

Member

I suppose we could mention Ctrl-c as the standard way to stop a running script.

This comment has been minimized.

Copy link
@pameyer

pameyer Aug 11, 2017

Contributor

rsync isn't the only dependency; unless a windows system has a few standard unix utilities also installed (and playing nicely together) the upload process won't work.

This comment has been minimized.

Copy link
@dlmurphy

dlmurphy Aug 11, 2017

Author Contributor

Seems like for now our best bet might be to just say something like "Unlike other operating systems, Windows does not come with rsync supported by default. We have not optimized this feature of Dataverse for Windows users, but you may be able to get it working if you install the right unix utilities. If you have found a way to get this feature working for you on Windows, please email support@dataverse.org with your solution and we'll add it to this guide."


7. Once the upload script completes its job, Dataverse will begin ingesting your data upload. This may take some time depending on the file size of your upload. While your upload is ingesting, you will not be able to delete or publish your dataset, and you will not be able to upload more files. You will still be able to edit the dataset's metadata, though. Once ingest is complete, the disabled functions will be enabled again. During ingest, you will see a blue bar at the bottom of the dataset page that reads "Upload in progress..."

This comment has been minimized.

Copy link
@pdurbin

pdurbin Aug 11, 2017

Member

Ingest has a specific meaning in Dataverse and has to do with tabular files. The word ingest would be good to include in a glossary in the guides.

This comment has been minimized.

Copy link
@pameyer

pameyer Aug 11, 2017

Contributor

Agreed - the it might be more accurate to refer to it as checksum validation (or verification?); "ingest" has connotations of making changes to dataset files in addition to its Dataverse-specific meaning.

This comment has been minimized.

Copy link
@dlmurphy

dlmurphy Aug 11, 2017

Author Contributor

I see your point about the word "ingest". Changing it to "processing", unless that also has issues.

This comment has been minimized.

Copy link
@pdurbin

pdurbin Aug 14, 2017

Member

"Processing" is fine. Thanks @dlmurphy !


- Dataset locks, "upload in progress" msg displayed, some features like publish, delete, upload are disabled.
- You can edit your metadata still.
- Cancel upload by canceling script in terminal window.
8. Once ingest is complete, you will be notified, and your data will be available for download on the dataset page. At this point, the upload feature for this dataset will be disabled. If you need to upload a new version of your data, you will need to delete the dataset's current data package and upload a new one.

This comment has been minimized.

Copy link
@pdurbin

pdurbin Aug 11, 2017

Member

My understanding is that data won't be available for download until the RSAL component is installed and configured. I wrote a bit about RSAL at https://github.com/IQSS/dataverse/blob/f8809c39f9b4f44566acccb964269cd95d051ca9/doc/sphinx-guides/source/developers/big-data-support.rst#repository-storage-abstraction-layer-rsal

This comment has been minimized.

Copy link
@dlmurphy

dlmurphy Aug 11, 2017

Author Contributor

Is this something that needs to be represented in the user guide? RSAL seems like a backend thing that a user uploading their data wouldn't need to concern themselves with.

This comment has been minimized.

Copy link
@pameyer

pameyer Aug 11, 2017

Contributor
  • RSAL may be a bit too technical for the user guide; but data files aren't available for download until the dataset has been published.
  • The system was designed for one upload (and one resulting dataverse package file) per dataset; deleting a datafile package/directory and trying to re-create will lead to undefined behavior.

Dataverse Package
~~~~~~~~~~~~~~~~~

Features

- Instead of a bunch of files displayed, you have one file, a "Dataverse Package".
- Once you've uploaded your files, upload is disabled.
- There are Data Access locations, as well as Verify Data commands, displayed on the dataset and file pgs.
- "If delete, delete the dataset"?

This comment has been minimized.

Copy link
@pdurbin

pdurbin Aug 11, 2017

Member

I'm confused by this. Deleting a package deletes the dataset?

This comment has been minimized.

Copy link
@dlmurphy

dlmurphy Aug 11, 2017

Author Contributor

I'm also confused about this. @mheppler, can you clarify?

This comment has been minimized.

Copy link
@pameyer

pameyer Aug 11, 2017

Contributor

Once data files have been transferred, and the checksum have been verified, future uploads for that dataset are disabled.


Expand Down

0 comments on commit d60bc51

Please sign in to comment.