Lightweight versioned data

The rationale for lightweight versioned data is laid out in this paper. But the simple goal is to provide versioned data to the world in a way that allows coded access to all versions, new and old. Moreover, the idea of this project is to provide the tools to do this for free, and without too onerous on-going financial or time commitments, while maintaining curation of the underlying data.

This set of instructions relies on a basic knowledge of git and github. If you're a bit rusty on this see here for a general introduction. This tutorial frequently uses tools for setting up R packages. For a excellent and general introduction to the topic see Hadley Wickham's website/book.

Setting up a new lightweight versioned dataset

Setting up the software

Fork and clone this versioned_data_template repository
Rename your repository (in Github this is currently accessed via the settings button) to reflect your dataset; this name should be the same as the R packages that will distribute your data, so something short, precise, and memorable is preferred.
Add your data to the repository folder, preferably as a .csv file. (If you're data is too complex for a csv file, this is still possible, see below. The data may be--but does not have to be--pushed to the cloud repository.)
Install the R library devtools if you don't have it already
Install a package called datastorr which manages the interface between your computer and github behind the scenes. For more on datastorr functionality see this repo. To install, in R run:

devtools::install_github("ropenscilabs/datastorr")

Renaming things for your specific dataset

In the dataset_access.R file, rename the main function called dataset_access_function to something specific to your dataset
Also in the dataset_access.R file, find the dataset_info function and change 1) the name of the repository to your repository name 2) the name of the file to reflect the name of the file that contains your data.
Option for non-csv data structures: If your dataset is too complicated to fit into a csv file, you will have to write an input function that loads your data into R. Write this input function, include it in the dataset_access.R and replace read_csv with a call to your input function so that your dataset reads nicely into R in a way that's convenient for your users.
Modify the description and licence files to reflect the title, authors, and date for your package (see here for more details)
The template is set up with a MIT licence, but this may be changed for different situations.

Documentation for your users (we have set up this template using roxygen2 for ease of use):

Include a description of the dataset in the documentation section of the function formerly known as dataset_access_function which you renamed above. This will show up as the R help file for users once they download and install your package.
It is also good practice to add an example of that same function to show users how to use the important features of your dataset.
Include the meta-data for your dataset. We recommend adding a zip file to the release with the appropriate meta-data. In this way the meta-data is both versioned and included in the DOI.

Building your documentation and loading your package

In R, with the appropriate working directory set, call devtools::document()
Install your package locally call devtools::load_all()

Upload to the cloud

Commit and push your changes to github
On your local machine run <your_package_name>:::dataset_release("<description>") Where <description> is a description of what changed in the package. This should push version 0.0.1 to a github release.

Testing your package

Test that everything works by calling the function formerly known as dataset_access_function. The data should download from github and load nicely into R.

Setting up Digital Object Identifier (DOI) assignment

The specifics of this depend on which DOI minter you use. We have used both zenodo and figshare. Each source has their own short tutorials for setting this up. The Zenodo/Github tutorial is here. All of the points made in the tutorial apply equally to code and to data.

That's it. You now have a package that is set up for distributing stable versioned data to the world.

Managing interactions with users of your database

We recommend suggesting that users flag issues using the "issue tracker" functionality of Github. This will allow specific questions to be asked, discussed, and resolved. Note: if you find an issue with this tutorial, please raise an issue on this repository! In some cases these queries may lead to improvements of the underlying dataset, in that case, it makes sense to release a new version of the database.

Maintaining a versioned dataset

When your dataset improves via error fixes or data addition, and you're happy with the changes, there are a few simple steps to bump the dataset into the future.

Update the DESCRIPTION file to increase the version number. Semantic versioning is one way to manage these changes.
Update your data and meta-data files with all your improvements
Commit data and code changes and DESCRIPTION and push to GitHub
With R in the package directory, run

<your_package_name>:::dataset_release("<description>")

where "<description>" is a brief description of the improvements to the dataset.

Happy data versioning!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
R		R
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R

R

.gitignore

.gitignore

DESCRIPTION

DESCRIPTION

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Lightweight versioned data

Setting up a new lightweight versioned dataset

Setting up the software

Renaming things for your specific dataset

Documentation for your users (we have set up this template using roxygen2 for ease of use):

Building your documentation and loading your package

Upload to the cloud

Testing your package

Setting up Digital Object Identifier (DOI) assignment

Managing interactions with users of your database

Maintaining a versioned dataset

About

Releases

Packages

Languages

License

traitecoevo/versioned_data_template

Folders and files

Latest commit

History

Repository files navigation

Lightweight versioned data

Setting up a new lightweight versioned dataset

Setting up the software

Renaming things for your specific dataset

Documentation for your users (we have set up this template using roxygen2 for ease of use):

Building your documentation and loading your package

Upload to the cloud

Testing your package

Setting up Digital Object Identifier (DOI) assignment

Managing interactions with users of your database

Maintaining a versioned dataset

About

Resources

License

Stars

Watchers

Forks

Languages