dataset: Create Data Frames that are Easier to Exchange and Reuse #553

antaldaniel · 2022-08-15T11:06:08Z

Submitting Author Name: Daniel Antal
Submitting Author Github Handle: @antaldaniel
Repository: https://github.com/dataobservatory-eu/dataset/
Version submitted: 0.1.7
Submission type: Standard
Editor: @annakrystalli
Reviewers: @msperlin, @romanflury

Due date for @msperlin: 2022-09-19

Due date for @romanflury: 2022-09-21

Archive: TBD
Version accepted: TBD
Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: dataset
Title: Create Data Frames that are Easier to Exchange and Reuse
Date: 2022-08-19
Version: 0.1.7.3
Authors@R: 
    person(given = "Daniel", family = "Antal", 
           email = "daniel.antal@dataobservatory.eu", 
           role = c("aut", "cre"),
           comment = c(ORCID = "0000-0001-7513-6760")
           )
Description: The aim of the 'dataset' package is to make tidy datasets easier to release, 
    exchange and reuse. It organizes and formats data frame 'R' objects into well-referenced, 
    well-described, interoperable datasets into release and reuse ready form. A subjective 
    interpretation of the  W3C  DataSet recommendation and the datacube model  <https://www.w3.org/TR/vocab-data-cube/>, 
    which is also used in the global Statistical Data and Metadata eXchange standards, 
    the application of the connected Dublin Core <https://www.dublincore.org/specifications/dublin-core/dcmi-terms/> 
    and DataCite <https://support.datacite.org/docs/datacite-metadata-schema-44/> standards 
    preferred by European open science repositories to improve the findability, accessibility,
    interoperability and reusability of the datasets.
License: GPL (>= 3)
URL: https://github.com/dataobservatory-eu/dataset
BugReports: https://github.com/dataobservatory-eu/dataset/issues
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.1
Depends: 
    R (>= 2.10)
LazyData: true
Imports: 
    assertthat,
    ISOcodes,
    utils
Suggests: 
    covr,
    declared,
    dplyr,
    eurostat,
    here,
    kableExtra,
    knitr,
    rdflib,
    readxl,
    rmarkdown,
    spelling,
    statcodelists,
    testthat (>= 3.0.0),
    tidyr
VignetteBuilder: knitr
Config/testthat/edition: 3
Language: en-US

You can find the package website on dataset.dataobservatory.eu. The article Motivation: Make Tidy Datasets Easier to Release Exchange and Reuse will eventually be condensed into a JOSS paper. It has a major development dilemma.

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- data retrieval
- data extraction
- data munging
- [x ] data deposition
- data validation and testing
- workflow automation
- version control
- citation management and bibliometrics
- scientific software wrappers
- field and lab reproducibility tools
- database software bindings
- geospatial data
- text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
Open science repositories and analyst comupters are full with datasets that have no provenance, structural or referential data. We believe that whenever possible, metadata should be machine-recorded when possible, and should not be detached from an R object.
There are several R packages that have overalapping goals or functionality to dataset, but they use a different philosophy. When exporting to different files, they should be written as exported, but no sooner, and preferably into the file that contains the data.
Who is the target audience and what are scientific applications of this package?

This package is intended to give a common foundation to the rOpenGov reproducible research packages. It mainly serves communities that want to reuse statistical data (using the SDMX statistical (meta)data exchange sources, like Eurostat, IMF, World Bank, OECD...) or release new datasets from primary social sciences data that can be integrated into an SDMX compatible API or placed on a knowledge graph. Our main aim is to provide a clear publication workflow to the European open science repository Zenodo, and clear serialization strategies to RDF application.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?
The dataspice package aims to create well-defined and referenced datasets, but follows a different schema and a different publication strategy. The dataset package follows the more restrictive W3C/SDMX "DataSet" definition within the datacube model, which is better suited to synchronize with statistical data sources. Unlike dataset, it uses a manual metadata entry from CSV files. (See the documentation of the dataspice package.)

The dataset package aims for a higher level of reproducibality, and does not detach the metadata from the R object's attributes (it is aimed to be used in other reproducible research pacakges that will directly record provenance and other transactional metadata into the attributes.) We aim to bind together dataspice and dataset by creating export functions to csv files that contain the same metadata that dataspice records. Generally, dataspice seems to be better suited to raw, observational data, while dataset for statistically processed data.

The intended use of dataset is to start correctly record referential, structural and provenance metadata retrieved by various reproducible science packages that interact with statistical data (such as the rOpenGov packages eurostat and iotables, or the oecd package.

Neither dataset or dataspice are very suitable of or documenting social sciences survey data, which are usually held in datasets. Our aim is to connect dataset, declared and DDIwR to create such datasets with DDI codebook metadata. They will create a stable new foundation of the retroharmonize package to create new, well-documented and harmonized statistical datasets from the observational datasets of social sciences surveys.

The zen4R package provides reproducible export functionality to the zenodo open science repository. Interacting with zen4R may be intimidating for the casual R user as it uses R6 classes. Our aim to provide an export function that completely wraps the workings of zen4R when releasing the dataset.

In our experience, while the tidy data standards make reuse more efficient by eliminating unnecessary data processing steps before analysis or placement in a relational database, the application of DataSet definition and the datacube model with the information science metadata standards make reuse more efficient with exchanging and combining the data with other data in different datasets.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Yes

If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
Explain reasons for any pkgcheck items which your package is unable to pass.

Technical checks

Confirm each of the following by checking the box.

[x ] I have read the rOpenSci packaging guide.
[x ] I have read the author guide and I expect to maintain this package for at least 2 years or to find a replacement.

This package:

[x ] does not violate the Terms of Service of any service it interacts with.
[ x] has a CRAN and OSI accepted license.
[ x] contains a README with instructions for installing the development version.
[ x] includes documentation with examples for all functions, created with roxygen2.
[x ] contains a vignette with examples of its essential functions and uses.
[ x] has a test suite.
has continuous integration, including reporting of test coverage.

Publication options

[x ] Do you intend for this package to go on CRAN? -> Yes, I started the CRAN publication process, but opted to stop and get feedback from rOpenSic first
Do you intend for this package to go on Bioconductor? -> Don't know.
Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

The package is novel and will be of interest to the broad readership of the journal.
The manuscript describing the package is no longer than 3000 words.
You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
(Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
(Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
(Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

[ x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

The text was updated successfully, but these errors were encountered:

ropensci-review-bot · 2022-08-15T11:06:09Z

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot · 2022-08-15T11:06:11Z

🚀

The following problem was found in your submission template:

URL = [https://repourl] is not valid
The package could not be checked because of problems with the URL.
Editors: Please ensure these problems are rectified, and then call @ropensci-review-bot check package.

👋

adamhsparks · 2022-08-15T11:52:38Z

Hi, @antaldaniel, could you please fix the repo URL by providing a link to the package’s repository, please? 🙏

antaldaniel · 2022-08-15T14:52:10Z

@adamhsparks Apologies for the original issue problem, I hope all is fine now. I added both the github repo and the package website url

mpadge · 2022-08-15T15:50:46Z

@antaldaniel Then you can start the checks yourself by calling @ropensci-review-bot check package

antaldaniel · 2022-08-15T19:05:41Z

@ropensci-review-bot check package

ropensci-review-bot · 2022-08-15T19:05:41Z

Thanks, about to send the query.

ropensci-review-bot · 2022-08-15T19:05:44Z

🚀

Editor check started

👋

ropensci-review-bot · 2022-08-16T08:27:18Z

Checks for dataset (v0.1.7)

git hash: 2eb439b5

✔️ Package name is available
✖️ does not have a 'codemeta.json' file.
✖️ does not have a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✖️ These functions do not have examples: [attributes_measures].
✖️ Function names are duplicated in other packages
✖️ Package has no continuous integration checks.
✖️ Package coverage is 67.8% (should be at least 75%).
✔️ R CMD check found no errors.
✔️ R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: GPL (>= 3)

1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type	package	ncalls
internal	base	159
internal	dataset	79
internal	stats	4
imports	utils	4
imports	rlang	1
imports	assertthat	NA
imports	ISOcodes	NA
suggests	declared	NA
suggests	dplyr	NA
suggests	eurostat	NA
suggests	here	NA
suggests	kableExtra	NA
suggests	knitr	NA
suggests	rdflib	NA
suggests	readxl	NA
suggests	rmarkdown	NA
suggests	spelling	NA
suggests	statcodelists	NA
suggests	testthat	NA
suggests	tidyr	NA
linking_to	NA	NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

names (26), data.frame (14), class (12), paste (9), rep (7), sapply (7), unlist (6), which (6), attr (5), lapply (5), length (5), ncol (5), subset (4), as.character (3), attributes (3), c (3), logical (3), seq_along (3), vapply (3), as.data.frame (2), as.numeric (2), cbind (2), file (2), inherits (2), matrix (2), nrow (2), round (2), args (1), date (1), deparse (1), for (1), gsub (1), ifelse (1), is.null (1), paste0 (1), rbind (1), tolower (1), union (1), unique (1), url (1), UseMethod (1)

dataset

dimensions (6), attributes_measures (5), measures (5), all_unique (3), dataset_title (3), related_item (3), creator (2), datacite (2), dataset (2), dataset_source (2), description (2), geolocation (2), identifier (2), language (2), metadata_header (2), publication_year (2), publisher (2), related_item_identifier (2), resource_type (2), add_date (1), add_relitem (1), arg.names (1), attributes_names (1), bibentry_dataset (1), datacite_add (1), dataset_download (1), dataset_download_csv (1), dataset_export (1), dataset_export_csv (1), dataset_local_id (1), dataset_title_create (1), dataset_uri (1), dimensions_names (1), document_package_used (1), dot.names (1), dublincore (1), dublincore_add (1), extract_year (1), is.dataset (1), measures_names (1), print (1), print.dataset (1), resource_type_general (1), rights (1), subject (1), time_var_guess (1), version (1)

stats

df (2), time (2)

utils

citation (1), object.size (1), read.csv (1), sessionInfo (1)

rlang

get_expr (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

code in R (100% in 26 files) and
1 authors
7 vignettes
no internal data file
4 imported packages
56 exported functions (median 10 lines of code)
82 non-exported functions in R (median 15 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

loc = "Lines of Code"
fn = "function"
exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure	value	percentile	noteworthy
files_R	26	87.0
files_vignettes	7	98.5
files_tests	27	97.6
loc_R	1000	68.2
loc_vignettes	676	84.7
loc_tests	371	68.8
num_vignettes	7	99.2	TRUE
n_fns_r	138	83.6
n_fns_r_exported	56	89.5
n_fns_r_not_exported	82	79.7
n_fns_per_file_r	3	55.0
num_params_per_fn	2	11.9
loc_per_fn_r	15	46.1
loc_per_fn_r_exp	10	22.2
loc_per_fn_r_not_exp	15	49.5
rel_whitespace_R	27	78.3
rel_whitespace_vignettes	36	88.3
rel_whitespace_tests	25	70.7
doclines_per_fn_exp	39	48.6
doclines_per_fn_not_exp	0	0.0	TRUE
fn_call_network_size	103	79.7

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

3. `goodpractice` and other checks

Details of goodpractice checks (click to open)

3b. `goodpractice` results

`R CMD check` with rcmdcheck

R CMD check generated the following check_fail:

no_description_date

Test coverage with covr

Package coverage: 67.81

The following files are not completely covered by tests:

file	coverage
R/creator.R	64.29%
R/datacite_attributes.R	0%
R/datacite.R	46.88%
R/dataset_uri.R	0%
R/dataset.R	48.36%
R/document_package_used.R	0%
R/dublincore.R	67.74%
R/publication_year.R	55.56%
R/related_item.R	66.67%

Cyclocomplexity with cyclocomp

The following functions have cyclocomplexity >= 15:

function	cyclocomplexity
datacite_add	24
dublincore_add	23

Static code analyses with lintr

lintr found the following 383 potential issues:

message	number of times
Avoid 1:ncol(...) expressions, use seq_len.	4
Avoid library() and require() calls in packages	20
Avoid using sapply, consider vapply instead, that's type safe	4
Lines should not be more than 80 characters.	352
Use <-, not =, for assignment.	3

4. Other Checks

Details of other checks (click to open)

✖️ The following 10 function names are duplicated in other packages:

- dataset from assemblerr, febr, robis
- description from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate
- dimensions from gdalcubes, openeo, sp, tiledb
- identifier from Ramble
- is.dataset from crunch
- language from sylly, wakefield
- measures from greybox, mlr3measures, tsibble
- size from acrt, BaseSet, container, crmPack, CVXR, datastructures, deal, disto, easyVerification, EFA.MRFA, flifo, gdalcubes, gWidgets2, hrt, iemisc, InDisc, kernlab, matlab2r, multiverse, optimbase, PopED, pracma, ramify, rEMM, rmonad, simplegraph, siren, tcltk2, UComp, unival, vampyr
- subject from DGM, emayili, gmailr, sendgridr
- version from BiocManager, garma, geoknife, mice, R6DS, rerddap, rsyncrosim, shiny.info, SMFilter

Package Versions

package	version
pkgstats	0.1.1.20
pkgcheck	0.1.0.3

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

adamhsparks · 2022-08-16T23:48:09Z

Hi again, @antaldaniel. If you could please address the issues that the bot flagged with the ✖️, then I can proceed with your submission.

antaldaniel · 2022-08-17T10:32:05Z

Hi @adamhsparks I hope I managed to add these things, with the following exception.

✔️does not have a 'codemeta.json' file -> added with codematar.
✔️does not have a 'contributing' file -> added CONTRIBUTING.md
✔️ These functions do not have examples: [attributes_measures]. -> added
✖️ Function names are duplicated in other packages

I tried to avoid duplications while keeping in mind rOpenSci duplication guildelines, and at this point, I do not see which are the dupblications and if there is any sensible way to resolve them.

Your guidelines state "Avoid function name conflicts with base packages or other popular ones (e.g. ggplot2, dplyr, magrittr, data.table)" The package currently has no name conflict with any packages that I was thinking of to be used together, and I do not know how to test for this. (Apolgoies if this is somewhere in the 1.3 Package API)

✔️ Package has no continuous integration checks -> added
✖️ Package coverage is 67.8% (should be at least 75%)

I do not see a sensible way to achieve 75%+ codecov coverage with a metadata package that is in an early development page, still has development questions open (see Motivation: Make Tidy Datasets Easier to Release Exchange and Reuse, hence the submission here before the first CRAN release). For example, in the target category, other metadata management pacakges like codemetar has a 42% coverage, EML has 65%, both below the current coverage before the first release of dataset.

mpadge · 2022-08-17T10:53:24Z

@antaldaniel You may indeed ignore the "Function names are duplicated in other packages." That will soon be changed from a failing check (:heavy_multiplication_x:) to an advisory note only. Sorry for any confusion there. @adamhsparks will comment further on the code coverage.

antaldaniel · 2022-08-17T15:43:08Z

@mpadge I do not seem to find the output where this informaiton is coming from, but I think that it is nevertheless a very useful reminder, and it would be good to see what conflicts your bot has found. Again, apologies if I ask the obvious, but where can I check what duplicates were flagged by your bot?

mpadge · 2022-08-17T15:49:18Z

It's in the check results. Under "4. Other Checks", you'll see a "Details of other checks (click to open)". You can also generate those yourself by running:

library(pkgcheck)
checks <- pkgcheck("/<path>/<to>/<dataset-pkg>")
checks_md <- checks_to_markdown(checks, render = TRUE)

That will automatically open a HTML-rendered version of the checks, just like the above. You can use that repeatedly as you work through the issues highlighted above.

antaldaniel · 2022-08-17T16:11:56Z

@mpadge Oh, really, sorry for asking the obvious.

I would like to comment here on the issue then in substance. The main development question of the package, which aims to make R objects standard datasets (as defined by W3C and SDMX), is to add structural and referential metadata, is if the best way to do this is to create an s3 object or not (see the dilemma here.)

In the current stage, it is a pseudo object inherited from data.frame, but it can be seen also as a utility to any data.frame, tibble, and data.table (or similar tabular format) R objects. The functions, which have duplicates in other packages, are following a very simple naming convention. I think that these is the cleanest API interface that I can think of, for example, the

subject() gets the metadata attribute Subject and the subject<-() sets it. As DataCite, Dublin Core and schema.org has dozens of potential attributes, to me the easiest is to use in a slightly modified form the name of the attribute to set/get its value.

All these functions are lowercase to manipulate a camelCase standard attribute. Except for the SDMX attribute 'attribute', which would create a conflict with the base R 'attributes()' function.

adamhsparks · 2022-08-18T07:53:57Z

Hi @antaldaniel,
I can understand the difficulty in writing tests for such a non-standard package. But I've had a look at covr::report() for "dataobservatory-eu/dataset". I think that there is still low-hanging fruit here that can be covered to get your code-coverage up to 75% that we ask for.

For instance, Lines 40-43 are covered but Lines 44-45 aren't. These are seemingly the same except for checking on 2 or 3 letter ISO codes, unless I'm mistaken.

Or the message response within the stop() functions in the same file aren't checked.

Could I ask that you have another look and see if you can't further improve the coverage a bit more?

antaldaniel · 2022-08-19T12:20:37Z

Hi @adamhsparks I went up to 71.27%, but further changes are not very productive. I did not extensively cover two areas, one is the constructor for the dataset() itself, where I expect potentially breaking changes, and in the file I/O areas, where I think I would like to come up with a more general solution, and also avoid test being run on CRAN later. As the overwrite function and its messages make the most branches, this is a bit of a play with %, as the very same copied test is tested again and again.

Do you have a good solution to include download and file I/O tests that run fast enough or cause no disruption when later run on CRAN?

antaldaniel · 2022-08-19T17:43:01Z

@adamhsparks I am much above your treshold, and apologies for the trivial error. I wanted to omit some issues in the dataset() construtor, but I did not realize that it had some old code that had been rewritten - the test were omitting them, of course, but they sat at the bottom of the file. It is now 81.2% covered, I know that it has to improve, but I'd prefer to do it when some issues are resolved in a clear direction (see my comment above.)

adamhsparks · 2022-08-20T03:27:30Z

Hi @antaldaniel, that's great to see. Thank you for rechecking everything and updating.

If you have tests that you feel are unconducive for CRAN, I'd just use (and do liberally use) skip_on_cran(). Reviewers should hopefully be able to help guide you on this more.

adamhsparks · 2022-08-20T03:27:55Z

@ropensci-review-bot check package

ropensci-review-bot · 2022-08-20T03:27:56Z

Thanks, about to send the query.

ropensci-review-bot · 2022-08-20T03:27:59Z

🚀

Editor check started

👋

ropensci-review-bot · 2022-08-20T03:33:39Z

Checks for dataset (v0.1.7.0002)

git hash: 93c03c54

✔️ Package name is available
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✔️ All functions have examples.
✖️ Function names are duplicated in other packages
✔️ Package has continuous integration checks.
✔️ Package coverage is 82.1%.
✔️ R CMD check found no errors.
✔️ R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: GPL (>= 3)

1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type	package	ncalls
internal	base	147
internal	dataset	66
internal	stats	2
imports	utils	2
imports	assertthat	NA
imports	ISOcodes	NA
suggests	covr	NA
suggests	declared	NA
suggests	dplyr	NA
suggests	eurostat	NA
suggests	here	NA
suggests	kableExtra	NA
suggests	knitr	NA
suggests	rdflib	NA
suggests	readxl	NA
suggests	rmarkdown	NA
suggests	spelling	NA
suggests	statcodelists	NA
suggests	testthat	NA
suggests	tidyr	NA
linking_to	NA	NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

names (21), class (12), data.frame (10), paste (9), vapply (9), rep (7), character (6), unlist (6), attr (5), lapply (5), length (5), ncol (5), subset (4), as.character (3), c (3), seq_along (3), as.data.frame (2), as.numeric (2), attributes (2), cbind (2), file (2), inherits (2), logical (2), matrix (2), nrow (2), round (2), which (2), date (1), for (1), ifelse (1), is.null (1), paste0 (1), rbind (1), seq_len (1), tolower (1), union (1), unique (1), url (1), UseMethod (1)

dataset

attributes_measures (5), dimensions (4), all_unique (3), dataset_title (3), measures (3), creator (2), datacite (2), dataset (2), dataset_source (2), description (2), geolocation (2), identifier (2), language (2), metadata_header (2), publication_year (2), publisher (2), related_item_identifier (2), resource_type (2), bibentry_dataset (1), datacite_add (1), dataset_download (1), dataset_download_csv (1), dataset_export (1), dataset_export_csv (1), dataset_local_id (1), dataset_title_create (1), dataset_uri (1), dublincore (1), dublincore_add (1), extract_year (1), is.dataset (1), print (1), print.dataset (1), related_item (1), resource_type_general (1), resource_type_general_allowed (1), rights (1), subject (1), time_var_guess (1), version (1)

stats

df (2)

utils

object.size (1), read.csv (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

code in R (100% in 24 files) and
1 authors
7 vignettes
no internal data file
3 imported packages
56 exported functions (median 10 lines of code)
66 non-exported functions in R (median 15 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

loc = "Lines of Code"
fn = "function"
exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure	value	percentile	noteworthy
files_R	24	85.5
files_vignettes	7	98.5
files_tests	28	97.7
loc_R	889	64.9
loc_vignettes	676	84.7
loc_tests	432	72.0
num_vignettes	7	99.2	TRUE
n_fns_r	122	81.1
n_fns_r_exported	56	89.5
n_fns_r_not_exported	66	74.6
n_fns_per_file_r	3	54.4
num_params_per_fn	2	11.9
loc_per_fn_r	11	32.3
loc_per_fn_r_exp	10	22.2
loc_per_fn_r_not_exp	15	49.5
rel_whitespace_R	27	75.4
rel_whitespace_vignettes	36	88.3
rel_whitespace_tests	28	76.4
doclines_per_fn_exp	39	48.6
doclines_per_fn_not_exp	0	0.0	TRUE
fn_call_network_size	103	79.7

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

3. `goodpractice` and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

GitHub Workflow Results

id	name	conclusion	sha	run_number	date
2891146042	pkgcheck	failure	93c03c	17	2022-08-19
2891146050	test-coverage	success	93c03c	20	2022-08-19

3b. `goodpractice` results

`R CMD check` with rcmdcheck

R CMD check generated the following check_fail:

no_description_date

Test coverage with covr

Package coverage: 82.12

Cyclocomplexity with cyclocomp

The following functions have cyclocomplexity >= 15:

function	cyclocomplexity
datacite_add	24
dublincore_add	23

Static code analyses with lintr

lintr found the following 370 potential issues:

message	number of times
Avoid library() and require() calls in packages	20
Lines should not be more than 80 characters.	350

4. Other Checks

Details of other checks (click to open)

✖️ The following 10 function names are duplicated in other packages:

- dataset from assemblerr, febr, robis
- description from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate
- dimensions from gdalcubes, openeo, sp, tiledb
- identifier from Ramble
- is.dataset from crunch
- language from sylly, wakefield
- measures from greybox, mlr3measures, tsibble
- size from acrt, BaseSet, container, crmPack, CVXR, datastructures, deal, disto, easyVerification, EFA.MRFA, flifo, gdalcubes, gWidgets2, hrt, iemisc, InDisc, kernlab, matlab2r, multiverse, optimbase, PopED, pracma, ramify, rEMM, rmonad, simplegraph, siren, tcltk2, UComp, unival, vampyr
- subject from DGM, emayili, gmailr, sendgridr
- version from BiocManager, garma, geoknife, mice, R6DS, rerddap, rsyncrosim, shiny.info, SMFilter

Package Versions

package	version
pkgstats	0.1.1.20
pkgcheck	0.1.0.3

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

adamhsparks · 2022-08-22T00:34:36Z

@ropensci-review-bot assign @melvidoni as editor

ropensci-review-bot · 2022-08-22T00:34:38Z

Assigned! @melvidoni is now the editor

msperlin · 2022-09-25T18:00:28Z

Hi @antaldaniel,

Thanks. I appreciate the chance to contribute in making datasets better.

As for the future, it seems you still have some structural decisions to make and I'm sure you'll sort it out with time.

best,

melvidoni · 2022-09-25T20:29:26Z

Hello all, and thank you. @antaldaniel we will keep this on hold until Q1, once you have completed the package. Please, let us know by then.

annakrystalli · 2023-01-13T08:36:51Z

@ropensci-review-bot assign @annakrystalli as editor

ropensci-review-bot · 2023-01-13T08:36:55Z

Assigned! @annakrystalli is now the editor

antaldaniel · 2023-01-13T08:40:43Z

Hi @annakrystalli , just wanted to give a short update. The small changes suggested in this thread were implemented, and the early version of the package was released on CRAN. I am devising a 2-year development plan for the package and have a clear overview of planned milestones. When done, I will contact the other mentioned package owners/maintainers with this plan. With the main developers, who are not software engineers, but statisticians with statistical software development expertise, we will have a kick-off meeting in the last week of January.

annakrystalli · 2023-01-13T09:05:09Z

Ok great! Thanks for the update @antaldaniel

annakrystalli · 2023-05-04T08:21:15Z

Hello @antaldaniel ! Was wondering whether you had any updates on progress on the package?

antaldaniel · 2023-05-04T09:06:05Z

Hi @annakrystalli , there has been very little change, only in documentation; I have secured development funding and will publish a more detailed development concept and look for paid and volunteer contributors in the coming weeks. I would like to ask you what would be an excellent way to do so; apart from adding this as a vignette to this early-stage package, would it be possible to raise attention by a blog post or something similar?

annakrystalli · 2023-05-04T12:43:02Z

Great to hear you have secured development funding! You are always welcome to advertise on the rOpenSci slack, especially in the #jobs channel. Blog posts are always a good idea but the rOpenSci blog is reserved for promoting packages once they have completed review so wouldn't be appropriate at this stage.

antaldaniel · 2023-11-09T15:37:41Z

After a very long time, here is a conceptual working paper on the development with far more detailed specification than before, and some code ideas:

Making Datasets Truly Interoperable in R is a working paper to accompany develop the package.

The working paper can be referenced with:

I am also looking for volunteer and potentially paid contributors to the package.

The source file is usually more recent: dataset-working-paper.qmd`

annakrystalli · 2023-11-10T13:45:47Z

Thank you for the update @antaldaniel !

Good to hear you are making progress with the plans. Ultimately I feel the package will still remain on hold until it has been developed enough to be considered, if not ready, pretty close to release. That's when feedback from reviewers will be most useful and is also more aligned what is expected for reviewers to contribute their views on.

Let us know when you feel you have reached that stage!

antaldaniel · 2023-12-08T15:28:38Z

@annakrystalli I think that the review would be useful now, because I am implementing this working paper Making Datasets Truly Interoperable now. I just sent a new version to CRAN, but there is still room to review. Also, if somebody wants to get involved in the development, I do have a public grant for it and could take on a co-developer.

The new version (which is an entire rewrite since the first review) is on the dataset.dataobservatory.eu/ website with the connecting GitHub repo. I see a problem though with your CI attached to the package, it throws errors which to me look configuration errors and not real error, the package just builds fine on appveyor and r_hub.

ldecicco-USGS · 2024-02-28T15:32:31Z

@ropensci-review-bot check package

ropensci-review-bot · 2024-02-28T15:32:33Z

Thanks, about to send the query.

ropensci-review-bot · 2024-02-28T15:32:36Z

🚀

The following problem was found in your submission template:

HTML variable [due-dates-list] is missing
Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

👋

ropensci-review-bot · 2024-02-28T15:49:07Z

Checks for dataset (v0.3.1)

git hash: b1dca41e

✔️ Package name is available
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✖️ The following functions have no documented return values: [provenance, subsetting, var_labels, xsd_convert]
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✖️ These functions do not have examples: [dataset_to_triples].
✔️ Package has continuous integration checks.
✔️ Package coverage is 79%.
✔️ R CMD check found no errors.
✔️ R CMD check found no warnings.
👀 Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with 👀 may be optionally addressed.)

Package License: GPL (>= 3)

1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type	package	ncalls
internal	base	312
internal	dataset	178
internal	graphics	6
imports	assertthat	22
imports	utils	11
imports	stats	10
imports	ISOcodes	NA
suggests	dataspice	NA
suggests	covr	NA
suggests	declared	NA
suggests	dplyr	NA
suggests	eurostat	NA
suggests	here	NA
suggests	kableExtra	NA
suggests	knitr	NA
suggests	rdflib	NA
suggests	readxl	NA
suggests	rmarkdown	NA
suggests	spelling	NA
suggests	statcodelists	NA
suggests	testthat	NA
suggests	tidyr	NA
suggests	tibble	NA
suggests	nycflights13	NA
suggests	tsibble	NA
suggests	data.table	NA
linking_to	NA	NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

as.character (40), ifelse (40), is.null (38), list (30), c (16), data.frame (14), names (10), lapply (8), attr (7), paste0 (7), inherits (6), class (5), col (5), drop (4), invisible (4), seq_along (4), which (4), as.POSIXct (3), character (3), date (3), for (3), format (3), length (3), ncol (3), Sys.time (3), unlist (3), vapply (3), all (2), args (2), as.data.frame (2), as.numeric (2), dim (2), paste (2), rbind (2), round (2), substitute (2), t (2), url (2), with (2), apply (1), as.Date (1), cbind (1), comment (1), do.call (1), environment (1), get (1), if (1), max (1), nchar (1), new.env (1), range (1), rep (1), substr (1), switch (1), Sys.Date (1)

dataset

dataset_bibentry (28), dataset_title (10), dataset (8), rights (8), subject (8), creator (7), description (6), publisher (6), identifier (5), language (5), new_Subject (5), provenance (5), xsd_convert (5), DataStructure (4), convert_column (3), publication_year (3), as_bibentry (2), as_dublincore (2), dots_number (2), geolocation (2), get_type (2), getdata (2), idcol_find (2), is_person (2), is.dataset (2), provenance_add (2), related_item_identifier (2), size (2), subject_create (2), version (2), as_datacite (1), as_dataset (1), as_dataset.data.frame (1), datacite (1), dataset_download (1), dataset_download_csv (1), dataset_prov (1), dataset_title_create (1), dataset_to_triples (1), dataset_ttl_write (1), datasource_get (1), datasource_set (1), DataStructure_update (1), describe (1), describe.dataset (1), dublincore (1), get_prefix (1), get_resource_identifier (1), head.dataset (1), id_to_column (1), initialise_dsd (1), is.datacite (1), is.datacite.datacite (1), is.dublincore (1), is.dublincore.dublincore (1), is.subject (1), new_datacite (1), new_dataset (1), new_dublincore (1), old_function (1), print.dataset (1), related_item (1), set_var_labels (1), set_var_labels.dataset (1)

assertthat

assert_that (22)

utils

bibentry (3), data (2), person (2), citation (1), object.size (1), read.csv (1), tail (1)

stats

df (5), var (3), ar (1), family (1)

graphics

title (6)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

code in R (100% in 38 files) and
1 authors
12 vignettes
3 internal data files
4 imported packages
81 exported functions (median 7 lines of code)
117 non-exported functions in R (median 13 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

loc = "Lines of Code"
fn = "function"
exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure	value	percentile	noteworthy
files_R	38	92.7
files_vignettes	12	99.6
files_tests	37	98.6
loc_R	1621	79.9
loc_vignettes	805	87.5
loc_tests	567	77.3
num_vignettes	12	99.9	TRUE
data_size_total	3007	64.7
data_size_median	578	61.1
n_fns_r	198	89.7
n_fns_r_exported	81	93.6
n_fns_r_not_exported	117	86.6
n_fns_per_file_r	3	55.0
num_params_per_fn	3	33.6
loc_per_fn_r	11	32.3
loc_per_fn_r_exp	7	13.5
loc_per_fn_r_not_exp	13	42.7
rel_whitespace_R	25	85.3
rel_whitespace_vignettes	36	91.1
rel_whitespace_tests	28	81.2
doclines_per_fn_exp	38	47.0
doclines_per_fn_not_exp	0	0.0	TRUE
fn_call_network_size	128	83.0

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

3. `goodpractice` and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

GitHub Workflow Results

id	name	conclusion	sha	run_number	date
7677839674	pkgcheck	failure	b1dca4	126	2024-01-27
7677839676	R-CMD-check	failure	b1dca4	46	2024-01-27
7677839673	test-coverage	failure	b1dca4	129	2024-01-27

3b. `goodpractice` results

`R CMD check` with rcmdcheck

R CMD check generated the following check_fail:

no_description_date

Test coverage with covr

Package coverage: 78.97

Cyclocomplexity with cyclocomp

The following function have cyclocomplexity >= 15:

function	cyclocomplexity
[[.dataset	17

Static code analyses with lintr

lintr found the following 417 potential issues:

message	number of times
Avoid 1:length(...) expressions, use seq_len.	1
Avoid 1:ncol(...) expressions, use seq_len.	2
Avoid 1:nrow(...) expressions, use seq_len.	3
Avoid library() and require() calls in packages	23
Lines should not be more than 80 characters.	384
unexpected symbol	2
Use <-, not =, for assignment.	2

4. Other Checks

Details of other checks (click to open)

✖️ The following 12 function names are duplicated in other packages:

- dataset from assemblerr, febr, robis
- describe from AzureVision, Bolstad2, describer, dlookr, explore, Hmisc, iBreakDown, ingredients, lambda.r, MSbox, onewaytests, prettyR, psych, psych, psyntur, questionr, radiant.data, RCPA3, Rlab, scan, scorecard, sylly, tidycomm
- description from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate
- identifier from Ramble
- is.dataset from crunch
- language from sylly, wakefield
- provenance from provenance
- set_var_labels from xpose
- size from acrt, BaseSet, container, crmPack, CVXR, datastructures, deal, disto, easyVerification, EFA.MRFA, flifo, gdalcubes, gWidgets2, hrt, iemisc, InDisc, kernlab, matlab2r, multiverse, optimbase, PopED, pracma, ramify, rEMM, rmonad, simplegraph, siren, tcltk2, UComp, unival, vampyr
- subject from DGM, emayili, gmailr, sendgridr
- var_labels from formatters, sjlabelled
- version from BiocManager, garma, geoknife, mice, R6DS, rerddap, rsyncrosim, shiny.info, SMFilter

Package Versions

package	version
pkgstats	0.1.3.11
pkgcheck	0.1.2.15

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

ldecicco-USGS · 2024-02-28T16:58:29Z

Hi @antaldaniel Since you mentioned "The new version (which is an entire rewrite since the first review) ", we're going to treat this as a new submission and get new reviewers. Can you work on the 2 outstanding issues above while I look for a new editor?

Thanks @annakrystalli for the initial work!

antaldaniel · 2024-02-28T17:20:15Z

@ldecicco-USGS thank you for the head up, and indeed, I will fix those issues.

ldecicco-USGS · 2024-03-08T14:22:41Z

Let me know when you've updated the package (or go ahead and rerun the "bot" command to check package. Once we've got that taken care of I'll assign a new editor. Thanks!

antaldaniel changed the title ~~Create Data Frames that are Easier to Exchange and Reuse~~ datasetÉ Create Data Frames that are Easier to Exchange and Reuse Aug 15, 2022

antaldaniel changed the title ~~datasetÉ Create Data Frames that are Easier to Exchange and Reuse~~ dataset: Create Data Frames that are Easier to Exchange and Reuse Aug 15, 2022

adamhsparks added the 0/editorial-team-prep label Aug 20, 2022

ropensci-review-bot assigned melvidoni Aug 22, 2022

ropensci-review-bot added the 1/editor-checks label Aug 22, 2022

ropensci-review-bot assigned annakrystalli and unassigned melvidoni Jan 13, 2023

ropensci-review-bot added the 1/editor-checks label Jan 13, 2023

annakrystalli removed the 1/editor-checks label Jan 13, 2023

maelle mentioned this issue Jan 13, 2023

Re-assigning editors should not change the label if the submission is further in the process ropensci-org/buffy#97

Open

antaldaniel mentioned this issue Dec 4, 2023

dataset luckinet/ontologics#28

Open

ldecicco-USGS added 1/editor-checks and removed 4/review(s)-in-awaiting-changes labels Feb 28, 2024

ldecicco-USGS assigned ldecicco-USGS and unassigned annakrystalli Feb 28, 2024

ldecicco-USGS removed their assignment Feb 28, 2024

dataset: Create Data Frames that are Easier to Exchange and Reuse #553

dataset: Create Data Frames that are Easier to Exchange and Reuse #553

Comments

antaldaniel commented Aug 15, 2022 • edited by ropensci-review-bot

Archive: TBD Version accepted: TBD Language: en

Scope

Technical checks

Publication options

Code of conduct

ropensci-review-bot commented Aug 15, 2022

ropensci-review-bot commented Aug 15, 2022

adamhsparks commented Aug 15, 2022

antaldaniel commented Aug 15, 2022

mpadge commented Aug 15, 2022

antaldaniel commented Aug 15, 2022

ropensci-review-bot commented Aug 15, 2022

ropensci-review-bot commented Aug 15, 2022

ropensci-review-bot commented Aug 16, 2022

Checks for dataset (v0.1.7)

1. Package Dependencies

2. Statistical Properties

2a. Network visualisation

3. goodpractice and other checks

3b. goodpractice results

R CMD check with rcmdcheck

Test coverage with covr

Cyclocomplexity with cyclocomp

Static code analyses with lintr

4. Other Checks

Editor-in-Chief Instructions:

adamhsparks commented Aug 16, 2022

antaldaniel commented Aug 17, 2022 • edited

mpadge commented Aug 17, 2022

antaldaniel commented Aug 17, 2022

mpadge commented Aug 17, 2022

antaldaniel commented Aug 17, 2022

adamhsparks commented Aug 18, 2022

antaldaniel commented Aug 19, 2022

antaldaniel commented Aug 19, 2022

adamhsparks commented Aug 20, 2022

adamhsparks commented Aug 20, 2022

ropensci-review-bot commented Aug 20, 2022

ropensci-review-bot commented Aug 20, 2022

ropensci-review-bot commented Aug 20, 2022

Checks for dataset (v0.1.7.0002)

1. Package Dependencies

2. Statistical Properties

2a. Network visualisation

3. goodpractice and other checks

3a. Continuous Integration Badges

3b. goodpractice results

R CMD check with rcmdcheck

Test coverage with covr

Cyclocomplexity with cyclocomp

Static code analyses with lintr

4. Other Checks

Editor-in-Chief Instructions:

adamhsparks commented Aug 22, 2022

ropensci-review-bot commented Aug 22, 2022

msperlin commented Sep 25, 2022

melvidoni commented Sep 25, 2022

annakrystalli commented Jan 13, 2023

ropensci-review-bot commented Jan 13, 2023

antaldaniel commented Jan 13, 2023

annakrystalli commented Jan 13, 2023

annakrystalli commented May 4, 2023

antaldaniel commented May 4, 2023

annakrystalli commented May 4, 2023

antaldaniel commented Nov 9, 2023

annakrystalli commented Nov 10, 2023

antaldaniel commented Dec 8, 2023 • edited

ldecicco-USGS commented Feb 28, 2024

ropensci-review-bot commented Feb 28, 2024

ropensci-review-bot commented Feb 28, 2024

ropensci-review-bot commented Feb 28, 2024

Checks for dataset (v0.3.1)

1. Package Dependencies

2. Statistical Properties

2a. Network visualisation

3. goodpractice and other checks

antaldaniel commented Aug 15, 2022 •

edited by ropensci-review-bot

Archive: TBD
Version accepted: TBD
Language: en

3. `goodpractice` and other checks

3b. `goodpractice` results

`R CMD check` with rcmdcheck

antaldaniel commented Aug 17, 2022 •

edited

3. `goodpractice` and other checks

3b. `goodpractice` results

`R CMD check` with rcmdcheck

antaldaniel commented Dec 8, 2023 •

edited

3. `goodpractice` and other checks

3b. `goodpractice` results

`R CMD check` with rcmdcheck