Skip to content

Latest commit

 

History

History
367 lines (230 loc) · 30.8 KB

File metadata and controls

367 lines (230 loc) · 30.8 KB

R Primer

Scope: text files with a “.R” or ".Rmd" extension that contain code for executing programs in the R language 

Topic Description
File Extensions .R (R script)
MIME Type text/plain
Structure unstructured
Version 2.0
Primary fields or areas of use Data analytics, statistics, bioinformatics, text mining, data science, digital humanities
Source and affiliation The R Foundation
Metadata Not Applicable
Key questions for curation review What is the purpose of the file? Are any data associated with the file? Are the referenced data present at the indicated location?
Tools for curation review Any text editor, RStudio
Date Created 201-11-21
Created by

Creators: Lynda Kellam, Cornell University
Katherine Koziar, University of California, Riverside
Stanislav Pejša, Purdue University


Mentor: Joel Herndon, Ph.D. 
Date updated and summary of changes made Please see README for complete listing of changes made


Notable changes:

  • Accessibilty content added by Emily Oxford, University of Michigan

Suggested Citation: Kellam, Lynda; Koziar, Katherine; Pejša, Stanislav. 2019. R Data Curation Primer. Data Curation Network GitHub Repository.

This work was created as part of the “Specialized Data Curation” Workshop #2 held at Johns Hopkins University in Baltimore, MD on April 17-18, 2019. These workshops have been generously funded by the Institute of Museum and Library Services # RE-85-18-0040-18.


Table of Contents

Description of format  

Ways in which fields may use this format

Examples of R code

Sample dataset citations

Software for viewing or analyzing data

Resources for reviewing data

Key questions to ask yourself

Key clarifications to get from researcher

Applicable metadata standards, core elements, and readme requirements

Accessibility considerations

Preservation actions

What to look for to make sure this file meets FAIR principles 

Documentation of curation process

Appendix A: filetype CURATED checklist

Appendix B: Long description of Figure 1

Description of Format

A file with the extension R (.R) typically contains a script written in R, which is a programming language and environment for statistical computing and graphics. An R script file is a plain text file that stores the R code. R script contains only instructions for analysing, manipulating, and visualizing data, and ideally comments explaining the code. The data are usually stored elsewhere and have to be loaded first, as well as optional R packages that are often needed so that the script runs successfully.

A file with an R markdown extension (.Rmd) contains R code in "code blocks" with text annotations in "text blocks" that typically explain the code or use the code to display visualizations and other supplemental components to support the text. While the file itself can be opened in a text editor, dynamically created output, such as executing scripts, tables, and plots, works only in RStudio.

A screenshot of a typical .R file structure in a text editor.

Figure 1. A screenshot of a typical .R file structure (the script was edited). See Appendix B for a screen reader-friendly description. The red boxes indicate different sections of the code with annotations directly to the right of the boxes. Further information about reading R code is included in the Resources for Reviewing Data section of this primer. Data source: Hall et al. 2018, https://doi.org/10.4231/R77W69FM

Ways in which fields may use this format

R is used in a variety of ways. It is popular in statistics, bioinformatics, and data science. It is gaining traction in the digital humanities. R can be used for data analytics, calculations, and data visualization, as R is capable of producing high-quality graphs and plots. 

Examples

WUSTL University Libraries created a good sample R code file that includes common R syntax including assignment statements, arithmetic functions, basic vectors and structures, and how to access data. Information about reading R code is included in the Resources for reviewing data section of this primer.

Sample dataset citations

Hall, H., Ma, J., Shekhar, S., Leon-Salas, W. D., Weake, V. M. (2018). Blue light induces a neuroprotective gene expression program in Drosophila photoreceptors - Supporting data for Hall and Ma et al. (2018). Purdue University Research Repository. doi:10.4231/R77W69FM (https://purr.purdue.edu/publications/3003/1)

Seliger, C. S. (2018). Text Mining and Plotting Tools for KSA / DS / HEI Research Study. Purdue University Research Repository. doi:10.4231/R7MK6B49 (https://purr.purdue.edu/publications/3041/1)  

Software for viewing or analyzing data

R is a plain text file that can be opened in any text editor, but in order to work it needs to be run in the R environment. A simple text editor is sufficient for the file review, but editor with syntax highlighting, such as Atom, Sublime Text, or Notepad++ will make the review much easier. 

  • R: The R software environment is free to download and use.
  • RStudio: RStudio is an open source IDE (or integrated development environment) for R. It is free for non-commercial use. For many users, RStudio is a more intuitive interface to view and analyze .R script files; however, it does have its own accessibility issues. RStudio requires R to be installed separately. They maintain a mirror as part of the Comprehensive R Archive Network (CRAN) for precompiled binary distributions of the base system for download.

.Rmd files can be displayed and manipulated in RStudio; in a plain text editor, the content of R Markdown code and text blocks will be displayed, but none of the visualizations generated by the code (if any) will appear. Many R Markdown files created in RStudio can be saved in HTML format as well and displayed in a web browser.

Resources for reviewing data

R Tutorials

Key questions to ask yourself

  • Are all the packages available? Are all the listed packages used? Which version of the listed packages were used? 

  • Are all data referenced in script available? Typically, references to data follow function calls such as ‘read’ or ‘load’.

  • Often data files are stored separately from the script file. Can the data be accessed? Are they provided with the script or is a URL available?  

  • Does the dataset follow R accessibility considerations?

  • Are any links to other files absolute or relative (i.e., are links to other files/directories replicated as absolute throughout the script)? 

Key clarifications to get from the researcher

  • Are all the used packages openly available? And if so where? (It can be a researcher’s website, but ideally another more general sharing platform, such as CRAN or GitHub.) If the files are not in CRAN, GitHub, or other publicly available site, please include the source code for the packages. 

  • Did the researcher include in-line comment descriptions of blocks of code? If not, ask them to add them, even if the comments are brief. This will improve accessibility and readability for beginner and expert users alike.

  • Did the researcher include text descriptions of any data visualizations to improve accessibility? If not, ask them to add them (see Accessibility Considerations below).

Applicable metadata standards, core elements, and readme requirements

READMEs

There are currently (July 2019) two schools of thought regarding READMEs with scripts. The first is that there should be a README with all scripts that explicitly lists items about the script; we will call this “Best Practices”, since README files are best and standard practices in computer coding and scripting. The second school of thought is that much of this information is already in the script itself; we will call this “Good Enough Practices”. Please note, in order for Good Enough Practices to work, the script file must be properly commented so a user understands what the code is expected to do. A single README file is sufficient for a group of scripts that are grouped together as a single dataset. 

README requirements

README should provide information on  

  • The purpose of the script(s).

  • The version of R used when the script was developed. 

  • How to execute the script(s) in the .R or .Rmd file.

  • The expected output.

  • List of dependencies (things the script(s) need(s) to successfully run) such as

    • Other datasets or additional scripts that are called in the .R file  
    • Packages
  • Accessibility information - which part(s) of the code may introduce accessibility challenges (such as data visualizations), attempts to remediate those challenges, and who to contact for accessibility support if the dataset proves inaccessible to an interested user.

Optional

  • Citation: if there is no separate citation file, READMEs will sometimes include an example citation. 

  • License: if there is no separate license file, READMEs will sometimes include license information.

Notes on Packages

Since some R Packages are actively developed, it is a Best Practice to keep track of the version used. 

library(tidyverse)
library(janitor)
library(lubridate)
library(ApacheLogProcessor)
library(urltools)
library(rgeolocate)
Dependency management systems for R

It is possible to encapsulate all dependencies in an environment using dependency management tools. These greatly simplify management of installed packages and their versions while keeping dependency list in plain text file in project directory.

Metadata Standards

Metadata standards are not usual for most script submissions. Most programming languages do not have metadata standards in the way traditionally understood by librarians and data curators; however, the closest equivalents are coding conventions and standards within the language, and metadata standards for libraries or packages used to extend the language. The former are covered in the Styles for R section below. The latter may be found in any R package hosted on the official Comprehensive R Archive Network page​. These standards are not expected in a Good Enough Practices script submission, but if one were looking for R metadata standards, packages are a good place to start. 

Styles for R

Style in terms of a programming language is the way the language is written in a script. There are certain items that are enforced by the code to allow the script to run; however, beyond that scripts may be written as the programmer wishes. It is not expected that a curator should enforce Style in R, but if a researcher wants to make their code more readable, they should follow good style practices. Below are two style guides used in the R community. 

Accessibility considerations

The accessibility of the R file format varies depending on its contents. Most proficient R users will be able to read, run, and broadly understand an R dataset if the relationships between all files are explicitly documented and in-line comments clearly explain the purpose behind each block of code. However, the software often used to create and run R files is inaccessible to many; beyond this, any data visualization generated in an R file will require further work to facilitate access.

RStudio

  • Ensure that no part of the code or accompanying documentation requires interaction with RStudio. In spite of recent updates, RStudio is generally not compatible with screen readers, and those who use them understandably prefer the native R IDE console or the command line interface.1

Data Visualizations

Does the file generate any visualizations? If so, make sure there are multiple ways to access and interpret them.

  • Include the data that created the visualization in the dataset.2 CSV is the preferred file format for accessibility. Be mindful that access to underlying data does not mean that all users will be able to understand the visualization at the same level, since visualizations illuminate patterns and nuances in a way that tabular data alone cannot.3 However, having the data available makes it easier for proficient R users to explore it with tools and methods that work for them, such as sonification.

  • Ask the dataset’s creator(s) to write a detailed description of the visualization.

    • If the visualization is generated in an .R file, include the description in an in-line comment, making sure to briefly state the comment’s purpose in the comment itself. In an .Rmd file, the description can be included in an in-line comment (or markdown block for an R Markdown file.).
    • As the expert on their own work, the creator will have the understanding and expertise to write a sufficiently detailed description of a visualization more so than any curator.4 The Diagram Center’s Image Description Guidelines are a useful starting point for anyone unsure of how to best describe a graph or diagram in writing.
    • Describe, but do not interpret the visualization for users. They should be able to draw their own conclusions about any patterns the visualization does or does not show.5
  • Ensure that any data visualizations follow accessible color contrast guidelines. There are a variety of automated contrast checkers you can use to see if the colors need to be altered.

Areas for Further Exploration

In 2018, Godfrey, Murrell, and Sorge presented a workflow in R for creating a more accessible version of data visualizations. Unfortunately, BrailleR - an R library necessary for this workflow - is now deprecated and will be difficult to implement until it is updated. However, the workflow they present is likely to be replicable:

  1. Create an SVG file from a data visualization.
  2. Edit/add underlying XML of SVG file to add or alter accessibility metadata.
  3. Use a JavaScript package to render the visualization in a browser; enable keystroke navigation to explore the different layers of the data.

An online demo of the final product is still available.

Finally, data curators and depositors alike may benefit from the following readings:

1 Godfrey, A.J.R., Murrell, P., Sorge, V., 2018. An Accessible Interaction Model for Data Visualisation in Statistics, in: Miesenberger, K., Kouroupetroglou, G. (Eds.), Computers Helping People with Special Needs, Lecture Notes in Computer Science. Springer International Publishing, Cham, pp. 593, https://doi.org/10.1007/978-3-319-94277-3_92

2 Walker, W., Keenan, T., 2015. Going Beyond Availability: Truly Accessible Research Data. Journal of Librarianship and Scholarly Communication 3, pp. 5. https://doi.org/10.7710/2162-3309.1223

3 Godfrey, A.J.R., Murrell, P., Sorge, V., 2018. An Accessible Interaction Model for Data Visualisation in Statistics. See note 1. Pp. 592, https://doi.org/10.1007/978-3-319-94277-3_92

4 Godfrey, A.J.R., Loots, M.T., 2015. Advice From Blind Teachers on How to Teach Statistics to Blind Students. Journal of Statistics Education 23, 4, pp. 10. https://doi.org/10.1080/10691898.2015.11889746

5 Conversation with U-M Accessibility Team (Darrell Williams)

Preservation actions

R files are plain text files, therefore the preservation stakes are not very high. Curators should be able to open .R files with any text editor, including Notepad, Notepad++, gedit, etc. One possible concern may be with character encoding. For that, UTF-8 encoding is the best option, but if no special or foreign language characters (such as Greek letters) or mathematical formulas are present, simple ASCII is adequate. All other character encodings should probably be converted to Unicode.

The .Rmd format is approximately as stable as a plain text file; much like with a plain R file, the functionality of the file will depend on the compatibility/versioning of the code and any packages used.

What to look for to make sure this file meets FAIR principles

In order to evaluate an .R script file under the FAIR principles1, it must be published in a manner that meets the following:

  • To be Findable: 

    • F1. metadata are assigned a globally unique and persistent identifier (e.g., DOI)
    • F2. .R file is described with rich metadata 
    • F3. metadata clearly and explicitly include the identifier of the data it describes 
    • F4. metadata are registered or indexed in a searchable resource 
  • To be Accessible2:

    • A1. metadata are retrievable by their identifier using a standardized communications protocol
    • A2. metadata are accessible, even when the .R file is no longer available
  • To be Interoperable: 

    • I1. metadata use a formal, accessible, shared, and broadly applicable language 
    • I2. metadata use vocabularies that follow FAIR principles 
    • I3. metadata include qualified references to other metadata
  • To be Reusable:

    • R1. metadata are richly described with a plurality of accurate and relevant attributes including usage license, provenance, and domain-relevant community standards 

1 Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016) doi:10.1038/sdata.2016.18

2 The FAIR accessibility principle does not currently (August 2020) include a specific requirement for accessibility for people with disabilities. We urge curators to consider it a FAIR requirement nonetheless, in the spirit of the accessibility principle itself: "Once the user finds the required data, she/he[/they] need[s] to know how can they be accessed."

Documentation of curation process

R scripts can interact with a variety of formats, such as CSV, TSV, TXT, Apache server log files, but also proprietary ones such as XLSX, XLS, or stata. Often the more specialised or proprietary formats require corresponding R package (RStata, readxl, etc.). R can also store data in its native format called *. RDATA or in its earlier version *.RDA. These files store not only the data, but also the working environment which includes the function and value objects created during an open session in R. These files cannot be opened in the text editor.  

In addition, researchers who use the RStudio IDE can submit with their data a file with extensions such as .Rproj or .Rmd. The .Rproj data contains metadata that are project oriented, it stores environment and variable pertinent to a given project.

Appendix A: filetype CURATED checklist

CHECK Step

CURATE Action Curator Checklist

Check data files and read documentation

  • Review the content of the data files (e.g., open and run the files or code). 
  • Verify all metadata provided by the author and review the available documentation.

  • Files open as expected
    • Issues __________  
  • Code runs as expected
    • Produces minor errors
    • Does not run and/or produces many errors
    • Did not try to run code
  • Metadata quality is rich, accurate, and complete
    • Metadata has issues _________ 
    • Metadata in comments 
  • Documentation Type (circle) 
    Readme / Codebook / Data Dictionary / Other:________________________
    • Missing/None 
    • Needs work 

UNDERSTAND Step

CURATE Action Curator Checklist

Understand the data (or try to)

  • Check for quality assurance and usability issues. 
  • Try to detect and extract any “hidden documentation” inherent to the data files that may facilitate reuse.
  • Determine if the documentation of the data is sufficient for a user with similar qualifications to the author’s to understand and reuse the data. If not, recommend or create additional documentation (e.g., a readme.txt template).
  • Verify that all files are as accessible as possible for all users, including users with disabilities.

  • The code's formatting aids readability
    • Spacing between code section
    • Code within conditionals and loops (​if​ and for ​statements) is indented 
    •  Code looks like one block of text 
  • Clear variable names
    • Variable names self-defined
    • Comments describe variable names
    • Documentation describes variable names 
    • Missing/None
    • Needs work
  • Clear sections of code
    • Sections self-define code actions 
    • Comments describe code actions
    • Documentation describes code actions 
    • Missing/None
    • Needs work
  • Review Documentation (in previous step, CHECK) for completeness and clarity  
  • All accessibility considerations met to the extent possible

REQUEST Step

CURATE Action Curator Checklist

Request missing information or changes

  • Generate a list of questions for the data author to fix any errors or issues, including accessibility. 

Narrative describing the concerns, issues, and needed improvements to the data submission.

  • Inquiry sent to researcher 
  • Response received 
  • Additional follow up communication needed 

AUGMENT Step

CURATE Action Curator Checklist

Augment the submission

  • Enhance metadata to best facilitate discoverability. 
  • Create and apply metadata for the data record, including descriptive keywords.
  • When appropriate, structure and present metadata in domain-specific schemas to facilitate interoperability with other systems. 

  • Discoverability sufficient
    • Recommend (circle one) full-text index / file rename / file reorder / file descriptions / zip files into one archive Other ______________   
  • Keywords Sufficient
    • Suggestions _______________
  • Linkages Sufficient
    • Link to report/paper 
    • Link to related datasets 
    • Link to source data
    • Link to other ____________

TRANSFORM Step

CURATE Action Curator Checklist

Transform file formats

  • Identify specialized file formats and their restrictions (e.g., Is the software freely available? Link to it or archive it alongside the data). 
  • Transform files into open, non-proprietary file formats2 that broaden the potential audience for reuse and ensure that preservation actions might be taken by the repository in later steps. Retain original files if data transfer is not perfect.

  • Preferred file formats in use
    • Recommend conversion from _________ to _________
    • Retain original formats
  • Software needed is readily available 
    • Unclear version of software
    • Unclear software used
  • Visualization of data easily accessible
    • Recommend graphical representation ____________  
    • Recommend web-accessible surrogate ________________

2 See Cornell’s List of Preservation Format Recommendations.

EVALUATE Step

CURATE Action Curator Checklist

Evaluate and rate the overall data record for FAIRness.

  • Score the dataset and recommend ways to increase the FAIRness of the data and become “DCN approved."

Findable-

  • Metadata exceeds author/ title/ date.
  • Unique PID (DOI, Handle, PURL, etc.).
  • Discoverable via web search engines 

Accessible-
  • Retrievable via a standard protocol (e.g., HTTP).
  • Free, open (e.g., download link).

Interoperable-
  • Metadata formatted in a standard schema (e.g., Dublin Core). 
  • Metadata provided in machine-readable format (OAI feed). 

Reusable-
  • Data include sufficient metadata about the data characteristics to reuse 
  • Contact info displayed if the direct assistance of the author needed. 
  • Clear indicators of who created, owns, and stewards the data.
  • Data are released with clear data usage terms (e.g., a CC License). 

Document

CURATE Action Curator Checklist

Document throughout curation activities.

  • Record all necessary information capturing who did what to the dataset and when 

  • Accessioning & deposit records (Names, dates, contact information, submission agreements, etc) 
  • Repository collection metadata
  • Provenance logs
  • Service workflow
  • Preservation packaging
  •  Any additional requirements at your institution

Appendix B: Long description of Figure 1

Figure 1 contains the following code:

###EdgeR: GLM analysis blue light photoreceptors
#Mapping used gene data from ftp://ftp.ensembl.org/pub/release-89/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.89.gtf.gz

library("edgeR")
library("NOISeq")

...

#Import gene names from Biomart
Biomart <- read.csv("Biomart_gene.txt")
colnames(Biomart) <- c("FbgnID", "gene", "CG.ID")

#Define groups for analysis
groups = factor(c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5))

#create data structures
d=DGEList(counts=counttable, group=groups,
genes=rownames(counttable)) #DGE is form of data class for EdgeR

m=match(d$genes$genes, Biomart$FbgnID)

These segments of the code above are labelled as follows:

  • "loading package libraries into script":
library("edgeR")
library("NOISeq")
  • "loading data":
Biomart <- read.csv("Biomart_gene.txt")
  • "comment":
#Define groups for analysis