Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WISH: Built-in R session-specific universally unique identifier (UUID) #96

Open
HenrikBengtsson opened this issue May 20, 2019 · 2 comments
Labels
cc/SU on r-devel or r-pkg-devel mailing lists Issue has been raised on the R-devel or R-pkg-devel mailing lists r-project-sprint-candidate

Comments

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented May 20, 2019

Proposal

Provide a built-in mechanism for obtaining an identifier for the current R session, e.g.

> Sys.info()[["session_uuid"]]
[1] "4258db4d-d4fb-46b3-a214-8c762b99a443"

The identifier should be "unique" in the sense that the probability for two R sessions(*) having the same identifier should be extremely small. There's no need for reproducibility, i.e. the algorithm for producing the identifier may be changed at any time.

(*) Two R sessions running at different times (seconds, minutes, days, years, ...) or on different machines (locally or anywhere in the world).

Use cases

In parallel-processing workflows, R objects may be "exported" (serialized) to background R processes ("workers") for further processing. In other workflows, objects may be saved to file to be reloaded in a future R session. However, certain types of objects in R maybe only be relevant, or valid, in the R session that created them. Attempts to use them in other R processes may give an obscure error or in the worst case produce garbage results.

Having an identifier that is unique to each R process will make it possible to detect when an object is used in the wrong context. This can be done by attaching the session identifier to the object. For example,

obj <- 42L
attr(obj, "owner") <- Sys.info()[["session_uuid"]]

With this, it is easy to validate the "ownership" later;

stopifnot(identical(attr(obj, "owner"), Sys.info()[["session_uuid"]]))

I argue that such an identifier should be part of base R for easy access and avoid each developer having to roll their own.

Possible implementation

One proposal would be to bring in Simon Urbanek's 'uuid' package (https://cran.r-project.org/package=uuid) into base R. This package provides:

> uuid::UUIDgenerate()
[1] "b7de6182-c9c1-47a8-b5cd-e5c8307a8efb"

based on Theodore Ts'o's libuuid (https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/). From man uuid_generate:

"The uuid_generate function creates a new universally unique identifier (UUID). The uuid will be generated based on high-quality randomness from /dev/urandom, if available. If it is not available, then uuid_generate will use an alternative algorithm which uses the current time, the local ethernet MAC address (if available), and random data generated using a pseudo-random generator.
[...]
The UUID is 16 bytes (128 bits) long, which gives approximately 3.4x10^38 unique values (there are approximately 10^80 elementary particles in the universe according to Carl Sagan's Cosmos). The new UUID can reasonably be considered unique among all UUIDs created on the local system, and among UUIDs created on other systems in the past and in the future."

An alternative, that does not require adding a dependency on the libuuid library, would be to roll a poor man's version based on a set of semi-unique attributes, e.g.

make_id <- function(...) {
  args <- list(...)
  saveRDS(args, file = f <- tempfile())
  on.exit(file.remove(f))
  unname(tools::md5sum(f))
}

session_id <- local({
  id <- NULL
  function() {
    if (is.null(id)) {
      id <<- make_id(
        info    = Sys.info(),
        pid     = Sys.getpid(),
        tempdir = tempdir(),
        time    = Sys.time(),
        random  = sample.int(.Machine$integer.max, size = 1L)
      )
    }
    id
  }
})

Example:

> session_id()
[1] "8d00b17384e69e7c9ecee47e0426b2a5"

> session_id()
[1] "8d00b17384e69e7c9ecee47e0426b2a5"

PS. Having a built-in make_id() function would be handy too, e.g. when creating object-specific identifiers for other purposes.

PPS. It would be neat if there was an object, or connection, interface for tools::md5sum(), which currently only operates on files sitting on the file system. The digest package provides this functionality.

See also

@gaborcsardi
Copy link

gaborcsardi commented May 20, 2019

FWIW, the startup time and the pid pretty well identify a process. The ps package has code to get the startup time. If you want to use this across machines, then add the hostname or IP address, maybe.

ps::ps_create_time(ps::ps_handle())
[1] "2019-05-20 23:31:52 GMT"

@HenrikBengtsson
Copy link
Owner Author

Good point. Didn't think of the process start time(*)

"... maybe", yeah, hostname/IP number may not be very unique - I wonder how many systems end up with the same values there, e.g. ("pi", 192.168.0.2).

(*) In the 'startup' package, I record the "onLoad" time when the startup package is loaded to serve as a proxy for base R not providing such an attribute itself. But, asking the OS about the process startup time is definitely nicer.

@HenrikBengtsson HenrikBengtsson added the on r-devel or r-pkg-devel mailing lists Issue has been raised on the R-devel or R-pkg-devel mailing lists label May 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cc/SU on r-devel or r-pkg-devel mailing lists Issue has been raised on the R-devel or R-pkg-devel mailing lists r-project-sprint-candidate
Projects
None yet
Development

No branches or pull requests

2 participants