Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for IEC (KiB, MiB, ...) and SI (kB, MB, ...) binary units #6

Open
4 of 8 tasks
HenrikBengtsson opened this issue Dec 30, 2015 · 12 comments
Open
4 of 8 tasks

Comments

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Dec 30, 2015

Background

There are a few standards [1] for binary prefixes for byte-size units:

  • IEC: KiB (1024 bytes), MiB (1024^2 bytes), GiB (1024^3 bytes), TiB (1024^4 bytes), ...
  • JEDEC & customary standard: KB (1024 bytes), MB (1024^2 bytes), GB (1024^3 bytes)

Note that for decimal prefixes, we have:

  • SI: kB (1000 bytes), MB (1000^2 bytes), GB (1000^3 bytes),, TB (1000^4 bytes), ...

For byte versus bit, we have:

  • IEC & customary standard: 'B' for 'byte' and 'bit' for 'bit' [3,4].
  • IEEE: 'b' for 'bit' [3].

Problem

  • R uses Kb, Mb and Gb. None of these are part of the above byte standards. Note the lower case 'b' is typically used for bit and not byte.

For example,

> size <- object.size(1:1e7)
> size
40000040 bytes
> format(size, units="auto")
[1] "38.1 Mb"

This is specific example illustrates a problem with utils:::format.object_size(). Another example is:

> base::gc()
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 279622 15.0     592000 31.7   350000 18.7
Vcells 478234  3.7    1023718  7.9   786432  6.0
> str(base::gc())
 num [1:2, 1:6] 279638 478263 15 3.7 592000 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2] "Ncells" "Vcells"
  ..$ : chr [1:6] "used" "(Mb)" "gc trigger" "(Mb)" ...

The issue with non-standard byte units in R has been reported to R-devel [5].

Wish / Suggestion

  • Use units KiB, MiB, GiB, TiB, ... everywhere in R because they are unambiguous. UPDATE: ... or SI units?
  • Migrate smoothly by:
    • Add support for IEC, JEDEC and SI prefixes where applicable;
      • IEC units for utils:::format.object_size(), cf. PR #16649. Completed as of 2016-01-06 in r69879.
      • JEDEC units for utils:::format.object_size(), cf. PR #16657. UPDATE: See discussion in comments below.
      • SI units for utils:::format.object_size(). UPDATE: Added to R-devel on 2017-01-11 (r71960)
    • Add options for default unit standard used in R, e.g. getOptions("byte.unit.standard", "legacy").
    • Make IEC SI units the new default, e.g. gc(), format.object_size(..., units="auto") and allocation error messages.
    • Deprecate invalid units (lower case b) with .Deprecate().
    • Eventually drop them using .Defunct().

Known functions / code affected:

Note, the out-of-memory errors in the native code can not easily be tweaked to support a global option; if tried, then there is a risk that that triggers another out-of-memory error.

Usages of IEC / SI elsewhere

  • The Ubuntu Linux distribution uses the IEC prefixes for base-2 units and SI prefixes for base-10 units [6].
  • Windows and Android uses JEDEC prefixes.
  • Mac OS X uses decimal SI units kB since 2009.

References

  1. Binary prefix, Wikipedia, https://en.wikipedia.org/wiki/Binary_prefix
  2. Byte, Wikipedia, https://en.wikipedia.org/wiki/Byte#Unit_symbol
  3. Bit, Wikipedia, https://en.wikipedia.org/wiki/Bit#Unit_and_symbol
  4. Man page units(7), http://man7.org/linux/man-pages/man7/units.7.html
  5. R devel thread 'format(object.size(...), units): KB, MB, and GB instead of Kb, Mb, and Gb?' started on 2014-09-07
  6. UnitsPolicy, Ubuntu Wiki, Jan 2016, https://wiki.ubuntu.com/UnitsPolicy
  • UPDATE 2016-05-03: Added src/gnuwin32/malloc.c to the list of places that needs to be updated.
  • UPDATE 2017-01-01: Aim for SI to be the new standard.
  • UPDATE 2017-01-11: Propose option byte.unit.standard for smooth transition.
  • UPDATE 2017-05-17: Identified more (all?) locations in R and native code that require updating.
@HenrikBengtsson
Copy link
Owner Author

As a first step, I just filed a backward-compatible patch to add support for IEC units in utils:::format.object_size(), cf. PR #16649.

UPDATE: This has been implemented as of 2016-01-06 in r69879.

@HenrikBengtsson
Copy link
Owner Author

IEC units are now supported by R. As the next step, I filed a backward-compatible patch to add support for JEDEC units in utils:::format.object_size(), cf. PR #16657.

@mmaechler
Copy link

  • Can you give an example and reference for "The Ubuntu Linux distribution uses the IEC prefixes since 2010" ? Personally, I find the 'KiB' notation quite ugly. I see df -h, du -h, ls -h all use suffixes K, M, G .. but no "iB" (or "B" or "b").
  • The real problem is that the SI standard really want "KB" or "MB" to mean something different than "KiB" or "MiB" and JEDEC does not.... But really the SI system is the world standard one, and JEDEC is mainly "industry" and not science bases (which the SI is). So, in principle --- if we are willing to change back compatibility--- we should really move towards the real world standard, i.e., the SI standard system.... and consequently, I'd be against endorsing JEDEC any more than we do now
    (by accepting it on "input").

@HenrikBengtsson
Copy link
Owner Author

Thanks for the comments.

  • I got the "Ubuntu" statement from [1], but must have been sloppy. I've now clarified it to say: "The Ubuntu Linux distribution uses the IEC prefixes for base-2 units and SI prefixes for base-10 units" which reflects Ubuntu's official UnitsPolicy.
  • Searching the web, there are references starting ~2010 (around Ubuntu 10.10) saying Ubuntu will move to using decimal/base-10 units with SI prefixes throughout. I don't know where they are regarding that goal.
  • SI vs JEDEC confusion: If I understand your comment correctly, you're saying we'll introduce more confusion if we explicitly add support for JEDEC. If so, I agree with you. My idea was to introduce it properly, to make it explicit that the old R units are home brewed. I'm happy to skip JEDEC.
  • Long-term for R: If this is what you are saying, I agree, supporting both decimal/base-10 and binary/base-2 units, using SI and IEC prefixes respectively, would be ideal. I'm all for that as well. Since R has only single API entry (=utils::format.object_size()) we could even introduce argument base=getOption(object.size.base=2) controlling whether base 2 or base 10 should be displayed (when units="auto"). It would also allow us to migrate from current base 2 to base 10 smoothly (and allow users to undo via the option), if that is where we heading. BTW, gc() should utilize utils::format.object_size().
  • To implementing the transition from R's current base-2 units (Kb, Mb, Gb) to SI/base-10 units (kB, MB, GB), it might be less of a shock if one does this in few release cycles:
    1. Switch to using IEC/base-2 units (KiB, MiB, GiB, ...) for units="auto".
    2. Deprecate explicit usage of units="Kb", units="Mb", ...
    3. Switch to using SI/base-10 units (kB, MB, GB, ...) for units="auto".

What do you think?

@HenrikBengtsson
Copy link
Owner Author

Another approach that could work is to add support for units="IEC", units="SI" and units="legacy". That can be done without breaking backward compatibilty. The units="auto" can equal units="legacy" and any future transitions can be in what units="auto" corresponds to.

UPDATE: The issues with this is that it's not possible to control whether units="MB" is meant to be current R "legacy" (base-2) units or SI (base-10) units.

@HenrikBengtsson HenrikBengtsson changed the title Use unambiguous KiB, MiB, GiB, ... binary units everywhere Support for IEC (KiB, MiB, ...) and SI (kB, MB, ...) binary units Feb 23, 2016
@HenrikBengtsson
Copy link
Owner Author

HenrikBengtsson commented Jan 2, 2017

Here's my new proposal for supporting "legacy", IEC and SI units in a backward compatible way and such that it will be easy to switch from today's default "legacy" to SI units at some point in R's future.

The file to be updated in R is src/library/utils/R/object.size.R:

object.size <- function(x)
    structure(.Call(C_objectSize, x), class = "object_size")

format.object_size <- function(x, units = "b", standard = "auto", digits = 1L, ...)
{
    known_bases <- c(legacy = 1024, IEC = 1024, SI = 1000)
    known_units <- list(
        SI      =  c("B", "kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB"),
        IEC     =  c("B", "KiB", "MiB", "GiB", "TiB", "PiB", "EiB", "ZiB", "YiB"),
        legacy  =  c("b", "Kb", "Mb", "Gb", "Tb", "Pb"),
        LEGACY  =  c("B", "KB", "MB", "GB", "TB", "PB")
    )

    units <- match.arg(units, c("auto", unique(unlist(known_units), use.names = FALSE)))
    standard <- match.arg(standard, c("auto", names(known_bases)))

    ## Infer 'standard' from 'units'?
    if (standard == "auto") {
        standard <- "legacy"           ## default; to become "SI"
        if (units != "auto") {
            if (grepl("iB$", units)) {
                standard <- "IEC"
            } else if (grepl("b$", units)) {
                standard <- "legacy"   ## keep when "SI" is the default
            } else if (units == "kB") {
	        ## SPECIAL: Drop when "SI" becomes the default
                stop("For SI units, please specify standard = \"SI\"")
	    }
        }
    }

    base <- known_bases[[standard]]
    units_map <- known_units[[standard]]

    if (units == "auto") {
        power <- if (x <= 0) 0 else min(as.integer(log(x, base = base)), length(units_map) - 1L)
    } else {
        power <- match(toupper(units), toupper(units_map)) - 1L
        if (is.na(power)) {
            stop(gettextf("Unit %s is not part of standard %s", sQuote(units), sQuote(standard)))
        }
    }

    unit <- units_map[power + 1L]

    ## SPECIAL: Use suffix 'bytes' instead of 'b' for 'legacy'
    if (power == 0 && standard == "legacy") unit <- "bytes"
    
    paste(round(x / base^power, digits = digits), unit)
}

print.object_size <-
    function(x, quote = FALSE, units = "b", standard = "auto", digits = 1L, ...)
{
    y <- format.object_size(x, units = units, standard = standard, digits = digits)
    if(quote) print.default(y, ...) else cat(y, "\n", sep = "")
    invisible(x)
}

Examples and tests

assert_size <- function(x, ..., expected) {
    size <- structure(x, class = "object_size")
    res <- try(format(size, ...), silent = TRUE)
    if (expected == "error") {
        if (!inherits(res, "try-error"))
            stop(sprintf("Expected %s but got %s", sQuote(expected), sQuote(res)))
    } else if (res != expected) {
        stop(sprintf("Expected %s but got %s", sQuote(expected), sQuote(res)))
    }
}

## The default is the 'legacy' standard (backward compatibility)
assert_size(0,    expected = "0 bytes")
assert_size(1,    expected = "1 bytes")
assert_size(1023, expected = "1023 bytes")
assert_size(1024, expected = "1024 bytes")

## Standard inferred from 'legacy' units
assert_size(0,            units = "b",  expected = "0 bytes")
assert_size(1,            units = "B",  expected = "1 bytes")
assert_size(999,          units = "B",  expected = "999 bytes")
assert_size(1000,         units = "Kb", expected = "1 Kb")
assert_size(1024,         units = "KB", expected = "1 Kb")
assert_size(2.0 * 1000^2, units = "MB", expected = "1.9 Mb")
assert_size(3.1 * 1000^3, units = "GB", expected = "2.9 Gb")
assert_size(4.2 * 1000^8, units = "TB", expected = "3819877747446.3 Tb")
assert_size(4.2 * 1000^9, units = "Pb", expected = "3730349362740.5 Pb")

## Standard inferred from 'IEC' units
assert_size(1000,         units = "KiB", expected = "1 KiB")
assert_size(1024,         units = "KiB", expected = "1 KiB")
assert_size(2.0 * 1000^2, units = "MiB", expected = "1.9 MiB")
assert_size(3.1 * 1000^3, units = "GiB", expected = "2.9 GiB")
assert_size(4.2 * 1000^8, units = "TiB", expected = "3819877747446.3 TiB")
assert_size(4.2 * 1000^9, units = "PiB", expected = "3730349362740.5 PiB")

## Inferring standard from 'SI' units is not possible because they
## conflict with 'legacy' units (and it would be confusing to support
## high-range SI units not covered by the legacy units)
assert_size(3.1 * 1024^1, units = "kB", expected = "error")
assert_size(3.1 * 1024^6, units = "EB", expected = "error")
assert_size(3.1 * 1024^7, units = "ZB", expected = "error")
assert_size(3.1 * 1024^8, units = "YB", expected = "error")


## Automatic 'legacy' units (default)
assert_size(0,            units = "auto", expected = "0 bytes")
assert_size(1,            units = "auto", expected = "1 bytes")
assert_size(1023,         units = "auto", expected = "1023 bytes")
assert_size(1024,         units = "auto", expected = "1 Kb")
assert_size(2.0 * 1000^2, units = "auto", expected = "1.9 Mb")

## Automatic 'legacy' units
assert_size(0,            units = "auto", standard = "legacy", expected = "0 bytes")
assert_size(1,            units = "auto", standard = "legacy", expected = "1 bytes")
assert_size(1023,         units = "auto", standard = "legacy", expected = "1023 bytes")
assert_size(1024,         units = "auto", standard = "legacy", expected = "1 Kb")
assert_size(2.0 * 1000^2, units = "auto", standard = "legacy", expected = "1.9 Mb")
assert_size(3.1 * 1024^3, units = "auto", standard = "legacy", expected = "3.1 Gb")
assert_size(3.1 * 1024^4, units = "auto", standard = "legacy", expected = "3.1 Tb")
assert_size(3.1 * 1024^5, units = "auto", standard = "legacy", expected = "3.1 Pb")
assert_size(3.1 * 1024^6, units = "auto", standard = "legacy", expected = "3174.4 Pb")

## Automatic 'IEC' units
assert_size(0,            units = "auto", standard = "IEC", expected = "0 B")
assert_size(1,            units = "auto", standard = "IEC", expected = "1 B")
assert_size(1023,         units = "auto", standard = "IEC", expected = "1023 B")
assert_size(1024,         units = "auto", standard = "IEC", expected = "1 KiB")
assert_size(2.0 * 1000^2, units = "auto", standard = "IEC", expected = "1.9 MiB")
assert_size(3.1 * 1024^3, units = "auto", standard = "IEC", expected = "3.1 GiB")
assert_size(3.1 * 1024^4, units = "auto", standard = "IEC", expected = "3.1 TiB")
assert_size(3.1 * 1024^5, units = "auto", standard = "IEC", expected = "3.1 PiB")
assert_size(3.1 * 1024^6, units = "auto", standard = "IEC", expected = "3.1 EiB")
assert_size(3.1 * 1024^7, units = "auto", standard = "IEC", expected = "3.1 ZiB")
assert_size(4.2 * 1024^8, units = "auto", standard = "IEC", expected = "4.2 YiB")
assert_size(4.2 * 1024^9, units = "auto", standard = "IEC", expected = "4300.8 YiB")

## Automatic 'SI' units
assert_size(0,            units = "auto", standard = "SI", expected = "0 B")
assert_size(1,            units = "auto", standard = "SI", expected = "1 B")
assert_size(999,          units = "auto", standard = "SI", expected = "999 B")
assert_size(1000,         units = "auto", standard = "SI", expected = "1 kB")
assert_size(1024,         units = "auto", standard = "SI", expected = "1 kB")
assert_size(2.0 * 1000^2, units = "auto", standard = "SI", expected = "2 MB")
assert_size(3.1 * 1000^3, units = "auto", standard = "SI", expected = "3.1 GB")
assert_size(3.1 * 1000^4, units = "auto", standard = "SI", expected = "3.1 TB")
assert_size(3.1 * 1000^5, units = "auto", standard = "SI", expected = "3.1 PB")
assert_size(3.1 * 1000^6, units = "auto", standard = "SI", expected = "3.1 EB")
assert_size(3.1 * 1000^7, units = "auto", standard = "SI", expected = "3.1 ZB")
assert_size(4.2 * 1000^8, units = "auto", standard = "SI", expected = "4.2 YB")
assert_size(4.2 * 1000^9, units = "auto", standard = "SI", expected = "4200 YB")

UPDATE: 2017-01-01: Forgot that SI uses 'kB'; minor tweaks above.

@HenrikBengtsson
Copy link
Owner Author

UPDATE: SI units are now supported in R-devel, see r71960.

@HenrikBengtsson HenrikBengtsson added the on r-devel or r-pkg-devel mailing lists Issue has been raised on the R-devel or R-pkg-devel mailing lists label Aug 29, 2018
@llrs
Copy link

llrs commented Feb 24, 2020

I'll just add a link to a thread on twitter for your future references on this topic: https://twitter.com/henrikbengtsson/status/1231986947360354305

@HenrikBengtsson
Copy link
Owner Author

Posted PR18297 titled 'Use standard file-size units everywhere in base R (e.g., Mb -> MiB)' on 2022-02-01.

@HenrikBengtsson
Copy link
Owner Author

Filed PR18435 adding new SI prefixes RB (ronnabytes) and QB (quettabytes) to format() for object_size.

@HenrikBengtsson
Copy link
Owner Author

SI prefixes RB (ronnabytes) and QB (quettabytes) was has been added to R-devel (to become R 4.3.0), cf. wch/r-source@cd2d0ba

@HenrikBengtsson
Copy link
Owner Author

One more location to fix, was just added to src/main/memory.c in R-devel, cf. wch/r-source@459492b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants