Fix Windows Parsing Issue #6150

joshhwuu · 2024-05-26T20:44:30Z

Closes #6141

Refactored tests to use strrep instead of strings, and wrap tests in UTF-8 locale.

Need to run against Windows GHA to test changes

…d by setlocale

…table into fixwindowsparse

tdhock · 2024-05-27T11:53:20Z

great, thanks, can you please ask for a review from @Anirban166 ?

inst/tests/tests.Rraw

MichaelChirico

LGTM, would be great if we could confirm it works 😅

joshhwuu · 2024-05-27T23:45:56Z

Looking forward to trying it, fingers crossed 😄

Anirban166 · 2024-05-28T04:42:44Z

LGTM, would be great if we could confirm it works 😅

I wonder how you would run the tests to check for this - Given how long it takes to run everything using test.data.table() (which also led to some XQuartz-specific issues on my Mac), I'd presume one would have to isolate the tests to be checked right?

For instance, I just modified the tests.Rraw file to be comparatively minimal besides including the code from this PR and it seems to be working:

> test.data.table("inst/tests/tests.Rraw")
getDTthreads(verbose=TRUE):
  OpenMP version (_OPENMP)       201811
  omp_get_num_procs()            10
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          10
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 5 threads with throttle==1024. See ?setDTthreads.
test.data.table() running: /Users/anirban166/data.table/inst/tests/tests.Rraw
Running test id 2253.19          
Mon May 27 21:28:15 2024  endian==little, sizeof(long double)==8, longdouble.digits==, sizeof(pointer)==8, TZ==unset, Sys.timezone()=='America/Phoenix', Sys.getlocale()=='en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8', l10n_info()=='MBCS=TRUE; UTF-8=TRUE; Latin-1=FALSE; codeset=UTF-8', getDTthreads()=='OpenMP version (_OPENMP)==201811; omp_get_num_procs()==10; R_DATATABLE_NUM_PROCS_PERCENT==unset (default 50); R_DATATABLE_NUM_THREADS==unset; R_DATATABLE_THROTTLE==unset (default 1024); omp_get_thread_limit()==2147483647; omp_get_max_threads()==10; OMP_THREAD_LIMIT==unset; OMP_NUM_THREADS==unset; RestoreAfterFork==true; data.table is using 5 threads with throttle==1024. See ?setDTthreads.', zlibVersion()==1.2.11 ZLIB_VERSION==1.2.11
10 longest running tests took 0s (100% of 0s)
     ID  time nTest
1: 2253 0.034    19
All 19 tests (last 2253.19) in data.table/inst/tests/tests.Rraw completed ok in 0.045s elapsed (0.037s cpu)

tests.Rraw:

require(methods)

if (exists("test.data.table", .GlobalEnv, inherits=FALSE)) {
  if (!identical(suppressWarnings(packageDescription("data.table")), NA)) {
    remove.packages("data.table")
    stop("This is dev mode but data.table was installed. Uninstalled it. Please q() this R session and try cc() again. The installed namespace causes problems in dev mode for the S4 tests.\n")
  }
  if ((tt<-compiler::enableJIT(-1))>0)
    cat("This is dev mode and JIT is enabled (level ", tt, ") so there will be a brief pause around the first test.\n", sep="")
  rm_all = function() {}
  DTfun = DT ## otherwise DT would be re-defined by many tests
} else {
  require(data.table)
  # Make symbols to the installed version's ::: so that we can i) test internal-only not-exposed R functions
  # in the test suite when user runs test.data.table() from installed package AND ii) so that in dev the same
  # tests can be used but in dev they test the package in .GlobalEnv. If we used ::: throughout tests, that
  # would pick up the installed version and in dev you'd have to reinstall every time which slows down dev.
  # NB: The string "data.table::" (which covers "data.table:::" too) should exist nowhere else in this file
  #      other than here inside this branch.

  all.equal.data.table = data.table:::all.equal.data.table
  allNA = data.table:::allNA
  any_na = data.table:::any_na
  as.data.table.array = data.table:::as.data.table.array
  as.data.table.default = data.table:::as.data.table.default
  as.IDate.default = data.table:::as.IDate.default
  as.ITime.default = data.table:::as.ITime.default
  binary = data.table:::binary
  bmerge = data.table:::bmerge
  brackify = data.table:::brackify
  Ctest_dt_win_snprintf = data.table:::Ctest_dt_win_snprintf
  chmatchdup = data.table:::chmatchdup
  compactprint = data.table:::compactprint
  cube.data.table = data.table:::cube.data.table
  dcast.data.table = data.table:::dcast.data.table
  DTfun = data.table:::DT
  endsWith = data.table:::endsWith
  endsWithAny = data.table:::endsWithAny
  forder = data.table:::forder
  forderv = data.table:::forderv
  format.data.table = data.table:::format.data.table
  format_col.default = data.table:::format_col.default
  format_list_item.default = data.table:::format_list_item.default
  getdots = data.table:::getdots
  groupingsets.data.table = data.table:::groupingsets.data.table
  guess = data.table:::guess
  INT = data.table:::INT
  is_na = data.table:::is_na
  is.sorted = data.table:::is.sorted
  isReallyReal = data.table:::isReallyReal
  isRealReallyInt = data.table:::isRealReallyInt
  is_utc = data.table:::is_utc
  melt.data.table = data.table:::melt.data.table  # for test 1953.4
  null.data.table = data.table:::null.data.table
  print.data.table = data.table:::print.data.table
  replace_dot_alias = data.table:::replace_dot_alias
  rollup.data.table = data.table:::rollup.data.table
  rss = data.table:::rss
  selfrefok = data.table:::selfrefok
  setcoalesce = data.table:::setcoalesce
  setdiff_ = data.table:::setdiff_
  setreordervec = data.table:::setreordervec
  shallow = data.table:::shallow # until exported
  .shallow = data.table:::.shallow
  split.data.table = data.table:::split.data.table
  if (!exists('startsWith', 'package:base', inherits=FALSE)) startsWith = data.table:::startsWith
  test = data.table:::test
  uniqlengths = data.table:::uniqlengths
  uniqlist = data.table:::uniqlist
  which_ = data.table:::which_
  which.first = data.table:::which.first
  which.last = data.table:::which.last
  `-.IDate` = data.table:::`-.IDate`
  haszlib = data.table:::haszlib

  # Also, for functions that are masked by other packages, we need to map the data.table one. Or else,
  # the other package's function would be picked up. As above, we only need to do this because we desire
  # to develop in .GlobalEnv with cc().
  # This should be retained even if these packages are removed from Suggests, because the test() in this file
  # checks against a data.table result which needs the data.table one to run. Otherwise the user can be
  # sure by using :: themselves.
                                        # masked by which package?
                                        # =================================
  setattr = data.table::setattr         # bit
  shift = data.table::shift             # IRanges, GenomicRanges
  between = data.table::between         # plm
  second = data.table::second           # S4Vectors
  dcast = data.table::dcast             # reshape2
  melt = data.table::melt               # reshape2
  last = data.table::last               # xts
  first = data.table::first             # xts, S4Vectors
  copy = data.table::copy               # bit64 v4; bit64 offered to rename though so this is just in case bit64 unoffers
  second = data.table::second           # lubridate #1135
  minute = data.table::minute           # lubridate
  hour = data.table::hour               # lubridate
  yday = data.table::yday               # lubridate
  wday = data.table::wday               # lubridate
  mday = data.table::mday               # lubridate
  week = data.table::week               # lubridate
  isoweek = data.table::isoweek         # lubridate
  month = data.table::month             # lubridate
  quarter = data.table::quarter         # lubridate
  year = data.table::year               # lubridate
  yearmon = data.table::yearmon         # zoo
  yearqtr = data.table::yearqtr         # zoo

  rm_all = function(env=parent.frame()) {
    tt = setdiff(ls(envir=env), .do_not_rm)
    rm(list=tt, envir=env)
    gc()
    invisible()
  }
}

# strrep is used many times in tests, but is from R 3.3.0, so use this equivalent if it is missing.
if (!exists("strrep", "package:base")) {
  strrep = function(x, times) mapply(function(x, times) paste(rep(x, times), collapse=""), rep_len(x, length(times)), times, USE.NAMES=FALSE)
}

local({
  lc_ctype = Sys.getlocale('LC_CTYPE')
  Sys.setlocale('LC_CTYPE', "en_US.UTF-8") # Japanese multibyte characters require utf8
  on.exit({Sys.setlocale('LC_CTYPE', lc_ctype)})
  accented_a = "\u0061\u0301"
  ja_ichi = "\u4E00"
  ja_ni = "\u4E8C"
  ja_ko = "\u3053"
  ja_n = "\u3093"
  dots = "..."
  clean_regex = "^\\d+:\\s+" # removes row numbering from beginning of output
  # Tests for combining character latin a and acute accent, single row
  DT = data.table(strrep(accented_a, 4L))
  test(2253.01, options=list(datatable.prettyprint.char = 4L), DT, output=strrep(accented_a, 4L))
  test(2253.02, options=list(datatable.prettyprint.char = 3L), DT, output=paste0(strrep(accented_a, 3L), dots))
  test(2253.03, options=list(datatable.prettyprint.char = 1L), DT, output=paste0(strrep(accented_a, 1L), dots))
  # Tests for full-width japanese character ichi, single row
  DT = data.table(strrep(ja_ichi, 4L))
  test(2253.04, options=list(datatable.prettyprint.char = 4L), DT, output=strrep(ja_ichi, 4L))

test(2253.05, options=list(datatable.prettyprint.char = 3L), DT, output=paste0(strrep(ja_ichi, 3L), dots))
  test(2253.06, options=list(datatable.prettyprint.char = 1L), DT, output=paste0(strrep(ja_ichi, 1L), dots))
  # Tests for multiple, different length combining character rows
  DT = data.table(strrep(accented_a, 1L:4L))
  test(2253.07, options=list(datatable.prettyprint.char = 4L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), strrep(accented_a, 1:4L))
  test(2253.08, options=list(datatable.prettyprint.char = 3L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c(strrep(accented_a, 1:3), paste0(strrep(accented_a, 3L), dots)))
  test(2253.09, options=list(datatable.prettyprint.char = 1L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c(accented_a, rep(paste0(accented_a, dots), 3L)))
  # Tests for multiple, different length full-width characters
  DT = data.table(strrep(ja_ichi, 1L:4L))
  test(2253.10, options=list(datatable.prettyprint.char = 4L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), strrep(ja_ichi, 1:4L))
  test(2253.11, options=list(datatable.prettyprint.char = 3L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c(strrep(ja_ichi, 1:3), paste0(strrep(ja_ichi, 3L), dots)))
  test(2253.12, options=list(datatable.prettyprint.char = 1L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c(ja_ichi, rep(paste0(ja_ichi, dots), 3L)))
  # Tests for combined characters, multiple columns
  DT = data.table(paste0(ja_ichi), strrep(ja_ni, 2L), strrep(ja_ko, 3L), strrep(accented_a, 2L), "aaa")
  test(2253.13, options=list(datatable.prettyprint.char = 4L), capture.output(print(DT))[-1L], paste("1:", ja_ichi, strrep(ja_ni, 2L), strrep(ja_ko, 3L), strrep(accented_a, 2L), "aaa"))
  test(2253.14, options=list(datatable.prettyprint.char = 3L), capture.output(print(DT))[-1L], paste("1:", ja_ichi, strrep(ja_ni, 2L), strrep(ja_ko, 3L), strrep(accented_a, 2L), "aaa"))
  test(2253.15, options=list(datatable.prettyprint.char = 2L), capture.output(print(DT))[-1L], paste("1:", ja_ichi, strrep(ja_ni, 2), paste0(strrep(ja_ko, 2), dots) , strrep(accented_a, 2), "aa..."))
  test(2253.16, options=list(datatable.prettyprint.char = 1L), capture.output(print(DT))[-1L], paste("1:", ja_ichi, paste0(ja_ni, dots), paste0(ja_ko, dots), paste0(accented_a, dots), "a..."))
  # Tests for multiple columns, multiple rows
  DT = data.table(strrep(ja_ko, 1:3L), strrep(ja_n, 2:4L), strrep(accented_a, 3))
  test(2253.17, options=list(datatable.prettyprint.char = 4L), gsub(clean_regex, "", capture.output(print(DT))[-1L]),
    c(paste0(ja_ko, "     ", strrep(ja_n, 2L), " ", strrep(accented_a, 3L)),
    paste0(strrep(ja_ko, 2L), "   ", strrep(ja_n, 3L), " ", strrep(accented_a, 3L)),
    paste(strrep(ja_ko, 3L), strrep(ja_n, 4L), strrep(accented_a, 3L))))
  test(2253.18, options=list(datatable.prettyprint.char = 3L), gsub(clean_regex, "", capture.output(print(DT))[-1L]),
    c(paste0(ja_ko, "      ", strrep(ja_n, 2L), " ", strrep(accented_a, 3L)),
    paste0(strrep(ja_ko, 2L), "    ", strrep(ja_n, 3L), " ", strrep(accented_a, 3L)),
    paste(strrep(ja_ko, 3L), paste0(strrep(ja_n, 3L), dots), strrep(accented_a, 3L))))
  test(2253.19, options=list(datatable.prettyprint.char = 1L), gsub(clean_regex, "", capture.output(print(DT))[-1L]),
    c(paste0(ja_ko, " ", paste0(ja_n, dots), " ", paste0(accented_a, dots)),
    paste0(c(ja_ko, ja_n, accented_a), dots, collapse=" "),
    paste0(c(ja_ko, ja_n, accented_a), dots, collapse=" ")))
})

joshhwuu added 2 commits May 21, 2024 20:04

refactor tests to no longer use explicit japanese characters, surroun…

800d787

…d by setlocale

wrap failing tests to use utf8 locale

5cde4a5

joshhwuu requested a review from MichaelChirico as a code owner May 26, 2024 20:44

joshhwuu added 3 commits May 26, 2024 13:44

Merge branch 'master' into fixwindowsparse

8cab214

remove trailing whitespaces

c20ccab

Merge branch 'fixwindowsparse' of https://github.com/Rdatatable/data.…

3f50bae

…table into fixwindowsparse

joshhwuu requested a review from Anirban166 May 27, 2024 14:45

MichaelChirico reviewed May 27, 2024

View reviewed changes

inst/tests/tests.Rraw Show resolved Hide resolved

MichaelChirico reviewed May 27, 2024

View reviewed changes

inst/tests/tests.Rraw Outdated Show resolved Hide resolved

review suggestions

a9cea9d

MichaelChirico approved these changes May 27, 2024

View reviewed changes

move strrep to top of file

5eb7e04

MichaelChirico merged commit 10c7dd6 into master May 27, 2024
3 checks passed

joshhwuu deleted the fixwindowsparse branch May 27, 2024 23:51

joshhwuu mentioned this pull request May 29, 2024

Master List of data.table Issues for GSoC '24 (Josh) joshhwuu/gsoc-2024#1

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Windows Parsing Issue #6150

Fix Windows Parsing Issue #6150

joshhwuu commented May 26, 2024 •

edited

Loading

tdhock commented May 27, 2024

MichaelChirico left a comment

joshhwuu commented May 27, 2024

Anirban166 commented May 28, 2024

Fix Windows Parsing Issue #6150

Fix Windows Parsing Issue #6150

Conversation

joshhwuu commented May 26, 2024 • edited Loading

tdhock commented May 27, 2024

MichaelChirico left a comment

Choose a reason for hiding this comment

joshhwuu commented May 27, 2024

Anirban166 commented May 28, 2024

joshhwuu commented May 26, 2024 •

edited

Loading