Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Windows Parsing Issue #6150

Merged
merged 7 commits into from
May 27, 2024
Merged

Fix Windows Parsing Issue #6150

merged 7 commits into from
May 27, 2024

Conversation

joshhwuu
Copy link
Member

@joshhwuu joshhwuu commented May 26, 2024

Closes #6141

Refactored tests to use strrep instead of strings, and wrap tests in UTF-8 locale.

Need to run against Windows GHA to test changes

@tdhock
Copy link
Member

tdhock commented May 27, 2024

great, thanks, can you please ask for a review from @Anirban166 ?

@joshhwuu joshhwuu requested a review from Anirban166 May 27, 2024 14:45
Copy link
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, would be great if we could confirm it works 😅

@joshhwuu
Copy link
Member Author

Looking forward to trying it, fingers crossed 😄

@MichaelChirico MichaelChirico merged commit 10c7dd6 into master May 27, 2024
3 checks passed
@joshhwuu joshhwuu deleted the fixwindowsparse branch May 27, 2024 23:51
@Anirban166
Copy link
Member

LGTM, would be great if we could confirm it works 😅

I wonder how you would run the tests to check for this - Given how long it takes to run everything using test.data.table() (which also led to some XQuartz-specific issues on my Mac), I'd presume one would have to isolate the tests to be checked right?

For instance, I just modified the tests.Rraw file to be comparatively minimal besides including the code from this PR and it seems to be working:

> test.data.table("inst/tests/tests.Rraw")
getDTthreads(verbose=TRUE):
  OpenMP version (_OPENMP)       201811
  omp_get_num_procs()            10
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          10
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 5 threads with throttle==1024. See ?setDTthreads.
test.data.table() running: /Users/anirban166/data.table/inst/tests/tests.Rraw
Running test id 2253.19          
Mon May 27 21:28:15 2024  endian==little, sizeof(long double)==8, longdouble.digits==, sizeof(pointer)==8, TZ==unset, Sys.timezone()=='America/Phoenix', Sys.getlocale()=='en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8', l10n_info()=='MBCS=TRUE; UTF-8=TRUE; Latin-1=FALSE; codeset=UTF-8', getDTthreads()=='OpenMP version (_OPENMP)==201811; omp_get_num_procs()==10; R_DATATABLE_NUM_PROCS_PERCENT==unset (default 50); R_DATATABLE_NUM_THREADS==unset; R_DATATABLE_THROTTLE==unset (default 1024); omp_get_thread_limit()==2147483647; omp_get_max_threads()==10; OMP_THREAD_LIMIT==unset; OMP_NUM_THREADS==unset; RestoreAfterFork==true; data.table is using 5 threads with throttle==1024. See ?setDTthreads.', zlibVersion()==1.2.11 ZLIB_VERSION==1.2.11
10 longest running tests took 0s (100% of 0s)
     ID  time nTest
1: 2253 0.034    19
All 19 tests (last 2253.19) in data.table/inst/tests/tests.Rraw completed ok in 0.045s elapsed (0.037s cpu)

tests.Rraw:

require(methods)

if (exists("test.data.table", .GlobalEnv, inherits=FALSE)) {
  if (!identical(suppressWarnings(packageDescription("data.table")), NA)) {
    remove.packages("data.table")
    stop("This is dev mode but data.table was installed. Uninstalled it. Please q() this R session and try cc() again. The installed namespace causes problems in dev mode for the S4 tests.\n")
  }
  if ((tt<-compiler::enableJIT(-1))>0)
    cat("This is dev mode and JIT is enabled (level ", tt, ") so there will be a brief pause around the first test.\n", sep="")
  rm_all = function() {}
  DTfun = DT ## otherwise DT would be re-defined by many tests
} else {
  require(data.table)
  # Make symbols to the installed version's ::: so that we can i) test internal-only not-exposed R functions
  # in the test suite when user runs test.data.table() from installed package AND ii) so that in dev the same
  # tests can be used but in dev they test the package in .GlobalEnv. If we used ::: throughout tests, that
  # would pick up the installed version and in dev you'd have to reinstall every time which slows down dev.
  # NB: The string "data.table::" (which covers "data.table:::" too) should exist nowhere else in this file
  #      other than here inside this branch.

  all.equal.data.table = data.table:::all.equal.data.table
  allNA = data.table:::allNA
  any_na = data.table:::any_na
  as.data.table.array = data.table:::as.data.table.array
  as.data.table.default = data.table:::as.data.table.default
  as.IDate.default = data.table:::as.IDate.default
  as.ITime.default = data.table:::as.ITime.default
  binary = data.table:::binary
  bmerge = data.table:::bmerge
  brackify = data.table:::brackify
  Ctest_dt_win_snprintf = data.table:::Ctest_dt_win_snprintf
  chmatchdup = data.table:::chmatchdup
  compactprint = data.table:::compactprint
  cube.data.table = data.table:::cube.data.table
  dcast.data.table = data.table:::dcast.data.table
  DTfun = data.table:::DT
  endsWith = data.table:::endsWith
  endsWithAny = data.table:::endsWithAny
  forder = data.table:::forder
  forderv = data.table:::forderv
  format.data.table = data.table:::format.data.table
  format_col.default = data.table:::format_col.default
  format_list_item.default = data.table:::format_list_item.default
  getdots = data.table:::getdots
  groupingsets.data.table = data.table:::groupingsets.data.table
  guess = data.table:::guess
  INT = data.table:::INT
  is_na = data.table:::is_na
  is.sorted = data.table:::is.sorted
  isReallyReal = data.table:::isReallyReal
  isRealReallyInt = data.table:::isRealReallyInt
  is_utc = data.table:::is_utc
  melt.data.table = data.table:::melt.data.table  # for test 1953.4
  null.data.table = data.table:::null.data.table
  print.data.table = data.table:::print.data.table
  replace_dot_alias = data.table:::replace_dot_alias
  rollup.data.table = data.table:::rollup.data.table
  rss = data.table:::rss
  selfrefok = data.table:::selfrefok
  setcoalesce = data.table:::setcoalesce
  setdiff_ = data.table:::setdiff_
  setreordervec = data.table:::setreordervec
  shallow = data.table:::shallow # until exported
  .shallow = data.table:::.shallow
  split.data.table = data.table:::split.data.table
  if (!exists('startsWith', 'package:base', inherits=FALSE)) startsWith = data.table:::startsWith
  test = data.table:::test
  uniqlengths = data.table:::uniqlengths
  uniqlist = data.table:::uniqlist
  which_ = data.table:::which_
  which.first = data.table:::which.first
  which.last = data.table:::which.last
  `-.IDate` = data.table:::`-.IDate`
  haszlib = data.table:::haszlib

  # Also, for functions that are masked by other packages, we need to map the data.table one. Or else,
  # the other package's function would be picked up. As above, we only need to do this because we desire
  # to develop in .GlobalEnv with cc().
  # This should be retained even if these packages are removed from Suggests, because the test() in this file
  # checks against a data.table result which needs the data.table one to run. Otherwise the user can be
  # sure by using :: themselves.
                                        # masked by which package?
                                        # =================================
  setattr = data.table::setattr         # bit
  shift = data.table::shift             # IRanges, GenomicRanges
  between = data.table::between         # plm
  second = data.table::second           # S4Vectors
  dcast = data.table::dcast             # reshape2
  melt = data.table::melt               # reshape2
  last = data.table::last               # xts
  first = data.table::first             # xts, S4Vectors
  copy = data.table::copy               # bit64 v4; bit64 offered to rename though so this is just in case bit64 unoffers
  second = data.table::second           # lubridate #1135
  minute = data.table::minute           # lubridate
  hour = data.table::hour               # lubridate
  yday = data.table::yday               # lubridate
  wday = data.table::wday               # lubridate
  mday = data.table::mday               # lubridate
  week = data.table::week               # lubridate
  isoweek = data.table::isoweek         # lubridate
  month = data.table::month             # lubridate
  quarter = data.table::quarter         # lubridate
  year = data.table::year               # lubridate
  yearmon = data.table::yearmon         # zoo
  yearqtr = data.table::yearqtr         # zoo

  rm_all = function(env=parent.frame()) {
    tt = setdiff(ls(envir=env), .do_not_rm)
    rm(list=tt, envir=env)
    gc()
    invisible()
  }
}

# strrep is used many times in tests, but is from R 3.3.0, so use this equivalent if it is missing.
if (!exists("strrep", "package:base")) {
  strrep = function(x, times) mapply(function(x, times) paste(rep(x, times), collapse=""), rep_len(x, length(times)), times, USE.NAMES=FALSE)
}

local({
  lc_ctype = Sys.getlocale('LC_CTYPE')
  Sys.setlocale('LC_CTYPE', "en_US.UTF-8") # Japanese multibyte characters require utf8
  on.exit({Sys.setlocale('LC_CTYPE', lc_ctype)})
  accented_a = "\u0061\u0301"
  ja_ichi = "\u4E00"
  ja_ni = "\u4E8C"
  ja_ko = "\u3053"
  ja_n = "\u3093"
  dots = "..."
  clean_regex = "^\\d+:\\s+" # removes row numbering from beginning of output
  # Tests for combining character latin a and acute accent, single row
  DT = data.table(strrep(accented_a, 4L))
  test(2253.01, options=list(datatable.prettyprint.char = 4L), DT, output=strrep(accented_a, 4L))
  test(2253.02, options=list(datatable.prettyprint.char = 3L), DT, output=paste0(strrep(accented_a, 3L), dots))
  test(2253.03, options=list(datatable.prettyprint.char = 1L), DT, output=paste0(strrep(accented_a, 1L), dots))
  # Tests for full-width japanese character ichi, single row
  DT = data.table(strrep(ja_ichi, 4L))
  test(2253.04, options=list(datatable.prettyprint.char = 4L), DT, output=strrep(ja_ichi, 4L))

test(2253.05, options=list(datatable.prettyprint.char = 3L), DT, output=paste0(strrep(ja_ichi, 3L), dots))
  test(2253.06, options=list(datatable.prettyprint.char = 1L), DT, output=paste0(strrep(ja_ichi, 1L), dots))
  # Tests for multiple, different length combining character rows
  DT = data.table(strrep(accented_a, 1L:4L))
  test(2253.07, options=list(datatable.prettyprint.char = 4L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), strrep(accented_a, 1:4L))
  test(2253.08, options=list(datatable.prettyprint.char = 3L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c(strrep(accented_a, 1:3), paste0(strrep(accented_a, 3L), dots)))
  test(2253.09, options=list(datatable.prettyprint.char = 1L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c(accented_a, rep(paste0(accented_a, dots), 3L)))
  # Tests for multiple, different length full-width characters
  DT = data.table(strrep(ja_ichi, 1L:4L))
  test(2253.10, options=list(datatable.prettyprint.char = 4L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), strrep(ja_ichi, 1:4L))
  test(2253.11, options=list(datatable.prettyprint.char = 3L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c(strrep(ja_ichi, 1:3), paste0(strrep(ja_ichi, 3L), dots)))
  test(2253.12, options=list(datatable.prettyprint.char = 1L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c(ja_ichi, rep(paste0(ja_ichi, dots), 3L)))
  # Tests for combined characters, multiple columns
  DT = data.table(paste0(ja_ichi), strrep(ja_ni, 2L), strrep(ja_ko, 3L), strrep(accented_a, 2L), "aaa")
  test(2253.13, options=list(datatable.prettyprint.char = 4L), capture.output(print(DT))[-1L], paste("1:", ja_ichi, strrep(ja_ni, 2L), strrep(ja_ko, 3L), strrep(accented_a, 2L), "aaa"))
  test(2253.14, options=list(datatable.prettyprint.char = 3L), capture.output(print(DT))[-1L], paste("1:", ja_ichi, strrep(ja_ni, 2L), strrep(ja_ko, 3L), strrep(accented_a, 2L), "aaa"))
  test(2253.15, options=list(datatable.prettyprint.char = 2L), capture.output(print(DT))[-1L], paste("1:", ja_ichi, strrep(ja_ni, 2), paste0(strrep(ja_ko, 2), dots) , strrep(accented_a, 2), "aa..."))
  test(2253.16, options=list(datatable.prettyprint.char = 1L), capture.output(print(DT))[-1L], paste("1:", ja_ichi, paste0(ja_ni, dots), paste0(ja_ko, dots), paste0(accented_a, dots), "a..."))
  # Tests for multiple columns, multiple rows
  DT = data.table(strrep(ja_ko, 1:3L), strrep(ja_n, 2:4L), strrep(accented_a, 3))
  test(2253.17, options=list(datatable.prettyprint.char = 4L), gsub(clean_regex, "", capture.output(print(DT))[-1L]),
    c(paste0(ja_ko, "     ", strrep(ja_n, 2L), " ", strrep(accented_a, 3L)),
    paste0(strrep(ja_ko, 2L), "   ", strrep(ja_n, 3L), " ", strrep(accented_a, 3L)),
    paste(strrep(ja_ko, 3L), strrep(ja_n, 4L), strrep(accented_a, 3L))))
  test(2253.18, options=list(datatable.prettyprint.char = 3L), gsub(clean_regex, "", capture.output(print(DT))[-1L]),
    c(paste0(ja_ko, "      ", strrep(ja_n, 2L), " ", strrep(accented_a, 3L)),
    paste0(strrep(ja_ko, 2L), "    ", strrep(ja_n, 3L), " ", strrep(accented_a, 3L)),
    paste(strrep(ja_ko, 3L), paste0(strrep(ja_n, 3L), dots), strrep(accented_a, 3L))))
  test(2253.19, options=list(datatable.prettyprint.char = 1L), gsub(clean_regex, "", capture.output(print(DT))[-1L]),
    c(paste0(ja_ko, " ", paste0(ja_n, dots), " ", paste0(accented_a, dots)),
    paste0(c(ja_ko, ja_n, accented_a), dots, collapse=" "),
    paste0(c(ja_ko, ja_n, accented_a), dots, collapse=" ")))
})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

non-UTF8 characters fail to parse on Windows
4 participants