only coerce UTF8 factors if needed in memrecycle #7480

ben-schwen · 2025-12-16T12:52:16Z

Not sure about the test since we want to essentially test for system.time of the test. Maybe its better to use an atime test? The added tests bombs runners by taking 2 hours with regression :(

codecov · 2025-12-16T13:00:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.06%. Comparing base (b0c4ac3) to head (dc00c4e).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #7480   +/-   ##
=======================================
  Coverage   99.06%   99.06%           
=======================================
  Files          86       86           
  Lines       16618    16619    +1     
=======================================
+ Hits        16463    16464    +1     
  Misses        155      155

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-12-16T13:09:25Z

HEAD=memrecycle_factor slower P<0.001 for memrecycle regression fixed in #5463
HEAD=memrecycle_factor slower P<0.001 for DT[by,verbose=TRUE] improved in #6296

Generated via commit dc00c4e

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	2 minutes and 58 seconds
Installing different package versions	21 seconds
Running and plotting the test cases	2 minutes and 38 seconds

aitap

Excellent diagnosis, thank you. Indeed, the costly calls to need2utf8 can be gated by the levels being non-identical (which we need to test anyway). With the regression, profiler shows almost all the time spent in need2utf8 → charIsASCII.

I think this is more suitable for an atime test than a normal regression test.

aitap · 2025-12-16T13:31:25Z

src/assign.c

+      if (needUtf8Coerce) {
+        sourceLevels = PROTECT(coerceUtf8IfNeeded(sourceLevels)); protecti++;
        targetLevels = PROTECT(coerceUtf8IfNeeded(targetLevels)); protecti++;
+        if (sourceIsFactor && R_compute_identical(sourceLevels, targetLevels, 0)) needUtf8Coerce = false;


Is the needUtf8Coerce = false assignment covered? I've tried compiling with -Og and setting a breakpoint on the exact instruction setting the register to 0 and it didn't fire during test.data.table(). I think it might be unreachable.

The results of R_compute_identical() shouldn't change after coerceUtf8IfNeeded() because identical() takes encodings into account:

https://github.com/r-devel/r-svn/blob/96eee1cdda590de914d48fed05d8f0783f921da4/src/main/memory.c#L4978-L4997

This is quite convenient because a factor with levels enc2utf8('ø') and a factor with levels iconv('ø', to = 'latin1') will pass the first R_compute_identical() already, without any other string conversions.

Good point! I removed the unreachable code. Will add an atime test as a separate PR. I guess our current "issue" with atime tests and why we have to keep these branches alive, is that we squash when merging and hence the commits will disappear

@tdhock @Anirban166 does this sound right?

aitap · 2025-12-17T11:08:58Z

A workaround discussed in a different issue is to first merge the PR and then record the SHA1-hash of the merge commit in a follow-up commit instead of the HEAD of the PR branch.

ben-schwen · 2025-12-17T15:04:29Z

@TysonStanley this should probably be also picked for the release (since it is a regression fix for smth included in 1.17.2)!

only coerce UTF8 factors if needed

4acabf0

ben-schwen requested review from HughParsonage and MichaelChirico as code owners December 16, 2025 12:52

ben-schwen requested review from aitap and removed request for HughParsonage and MichaelChirico December 16, 2025 12:52

aitap approved these changes Dec 16, 2025

View reviewed changes

ben-schwen added 3 commits December 17, 2025 11:40

remove unreachable code

d111dcc

remove unit test

7e3d3f1

add NEWS

dc00c4e

ben-schwen merged commit b6ad1a4 into master Dec 17, 2025
13 checks passed

ben-schwen deleted the memrecycle_factor branch December 17, 2025 11:10

ben-schwen mentioned this pull request Dec 17, 2025

add atime test for grouping by factor (encodings) #7482

Merged

ben-schwen mentioned this pull request Dec 21, 2025

allow to fail atime job in GHCI / use merge commits to allow branch deletion #7363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

only coerce UTF8 factors if needed in memrecycle #7480

only coerce UTF8 factors if needed in memrecycle #7480

ben-schwen commented Dec 16, 2025

Uh oh!

codecov bot commented Dec 16, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 16, 2025 •

edited

Loading

Uh oh!

aitap left a comment •

edited

Loading

Uh oh!

aitap Dec 16, 2025 •

edited

Loading

Uh oh!

ben-schwen Dec 17, 2025

Uh oh!

aitap commented Dec 17, 2025 via email

Uh oh!

Uh oh!

ben-schwen commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

only coerce UTF8 factors if needed in memrecycle #7480

only coerce UTF8 factors if needed in memrecycle #7480

Conversation

ben-schwen commented Dec 16, 2025

Uh oh!

codecov bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aitap left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aitap Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ben-schwen Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

aitap commented Dec 17, 2025 via email

Uh oh!

Uh oh!

ben-schwen commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Dec 16, 2025 •

edited

Loading

github-actions bot commented Dec 16, 2025 •

edited

Loading

aitap left a comment •

edited

Loading

aitap Dec 16, 2025 •

edited

Loading