-
Notifications
You must be signed in to change notification settings - Fork 1k
only coerce UTF8 factors if needed in memrecycle #7480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7480 +/- ##
=======================================
Coverage 99.06% 99.06%
=======================================
Files 86 86
Lines 16618 16619 +1
=======================================
+ Hits 16463 16464 +1
Misses 155 155 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Generated via commit dc00c4e Download link for the artifact containing the test results: ↓ atime-results.zip
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent diagnosis, thank you. Indeed, the costly calls to need2utf8 can be gated by the levels being non-identical (which we need to test anyway). With the regression, profiler shows almost all the time spent in need2utf8 → charIsASCII.
I think this is more suitable for an atime test than a normal regression test.
src/assign.c
Outdated
| if (needUtf8Coerce) { | ||
| sourceLevels = PROTECT(coerceUtf8IfNeeded(sourceLevels)); protecti++; | ||
| targetLevels = PROTECT(coerceUtf8IfNeeded(targetLevels)); protecti++; | ||
| if (sourceIsFactor && R_compute_identical(sourceLevels, targetLevels, 0)) needUtf8Coerce = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the needUtf8Coerce = false assignment covered? I've tried compiling with -Og and setting a breakpoint on the exact instruction setting the register to 0 and it didn't fire during test.data.table(). I think it might be unreachable.
The results of R_compute_identical() shouldn't change after coerceUtf8IfNeeded() because identical() takes encodings into account:
This is quite convenient because a factor with levels enc2utf8('ø') and a factor with levels iconv('ø', to = 'latin1') will pass the first R_compute_identical() already, without any other string conversions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I removed the unreachable code. Will add an atime test as a separate PR. I guess our current "issue" with atime tests and why we have to keep these branches alive, is that we squash when merging and hence the commits will disappear
@tdhock @Anirban166 does this sound right?
|
A workaround discussed in a different issue is to first merge the PR and then record the SHA1-hash of the merge commit in a follow-up commit instead of the HEAD of the PR branch.
|
|
@TysonStanley this should probably be also picked for the release (since it is a regression fix for smth included in |

Closes #7404
Not sure about the test since we want to essentially test for
system.timeof the test. Maybe its better to use an atime test? The added tests bombs runners by taking 2 hours with regression :(