Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for .datatable.aware being FALSE #5655

Merged
merged 16 commits into from
Feb 19, 2024

Conversation

dvg-p4
Copy link
Contributor

@dvg-p4 dvg-p4 commented Jun 12, 2023

Building a quick implementation of #5654 based off the code added in #4909. Technically this won't check if the package sets .datatable.aware = FALSE in its top-level namespace, but I can't imagine a use case for that--why would you be importing data.table in NAMESPACE if you didn't want to use its syntax at all?

TODO:

  • Double-check https://rdatatable.gitlab.io/data.table/CONTRIBUTING.html to see if this list is missing any steps
  • See if devtools::check() passes on Linux (getting thread-related errors on my macbook)
  • Check if this works on the command line
  • Check if this works within a package function (as motivated)
  • Write a test case for this
  • Make sure the new test fails on master
  • Make sure all tests (including the new one) pass on the PR branch
  • Update NEWS
  • Mark as ready for review

Copy link

codecov bot commented Jan 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (4e274ab) 97.48% compared to head (476049e) 97.48%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5655   +/-   ##
=======================================
  Coverage   97.48%   97.48%           
=======================================
  Files          80       80           
  Lines       14857    14857           
=======================================
+ Hits        14483    14484    +1     
+ Misses        374      373    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

R/cedta.R Outdated Show resolved Hide resolved
@MichaelChirico
Copy link
Member

BTW, for this case, feel free to confirm it works locally only. It will be cumbersome to write a test case (I reckon) since AIUI it will require creating & installing a test package. I don't think we have any tests like that so far, and this functionality is very close to existing & well-worn functionality, so I don't expect it to do anything surprising. Therefore just testing locally & reporting back is sufficient IMO as reviewer.

@dvg-p4
Copy link
Contributor Author

dvg-p4 commented Jan 15, 2024

Therefore just testing locally & reporting back is sufficient IMO as reviewer.

Ok, that makes this a lot more straightforward--let me get on that and I'll see if I can get this ready for review today.

@dvg-p4
Copy link
Contributor Author

dvg-p4 commented Jan 15, 2024

Test 2244.2 ran without errors but selfrefok(x) is FALSE

This is starting to go a bit over my head but I suspect this is a deeper error that I've revealed rather than caused, glancing at the comments above _selfrefok()'s definition:

data.table/src/assign.c

Lines 93 to 96 in cde7333

However, we still have problem (ii) above and it didn't pass tests involving base R copies.
We really need R itself to start setting TRUELENGTH to be the allocated length and then
for GC to release TRUELENGTH not LENGTH. Would really tidy this up.

@dvg-p4
Copy link
Contributor Author

dvg-p4 commented Jan 15, 2024

And this works as intended in a test package! https://github.com/dvg-p4/DataTableAwareTestPkg/tree/master

> library(data.table)
data.table 1.14.99 IN DEVELOPMENT built 2024-01-15 20:22:22 UTC; dgealow using 48 threads (see ?getDTthreads).  Latest news: r-datatable.com
> library(DataTableAwareTestPkg)
> dt <- data.table(foo = 1:3, bar = 4:6)
> df <- data.frame(foo = 1:3, bar = 4:6)
> get_first_row_of_datatable(dt)
     foo   bar
   <int> <int>
1:     1     4
> get_first_row_of_datatable(df)
  foo
1   1
2   2
3   3
> get_first_column_of_dataframe(dt)
     foo
   <int>
1:     1
2:     2
3:     3
> get_first_column_of_dataframe(df)
  foo
1   1
2   2
3   3

@dvg-p4 dvg-p4 marked this pull request as ready for review January 15, 2024 21:53
@dvg-p4
Copy link
Contributor Author

dvg-p4 commented Jan 15, 2024

Yeah it looks like the issue with selfrefok() failing with datatable-unaware indexing is a preexisting problem, I can replicate it on master with

> dt = data.table(foo = 1:3, bar = 4:6, baz = 7:9)
> res <- `[.data.frame`(dt, 2:3, 1:2)
> res
     foo   bar
   <int> <int>
1:     2     5
2:     3     6
> data.table:::selfrefok(res)
[1] 0

Or, with the https://github.com/dvg-p4/DataTableAwareTestPkg/tree/default-unaware branch of my test package:

> dt = data.table(foo = 1:3, bar = 4:6, baz = 7:9)
> res = get_first_column_of_dataframe(dt) # returns dt[1] in a non-DT-aware environment
> res
     foo
   <int>
1:     1
2:     2
3:     3
> data.table:::selfrefok(res)
[1] 0

@dvg-p4
Copy link
Contributor Author

dvg-p4 commented Jan 25, 2024

I was wondering, since I don't see a milestone assigned to this issue or PR, should I expect this to make it into the next release if it gets approved, or would it probably not be until the next patch after that? or next major release? Trying to plan out whether I ought to spend time on a workaround for our package if this won't be incorporated into a DT release for a while.

.datatable.aware = TRUE
test(2244.1, dt[1], data.table(foo = 1L, bar = 4L)) # Single index should be interpreted as row
.datatable.aware = FALSE
# selfrefok() fails on unaware-indexed datatables; the copy() here sidesteps that issue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if selfrefok() becomes FALSE after returning from cedta(), isn't that a broader issue?

Now users who declare themselves unaware of data.table are getting obscure internal warnings from data.table....

WDYT? Worth addressing in this PR? Or kick the can until a further issue is reported downstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it actually propagates to a user-visible warning in normal use--it's specifically data.table::test() that checks selfrefok() and throws an error if not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's wait & see then.

@MichaelChirico
Copy link
Member

I was wondering, since I don't see a milestone assigned to this issue or PR, should I expect this to make it into the next release if it gets approved, or would it probably not be until the next patch after that? or next major release? Trying to plan out whether I ought to spend time on a workaround for our package if this won't be incorporated into a DT release for a while.

Right, 1.15.0 is sealed up already.

Going forward, "major" release will be the default, i.e., 1.16.0 is the next release. I believe the timeline we discussed is every 6 months for such releases.

@MichaelChirico MichaelChirico changed the base branch from master to 1-15-99 January 25, 2024 21:26
@MichaelChirico
Copy link
Member

Would you mind resolving the conflicts? I set the merge target to the upcoming dev version of R.

@dvg-p4
Copy link
Contributor Author

dvg-p4 commented Feb 14, 2024

Would you mind resolving the conflicts? I set the merge target to the upcoming dev version of R.

Sorry for the delay---done

@dvg-p4
Copy link
Contributor Author

dvg-p4 commented Feb 18, 2024

@MichaelChirico I'm guessing from the timestamps that this being closed was an automatic Github action due to you deleting the branch it was targeting--if so could you re-open and re-target whatever it should be targeting now?

@MichaelChirico
Copy link
Member

@dvg-p4 sorry about that, usually it auto-rebases to master. I think it failed to do so because the branch is on your fork... "Reopen and comment" is greyed out on my UI, so I'm not sure how to proceed...

@MichaelChirico MichaelChirico changed the base branch from 1-15-99 to master February 19, 2024 01:01
@dvg-p4
Copy link
Contributor Author

dvg-p4 commented Feb 19, 2024

Oh looks like you got it haha. I can take a look at those conflicts tomorrow.

@MichaelChirico
Copy link
Member

MichaelChirico commented Feb 19, 2024

Kinda stupid, but for future reference:

  1. Re-create the Rdatatable:1-15-99 branch that this PR was targeting
  2. Only now can I re-open this PR
  3. Only then can I edit the target branch for this PR back to Rdatatable:master
  4. (delete placeholder 1-15-99 branch)

dvg-p4 and others added 12 commits February 18, 2024 17:10
* Fix 5492 by limiting the costly deparse to `nlines=1`

* Implementing PR feedbacks

* Added  inside

* Fix typo in name

* Idiomatic use of  inside

* Separating the deparse line limit to a different PR

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
* Added my improvements to the intro vignette

* Removed two lines I added extra as a mistake earlier

* Requested changes
…able#5889)

* frollmax exact, buggy fast, no fast adaptive

* frollmax fast fixing bugs

* frollmax man to fix CRAN check

* frollmax fast adaptive non NA, dev

* froll docs, adaptive left

* no frollmax fast adaptive

* frollmax adaptive exact NAs handling

* PR summary in news

* copy-edit changes from reviews

Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>

* comment requested by Michael

* update NEWS file

* Apply suggestions from code review

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Apply suggestions from code review

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* add comment requested by Michael

* add comment about int iterator for loop over k-1 obs

* extra comments

* Revert "extra comments"

This reverts commit 03af0e3.

* add comments to frollmax and frollsum

* typo fix

---------

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>
Copy link
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience! Should be ready to merge once CI passes.

@MichaelChirico MichaelChirico merged commit 94c83de into Rdatatable:master Feb 19, 2024
6 checks passed
@tdhock
Copy link
Member

tdhock commented Feb 21, 2024

hi @MichaelChirico about your comment "delete placeholder 1-15-99 branch" I would suggest to by default not delete branches of PRs that have been merged, please, unless you have a strong reason to do so. Leaving the branch (not deleting) is preferable for performance testing, where we want to actually run some code on old branches, and compare with current code, to make sure that current code is indeed faster then previous/regression code.
For performance testing we have had to go through PR history and click button to re-create branches, after getting error messages like "object not found" see DorisAmoakohene/data.table_test#3 for an example.

@ben-schwen
Copy link
Member

hi @MichaelChirico about your comment "delete placeholder 1-15-99 branch" I would suggest to by default not delete branches of PRs that have been merged, please, unless you have a strong reason to do so. Leaving the branch (not deleting) is preferable for performance testing, where we want to actually run some code on old branches, and compare with current code, to make sure that current code is indeed faster then previous/regression code.

Is there any upside of checking out a branch for performance testing over checking out a single a commit? Both even work with git checkout branchName/commitID.

@tdhock
Copy link
Member

tdhock commented Feb 21, 2024

actually, yes for performance testing we do checkout single commits.
but when the branch is deleted, the commits in that branch can no longer be checked out.

@MichaelChirico
Copy link
Member

I would suggest to by default not delete branches of PRs that have been merged

when the branch is deleted, the commits in that branch can no longer be checked out.

FWIW 1-15-99 is not a good example since it was rebase-merged and all of the PR commits now survive to master.

But anyway, deleting merged branches is the norm in every repo I've seen -- that's why GitHub lets us do so in one click. For your concern about checking out PR commits, I don't think your statement is accurate, e.g. taking Jan's #5889 (branch deleted circa Jan 12):

git checkout master
git checkout -b tmp
git cherry-pick cca5b7912aeee1826792d983f531f36736c9fb91
git cherry-pick 14555b2b3ef3498242110bfd64c3fc8570f4f7aa

Those SHAs are from https://github.com/Rdatatable/data.table/pull/5889/commits. I.e., I don't have any trouble applying specific commits even though the branch is deleted.

@jangorecki
Copy link
Member

If you want to benchmark PRs then merge commit is fine, if you want to benchmark individual commits inside PR then you cannot delete branches. IMO first should be enough.

@tdhock
Copy link
Member

tdhock commented Feb 22, 2024

anyway, if you delete a branch, we can always restore it later thanks to github.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants