-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fcase / case_when function for data.table #3823
Comments
I actually prefer the synatx like |
It definitely probably be easier to code from that too. One way forward would be to S3 it and build a |
Yes, it's easier to read especially when test & value is short and inlined. But I'm not sure using the formula style will have "significant" overheads or not since it requires extra patching. If not, I'm ok with both styles... |
Question, how does this improve on (or differ from) subassignment by reference? You can do an equivalent to a case when using the following: DT <- data.table(age = 0:100)
DT[, age_label := "65+"]
DT[age < 65, age_label := "35-65"]
DT[age < 35, age_label := "18-35"]
DT[age < 18, age_label := "0-18"] for which we already get auto-indexing by default. |
Hugh see last paragraph. I had a fuller explanation written out but it disappeared when I pressed Comment 😢 |
Your example is a special case. |
I am thinking it would be more efficient if we don't have to evaluate all the "LHS" value, but I need to think how I would solve that problem. What do you think of something like that? |
@2005m I don't quite follow the intuition of the function signature? I have exploratory work on the
My thinking is to implement @shrektan's suggestion first as it'll be pretty straightforward (just need to figure out how to deal with Then later build the logic to interpret with |
I think we can pass a list object to C. e.g., fcase = function(..., default = NA) {
.Call(Cfcase, list(...), default = NA)
} SEXP fcase(SEXP x, SEXP default) {
...
} In addition,
|
fcase(variable, default, test1, value1,...,testN, valeN)
With this construction one could test for missing(..1) at R level then
evaluate fcase recursively.
…On Fri, 13 Sep 2019 at 12:31 pm, Xianying Tan ***@***.***> wrote:
I think we can pass a list object to C. e.g.,
fcase = function(..., default = NA) {
.Call(Cfcase, list(...), default = NA)
}
SEXP fcase(SEXP x, SEXP default) {
...
}
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3823?email_source=notifications&email_token=AB54MDCWWKJSFOVMDDQWBADQJL3PDA5CNFSM4IT6NOD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6TZGNI#issuecomment-531075893>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB54MDA2AHRQQRGQFC2FW73QJL3PDANCNFSM4IT6NODQ>
.
|
@MichaelChirico , I was thinking to use it like this: |
since it only depends on a single variable |
the less checks on R level the better. I also prefer We should first agree on the API. As @shrektan said |
As of now we have
fcase(...)
fcase(..., default)
fcase(when1, value1, ..., default)
fcase(age < 18, '0-18', age < 35, '18-35', age < 65, '35-65', '65+')
fcase(...)
fcase(..., default)
fcase(x, ..., default)
fcase(age < 18 ~ '0-18', age < 35 ~ '18-35', age < 65 ~ '35-65', TRUE ~ '65+')
fcase(when, value, default)
fcase(list(age < 18, age < 35, age < 65),
list('0-18', '18-35', '35-65'),
'65+') any other suggestions? |
Some food for thought: https://coolbutuseless.github.io/2018/09/06/strict-case_when/ |
Author of that post seems to asks for a lookup table. The problem we solve by update on join. Case when has somehow different goal, as explained in above comments. |
Question: Can |
in SQL it's usually the same length ( |
Maybe this is relevant. I just built a version that is just built on top of your very quick It can be found in the very developmental |
Thank you @TysonStanley . I actually finished writting the function in C. I was hoping to do a pull request tonight. I just need to finish writting the tests. I'll have a look at your function. |
@2005m That’s awesome! I’m excited to see it rolled out. Is the syntax similar to And yes, take a look and let me know what you think. It’s a pretty simple approach since I could rely on |
Here is a sneak peek:
The syntax is different to |
Thanks for the sneak peak. That performance is fun to see. And the syntax looks just as friendly either way. Personally I like the formulas but for most cases, it probably doesn’t matter much. Are you planning on supporting other vector types in the future? |
does it evaluate every single case or only those that needs to be reached to provide answer? |
@jangorecki , yes it evaluates all cases and that is why I am not happy with it. I am also not happy with the performance either. If it can't be improved, I think @TysonStanley 's approach if better because simpler and timings are similar. @TysonStanley , I also prefer the formulas but it is probably subjective...Regarding other vector types, I don't know. It is up to the team. |
We are not in hurry. I think having lazyness should be crucial, otherwise it doesn't bring anything new (other than API) comparing to using lookup table. Don't try to resolve everything at first iteration. When you will feel ready submit PR to get feedback. Final state can take multiple iterations or a follow up PR(s). |
Does |
yes it does
|
@jangorecki , I hope to be able to share my code this weekend. I need to write the Rd file and add more tests. |
Related/follow-up to #3657
case_when
is a common tool in SQL for e.g. building labels based on conditions of flags, e.g. cutting age groups:Our comrades at
dplyr
have implemented this ascase_when
with an interface likeUsing & interpeting formulas seems pretty natural for R -- the only other thing I can thing of would be like
on
syntax (case_when('age<18' = '0-18', ...)
).As for the back end,
dplyr
is doing a two-passfor
loop at the R level which e.g. requires evaluatingage < 65
for all rows (whereas this is unnecessary for anything with labels'0-18'
or'18-35'
).I guess we can do much better with a C implementation. Will be interesting to contrast a proper lazy implementation with a parallel version (since IINM doing it in parallel will require first evaluating all the "LHS" values, e.g. as Jan met recently in
frollapply
).Note that normally I'm doing
case_when
stuff with a look-up join, but it's not straightforward to implement acase_when
as a join in general -- though I think the two are isomorphic, backing out what the corresponding join should be I think is a hard problem. e.g. in the example here, the order of truth values matters since implicit in the later conditions are(x & !any(y))
for each conditionx
that was preceded by conditionsy
. This case is straightforward to cast as a join oncut(age)
, perhaps just usingroll
, but things can be much more complicated when several variables are involved in the truth conditions. So I don't think this route is likely to be fruitful.The text was updated successfully, but these errors were encountered: