Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New vignette -- Usages of .SD #3572

Merged
merged 5 commits into from May 22, 2019
Merged

New vignette -- Usages of .SD #3572

merged 5 commits into from May 22, 2019

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented May 18, 2019

Closes #3412

Not sure if we should track the png on GH or not

@jangorecki
Copy link
Member

jangorecki commented May 18, 2019

I haven't gone through yet but looks like a comprehensive guide on using .SD.

  • I would avoid such title, as some people might see it as "cryptic symbols".
  • Also as discussed in linked issue, we can have scope of that vignette extended for other tricks in j. Would be very useful if you could leave placeholders for that in the document, so it can be filled by others.
  • png looks unnecessarily big, not sure if 12KB will make difference but recently Matt was dealing with compiler flags to reduce the size of package.

@mattdowle
Copy link
Member

mattdowle commented May 18, 2019

Haven't look either yet but just on the png size, removing -g compiler flag saved 1MB recently (.so reduced from 1.5MB to 0.5MB) so the package size is now apx 4MB of 5MB limit. 12KB not an issue (1.2% of remaining). The Pitching.RData (1.3MB) file in vignettes/ is more of a concern but from what I can gather only the vignette PDF is installed and counts towards the 5MB installed size limt, so that should be ok.

@mattdowle mattdowle added this to the 1.12.4 milestone May 22, 2019
Copy link
Member

@mattdowle mattdowle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@codecov
Copy link

codecov bot commented May 22, 2019

Codecov Report

Merging #3572 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3572   +/-   ##
=======================================
  Coverage   97.58%   97.58%           
=======================================
  Files          66       66           
  Lines       12695    12695           
=======================================
  Hits        12389    12389           
  Misses        306      306

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6484781...9660daa. Read the comment docs.

1 similar comment
@codecov
Copy link

codecov bot commented May 22, 2019

Codecov Report

Merging #3572 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3572   +/-   ##
=======================================
  Coverage   97.58%   97.58%           
=======================================
  Files          66       66           
  Lines       12695    12695           
=======================================
  Hits        12389    12389           
  Misses        306      306

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6484781...9660daa. Read the comment docs.

@mattdowle
Copy link
Member

mattdowle commented May 22, 2019

For future reference ...
It seems the Travis error when building vignettes and it suggests that rmarkdown is not available and to install it, is spurious (seen that before, iirc). It just means that something is wrong with the vignette somewhere and you have to reproduce it locally to find what's wrong.
In addition to changing the RData version from 3 to 2, I needed to change cache= from TRUE to FALSE. Otherwise it produced warning about version 3 format meaning that the package then depends on R 3.5+.

@mattdowle mattdowle merged commit c68e95e into master May 22, 2019
@mattdowle mattdowle deleted the sd_vignette branch May 22, 2019 03:58
@jangorecki
Copy link
Member

why not use csv.gz instead of RData? there is no risk that vignette can be build on newer R only due to format incompatibility?


This vignette will explain the most common ways to use the `.SD` variable in your `data.table` analyses. It is an adaptation of [this answer](https://stackoverflow.com/a/47406952/3576984) given on StackOverflow.

# What is `.SD`?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://rdatatable.gitlab.io/data.table/library/data.table/doc/datatable-sd-usage.html
rendered from R it results into What is <code>.SD</code>? tab name in browser, maybe better remove code and leave .SD as plaintext

Pitching[ , coef(lm(ERA ~ ., data = .SD))['W'], .SDcols = c('W', rhs)]
})
barplot(lm_coef, names.arg = sapply(models, paste, collapse = '/'),
main = 'Wins Coefficient\nWiith Various Covariates',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wiith double i


## Conditional Joins

`data.table` syntax is beautiful for its simplicity and robustness. The syntax `x[i]` flexibly handles two common approaches to subsetting -- when `i` is a `logical` vector, `x[i]` will return those rows of `x` corresponding to where `i` is `TRUE`; when `i` is _another `data.table`_, a (right) `join` is performed (in the plain form, using the `key`s of `x` and `i`, otherwise, when `on = ` is specified, using matches of those columns).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is also a case of DT["someid"]


Note that this approach can of course be combined with `.SDcols` to return only portions of the `data.table` for each `.SD` (with the caveat that `.SDcols` should be fixed across the various subsets)

_NB_: `.SD[1L]` is currently optimized by [_`GForce`_](https://jangorecki.gitlab.io/data.table/library/data.table/html/datatable-optimize.html) ([see also](https://stackoverflow.com/questions/22137591/about-gforce-in-data-table-1-9-2)), `data.table` internals which massively speed up the most common grouped operations like `sum` or `mean` -- see `?GForce` for more details and keep an eye on/voice support for feature improvement requests for updates on this front: [1](https://github.com/Rdatatable/data.table/issues/735), [2](https://github.com/Rdatatable/data.table/issues/2778), [3](https://github.com/Rdatatable/data.table/issues/523), [4](https://github.com/Rdatatable/data.table/issues/971), [5](https://github.com/Rdatatable/data.table/issues/1197), [6](https://github.com/Rdatatable/data.table/issues/1414)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.SD vignette based on SO answer
3 participants