Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vignettes #944

Open
12 of 33 tasks
arunsrinivasan opened this issue Nov 11, 2014 · 57 comments
Open
12 of 33 tasks

Vignettes #944

arunsrinivasan opened this issue Nov 11, 2014 · 57 comments

Comments

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Nov 11, 2014

HTML vignette series:

Planned for v1.9.8


Future releases

  • data.table internals, performance aspects and expressiveness
  • Reading multiple files (fread + rbindlist), ordering, ranking and set operations
  • IDateTime vignette
  • Document the difference between data.table() and data.frame() somewhere - relevant issues: #968, #877. Perhaps slightly more in detail in the FAQ.
  • coursera FAQ
  • Advanced data.table usage:
    • NSE
    • ...
  • Timings vignette (moving #520 here to get everything in one place, but not sure if we need it as a vignette since we've the Wiki with benchmarks/timings).
  • fread+fwrite vignette, include also Convenience features of fread wiki, also #2855

Finished:


Minor:

  • Operations using integer64, and promoting it for large integers.

Notes (to update current vignettes based on feedbacks): Please let me know if I missed anything..

Introduction to data.table:

  • order in i.
  • Explain how to name columns in j while selecting/computing.
  • Emphasise that keyby is applied after obtaining the result on the computed result, not on the original data.table.
  • Mention new updates to .SDcols and cols in with=FALSE being able to select columns as colA:colB.

Reference semantics:

  • Also explain all other relevant set* functions here.. (setnames, setcolorder etc..)
  • Mainly set.
  • Explain that 1b) the := operator is just defining ways to use it - the example there doesn't work as it just shows two different ways of using it -- Following this comment.

Keys and fast binary search based subsets:

  • Add an example of subset using integer/double keys.
  • Difference in "nomatch" default in binary search based subsets.
  • replacing NAs with binary search based subsets possible?

FAQ (most appropriate here, I think).

  • Update FAQ with issue on external pointer being NULL when reading an R object from file, for example, using readRDS(). Update this SO post.
  • Explain with example, on over allocating the data.table using alloc.col(), and when to use it (when you need to create multiple columns), and why. Update this SO post.
@jangorecki

This comment has been hidden.

@arunsrinivasan

This comment has been hidden.

@matthieugomez
Copy link
Contributor

@matthieugomez matthieugomez commented Nov 14, 2014

I'm curious about what makes a cold by faster than say tapply. One part of the answer is gforce, but what about user written functions? I could not find anything about this. There's a nice post about panda : http://wesmckinney.com/blog/?p=489
One could even compare it with sapply. For instance, suppose I start from a list of vectors. Is it ever worth it to append all the vectors in one column in a data.table and use by instead of sapply ?

@arunsrinivasan

This comment has been hidden.

@arunsrinivasan arunsrinivasan added this to the v1.9.8 milestone Nov 16, 2014
@gsee

This comment has been minimized.

@arunsrinivasan

This comment has been hidden.

@arunsrinivasan arunsrinivasan self-assigned this Nov 26, 2014
@markdanese
Copy link

@markdanese markdanese commented Nov 30, 2014

Being new to R and data.table (since March), I would say that there needs to be a basic outcome-oriented introduction as opposed to the current function-oriented one. In other words, it is one thing to read what each parameter in data.table does, but they often make little sense without having a use-case in mind. While there are examples of output, many people need to go the other direction. That is, they know what output they need, but they don't know what function/parameter/setting is most appropriate to use. It would be helpful to have a simple recipe approach to get them started.

How to I create subsets of my data?
How do I do an operation on subsets of my data to create a new or updated data set?
How do I add a new column?
How do I delete a column?
How do I create a single variable?
How do I create multiple variables?
How do I do different operations on different subsets of my data? (.BY)
How do I use data.table in a function and pass in data.table names and columns on which to operate?
How do I do multiple sequential operations on the same data.table?
Can I select a subset of data and do an operation on it at the same time?
When do I need to be careful about creating/updating variables by reference?
How do I select one observation per group (first, last)?
How do I set a key and how is it different from setting an index?
Under what conditions does my key get deleted when I do an operation on my data.table?
Can I just use the regular "merge" syntax or do I need to use data.table syntax (Y[X])?
How do I collapse a list of lists into one big data.table? What if the columns are in different order?

There are probably a ton of other items all on SO that could be edited into a simple compilation of questions and answers.

@arunsrinivasan

This comment has been hidden.

@vlulla

This comment has been hidden.

@arunsrinivasan

This comment has been hidden.

@markdanese

This comment has been hidden.

@arunsrinivasan

This comment has been hidden.

@jangorecki

This comment has been hidden.

@brodieG

This comment has been hidden.

@arunsrinivasan

This comment has been hidden.

@markdanese

This comment has been hidden.

@juancentro

This comment has been hidden.

@arunsrinivasan

This comment has been hidden.

@juancentro

This comment has been hidden.

@arunsrinivasan

This comment has been hidden.

@markdanese

This comment has been hidden.

@jangorecki

This comment has been hidden.

@markdanese

This comment has been hidden.

@arunsrinivasan

This comment has been hidden.

@smartinsightsfromdata
Copy link

@smartinsightsfromdata smartinsightsfromdata commented Jan 30, 2015

Great work on these vignettes!
My comments may be late or already covered:

  • I would like to see a variety of ways / examples of using dynamic rows and columns.
  • More extensive comparison on merge and joins.
  • Different / richer ways to use set. Also, it would be nice to see an explanation why the following does give an error (see here ):
for (j in  valCols)
   set(dt_,  
    i = which(is.na(dt_[[j]])),
    j = j, 
    value= as.numeric(originTable[[j]]))

@jangorecki

This comment has been hidden.

@markdanese

This comment has been hidden.

@arunsrinivasan

This comment has been hidden.

@pakom
Copy link

@pakom pakom commented Nov 30, 2016

Thank you for the updated vignettes with the release of v1.9.8.
The "Reference semantics" refers to the copy() function and its new capabilities to make shallow copies (especially inside functions, something that I am really interested in):

"However we could improve this functionality further by shallow copying instead of deep copying. In fact, we would very much like to provide this functionality for v1.9.8. We will touch up on this again in the data.table design vignette."

But the design vignette is missing and the link points to an old issue. The reference manual does not provide more information on copy() than the one provided in the vignette. The rest of the vignettes do not provide any information on copy.

Will this vignette become available soon?

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Aug 11, 2017

+1 for internals vignette. I (and I guess a few others) am quite interested in contributing a bit on the C side of things, but am a bit intimidated by the (as it stands) 35k lines of C code... quite the learning curve to 'go it alone' -- an intro to internals could do wonders!

@MichaelChirico

This comment has been minimized.

@zeomal
Copy link

@zeomal zeomal commented Apr 24, 2020

Wanted to chime in and ask if contributions to the vignette are accepted from non-code contributors (like me). I am particularly interested in contributing to the joins vignette as I had quite a bit of trouble with it initially and was guided to solutions from Arun's answers on Stackoverflow, and I'd like some guidance on how to do so, if allowed.

@Henrik-P
Copy link

@Henrik-P Henrik-P commented Apr 24, 2020

@arunsrinivasan I see that you have a point IDateTime vignette. Perhaps it could be included in the more general vignette suggested by @jangorecki: vignettes: timeseries - ordered observations?

In addition, I am preparing a first draft on some of the topics suggested by jan. Perhaps parts of it may be relevant for a join vignette as well? I'm happy to share if anyone may find it useful.

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Apr 24, 2020

@zeomal such a contribution would be highly valuable and much appreciated!

@zeomal
Copy link

@zeomal zeomal commented Apr 24, 2020

@MichaelChirico, thank you. @Henrik-P, will your brief on normal joins be comprehensive - i.e. will your focus be more on timeseries? If not, I can start work on it - I haven't used rolling joins yet, so no knowledge there. :)

@Henrik-P
Copy link

@Henrik-P Henrik-P commented Apr 24, 2020

@zeomal Hopefully I will be able to upload the first draft soon, so you can have a look at it. In my draft, I provide a simple example of a "normal" join on a single variable, time, where there are non-matching rows. I use nomatch = NA. (maaaybe also a quick example with nomatch = NULL)

My idea was that this simple join could provide a context and a feeling for the problem, which I then treat more thoroughly in the following sections on rolling and non-equi joins et al.

Thanks a lot for your willingness to contribute! .

@zeomal

This comment has been hidden.

@jangorecki

This comment has been hidden.

@Henrik-P
Copy link

@Henrik-P Henrik-P commented Apr 25, 2020

@zeomal If you wish to check how brief my treatment on normal (equi) joins is, I just want to let you know that I posted a PR on a timeseries vignette.

@jangorecki jangorecki removed the High label Jun 3, 2020
@kjytay

This comment has been hidden.

@MichaelChirico

This comment has been hidden.

@kjytay

This comment has been hidden.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet