-
Notifications
You must be signed in to change notification settings - Fork 278
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
40 additions
and
71 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,16 @@ | ||
# What's new in 2.0 | ||
# What's new in 2.0.1 | ||
|
||
With 1.3 we added support for vectorized processing and structured data, and the feedback from users was encouraging. At the same time, we increased the complexity of the API. With this version we tried to define a synthesis between all the modes (record-at-a-time, vectorized and structured) present in 1.3, with the following goals: | ||
* automatic package install & update feature. At default, nothing changes, you need to install all the packages you need on all the nodes. If you set the package option `install.args` to a list, even empty, rmr will try to install missing packages using that list as additional arguments. Same logic with `update.args`. This feature is experimental and, as we said, off by default. | ||
|
||
* bring the footprint of the API back to 1.2 levels. | ||
* make sure that no matter what the corner of the API one is exercising, he or she can rely on simple properties and invariants; writing an identity mapreduce should be trivial. | ||
* encourage writing the most efficient and idiomatic R code from the start, as opposed to writing against a simple API first and then developing a vectorized version for speed. | ||
* fixed `rmr.sample`. The function in 2.0 was broken due to a commit mishap. | ||
* hardened `keyval` argument recycling feature against some edge cases (not `NULL` but 0-length keys or values). | ||
* local mode again compatible with mixed type input files. The local backend failed when multiple inputs were provided but they represented different record types, which is typical of joins. No longer. | ||
* fixed equijoin with new reduce default. The 2.0 version had a number of problems due to an incomplete port to the 2.0 API. The new reduce default does a `merge` of left and right side in most cases, returns left and right groups as they are when lists are involved. | ||
* hardened streaming backend against user code or packages writing to `stdout`. This output stream is reserved for hadoop streaming use. We now capture all output at package load time and from the map and reduce functions and redirect it to standard error. | ||
* improved environment load on the nodes. In some cases local variables could be overwritten by global environment variables, not anymore. | ||
* hardened hdfs functions against some small platform variations. This seemed to give rise only to error messages without further consequences in most cases, but had been a source of alarm for users. | ||
|
||
This is how we tried to get there: | ||
|
||
* Hadoop data is no longer seen as a big list where the elements can be any pair (key and value) of R objects, but as an on disk representation of a variety of R data structures: lists, atomic vectors, matrices or data frames, split according to the key. Which data type will be determined by the type of the R variable passed to the `to.dfs` function or returned by map and reduce, or assigned based on the format (csv files are read as data frames, text as character vectors, JSON TBD). Each key-value pair holds a subrange of the data (range of rows where applicable) | ||
* The `keyval` function is always vectorized. The data payload is in the value part of a key-value pair. The key is construed as an index to use in splitting the data for its on-disk representation, particularly as it concerns the shuffle operation (the grouping that comes before the reduce phase). The model, albeit with some differences, is the R function `split`. So if `map` returns `keyval (1, matrix(...))`, the second arguments of some reduce call will be another matrix that has the matrix returned by map as a subrange of rows. If you don't want that to happen because, say, you need to sum all the smaller matrices together, not stack them, do not fret. Have your map function return `keyval(1, list(matrix(...)))` and on the reduce side do a `Reduce("+", vv)` where `vv` is the second argument to a reduce. Get the idea? In the first case one is building a large matrix from smaller ones, in the second just collecting the matrices to sum them up. `keyval(NULL, x)` or, equivalently `keyval(x)` means that we don't care how the data is split. This is not allowed in the map or combine functions, where defining the grouping is important. | ||
* As a consequence, all lists of key-value pairs have been ostracized from the API. One `keyval` call is all that can and needs to be called in each map and reduce call. | ||
* The `mapreduce` function is always vectorized, meaning that each `map` call processes a range of elements or, when applicable, rows of data and each `reduce` call processes all the data associated with the same key. Please note that we are talking always in terms of R dimensions, not numbers of on disk records, providing some independence from the exact format of the data. | ||
* The `structured` option which converted lists into data.frames has no reason to exist any more. What started as list will be seen by the user as list, data frames as data frames etc. throughout a mapreduce, removing the need for complex conversions. | ||
* In 1.3 the default serialization switched from a very R friendly `native` to a more efficient `sequence.typedbytes` when the `vectorized` option was on. Since that option doesn't exist anymore we need to explain what happens to serialization. We thought that transparency was too important to give up and therefore R native serialization is used unless the alternative is totally transparent to the user. Right now an efficient alternative kicks in only for nameless atomic vectors with the exclusion of character vectors. There is no need for you to know this unless you are using small key-value pairs (say 100 bytes or less) and performance is important. In that case you may find that using nameless non-character vectors gives you a performance boost. Further extensions of the alternate serialization will be considered based on use cases, but the goal is to keep them semantically transparent. | ||
|
||
## Other improvements | ||
* The source has been deeply refactored. The subdivision of the source into many files (`IO.R extras.R local.R quickcheck-rmr.R | ||
basic.R keyval.R mapreduce.R streaming.R`) suggests a modularization that is not complete and not enforced by the language, but helps reduce the complexity of the implementation. | ||
* The testing code (not the actual tests) has been factored out as a `quickcheck` package, inspired by Haskell's module by the same name. For now it is neither supported nor documented, but it belonged in a separate package. Required only to run the tests. | ||
* Added useful functions like `scatter`, `gather`, `rmr.sample`, `rmr.str` and `dfs.size`, following up on user feedback. | ||
* When the `reduce` argument to `mapreduce` is `NULL` the mapreduce job is going to be map-only rather than have an identity reduce. | ||
* The *war on boilerplate code* continues. `keyval(v)`, with a single argument, means `keyval(NULL,v)`. When you provide a value that is not a `keyval` return value where one is expected, one is generated with `keyval(v)` where v was whatever argument has been provided. For instance `to.dfs(matrix(...))` means `to.dfs(keyval(NULL, matrix(...)))`. | ||
* prettified a lot of code to show directly in documents and presentations using literate programming. This is important because the same code is showed everywhere, no copy and paste, no pseudo-code, no nothing. The code in the [Tutorial] and other documentats is always real, fresh and tested. Documents are generated with knitr and presentation with knitr and pandoc. Thanks to Yihui Xie and John MacFarlane for these great software and direct support. | ||
* logistic regression test no longer limited to only two dimensions | ||
* wordcount test now cleans up after itself |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.