Permalink
Browse files

rd second pass

  • Loading branch information...
1 parent 01672a5 commit 1e924ff54e69151ac3654b6abc52efe197ca332f @piccolbo piccolbo committed Sep 27, 2012
@@ -5,7 +5,7 @@
The big data object.}
\description{
-A stub representing data on disk that can be manipulated by other functions in rmr. "Stub" means that the data is not actually "there" or more concretely is not held in memory in the current process. This is a technique used in different programming languages when remote resources need to me made available. In this case the rationale is that we need to be able to process large data sets whose size is not compatible with them being held in memory at once. Nonetheless it is convenient to be able to refer to the complete data set in the language, albeit the set of operations we can perform on it is limited. Big data objects are returned by \code{\link{to.dfs}}, \code{\link{mapreduce}}, \code{\link{scatter}}, \code{\link{gather}}, \code{\link{equijoin}} and \code{\link{rmr.sample}}, and accepted as input by all of the above with the exception of \code{\link{to.dfs}} and the inclusion of \code{\link{from.dfs}}. Big data objects are NOT persistent, meaning that they are not meant to be saved beyond the limits of a session. They use temporary space and the space is reclaimed as soon as possible when the data can not be referred to any more, or at the end of a session. For data that needs to be accessible outside the current R session, you need to use paths to the file or directory where the data is or should be written to. Valid paths can be used interchangeable wherever big data objects are accepted}
+A stub representing data on disk that can be manipulated by other functions in rmr. "Stub" means that the data is not actually "there" or more concretely it is not held in memory in the current process. This is a technique used in different programming languages when remote resources need to be made available. In this case the rationale is that we need to be able to process large data sets whose size is not compatible with them being held in memory at once. Nonetheless it is convenient to be able to refer to the complete data set in the language, albeit the set of operations we can perform on it is limited. Big data objects are returned by \code{\link{to.dfs}}, \code{\link{mapreduce}}, \code{\link{scatter}}, \code{\link{gather}}, \code{\link{equijoin}} and \code{\link{rmr.sample}}, and accepted as input by all of the above with the exception of \code{\link{to.dfs}} and the inclusion of \code{\link{from.dfs}}. Big data objects are NOT persistent, meaning that they are not meant to be saved beyond the limits of a session. They use temporary space and the space is reclaimed as soon as possible when the data can not be referred to any more, or at the end of a session. For data that needs to be accessible outside the current R session, you need to use paths to the file or directory where the data is or should be written to. Valid paths can be used interchangeably wherever big data objects are accepted}
\examples{
View
@@ -17,14 +17,10 @@
\item{key}{the desired key or keys}
\item{val}{the desired value or values}}
-\details{The keyval function is used to create return values for the map and reduce functions passed as parameters to
-\code{mapreduce}, which can also return NULL. Key-value pairs are also appropriate arguments for the \code{to.dfs} function and are returned by
-\code{from.dfs}. \code{keys} and \code{values} extract keys and values resp. from a key value pair. A key value pair should be always considered vectorized, meaning that it defines a collection of key-value pairs. For the purposed of forming key-value pairs, the length of an object is considered its number of rows whene defined, that is for matrices and data frames, or its R \code{length} otherwise). Consistently with this definition, the n=th element of a key or value is its n-th row or a subrange including only the n-th element otherwise. Data types are preserved, meaning that, for instance, if the \code{key} is a matrix its n-th element is a matrix with only one row, the n-th row of the larger matrix (the behavior of the \code{[]} operator with \code{drop = FALSE}). The same is true for data frames, list and atomic vectors. When \code{key} and \code{val} have different lengths according to this definition, recycling is applied. The pairing between keys and values is modeled after the behavior of the function \code{\link{split}}, with some differences that will be detailed. This means that as many key value pairs are defined as there are \emph{distinct} keys. Each key will be associated with a subrange of \code{val} (subrange of rows when possible), specifically the subrange of \code{key} where each element is equal to \code{key}. The differences with split are the following:
-\enumerate{
-\item{\code{split} acts immediately whereas the generation of actual key-value pairs is deferred to serialization time for \code{keyval} and in general invisible to the user but for its effects on the behavior of mapreduce}
-\item{\code{keyval} treats matrices as matrices, whereas \code{split} turns them into vector}
-\item{\code{split} considers lists as data frames when provided as the grouping argument; \code{keyval} treats them as generic vectors when supplied as the key argument}}
-}
+\details{The keyval function is used to create return values for the map and reduce functions, themselves parameters to
+\code{mapreduce}. Key-value pairs are also appropriate arguments for the \code{to.dfs} function and are returned by
+\code{from.dfs}. \code{keys} and \code{values} extract keys and values resp. from a key value pair. A key value pair should be always considered vectorized, meaning that it defines a collection of key-value pairs. For the purpose of forming key-value pairs, the length of an object is considered its number of rows whene defined, that is for matrices and data frames, or its R \code{length} otherwise). Consistently with this definition, the n-th element of a key or value is its n-th row or a subrange including only the n-th element otherwise. Data types are preserved, meaning that, for instance, if the \code{key} is a matrix its n-th element is a matrix with only one row, the n-th row of the larger matrix (the behavior of the \code{[]} operator with \code{drop = FALSE}). The same is true for data frames, list and atomic vectors. When \code{key} and \code{val} have different lengths according to this definition, recycling is applied. The pairing between keys and values is positional, meaning that the n-th key is associated with the n-the value.}
+
\examples{
#single key-val
@@ -33,8 +29,8 @@
values(keyval(1,2))
#10 kv pairs of the form (i,i)
keyval(1:10, 1:10)
-#2 kv pairs (1, c(1,3,5,6,9)) and (2, c(2,4,6,8,10))
+#2 kv pairs (1, 2i-1) and (2, 2i) for i in 1:5
keyval(1:2, 1:10)
-# split mtcars data according to cyl column, create several kv pairs
+# mtcars is a data frame, each row is a value with key set to the value of column cyl
keyval(mtcars$cyl, mtcars)
}
@@ -15,17 +15,17 @@ make.output.format(format = make.native.output.format( keyval.length = rmr.opt
streaming.format = "org.apache.hadoop.mapred.SequenceFileOutputFormat", ...)}
\arguments{
- \item{format}{Either a string describing a predefined combination of IO settings (possibilities include: "text", "json", "csv", "native","sequence.typedbytes") or a function. For an input format, this function accepts a connection and a number of records and returns a key-value pair (see \code{\link{keyval}}. For an output format, this function accepts a key-value pair, a connection and writes the first to the connection.}
- \item{mode}{Mode can be either "text" or "binary", which tells R what type of connection to use when opening the IO connections.}
+ \item{format}{Either a string describing a predefined combination of IO settings (possibilities include: \code{"text"}, \code{"json"}, \code{"csv"}, \code{"native"},\code{"sequence.typedbytes"}) or a function. For an input format, this function accepts a connection and a number of records and returns a key-value pair (see \code{\link{keyval}}). For an output format, this function accepts a key-value pair and a connection and writes the formet to the latter.}
+ \item{mode}{Mode can be either \code{"text"} or \code{"binary"}, which tells R what type of connection to use when opening the IO connections.}
\item{streaming.format}{Class to pass to hadoop streaming as inputformat or outputformat option. This class is the first in the input chain to perform its duties on the input side and the last on the output side. Right now this option is not honored in local mode.}
- \item{\dots}{Additional arguments to the format function, for instance for the csv format they detail the specifics of the csv dialect to use, see \code{\link{read.table}} and \code{\link{write.table}} for details}
-}
+ \item{\dots}{Additional arguments to the format function. For the csv format they detail the specifics of the csv dialect to use and are the same as for \code{\link{read.table}} and \code{\link{write.table}} for the input and output resp. For \code{"json"}, only on the input side, one can specify a \code{key.class} and a \code{value.class} to help in mapping the JSON data modle to R's own more flexibly. For the \code{"native"} and \code{"sequence.typedbytes"} output formats the user can specify a \code{keyval.length} that says how many values to map to a single physical key-value pair when the key is \code{NULL}}}
+
\details{
-The goal of these function is to encapsulate some of the complexity of the IO settings, providing meaningful defaults and predefined combinations. The input processing is the result of the composition of a Java class and an R function, and the same is true on the output side but in reverse order. If you don't want to deal with the full complexity of defining custom IO formats, there are pre-packaged combinations. "text" is free text, useful mostly on the input side for NLP type applications; "json" is one or two tab separated, single line JSON objects per record; "csv" is the csv format, configurable through additional arguments; "native.text" uses the internal R serialization in text mode, and was the default in previous releases, use only for backward compatibility; "native" uses the internal R serialization, offers the highest level of compatibility with R data types and is the default; "sequence.typedbytes" is a sequence file (in the Hadoop sense) where key and value are of type typedbytes, which is a simple serialization format used in connection with streaming for compatibility with other hadoop subsystems. Typedbytes is documented here \url{https://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/typedbytes/package-summary.html}
-}
+The goal of these function is to encapsulate some of the complexity of the IO settings, providing meaningful defaults and predefined combinations. If you don't want to deal with the full complexity of defining custom IO formats, there are pre-packaged combinations. "text" is free text, useful mostly on the input side for NLP type applications; "json" is one or two tab separated, single line JSON objects per record; "csv" is the csv format, configurable through additional arguments; "native.text" uses the internal R serialization in text mode, and was the default in previous releases, use only for backward compatibility; "native" uses the internal R serialization, offers the highest level of compatibility with R data types and is the default; "sequence.typedbytes" is a sequence file (in the Hadoop sense) where key and value are of type typedbytes, which is a simple serialization format used in connection with streaming for compatibility with other hadoop subsystems. Typedbytes is documented here \url{https://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/typedbytes/package-summary.html}. If you want to implement custom formats, the input processing is the result of the composition of a Java class and an R function, and the same is true on the output side but in reverse order and you can specify both as arguments to this functions.}
+
\value{
-Return a list of IO specifications, to be passed as \code{input.format} and \code{output.format} to \code{\link{mapreduce}}, and as \code{format} to \code{\link{from.dfs}} (input) and \code{\link{to.dfs}} (output).
-}
+Return a list of IO specifications, to be passed as \code{input.format} and \code{output.format} to \code{\link{mapreduce}}, and as \code{format} to \code{\link{from.dfs}} (input) and \code{\link{to.dfs}} (output).}
+
\examples{
##---- Should be DIRECTLY executable !! ----
##-- ==> Define data, use random,
@@ -20,13 +20,13 @@
\item{input}{Paths to the input folder(s) (on HDFS) or vector thereof
or or the return value of another \code{mapreduce} or a \code{\link{to.dfs}} call}
\item{output}{A path to the destination folder (on HDFS); if missing, a \code{\link{big.data.object}} is returned, see "Value" below}
-\item{map}{An optional R function of two arguments, returning either NULL or the return value of \code{\link{keyval}}, that specifies the map operation to execute as part of a mapreduce job. The two arguments represent multiple key-value pairs according to the definition of the mapreduce model. They can be any of the following: list, vector, matrix, data frame or NULL (the last one only allowed for keys). Keys are matched to the corresponding values by position, according to the second dimension if it is defined (that is rows in matrices and data frames, position otherwise), analogous to the behavior of \code{cbind}}
+\item{map}{An optional R function of two arguments, returning either NULL or the return value of \code{\link{keyval}}, that specifies the map operation to execute as part of a mapreduce job. The two arguments represent multiple key-value pairs according to the definition of the mapreduce model. They can be any of the following: list, vector, matrix, data frame or NULL (the last one only allowed for keys). Keys are matched to the corresponding values by position, according to the second dimension if it is defined (that is rows in matrices and data frames, position otherwise), analogous to the behavior of \code{cbind}, see \code{\link{keyval}} for details.}
\item{reduce}{An optional R function of two arguments, a key and a data structure representing all the values associated with that key (the same type as returned by the map call, merged with \code{rbind} for matrices and data frames and \code{c} otherwise), returning either NULL or the return value of \code{\link{keyval}}, that specifies the reduce operation to execute as part of a mapreduce job. The default is no reduce phase, that is the output of the map phase is the output of the mapreduce job}
\item{combine}{A function with the same signature and possible return values as the reduce function, or TRUE, which means use the reduce function as combiner. NULL or FALSE means no combiner is used.}
-\item{input.format}{Input specification, see \code{\link{make.input.format}}}
-\item{output.format}{Output specification, see \code{\link{make.output.format}}}
+\item{input.format}{Input format specification, see \code{\link{make.input.format}}}
+\item{output.format}{Output format specification, see \code{\link{make.output.format}}}
\item{backend.parameters}{This option is for advanced users only and may be removed in the future. Specify additional, backend-specific
- options, as in \code{backend.parameters = list(hadoop = list(D = "mapred.reduce.tasks=1"), local = list())}. It is recommended not to use this argument to change the semantics of mapreduce (output should be independent of this argument). Each backend can only see the nested list named after the backend itself. The interpretationis the following: for the hadoop backend, generate an additional hadoop streaming command line argument for each element of the list, "-name value". If the value is TRUE generate "-name" only, if it is FALSE skip. One possible use is to specify the number of mappers and reducers on a per-job basis. It is not guaranteed that the generated streaming command will be a legal command. In particular, remember to put any generic options before any specific ones, as per hadoop streaming manual. For the local backend, the list is currently ignored.}
+ options, as in \code{backend.parameters = list(hadoop = list(D = "mapred.reduce.tasks=1"), local = list())}. It is recommended not to use this argument to change the semantics of mapreduce (output should be independent of this argument). Each backend can only see the nested list named after the backend itself. The interpretation is the following: for the hadoop backend, generate an additional hadoop streaming command line argument for each element of the list, "-name value". If the value is TRUE generate "-name" only, if it is FALSE skip. One possible use is to specify the number of mappers and reducers on a per-job basis. It is not guaranteed that the generated streaming command will be a legal command. In particular, remember to put any generic options before any specific ones, as per hadoop streaming manual. For the local backend, the list is currently ignored.}
\item{verbose}{Run hadoop in verbose mode. No effect on the local backend}}
\value{The value of \code{output}, or, when missing, a \code{\link{big.data.object}}}
View
@@ -1,19 +1,18 @@
\name{rmr.sample}
\alias{rmr.sample}
+
\title{Sample large data sets}
+
\description{Sample large data sets}
+
\usage{rmr.sample(input, output = NULL, method = c("any", "Bernoulli"), ...)}
-%- maybe also 'usage' for other objects documented here.
+
\arguments{
\item{input}{The data set to be sampled as a file path or \code{\link{mapreduce}} return value}
\item{output}{Where to store the result. See \code{\link{mapreduce}}, output argument, for details}
- \item{method}{One of "any" or "Bernoulli". "any" will return some records out, optimized for speed, but with no statistical guarantees. "Bernoulli" is what is says, independent sampling according to the Bernoulli distribution}
+ \item{method}{One of "any" or "Bernoulli". "any" will return some records out, optimized for speed, but with no statistical guarantees. "Bernoulli" implements independent sampling according to the Bernoulli distribution}
\item{\dots}{Additional arguments to fully specify the sample, they depend on the method selected. If it is "any" then the size of the desired sample should be provided as the argument \code{n}. If it is "Bernoulli" the argument \code{p} specifies the probabity of picking each record}}
-\details{
-}
+
\value{
-The sampled data. See \code{\link{mapreduce}} for details.
-}
+The sampled data. See \code{\link{mapreduce}} for details.}
-\examples{
-}

0 comments on commit 1e924ff

Please sign in to comment.