Permalink
Browse files

a second pass for the Tutorial

  • Loading branch information...
1 parent a122889 commit 28a04720186596edc9416be7b122d7eb19f70bb2 @piccolbo piccolbo committed Sep 25, 2012
Showing with 36 additions and 38 deletions.
  1. +12 −13 rmr2/docs/tutorial.Rmd
  2. +12 −12 rmr2/docs/tutorial.html
  3. +12 −13 rmr2/docs/tutorial.md
View
25 rmr2/docs/tutorial.Rmd
@@ -21,7 +21,7 @@ The example is trivial, just computing the first 10 squares, but we just want to
```
-This is all it takes to write your first mapreduce job in `rmr`. There are some differences that we will review, but the first thing to notice is that it isn't all that different, and just two lines of code. The first line puts the data into HDFS, where the bulk of the data has to reside for mapreduce to operate on. It is not possible to write out big data with `to.dfs`, not in a scalable way. `to.dfs` is nonetheless very useful for a variety of uses like writing test cases, learning and debugging. `to.dfs` can put the data in a file of your own choosing, but if you don't specify one it will create tempfiles and clean them up when done. The return value is something we call a *big data object*. You can assign it to variables, pass it to other `rmr` functions, mapreduce jobs or read it back in. It is a stub, that is the data is not in memory, only some information that helps finding and managing the data. This way you can refer to very large data sets whose sixe exceeds memory limits.
+This is all it takes to write your first mapreduce job in `rmr`. There are some differences that we will review, but the first thing to notice is that it isn't all that different, and just two lines of code. The first line puts the data into HDFS, where the bulk of the data has to reside for mapreduce to operate on. It is not possible to write out big data with `to.dfs`, not in a scalable way. `to.dfs` is nonetheless very useful for a variety of uses like writing test cases, learning and debugging. `to.dfs` can put the data in a file of your own choosing, but if you don't specify one it will create temp files and clean them up when done. The return value is something we call a *big data object*. You can assign it to variables, pass it to other `rmr` functions, mapreduce jobs or read it back in. It is a stub, that is the data is not in memory, only some information that helps finding and managing the data. This way you can refer to very large data sets whose size exceeds memory limits.
Now onto the second line. It has `mapreduce` replace `lapply`. We prefer named arguments with `mapreduce` because there's quite a few possible arguments, but it's not mandatory. The input is the variable `small.ints` which contains the output of `to.dfs`, that is a stub for our small number data set in its HDFS version, but it could be a file path or a list containing a mix of both. The function to apply, which is called a map function as opposed to the reduce function, which we are not using here, is a regular R function with a few constraints:
@@ -43,7 +43,7 @@ This creates a sample from the binomial and counts how many times each outcome o
First we move the data into HDFS with `to.dfs`. As we said earlier, this is not the normal way in which big data will enter HDFS; it is normally the responsibility of scalable data collection systems such as Flume or Sqoop. In that case we would just specify the HDFS path to the data as input to `mapreduce`. But in this case the input is the variable `groups` which contains a big data object, which keeps track of where the data is and does the clean up when the data is no longer needed. Since a map function is not specified it is set to the default, which is like an identity but consistent with the map requirements, that is `function(k,v) keyval(k,v)`. The reduce function takes two arguments, one is a key and the other is a collection of all the values associated with that key. It could be one of vector, list, data frame or matrix depending on what was returned by the map function. The idea is that if the user returned values of one class, we should preserve that through the shuffle phase. Like in the map case, the reduce function can return `NULL`, a key-value pair as generated by the function `keyval` or any other object `x` which is equivalent to `keyval(NULL, x)`. The default is no reduce, that is the output of the map is the output of mapreduce. In this case the keys are realizations of the binomial and the values are all `1` (please note recycling in action) and the only important thing is how many there are, so `length` gets the job done. Looking back at this second example, there are some small differences with `tapply` but the overall complexity is very similar.
-## Wordcount
+## Word count
The word count program has become a sort of "hello world" of the mapreduce world. For a review of how the same task can be accomplished in several languages, but always for mapreduce, see this [blog entry](http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html).
@@ -89,12 +89,12 @@ As you can see we have an input representing the training data. For simplicity w
We start by initializing the separating plane and defining the logistic function. As before, those variables will be used inside the map function, that is they will travel across interpreter and processor and network barriers to be available where the developer needs them and where a traditional, meaning sequential, R developer expects them to be available according to scope rules — no boilerplate code and familiar, powerful behavior.
-Then we have the main loop where computing the gradient of the loss function is the duty of a map reduce job, whose output is brought into main memory with a call to `from.dfs` — there are temp files being created and destroyed behind the scenes but you don't need to know. The only important thing you need to know is that the gradient is going to fit in memory so we can call `from.dfs` to get it without exceeding available resources.
+Then we have the main loop where computing the gradient of the loss function is the duty of a map reduce job, whose output is brought into main memory with a call to `from.dfs` — any intermediate result files are managed by the system, not you. The only important thing you need to know is that the gradient is going to fit in memory so we can call `from.dfs` to get it without exceeding available resources.
```{r logistic.regression-map}
```
-The map function simply computes the contribution of a subset of points to the gradient. Please note the variables `g` and `plane` making their necessary appearance here without any work on the developer's part. The access here is read only but you could even modify them if you wanted — the semantics is copy on assign, which is consistent with how R works and easily supported by hadoop. Since in the next step we just want to add everything together, we return a dummy, constant key for each value. Note the use of recycling in `keyval`.
+The map function simply computes the contribution of a subset of points to the gradient. Please note the variables `g` and `plane` making their necessary appearance here without any work on the developer's part. The access here is read only but you could even modify them if you wanted — the semantics is copy on assign, which is consistent with how R works and easily supported by Hadoop. Since in the next step we just want to add everything together, we return a dummy, constant key for each value. Note the use of recycling in `keyval`.
```{r logistic.regression-reduce}
```
@@ -112,11 +112,11 @@ We are talking about k-means. This is not a production ready implementation, but
```{r kmeans-dist.fun}
```
-This is simply a distance function, the only noteworthy proprty of which is that it can compute all the distance between a matrix of centers `C` and a matrix of points `P` very efficiently, on my laptop it can do 10^6 points and 10^2 centers in 5 dimensions in approx. 16s. The only explicit iteration is over the dimension, but all the other operations are vectorized (e.g. loops are pushed to the C library), hence the speed.
+This is simply a distance function, the only noteworthy property of which is that it can compute all the distance between a matrix of centers `C` and a matrix of points `P` very efficiently, on my laptop it can do 10^6 points and 10^2 centers in 5 dimensions in approx. 16s. The only explicit iteration is over the dimension, but all the other operations are vectorized (e.g. loops are pushed to the C library), hence the speed.
```{r kmeans.map.1}
```
-The role of the map function is to compute distances between some points and all centers and return for each point the closest center. It has two flavors contolled by the main `if`: the first iteration when no candidate centers are available and all the following ones. Please note that while the points are stored in HDFS and provided to the map function as its second argument, the centers are simply stored in a matrix and available in the map function because of normal scope rules. In the first iteration, each point is randomly assigned to a center, whereas in the following ones a min distance criterion is used. Finally notice the vectorized use of keyval whereby all the center-point pairs are returned in one statement (the correspondence is positional, with the second dimension used when present).
+The role of the map function is to compute distances between some points and all centers and return for each point the closest center. It has two flavors controlled by the main `if`: the first iteration when no candidate centers are available and all the following ones. Please note that while the points are stored in HDFS and provided to the map function as its second argument, the centers are simply stored in a matrix and available in the map function because of normal scope rules. In the first iteration, each point is randomly assigned to a center, whereas in the following ones a min distance criterion is used. Finally notice the vectorized use of keyval whereby all the center-point pairs are returned in one statement (the correspondence is positional, with the second dimension used when present).
```{r kmeans.reduce.1}
```
@@ -125,12 +125,12 @@ The reduce function couldn't be simpler as it just computes column averages of a
```{r kmeans-main}
```
-The main loop does nothing but bring into memory the result of a mapreduce job with the two above functions as mapper and reducer and the big data object with the points as input. Once the keys are discarded, the values form a matrix which become the new centers. The last two lines before the return value are a heuristic to keep the number of centers the desired one (when centers are nearest to no points, they are lost). To run this function we neef some data:
+The main loop does nothing but bring into memory the result of a mapreduce job with the two above functions as mapper and reducer and the big data object with the points as input. Once the keys are discarded, the values form a matrix which become the new centers. The last two lines before the return value are a heuristic to keep the number of centers the desired one (when centers are nearest to no points, they are lost). To run this function we need some data:
```{r kmeans-data}
```
- That is, create a large matrix with a few rows repeated many times and then add some noise.
+ That is, create a large matrix with a few rows repeated many times and then add some noise. Finally, the function call:
```{r kmeans-run}
```
@@ -139,10 +139,9 @@ With a little extra work you can even get pretty visualizations like [this one](
## Linear Least Squares
- We are going to build another example, LLS, that illustrates how to build map reduce reusable abstractions and how to combine them to solve a larger task. We want to solve LLS under the assumption that we have too many data points to fit in memory but not such a huge number of variables that we need to implement the whole process as map reduce job. This is sort of a hybrid solution that is made particularly easy by the seamless integration of `rmr` with R and an example of a pragmatic approach to big data. If we have operations A, B, and C in a cascade and the data sizes decrease at each step and we have already an in-memory solution to it, than we might get away by replacing only the first step with a big data solution and then continuing with tried and true function and pacakges. To make this as easy as possible, we need the in memory and big data worlds to integrate easily.
+This is an example of a hybrid mapreduce-conventional solution to a well known problem. We will start with a mapreduce job that results in a smaller data set that can be brought into main memory and processed in a single R instance. This is straightforward in rmr because of the simple primitive that transfers data into memory, `from.dfs`, and the R-native data model. This is in contrast with hybrid pig-java-python solutions where mapping data types from one language to the other is a time-consuming and error-prone chore the developer has to deal with.
-
-This is the basic equation we want to solve in the least square sense:
+Specifically, we want to solve LLS under the assumption that we have too many data points to fit in memory but not such a huge number of variables that we need to implement the whole process as map reduce job. This is the basic equation we want to solve in the least square sense:
**X** **β** = **y**
@@ -164,12 +163,12 @@ The next is a reusable reduce function that just sums a list of matrices, ignore
```{r LLS-sum}
```
-The big matrix is passed to the mapper in chunks of complete rows. Smaller crossproducts are computed for these submatrices and passed on to a single reducer, which sums them together. Since we have a single key a combiner is mandatory and since matrix sum is associative and commutatitve we certainly can use it here.
+The big matrix is passed to the mapper in chunks of complete rows. Smaller cross-products are computed for these submatrices and passed on to a single reducer, which sums them together. Since we have a single key a combiner is mandatory and since matrix sum is associative and commutative we certainly can use it here.
```{r LLS-XtX}
```
-The same pretty much goes on also for vectory y, which is made available to the nodes according to normal scope rules.
+The same pretty much goes on also for vector y, which is made available to the nodes according to normal scope rules.
```{r LLS-Xty}
```
View
24 rmr2/docs/tutorial.html
@@ -195,7 +195,7 @@
mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2))
</code></pre>
-<p>This is all it takes to write your first mapreduce job in <code>rmr</code>. There are some differences that we will review, but the first thing to notice is that it isn&#39;t all that different, and just two lines of code. The first line puts the data into HDFS, where the bulk of the data has to reside for mapreduce to operate on. It is not possible to write out big data with <code>to.dfs</code>, not in a scalable way. <code>to.dfs</code> is nonetheless very useful for a variety of uses like writing test cases, learning and debugging. <code>to.dfs</code> can put the data in a file of your own choosing, but if you don&#39;t specify one it will create tempfiles and clean them up when done. The return value is something we call a <em>big data object</em>. You can assign it to variables, pass it to other <code>rmr</code> functions, mapreduce jobs or read it back in. It is a stub, that is the data is not in memory, only some information that helps finding and managing the data. This way you can refer to very large data sets whose sixe exceeds memory limits. </p>
+<p>This is all it takes to write your first mapreduce job in <code>rmr</code>. There are some differences that we will review, but the first thing to notice is that it isn&#39;t all that different, and just two lines of code. The first line puts the data into HDFS, where the bulk of the data has to reside for mapreduce to operate on. It is not possible to write out big data with <code>to.dfs</code>, not in a scalable way. <code>to.dfs</code> is nonetheless very useful for a variety of uses like writing test cases, learning and debugging. <code>to.dfs</code> can put the data in a file of your own choosing, but if you don&#39;t specify one it will create temp files and clean them up when done. The return value is something we call a <em>big data object</em>. You can assign it to variables, pass it to other <code>rmr</code> functions, mapreduce jobs or read it back in. It is a stub, that is the data is not in memory, only some information that helps finding and managing the data. This way you can refer to very large data sets whose size exceeds memory limits. </p>
<p>Now onto the second line. It has <code>mapreduce</code> replace <code>lapply</code>. We prefer named arguments with <code>mapreduce</code> because there&#39;s quite a few possible arguments, but it&#39;s not mandatory. The input is the variable <code>small.ints</code> which contains the output of <code>to.dfs</code>, that is a stub for our small number data set in its HDFS version, but it could be a file path or a list containing a mix of both. The function to apply, which is called a map function as opposed to the reduce function, which we are not using here, is a regular R function with a few constraints:</p>
@@ -222,7 +222,7 @@
<p>First we move the data into HDFS with <code>to.dfs</code>. As we said earlier, this is not the normal way in which big data will enter HDFS; it is normally the responsibility of scalable data collection systems such as Flume or Sqoop. In that case we would just specify the HDFS path to the data as input to <code>mapreduce</code>. But in this case the input is the variable <code>groups</code> which contains a big data object, which keeps track of where the data is and does the clean up when the data is no longer needed. Since a map function is not specified it is set to the default, which is like an identity but consistent with the map requirements, that is <code>function(k,v) keyval(k,v)</code>. The reduce function takes two arguments, one is a key and the other is a collection of all the values associated with that key. It could be one of vector, list, data frame or matrix depending on what was returned by the map function. The idea is that if the user returned values of one class, we should preserve that through the shuffle phase. Like in the map case, the reduce function can return <code>NULL</code>, a key-value pair as generated by the function <code>keyval</code> or any other object <code>x</code> which is equivalent to <code>keyval(NULL, x)</code>. The default is no reduce, that is the output of the map is the output of mapreduce. In this case the keys are realizations of the binomial and the values are all <code>1</code> (please note recycling in action) and the only important thing is how many there are, so <code>length</code> gets the job done. Looking back at this second example, there are some small differences with <code>tapply</code> but the overall complexity is very similar.</p>
-<h2>Wordcount</h2>
+<h2>Word count</h2>
<p>The word count program has become a sort of &ldquo;hello world&rdquo; of the mapreduce world. For a review of how the same task can be accomplished in several languages, but always for mapreduce, see this <a href="http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html">blog entry</a>.</p>
@@ -283,7 +283,7 @@
<p>We start by initializing the separating plane and defining the logistic function. As before, those variables will be used inside the map function, that is they will travel across interpreter and processor and network barriers to be available where the developer needs them and where a traditional, meaning sequential, R developer expects them to be available according to scope rules &mdash; no boilerplate code and familiar, powerful behavior.</p>
-<p>Then we have the main loop where computing the gradient of the loss function is the duty of a map reduce job, whose output is brought straight into main memory with a call to <code>from.dfs</code> &mdash; there are temp files being created and destroyed behind the scenes but you don&#39;t need to know. The only important thing you need to know is that the gradient is going to fit in memory so we can call <code>from.dfs</code> to get it without exceeding available resources.</p>
+<p>Then we have the main loop where computing the gradient of the loss function is the duty of a map reduce job, whose output is brought into main memory with a call to <code>from.dfs</code> &mdash; any intermediate result files are managed by the system, not you. The only important thing you need to know is that the gradient is going to fit in memory so we can call <code>from.dfs</code> to get it without exceeding available resources.</p>
<pre><code class="r"> lr.map =
function(dummy, M) {
@@ -293,7 +293,7 @@
Y * X * g(-Y * as.numeric(X %*% t(plane))))}
</code></pre>
-<p>The map function simply computes the contribution of a subset of points to the gradient. Please note the variables <code>g</code> and <code>plane</code> making their necessary appearance here without any work on the developer&#39;s part. The access here is read only but you could even modify them if you wanted &mdash; the semantics is copy on assign, which is consistent with how R works and easily supported by hadoop. Since in the next step we just want to add everything together, we return a dummy, constant key for each value. Note the use of recycling in <code>keyval</code>.</p>
+<p>The map function simply computes the contribution of a subset of points to the gradient. Please note the variables <code>g</code> and <code>plane</code> making their necessary appearance here without any work on the developer&#39;s part. The access here is read only but you could even modify them if you wanted &mdash; the semantics is copy on assign, which is consistent with how R works and easily supported by Hadoop. Since in the next step we just want to add everything together, we return a dummy, constant key for each value. Note the use of recycling in <code>keyval</code>.</p>
<pre><code class="r"> lr.reduce =
function(k, Z) keyval(k, t(as.matrix(apply(Z,2,sum))))
@@ -317,7 +317,7 @@
ncol = length(x)))}
</code></pre>
-<p>This is simply a distance function, the only noteworthy proprty of which is that it can compute all the distance between a matrix of centers <code>C</code> and a matrix of points <code>P</code> very efficiently, on my laptop it can do 10<sup>6</sup> points and 10<sup>2</sup> centers in 5 dimensions in approx. 16s. The only explicit iteration is over the dimension, but all the other operations are vectorized (e.g. loops are pushed to the C library), hence the speed.</p>
+<p>This is simply a distance function, the only noteworthy property of which is that it can compute all the distance between a matrix of centers <code>C</code> and a matrix of points <code>P</code> very efficiently, on my laptop it can do 10<sup>6</sup> points and 10<sup>2</sup> centers in 5 dimensions in approx. 16s. The only explicit iteration is over the dimension, but all the other operations are vectorized (e.g. loops are pushed to the C library), hence the speed.</p>
<pre><code class="r"> kmeans.map.1 =
function(k, P) {
@@ -330,7 +330,7 @@
keyval(nearest, P) }
</code></pre>
-<p>The role of the map function is to compute distances between some points and all centers and return for each point the closest center. It has two flavors contolled by the main <code>if</code>: the first iteration when no candidate centers are available and all the following ones. Please note that while the points are stored in HDFS and provided to the map function as its second argument, the centers are simply stored in a matrix and available in the map function because of normal scope rules. In the first iteration, each point is randomly assigned to a center, whereas in the following ones a min distance criterion is used. Finally notice the vectorized use of keyval whereby all the center-point pairs are returned in one statement (the correspondence is positional, with the second dimension used when present).</p>
+<p>The role of the map function is to compute distances between some points and all centers and return for each point the closest center. It has two flavors controlled by the main <code>if</code>: the first iteration when no candidate centers are available and all the following ones. Please note that while the points are stored in HDFS and provided to the map function as its second argument, the centers are simply stored in a matrix and available in the map function because of normal scope rules. In the first iteration, each point is randomly assigned to a center, whereas in the following ones a min distance criterion is used. Finally notice the vectorized use of keyval whereby all the center-point pairs are returned in one statement (the correspondence is positional, with the second dimension used when present).</p>
<pre><code class="r"> kmeans.reduce.1 =
function(x, P) {
@@ -350,14 +350,14 @@
C}
</code></pre>
-<p>The main loop does nothing but bring into memory the result of a mapreduce job with the two above functions as mapper and reducer and the big data object with the points as input. Once the keys are discarded, the values form a matrix which become the new centers. The last two lines before the return value are a heuristic to keep the number of centers the desired one (when centers are nearest to no points, they are lost). To run this function we neef some data:</p>
+<p>The main loop does nothing but bring into memory the result of a mapreduce job with the two above functions as mapper and reducer and the big data object with the points as input. Once the keys are discarded, the values form a matrix which become the new centers. The last two lines before the return value are a heuristic to keep the number of centers the desired one (when centers are nearest to no points, they are lost). To run this function we need some data:</p>
<pre><code class="r"> input =
do.call(rbind, rep(list(matrix(rnorm(10, sd = 10), ncol=2)), 20)) +
matrix(rnorm(200), ncol =2)
</code></pre>
-<p>That is, create a large matrix with a few rows repeated many times and then add some noise.</p>
+<p>That is, create a large matrix with a few rows repeated many times and then add some noise. Finally, the function call:</p>
<pre><code class="r"> kmeans.mr(to.dfs(input), num.clusters = 12, num.iter= 5)
</code></pre>
@@ -366,9 +366,9 @@
<h2>Linear Least Squares</h2>
-<p>We are going to build another example, LLS, that illustrates how to build map reduce reusable abstractions and how to combine them to solve a larger task. We want to solve LLS under the assumption that we have too many data points to fit in memory but not such a huge number of variables that we need to implement the whole process as map reduce job. This is sort of a hybrid solution that is made particularly easy by the seamless integration of <code>rmr</code> with R and an example of a pragmatic approach to big data. If we have operations A, B, and C in a cascade and the data sizes decrease at each step and we have already an in-memory solution to it, than we might get away by replacing only the first step with a big data solution and then continuing with tried and true function and pacakges. To make this as easy as possible, we need the in memory and big data worlds to integrate easily.</p>
+<p>This is an example of a hybrid mapreduce-conventional solution to a well known problem. We will start with a mapreduce job that results in a smaller data set that can be brought into main memory and processed in a single R instance. This is straightforward in rmr because of the simple primitive that transfers data into memory, <code>from.dfs</code>, and the R-native data model. This is in contrast with hybrid pig-java-python solutions where mapping data types from one language to the other is a time-consuming and error-prone chore the developer has to deal with.</p>
-<p>This is the basic equation we want to solve in the least square sense: </p>
+<p>Specifically, we want to solve LLS under the assumption that we have too many data points to fit in memory but not such a huge number of variables that we need to implement the whole process as map reduce job. This is the basic equation we want to solve in the least square sense: </p>
<p><strong>X</strong> <strong>&beta;</strong> = <strong>y</strong></p>
@@ -392,7 +392,7 @@
<pre><code class="r">Sum = function(k, YY) keyval(1, list(Reduce(&#39;+&#39;, YY)))
</code></pre>
-<p>The big matrix is passed to the mapper in chunks of complete rows. Smaller crossproducts are computed for these submatrices and passed on to a single reducer, which sums them together. Since we have a single key a combiner is mandatory and since matrix sum is associative and commutatitve we certainly can use it here.</p>
+<p>The big matrix is passed to the mapper in chunks of complete rows. Smaller cross-products are computed for these submatrices and passed on to a single reducer, which sums them together. Since we have a single key a combiner is mandatory and since matrix sum is associative and commutative we certainly can use it here.</p>
<pre><code class="r">XtX =
values(
@@ -406,7 +406,7 @@
combine = TRUE)))[[1]]
</code></pre>
-<p>The same pretty much goes on also for vectory y, which is made available to the nodes according to normal scope rules.</p>
+<p>The same pretty much goes on also for vector y, which is made available to the nodes according to normal scope rules.</p>
<pre><code class="r">Xty =
values(
View
25 rmr2/docs/tutorial.md
@@ -29,7 +29,7 @@ The example is trivial, just computing the first 10 squares, but we just want to
-This is all it takes to write your first mapreduce job in `rmr`. There are some differences that we will review, but the first thing to notice is that it isn't all that different, and just two lines of code. The first line puts the data into HDFS, where the bulk of the data has to reside for mapreduce to operate on. It is not possible to write out big data with `to.dfs`, not in a scalable way. `to.dfs` is nonetheless very useful for a variety of uses like writing test cases, learning and debugging. `to.dfs` can put the data in a file of your own choosing, but if you don't specify one it will create tempfiles and clean them up when done. The return value is something we call a *big data object*. You can assign it to variables, pass it to other `rmr` functions, mapreduce jobs or read it back in. It is a stub, that is the data is not in memory, only some information that helps finding and managing the data. This way you can refer to very large data sets whose sixe exceeds memory limits.
+This is all it takes to write your first mapreduce job in `rmr`. There are some differences that we will review, but the first thing to notice is that it isn't all that different, and just two lines of code. The first line puts the data into HDFS, where the bulk of the data has to reside for mapreduce to operate on. It is not possible to write out big data with `to.dfs`, not in a scalable way. `to.dfs` is nonetheless very useful for a variety of uses like writing test cases, learning and debugging. `to.dfs` can put the data in a file of your own choosing, but if you don't specify one it will create temp files and clean them up when done. The return value is something we call a *big data object*. You can assign it to variables, pass it to other `rmr` functions, mapreduce jobs or read it back in. It is a stub, that is the data is not in memory, only some information that helps finding and managing the data. This way you can refer to very large data sets whose size exceeds memory limits.
Now onto the second line. It has `mapreduce` replace `lapply`. We prefer named arguments with `mapreduce` because there's quite a few possible arguments, but it's not mandatory. The input is the variable `small.ints` which contains the output of `to.dfs`, that is a stub for our small number data set in its HDFS version, but it could be a file path or a list containing a mix of both. The function to apply, which is called a map function as opposed to the reduce function, which we are not using here, is a regular R function with a few constraints:
@@ -59,7 +59,7 @@ This creates a sample from the binomial and counts how many times each outcome o
First we move the data into HDFS with `to.dfs`. As we said earlier, this is not the normal way in which big data will enter HDFS; it is normally the responsibility of scalable data collection systems such as Flume or Sqoop. In that case we would just specify the HDFS path to the data as input to `mapreduce`. But in this case the input is the variable `groups` which contains a big data object, which keeps track of where the data is and does the clean up when the data is no longer needed. Since a map function is not specified it is set to the default, which is like an identity but consistent with the map requirements, that is `function(k,v) keyval(k,v)`. The reduce function takes two arguments, one is a key and the other is a collection of all the values associated with that key. It could be one of vector, list, data frame or matrix depending on what was returned by the map function. The idea is that if the user returned values of one class, we should preserve that through the shuffle phase. Like in the map case, the reduce function can return `NULL`, a key-value pair as generated by the function `keyval` or any other object `x` which is equivalent to `keyval(NULL, x)`. The default is no reduce, that is the output of the map is the output of mapreduce. In this case the keys are realizations of the binomial and the values are all `1` (please note recycling in action) and the only important thing is how many there are, so `length` gets the job done. Looking back at this second example, there are some small differences with `tapply` but the overall complexity is very similar.
-## Wordcount
+## Word count
The word count program has become a sort of "hello world" of the mapreduce world. For a review of how the same task can be accomplished in several languages, but always for mapreduce, see this [blog entry](http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html).
@@ -140,7 +140,7 @@ As you can see we have an input representing the training data. For simplicity w
We start by initializing the separating plane and defining the logistic function. As before, those variables will be used inside the map function, that is they will travel across interpreter and processor and network barriers to be available where the developer needs them and where a traditional, meaning sequential, R developer expects them to be available according to scope rules &mdash; no boilerplate code and familiar, powerful behavior.
-Then we have the main loop where computing the gradient of the loss function is the duty of a map reduce job, whose output is brought straight into main memory with a call to `from.dfs` &mdash; there are temp files being created and destroyed behind the scenes but you don't need to know. The only important thing you need to know is that the gradient is going to fit in memory so we can call `from.dfs` to get it without exceeding available resources.
+Then we have the main loop where computing the gradient of the loss function is the duty of a map reduce job, whose output is brought into main memory with a call to `from.dfs` &mdash; any intermediate result files are managed by the system, not you. The only important thing you need to know is that the gradient is going to fit in memory so we can call `from.dfs` to get it without exceeding available resources.
```r
@@ -153,7 +153,7 @@ Then we have the main loop where computing the gradient of the loss function is
```
-The map function simply computes the contribution of a subset of points to the gradient. Please note the variables `g` and `plane` making their necessary appearance here without any work on the developer's part. The access here is read only but you could even modify them if you wanted &mdash; the semantics is copy on assign, which is consistent with how R works and easily supported by hadoop. Since in the next step we just want to add everything together, we return a dummy, constant key for each value. Note the use of recycling in `keyval`.
+The map function simply computes the contribution of a subset of points to the gradient. Please note the variables `g` and `plane` making their necessary appearance here without any work on the developer's part. The access here is read only but you could even modify them if you wanted &mdash; the semantics is copy on assign, which is consistent with how R works and easily supported by Hadoop. Since in the next step we just want to add everything together, we return a dummy, constant key for each value. Note the use of recycling in `keyval`.
```r
@@ -183,7 +183,7 @@ We are talking about k-means. This is not a production ready implementation, but
```
-This is simply a distance function, the only noteworthy proprty of which is that it can compute all the distance between a matrix of centers `C` and a matrix of points `P` very efficiently, on my laptop it can do 10^6 points and 10^2 centers in 5 dimensions in approx. 16s. The only explicit iteration is over the dimension, but all the other operations are vectorized (e.g. loops are pushed to the C library), hence the speed.
+This is simply a distance function, the only noteworthy property of which is that it can compute all the distance between a matrix of centers `C` and a matrix of points `P` very efficiently, on my laptop it can do 10^6 points and 10^2 centers in 5 dimensions in approx. 16s. The only explicit iteration is over the dimension, but all the other operations are vectorized (e.g. loops are pushed to the C library), hence the speed.
```r
@@ -198,7 +198,7 @@ This is simply a distance function, the only noteworthy proprty of which is that
keyval(nearest, P) }
```
-The role of the map function is to compute distances between some points and all centers and return for each point the closest center. It has two flavors contolled by the main `if`: the first iteration when no candidate centers are available and all the following ones. Please note that while the points are stored in HDFS and provided to the map function as its second argument, the centers are simply stored in a matrix and available in the map function because of normal scope rules. In the first iteration, each point is randomly assigned to a center, whereas in the following ones a min distance criterion is used. Finally notice the vectorized use of keyval whereby all the center-point pairs are returned in one statement (the correspondence is positional, with the second dimension used when present).
+The role of the map function is to compute distances between some points and all centers and return for each point the closest center. It has two flavors controlled by the main `if`: the first iteration when no candidate centers are available and all the following ones. Please note that while the points are stored in HDFS and provided to the map function as its second argument, the centers are simply stored in a matrix and available in the map function because of normal scope rules. In the first iteration, each point is randomly assigned to a center, whereas in the following ones a min distance criterion is used. Finally notice the vectorized use of keyval whereby all the center-point pairs are returned in one statement (the correspondence is positional, with the second dimension used when present).
```r
@@ -223,7 +223,7 @@ The reduce function couldn't be simpler as it just computes column averages of a
C}
```
-The main loop does nothing but bring into memory the result of a mapreduce job with the two above functions as mapper and reducer and the big data object with the points as input. Once the keys are discarded, the values form a matrix which become the new centers. The last two lines before the return value are a heuristic to keep the number of centers the desired one (when centers are nearest to no points, they are lost). To run this function we neef some data:
+The main loop does nothing but bring into memory the result of a mapreduce job with the two above functions as mapper and reducer and the big data object with the points as input. Once the keys are discarded, the values form a matrix which become the new centers. The last two lines before the return value are a heuristic to keep the number of centers the desired one (when centers are nearest to no points, they are lost). To run this function we need some data:
```r
@@ -233,7 +233,7 @@ The main loop does nothing but bring into memory the result of a mapreduce job w
```
- That is, create a large matrix with a few rows repeated many times and then add some noise.
+ That is, create a large matrix with a few rows repeated many times and then add some noise. Finally, the function call:
```r
@@ -245,10 +245,9 @@ With a little extra work you can even get pretty visualizations like [this one](
## Linear Least Squares
- We are going to build another example, LLS, that illustrates how to build map reduce reusable abstractions and how to combine them to solve a larger task. We want to solve LLS under the assumption that we have too many data points to fit in memory but not such a huge number of variables that we need to implement the whole process as map reduce job. This is sort of a hybrid solution that is made particularly easy by the seamless integration of `rmr` with R and an example of a pragmatic approach to big data. If we have operations A, B, and C in a cascade and the data sizes decrease at each step and we have already an in-memory solution to it, than we might get away by replacing only the first step with a big data solution and then continuing with tried and true function and pacakges. To make this as easy as possible, we need the in memory and big data worlds to integrate easily.
+This is an example of a hybrid mapreduce-conventional solution to a well known problem. We will start with a mapreduce job that results in a smaller data set that can be brought into main memory and processed in a single R instance. This is straightforward in rmr because of the simple primitive that transfers data into memory, `from.dfs`, and the R-native data model. This is in contrast with hybrid pig-java-python solutions where mapping data types from one language to the other is a time-consuming and error-prone chore the developer has to deal with.
-
-This is the basic equation we want to solve in the least square sense:
+Specifically, we want to solve LLS under the assumption that we have too many data points to fit in memory but not such a huge number of variables that we need to implement the whole process as map reduce job. This is the basic equation we want to solve in the least square sense:
**X** **&beta;** = **y**
@@ -277,7 +276,7 @@ Sum = function(k, YY) keyval(1, list(Reduce('+', YY)))
```
-The big matrix is passed to the mapper in chunks of complete rows. Smaller crossproducts are computed for these submatrices and passed on to a single reducer, which sums them together. Since we have a single key a combiner is mandatory and since matrix sum is associative and commutatitve we certainly can use it here.
+The big matrix is passed to the mapper in chunks of complete rows. Smaller cross-products are computed for these submatrices and passed on to a single reducer, which sums them together. Since we have a single key a combiner is mandatory and since matrix sum is associative and commutative we certainly can use it here.
```r
@@ -294,7 +293,7 @@ XtX =
```
-The same pretty much goes on also for vectory y, which is made available to the nodes according to normal scope rules.
+The same pretty much goes on also for vector y, which is made available to the nodes according to normal scope rules.
```r

0 comments on commit 28a0472

Please sign in to comment.