Skip to content

Commit

Permalink
manual anchoring not needed
Browse files Browse the repository at this point in the history
  • Loading branch information
piccolbo committed Jul 17, 2012
1 parent db41741 commit 7ca6c2b
Showing 1 changed file with 0 additions and 9 deletions.
9 changes: 0 additions & 9 deletions rmr/pkg/docs/tutorial.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@

# Mapreduce in R

<a name="myfirstmapreducejob"></a>
## My first mapreduce job

Conceptually, mapreduce is not very different than a combination of lapplys and a tapply: transform elements of a list, compute an index &mdash; key in mapreduce jargon &mdash; and process the groups thus defined. Let's start with a simple lapply example:
Expand Down Expand Up @@ -37,7 +36,6 @@ function, which we are not using here, is a regular R function with a few constr
In this example, we are not using the key at all, only the value, but we still need both to support the general mapreduce case.
The return value is an object, and you can pass it as input to other jobs or read it into memory (watch out, not good for big data) with `from.dfs`. `from.dfs` is complementary `to.dfs`. It returns a list of key-value pairs, which is the most general data type that mapreduce can handle. If you prefer data frames to lists, you can instruct `from.dfs` to perform a conversion to data frames, which will cover many many important use cases but is not as general as a list of pairs (structured vs. unstructured case). `from.dfs` is useful in defining map reduce algorithms whenever a mapreduce job produces something of reasonable size, like a summary, that can fit in memory and needs to be inspected to decide on the next steps, or to visualize it.

<a name="mysecondmapreducejob"></a>
## My second mapreduce job

We've just created a simple job that was logically equivalent to a lapply but can run on big data. That job had only a map. Now to the reduce part. The closest equivalent in R is arguably a tapply. So here is the example from the R docs:
Expand All @@ -51,7 +49,6 @@ This creates a sample from the binomial and counts how many times each outcome o

First we move the data into HDFS with `to.dfs`. As we said earlier, this is not the normal way in which big data will enter HDFS; it is normally the responsibility of scalable data collection systems such as Flume or Sqoop. In that case we would just specify the HDFS path to the data as input to `mapreduce`. But in this case the input is the variable `groups` which points to where the data is temporarily stored, and the naming and clean up is taken care of for you. All you need to know is how you can use it. There isn't a map function, so it is set to default which is like an identity but consistent with the map requirements, that is `function(k,v) keyval(k,v)`. The reduce function takes two arguments, one is a key and the other is a list of all the values associated with that key. Like in the map case, the reduce function can return `NULL`, a key value pair as generated by the function `keyval` or a list thereof. The default is somewhat equivalent to an identity function, under the constraints of a reduce function, that is `function(k, vv) lapply(vv, function(v) keyval(k,v))`. In this case they key is one possible outcome of the binomial and the values are all `NULL` and the only important thing is how many there are, so `length` gets the job done. Looking back at this second example, there are some small differences with `tapply` but the overall complexity is very similar.

<a name="wordcount"></a>
## Wordcount

The word count program has become a sort of "hello world" of the mapreduce world. For a review of how the same task can be accomplished in several languages, but always for map reduce, see this [blog entry](http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html).
Expand All @@ -70,7 +67,6 @@ The map function, as we know already, takes two arguments, a key and a value. Th

The reduce function takes a key and a list of values as input and simply sums up all the counts and returns the pair word, count using the same helper function, `keyval`. Finally, specifying the use of a combiner is necessary to guarantee the scalability of this algorithm.

<a name="logisticregression"></a>
## Logistic Regression

Now onto an example from supervised learning, specifically logistic regression by gradient descent. Again we are going to create a function that encapsulates this algorithm.
Expand All @@ -96,7 +92,6 @@ After the map reduce job is complete and `from.dfs` has copied the only record t

To make this example production-level there are several things one needs to do, like having a convergence criterion instead of a fixed iteration number an an adaptive learning rate, but probably gradient descent just requires too many iterations to be the right approach in a big data context. But this example should give you all the elements to be able to implement, say, conjugate gradient instead. In general, when each iteration requires I/O of a large data set, the number of iterations needs to be modest and algorithms with O(log(N)) number of iterations are natural candidates, even if the work in each iteration may be more substantial.

<a name="kmeans"></a>
## K-means

We are now going to cover a simple but significant clustering algorithm and the complexity will go up just a little bit. To cheer yourself up, you can take a look at [this alternative implementation](http://www.hortonworks.com/new-apache-pig-features-part-2-embedding/) which requires three languages, python, pig and java, to get the job done and is hailed as a model of simplicity.
Expand Down Expand Up @@ -131,7 +126,6 @@ And this is a simple test of what we've just implemented,
```
With a little extra work you can even get pretty visualizations like [this one](kmeans.gif).

<a name="linearleastsquares"></a>
## Linear Least Squares

We are going to build another example, LLS, that illustrates how to build map reduce reusable abstractions and how to combine them to solve a larger task. We want to solve LLS under the assumption that we have too many data points to fit in memory but not such a huge number of variables that we need to implement the whole process as map reduce job. This is sort of a hybrid solution that is made particularly easy by the seamless integration of `rmr` with R and an example of a pragmatic approach to big data. If we have operations A, B, and C in a cascade and the data sizes decrease at each step and we have already an in-memory solution to it, than we might get away by replacing only the first step with a big data solution and then continuing with tried and true function and pacakges. To make this as easy as possible, we need the in memory and big data worlds to integrate easily.
Expand Down Expand Up @@ -184,7 +178,6 @@ Then we have the map reduce transpose job which is abstracted into a function tr
It takes an input, an optional output and returns the return value of the map reduce job. It passes input and output to it and the map function we've just defined and that's all.


<a name="relationaljoins"></a>
### Detour: Relational Joins

Now we would like to tackle matrix multiplication but we need a short detour first. This takes one step further in hadoop mastery as we need to combine and process two files into one map reduce job. By default mapreduce supports merging two inputs the way hadoop does, that is once can specify multiple inputs and the only guarantee is that every record will go through one mapper. No order or grouping of any sort is guaranteed as the mappers are processing the input files.
Expand Down Expand Up @@ -213,7 +206,6 @@ Now to the interesting bits. This function is a bit relational join and a bit ma

The reduce can be specified as usual, but an alternate interface is offered with `reduce.all` which is more SQL-like in that it is a join without a group by on the join key, whereas the reduce form implies a group by. This is a little advanced in a number of ways and also very reusable, so we made it part of the library even if it is built on top of `mapreduce`. No we are going to see its application to perform a matrix multiplication.

<a name="linearleastsquarescontinued"></a>
### Linear Least Squares (continued)

Back to our matrix multiplication task, which we will implement as an application of the equijoin just shown.
Expand Down Expand Up @@ -253,7 +245,6 @@ The first step is a join on the the column index for the left side and the row

We start with a transpose, compute the normal equations, left and right side, and call solve on the converted data.

<a name="whatwehavelearned"></a>
## What we have learned

We will summarize here a few ways in which we have used the functions in the library.
Expand Down

0 comments on commit 7ca6c2b

Please sign in to comment.