manual anchoring not needed

RevolutionAnalytics · Jul 17, 2012 · 7ca6c2b · 7ca6c2b
1 parent db41741
commit 7ca6c2b
Showing 1 changed file with 0 additions and 9 deletions.
diff --git a/rmr/pkg/docs/tutorial.Rmd b/rmr/pkg/docs/tutorial.Rmd
@@ -8,7 +8,6 @@
 
 # Mapreduce in R
 
-<a name="myfirstmapreducejob"></a>
 ## My first mapreduce job
 
   Conceptually, mapreduce is not very different than a combination of lapplys and a tapply: transform elements of a list, compute an index &mdash; key in mapreduce jargon &mdash; and process the groups thus defined. Let's start with a simple lapply example:
@@ -37,7 +36,6 @@ function, which we are not using here, is a regular R function with a few constr
 In this example, we are not using the key at all, only the value, but we still need both to support the general mapreduce case. 
 The return value is an object, and you can pass it as input to other jobs or read it into memory (watch out, not good for big data) with `from.dfs`. `from.dfs` is complementary `to.dfs`. It returns a list of key-value pairs, which is the most general data type that mapreduce can handle. If you prefer data frames to lists, you can instruct `from.dfs` to perform a conversion to data frames, which will cover many many important use cases but is not as general as a list of pairs (structured vs. unstructured case). `from.dfs` is useful in defining map reduce algorithms whenever a mapreduce job produces something of reasonable size, like a summary, that can fit in memory and needs to be inspected to decide on the next steps, or to visualize it.
 
-<a name="mysecondmapreducejob"></a>
 ## My second mapreduce job
 
 We've just created a simple job that was logically equivalent to a lapply but can run on big data. That job had only a map. Now to the reduce part. The closest equivalent in R is arguably a tapply. So here is the example from the R docs:
@@ -51,7 +49,6 @@ This creates a sample from the binomial and counts how many times each outcome o
 
 First we move the data into HDFS with `to.dfs`. As we said earlier, this is not the normal way in which big data will enter HDFS; it is normally the responsibility of scalable data collection systems such as Flume or Sqoop. In that case we would just specify the HDFS path to the data as input to `mapreduce`. But in this case the input is the variable `groups` which points to where the data is temporarily stored, and the naming and clean up is taken care of for you. All you need to know is how you can use it. There isn't a map function, so it is set to default which is like an identity but consistent with the map requirements, that is `function(k,v) keyval(k,v)`. The reduce function takes two arguments, one is a key and the other is a list of all the values associated with that key. Like in the map case, the reduce function can return `NULL`, a key value pair as generated by the function `keyval` or a list thereof. The default is somewhat equivalent to an identity function, under the constraints of a reduce function, that is `function(k, vv) lapply(vv, function(v) keyval(k,v))`. In this case they key is one possible outcome of the binomial and the values are all `NULL` and the only important thing is how many there are, so `length` gets the job done. Looking back at this second example, there are some small differences with `tapply` but the overall complexity is very similar.
 
-<a name="wordcount"></a>
 ## Wordcount
 
 The word count program has become a sort of "hello world" of the mapreduce world. For a review of how the same task can be accomplished in several languages, but always for map reduce, see this [blog entry](http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html).
@@ -70,7 +67,6 @@ The map function, as we know already, takes two arguments, a key and a value. Th
 
 The reduce function takes a key and a list of values as input and simply sums up all the counts and returns the pair word, count using the same helper function, `keyval`. Finally, specifying the use of a combiner is necessary to guarantee the scalability of this algorithm.
 
-<a name="logisticregression"></a>
 ## Logistic Regression
 
 Now onto an example from supervised learning, specifically logistic regression by gradient descent. Again we are going to create a function that encapsulates this algorithm. 
@@ -96,7 +92,6 @@ After the map reduce job is complete and `from.dfs` has copied the only record t
 
 To make this example production-level there are several things one needs to do, like having a convergence criterion instead of a fixed iteration number an an adaptive learning rate,  but probably gradient descent just requires too many iterations to be the right approach in a big data context. But this example should give you all the elements to be able to implement, say, conjugate gradient instead. In general, when each iteration requires I/O of a large data set, the number of iterations needs to be modest and algorithms with O(log(N)) number of iterations are natural candidates, even if the work in each iteration may be more substantial.
 
-<a name="kmeans"></a>
 ## K-means
 
 We are now going to cover a simple but significant clustering algorithm and the complexity will go up just a little bit. To cheer yourself up, you can take a look at [this alternative implementation](http://www.hortonworks.com/new-apache-pig-features-part-2-embedding/) which requires three languages, python, pig and java, to get the job done and is hailed as a model of simplicity.
@@ -131,7 +126,6 @@ And this is a simple test of what we've just implemented,
 ```  
   With a little extra work you can even get pretty visualizations like [this one](kmeans.gif).
 
-<a name="linearleastsquares"></a>
 ## Linear Least Squares
 
   We are going to build another example, LLS, that illustrates how to build map reduce reusable abstractions and how to combine them to solve a larger task. We want to solve LLS under the assumption that we have too many data points to fit in memory but not such a huge number of variables that we need to implement the whole process as map reduce job. This is sort of a hybrid solution that is made particularly easy by the seamless integration of `rmr` with R and an example of a pragmatic approach to big data. If we have operations A, B, and C in a cascade and the data sizes decrease at each step and we have already an in-memory solution to it, than we might get away by replacing only the first step with a big data solution and then continuing with tried and true function and pacakges. To make this as easy as possible, we need the in memory and big data worlds to integrate easily.
@@ -184,7 +178,6 @@ Then we have the map reduce transpose job which is abstracted into a function tr
 It takes an input, an optional output and returns the return value of the map reduce job. It passes input and output to it and the map function we've just defined and that's all.
 
 
-<a name="relationaljoins"></a>
 ### Detour: Relational Joins
 
 Now we would like to tackle matrix multiplication but we need a short detour first. This takes one step further in hadoop mastery as we need to combine and process two files into one map reduce job. By default mapreduce supports merging two inputs the way hadoop does, that is once can specify multiple inputs and the only guarantee is that every record will go through one mapper. No order or grouping of  any sort is guaranteed as the mappers are processing the input files.
@@ -213,7 +206,6 @@ Now to the interesting bits. This function is a bit relational join and a bit ma
 
 The reduce can be specified as usual, but an alternate interface is offered with `reduce.all` which is more SQL-like in that it is a join without a group by on the join key, whereas the reduce form implies a group by. This is a little advanced in a number of ways and also very reusable, so we made it part of the library even if it is built on top of `mapreduce`. No we are going to see its application to perform a matrix multiplication.
 
-<a name="linearleastsquarescontinued"></a>
 ### Linear Least Squares (continued)
 
 Back to our matrix multiplication task, which we will implement as an application of the equijoin just shown.
@@ -253,7 +245,6 @@ The first step is a join on the the column index for the left side  and the row
 
    We start with a transpose, compute the normal equations,  left and right side, and call solve on the converted data.
 
-   <a name="whatwehavelearned"></a>
 ## What we have learned
 
    We will summarize here a few ways in which we have used the functions in the library.