Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Fixed width format data #118

Merged
merged 5 commits into from

2 participants

@ryangarner

I think I did this right? Let me know what issues you find. I'm still getting the hang of github, markdown, and knitr.

@piccolbo
Owner

Looking awesome. Let me wait one day to see if we can close 1.3.1 without branching and then I will merge this in. I am only a little worried about the rhdfs dependency, which bring a number of other dependencies with it, is it absolutely necessary? Trying to keep the three packages independent.

@ryangarner

Let me see if I can get the example to work with "to.dfs()" instead of using "hdfs.put()"

@ryangarner

Removed the rhdfs dependency

@piccolbo piccolbo merged commit e29033d into from
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 25, 2012
  1. Fixed width format R code

    ryangarner authored
  2. Fixed width format example

    ryangarner authored
  3. Remove extra knitr end statement

    ryangarner authored
Commits on Jul 26, 2012
  1. Added fwf.writer and used to.dfs

    ryangarner authored
  2. Modified generation of fixed width data

    ryangarner authored
This page is out of date. Refresh to see the latest.
View
33 rmr/pkg/docs/getting-data-in-and-out.Rmd
@@ -81,6 +81,39 @@ You can then use the list names to directly access your column of interest for m
```{r getting-data.named.column.access}
```
+Another common `input.format` is fixed width formatted data:
+```{r getting-data.fwf.reader}
+```
+
+Using the text `output.format` as a template, we modify it slightly to write fixed width data without tab seperation:
+```{r getting-data.fwf.writer}
+```
+
+Writing the `mtcars` dataset to a fixed width file with column widths of 6 bytes and putting into hdfs:
+```{r getting-data.generate.fwf.data}
+```
+
+The key thing to note about `fwf.reader` is the global variable `fields`. In `fields`, we define the start and
+end byte locations for each field in the data:
+```{r getting-data.create.fields.list}
+```
+
+Sending 1 line at a time to the map function:
+```{r getting-data.from.dfs.one.line}
+```
+
+Sending more than 1 line at a time to the map function via the vectorized API:
+```{r getting-data.from.dfs.multiple.lines}
+```
+
+Frequency count on `cyl`:
+```{r getting-data.cyl.frequency.count}
+```
+
+Frequency count on `cyl` with vectorized API:
+```{r getting-data.cyl.vectorized.frequency.count}
+```
+
To get your data out - say you input file, apply column transformations, add columns, and want to output a new csv file
Just like input.format, one must define a textoutputformat
View
75 rmr/pkg/tests/getting-data-in-and-out.R
@@ -94,4 +94,77 @@ mapreduce(
#complicated function here
keyval(k, vv[[1]])})
## @knitr end
-
+## @knitr getting-data.fwf.reader
+fwf.reader <- function(con, nrecs) {
+ lines <- readLines(con, nrecs)
+ if (length(lines) == 0) {
+ NULL
+ }
+ else {
+ df <- as.data.frame(lapply(fields, function(x) substr(lines, x[1], x[2])), stringsAsFactors = FALSE)
+ keyval(NULL, df)
+ }
+}
+## @knitr end
+## @knitr getting-data.fwf.writer
+fwf.writer <- function(k, v, con, vectorized) {
+ ser <- function(k, v) paste(k, v, collapse = "", sep = "")
+ out <- if(vectorized) {
+ mapply(ser, k, v)
+ }
+ else {
+ ser(k, v)
+ }
+ writeLines(out, sep = "\n", con = con)
+}
+## @knitr end
+## @knitr getting-data.generate.fwf.data
+cars <- apply(mtcars, 2, function(x) format(x, width = 6))
+fwf.data <- to.dfs(cars, format = make.output.format(mode = "text", format = fwf.writer))
+## @knitr end
+## @knitr getting-data.create.fields.list
+fields <- list(mpg = c(1,6),
+ cyl = c(7,12),
+ disp = c(13,18),
+ hp = c(19,24),
+ drat = c(25,30),
+ wt = c(31,36),
+ qsec = c(37,42),
+ vs = c(43,48),
+ am = c(49,54),
+ gear = c(55,60),
+ carb = c(61,66))
+## @knitr end
+## @knitr getting-data.from.dfs.one.line
+out <- from.dfs(mapreduce(input = fwf.data,
+ input.format = make.input.format(mode = "text", format = fwf.reader)))
+out[[1]]
+## @knitr end
+## @knitr getting-data.from.dfs.multiple.lines
+out <- from.dfs(mapreduce(input = fwf.data,
+ input.format = make.input.format(mode = "text", format = fwf.reader),
+ vectorized = list(map = TRUE)))
+out[[1]]
+## @knitr end
+## @knitr getting-data.cyl.frequency.count
+out <- from.dfs(mapreduce(input = fwf.data,
+ input.format = make.input.format(mode = "text", format = fwf.reader),
+ map = function(key, value) keyval(value[,"cyl"], 1),
+ reduce = function(key, value) keyval(key, sum(unlist(value))),
+ combine = TRUE), structured = TRUE)
+df <- data.frame(out$key, out$val)
+names(df) <- c("cyl","count")
+df
+## @knitr end
+## @knitr getting-data.cyl.vectorized.frequency.count
+out <- from.dfs(mapreduce(input = fwf.data,
+ input.format = make.input.format(mode = "text", format = fwf.reader),
+ map = function(key, value) keyval(value[,"cyl"], 1, vectorized = TRUE),
+ reduce = function(key, value) keyval(key, sum(unlist(value))),
+ combine = TRUE,
+ vectorized = list(map = TRUE),
+ structured = list(map = TRUE)), structured = TRUE)
+df <- data.frame(out$key, out$val)
+names(df) <- c("cyl","count")
+df
+## @knitr end
Something went wrong with that request. Please try again.