Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed width format data #118

Merged
merged 5 commits into from Jul 26, 2012
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
33 changes: 33 additions & 0 deletions rmr/pkg/docs/getting-data-in-and-out.Rmd
Expand Up @@ -81,6 +81,39 @@ You can then use the list names to directly access your column of interest for m
```{r getting-data.named.column.access}
```

Another common `input.format` is fixed width formatted data:
```{r getting-data.fwf.reader}
```

Using the text `output.format` as a template, we modify it slightly to write fixed width data without tab seperation:
```{r getting-data.fwf.writer}
```

Writing the `mtcars` dataset to a fixed width file with column widths of 6 bytes and putting into hdfs:
```{r getting-data.generate.fwf.data}
```

The key thing to note about `fwf.reader` is the global variable `fields`. In `fields`, we define the start and
end byte locations for each field in the data:
```{r getting-data.create.fields.list}
```

Sending 1 line at a time to the map function:
```{r getting-data.from.dfs.one.line}
```

Sending more than 1 line at a time to the map function via the vectorized API:
```{r getting-data.from.dfs.multiple.lines}
```

Frequency count on `cyl`:
```{r getting-data.cyl.frequency.count}
```

Frequency count on `cyl` with vectorized API:
```{r getting-data.cyl.vectorized.frequency.count}
```

To get your data out - say you input file, apply column transformations, add columns, and want to output a new csv file
Just like input.format, one must define a textoutputformat

Expand Down
75 changes: 74 additions & 1 deletion rmr/pkg/tests/getting-data-in-and-out.R
Expand Up @@ -94,4 +94,77 @@ mapreduce(
#complicated function here
keyval(k, vv[[1]])})
## @knitr end

## @knitr getting-data.fwf.reader
fwf.reader <- function(con, nrecs) {
lines <- readLines(con, nrecs)
if (length(lines) == 0) {
NULL
}
else {
df <- as.data.frame(lapply(fields, function(x) substr(lines, x[1], x[2])), stringsAsFactors = FALSE)
keyval(NULL, df)
}
}
## @knitr end
## @knitr getting-data.fwf.writer
fwf.writer <- function(k, v, con, vectorized) {
ser <- function(k, v) paste(k, v, collapse = "", sep = "")
out <- if(vectorized) {
mapply(ser, k, v)
}
else {
ser(k, v)
}
writeLines(out, sep = "\n", con = con)
}
## @knitr end
## @knitr getting-data.generate.fwf.data
cars <- apply(mtcars, 2, function(x) format(x, width = 6))
fwf.data <- to.dfs(cars, format = make.output.format(mode = "text", format = fwf.writer))
## @knitr end
## @knitr getting-data.create.fields.list
fields <- list(mpg = c(1,6),
cyl = c(7,12),
disp = c(13,18),
hp = c(19,24),
drat = c(25,30),
wt = c(31,36),
qsec = c(37,42),
vs = c(43,48),
am = c(49,54),
gear = c(55,60),
carb = c(61,66))
## @knitr end
## @knitr getting-data.from.dfs.one.line
out <- from.dfs(mapreduce(input = fwf.data,
input.format = make.input.format(mode = "text", format = fwf.reader)))
out[[1]]
## @knitr end
## @knitr getting-data.from.dfs.multiple.lines
out <- from.dfs(mapreduce(input = fwf.data,
input.format = make.input.format(mode = "text", format = fwf.reader),
vectorized = list(map = TRUE)))
out[[1]]
## @knitr end
## @knitr getting-data.cyl.frequency.count
out <- from.dfs(mapreduce(input = fwf.data,
input.format = make.input.format(mode = "text", format = fwf.reader),
map = function(key, value) keyval(value[,"cyl"], 1),
reduce = function(key, value) keyval(key, sum(unlist(value))),
combine = TRUE), structured = TRUE)
df <- data.frame(out$key, out$val)
names(df) <- c("cyl","count")
df
## @knitr end
## @knitr getting-data.cyl.vectorized.frequency.count
out <- from.dfs(mapreduce(input = fwf.data,
input.format = make.input.format(mode = "text", format = fwf.reader),
map = function(key, value) keyval(value[,"cyl"], 1, vectorized = TRUE),
reduce = function(key, value) keyval(key, sum(unlist(value))),
combine = TRUE,
vectorized = list(map = TRUE),
structured = list(map = TRUE)), structured = TRUE)
df <- data.frame(out$key, out$val)
names(df) <- c("cyl","count")
df
## @knitr end