-
Notifications
You must be signed in to change notification settings - Fork 92
/
Copy pathexample.Rmd
137 lines (109 loc) · 4.48 KB
/
example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: "replyr example"
author: "John Mount, Win-Vector LLC"
date: "3/4/2017"
output:
md_document:
variant: markdown_github
---
[`replyr`](https://github.com/WinVector/replyr) is an [`R`](https://cran.r-project.org) package
that contains extensions, adaptions, and work-arounds to make remote `R` `dplyr` data sources (including
big data systems such as `Spark`) behave more like local data. This allows the analyst to more easily develop
and debug procedures that simultaneously work on a variety of data services (in-memory `data.frame`,
`SQLite`, `PostgreSQL`, and `Spark2` currently being the primary supported platforms).

## Example
Suppose we had a large data set hosted on a `Spark` cluster that we wished to work
with using `dplyr` and `sparklyr` (for this article we will simulate such using data loaded into `Spark` from
the `nycflights13` package).
We will work a trivial example: taking a quick peek at your data.
The analyst should always be able to and willing to look at the data.
```{r setup, include=FALSE}
library("sparklyr")
library("dplyr")
library("nycflights13")
my_db <- sparklyr::spark_connect(version='2.0.0',
master = "local")
flts <- replyr::replyr_copy_to(my_db, flights)
```
It is easy to look at the top of the data, or any specific set of rows
of the data.
Either through `print()` (which is much safter with `tbl_df` derived classes, than with base
`data.frame`).
```{r}
print(flts)
```
Or with `dplyr::glimpse()`:
```{r glimpse}
dplyr::glimpse(flts)
```
What `replyr` adds to the task of "looking at the data" is a rough
equivalent to `base::summary()`: a few per-column statistics.
```{r replyr}
# using dev version of replyr https://github.com/WinVector/replyr
replyr::replyr_summary(flts,
countUniqueNonNum= TRUE)
```
As we see, `replyr` summary returns data in a data frame, and can deal with multiple column types.
Note: the above summary has problems with `NA` in `character` columns with `Spark`, and thus is mis-reporting the `NA` count in the `tailum` column. We are working on the issue. That is also one of the advantages of taking your work-arounds from a package: when they do improve you can easily incorporate bring the improvements into your own work by a mere package update.
We could also use `dplyr::summarize_each` for the task, but it has the minor downside of returning
the data in a wide form.
```{r summarizeeach}
# currently throws if tailnum left in column list
vars <- setdiff(colnames(flts), 'tailnum')
flts %>% summarize_each(funs(min, max, mean, sd),
one_of(vars))
```
```{r gatehr, eval=FALSE, include=FALSE}
# show the kind of work needed to gather this result
flts %>% summarize_each(funs(min, max, mean, sd),
one_of(vars)) -> dz
library("tidyr")
# a crude gather-like operation
summarizeV <- function(v) {
# get the column type
colClass <- flts %>%
head(n=1) %>%
select_(v) %>%
collect() %>%
.[[1]] %>%
is.numeric %>%
ifelse(.,'num', 'str')
# limit down to summaries from this col
dzi <- dz %>% select(starts_with(v))
oldNames <- colnames(dzi)
newNames <- paste(gsub(paste0('^',v,'_'), '', nms),
colClass,
sep= '_')
dzi %>%
rename_(.dots= setNames(oldNames, newNames)) %>%
mutate(column= v, colClass= colClass) %>%
select_(.dots= c('column', 'colClass', newNames))
}
summaries <- lapply(vars, summarizeV)
# dplyr::bind_rows works only on local data
summaries <- lapply(summaries, collect)
bind_rows(summaries)
```
```{r sume, error=TRUE}
flts %>% summarize_each(funs(min, max, mean, sd))
```
Special code for remote data is needed as none of the obvious "one liner" candidates (`base::summary()`,
or `broom:glance()`) are not currently (as of March 4, 2017) intended to work
with remote data sources.
```{r otheropts, error=TRUE}
summary(flts)
str(flts)
packageVersion('broom')
broom::glance(flts)
```
The source for the examples can be found [here](https://github.com/WinVector/Examples/blob/master/replyr/example.Rmd).
## Conclusion
`replyr_summary` is not the only service `replyr` supplies, `replyr` includes many more
adaptions [including my own version of case-completion](http://www.win-vector.com/blog/2017/02/the-zero-bug/).
Roughly `replyr` is where I collect my adaptions so they don't infest application code. `replyr`
a way you can use heavy-duty big-data machinery, while keeping you fingers out of the gears.
```{r cleanup, include=FALSE}
rm(list=ls())
gc()
```