replyr/example.Rmd

---
title: "replyr example"
author: "John Mount, Win-Vector LLC"
date: "3/4/2017"
output:
  md_document:
    variant: markdown_github
---

[`replyr`](https://github.com/WinVector/replyr) is an [`R`](https://cran.r-project.org) package
that contains extensions, adaptions, and work-arounds to make remote `R` `dplyr` data sources (including
big data systems such as `Spark`) behave more like local data.  This allows the analyst to more easily develop
and debug procedures that simultaneously work on a variety of data services (in-memory `data.frame`, 
`SQLite`, `PostgreSQL`, and `Spark2` currently being the primary supported platforms).

![](replyrs.png)

## Example

Suppose we had a large data set hosted on a `Spark` cluster that we wished to work 
with using `dplyr` and `sparklyr` (for this article we will simulate such using data loaded into `Spark` from
the `nycflights13` package).

We will work a trivial example: taking a quick peek at your data.
The analyst should always be able to and willing to look at the data.

```{r setup, include=FALSE}
library("sparklyr")
library("dplyr")
library("nycflights13")
my_db <- sparklyr::spark_connect(version='2.0.0', 
   master = "local")
flts <- replyr::replyr_copy_to(my_db, flights)
```


It is easy to look at the top of the data, or any specific set of rows
of the data.

Either through `print()` (which is much safter with `tbl_df` derived classes, than with base
`data.frame`).

```{r}
print(flts)
```

Or with `dplyr::glimpse()`:

```{r glimpse}
dplyr::glimpse(flts)
```

What `replyr` adds to the task of "looking at the data" is a rough 
equivalent to `base::summary()`: a few per-column statistics.

```{r replyr}
# using dev version of replyr https://github.com/WinVector/replyr
replyr::replyr_summary(flts, 
                       countUniqueNonNum= TRUE)
```

As we see, `replyr` summary returns data in a data frame, and can deal with multiple column types.

Note: the above summary has problems with `NA` in `character` columns with `Spark`, and thus is mis-reporting the `NA` count in the `tailum` column.  We are working on the issue. That is also one of the advantages of taking your work-arounds from a package: when they do improve you can easily incorporate bring the improvements into your own work by a mere package update.

We could also use `dplyr::summarize_each` for the task, but it has the minor downside of returning
the data in a wide form.

```{r summarizeeach}
# currently throws if tailnum left in column list 
vars <- setdiff(colnames(flts), 'tailnum')
flts %>% summarize_each(funs(min, max, mean, sd), 
                        one_of(vars))
```

```{r gatehr, eval=FALSE, include=FALSE}
# show the kind of work needed to gather this result
flts %>% summarize_each(funs(min, max, mean, sd), 
                        one_of(vars)) -> dz
library("tidyr")
# a crude gather-like operation
summarizeV <- function(v) {
  # get the column type
  colClass <- flts %>% 
    head(n=1) %>% 
    select_(v) %>% 
    collect() %>% 
    .[[1]] %>% 
    is.numeric %>%
    ifelse(.,'num', 'str')
  # limit down to summaries from this col
  dzi <- dz %>% select(starts_with(v))
  oldNames <- colnames(dzi)
  newNames <- paste(gsub(paste0('^',v,'_'), '', nms),
                    colClass, 
                    sep= '_')
  dzi %>% 
    rename_(.dots= setNames(oldNames, newNames)) %>%
    mutate(column= v, colClass= colClass) %>%
    select_(.dots= c('column', 'colClass', newNames))
}
summaries <- lapply(vars, summarizeV)
# dplyr::bind_rows works only on local data
summaries <- lapply(summaries, collect)
bind_rows(summaries)
```

```{r sume, error=TRUE}
flts %>% summarize_each(funs(min, max, mean, sd))
```

Special code for remote data is needed as none of the obvious "one liner" candidates (`base::summary()`,
 or `broom:glance()`) are not currently (as of March 4, 2017) intended to work
with remote data sources. 

```{r otheropts, error=TRUE}
summary(flts)
str(flts)

packageVersion('broom')
broom::glance(flts)
```

The source for the examples can be found [here](https://github.com/WinVector/Examples/blob/master/replyr/example.Rmd).

## Conclusion

`replyr_summary` is not the only service `replyr` supplies, `replyr` includes many more
adaptions [including my own version of case-completion](http://www.win-vector.com/blog/2017/02/the-zero-bug/).

Roughly `replyr` is where I collect my adaptions so they don't infest application code.  `replyr`
a way you can use heavy-duty big-data machinery, while keeping you fingers out of the gears.

```{r cleanup, include=FALSE}
rm(list=ls())
gc()
```