vignettes/intro-to-arrow.Rmd

---
title: "Introduction to Apache Arrow framework"
output: rmarkdown::html_vignette
author: Fernando Mayer, Niamh Cahill
vignette: >
  %\VignetteIndexEntry{Introduction to Apache Arrow framework}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


## The Apache Arrow framework

The definition of the Apache Arrow framework is best described from
their website:

> Apache Arrow is a software development platform for building high
performance applications that process and transport large data sets. It
is designed to both improve the performance of analytical algorithms and
the efficiency of moving data from one system or programming language to
another.

> A critical component of Apache Arrow is its in-memory columnar format, a
standardized, language-agnostic specification for representing
structured, table-like datasets in-memory. This data format has a rich
data type system (including nested and user-defined data types) designed
to support the needs of analytic database systems, data frame libraries,
and more.

In other words, the Apache Arrow framework was designed to deal with
large datasets (larger than memory), using in-memory analytics. This
means that the computations made with "Arrow datasets" are extremely
efficient, resulting in very fast computations, otherwise infeasible
with standard computations.

The Apache Arrow framework can be used in many different programming
languages. However, in each of these languages, there are specific
libraries to deal with it. In R, the [arrow][] package is available to
load and manipulate Arrow datasets. The manipulation of Arrow objects
are made through [dplyr][] verbs, which helps users to feel familiar
with it. Not all **dplyr** verbs are available to work with Arrow
datasets, but the vast majority of the most used ones are already
"translated" to be used with Arrow. A list of such functions can be
found in [Functions available in Arrow dplyr queries][]. A general
introduction of using **dplyr** verbs with Arrow can be seen in [Data
analysis with dplyr syntax][].

## Using Apache Arrow with geslaR

The **geslaR** package makes use of the [Apache Arrow][] framework to deal
with the [GESLA][] dataset in R.

In this tutorial, we will use the `download_gesla()` function, to
download the full GESLA dataset, and show some basic data manipulation
with the **arrow** package and **dplyr** verbs.

The first time you load the **geslaR** package, it will automatically
load both the **arrow** and **dplyr** packages.


```r
library(geslaR)
#> Loading required package: arrow
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
```

To download the full GESLA dataset, one can simply use


```r
download_gesla()
```

This will create a directory called `gesla_dataset` in the current
working directory (as defined by `getwd()`) and download the full
dataset locally. This download may take some time (expect around 5 to 10
minutes), as it depends on internet connection. Note that this full
dataset will need at least 7GB of (hard drive) storage, so make sure
this is feasible. However, once downloaded, you will have access to the
full dataset, and you will only need to do this once.

You will notice that the full dataset is composed of 5119 [Apache
Parquet][] files, ending in `.parquet`


```r
## Number of downloaded files
length(list.files("gesla_dataset"))
#> [1] 5119
## Check the first files
head(list.files("gesla_dataset"))
#> [1] "a121-a12-nld-cmems.parquet"      "a2-a2-bel-cmems.parquet"        
#> [3] "aalesund-aal-nor-cmems.parquet"  "aarhus-aar-dnk-cmems.parquet"   
#> [5] "aasiaat-aas-grl-gloss.parquet"   "abashiri-347a-jpn-uhslc.parquet"
```

These files are the same as those originally distributed in the GESLA
database, so that each one refers to a site from where the data comes
from. To load this full dataset in R, use the `arrow::open_dataset()`
function, specifying the location of the `.parquet` files. Although there
are many files, this function recognizes them as a single dataset,
because they all have the same structure (or "Schema").


```r
## Open dataset
da <- open_dataset("gesla_dataset")
## Check the object
da
#> FileSystemDataset with 5119 Parquet files
#> date_time: timestamp[us]
#> year: int64
#> month: int64
#> day: int64
#> hour: int64
#> country: string
#> site_name: string
#> lat: double
#> lon: double
#> sea_level: double
#> qc_flag: int64
#> use_flag: int64
#> file_name: string
#> 
#> See $metadata for additional Schema metadata
## Verify class
class(da)
#> [1] "FileSystemDataset" "Dataset"           "ArrowObject"      
#> [4] "R6"
```

Since this is an `ArrowObject` object, it will actually not load the
full dataset in memory (as it would if it was a standard R object, such
as a `tibble` or `data.frame`). Note, however, that some basic
information, such as `dim()` and `names()` can be retrieved simply with


```r
dim(da)
#> [1] 1172435674         13
names(da)
#>  [1] "date_time" "year"      "month"     "day"       "hour"      "country"  
#>  [7] "site_name" "lat"       "lon"       "sea_level" "qc_flag"   "use_flag" 
#> [13] "file_name"
```

Any other manipulation of the dataset must be made using **dplyr**
verbs. For example, to count the number of observations by country, one
could use


```r
da |>
    count(country)
#> FileSystemDataset (query)
#> country: string
#> n: int64
#> 
#> See $.data for the source Arrow object
```

Note, however, that the output is just a query to the full dataset. To
explicitly return the calculation, you should use `dplyr::collect()`, so
the result is a standard `tibble`


```r
da |>
    count(country) |>
    collect()
#> # A tibble: 113 × 2
#>    country         n
#>    <chr>       <int>
#>  1 BEL       1263467
#>  2 JPN      74580447
#>  3 NOR      28452201
#>  4 DNK      36726648
#>  5 USA     243838504
#>  6 GBR      76038844
#>  7 SLV        812230
#>  8 MEX      10755284
#>  9 CAN      67019312
#> 10 NLD     133699230
#> # ℹ 103 more rows
```

This is intentionally done so that you can manipulate, calculate, and
extract information from the dataset, taking advantage of the Arrow
in-memory analytics framework. This way, the computations should be
faster, and the idea is that you just use `dplyr::collect()` when the
final result is needed as an R object. For example, we could
calculate the mean sea level for Ireland per year as


```r
da |>
    filter(country == "IRL", use_flag == 1) |>
    group_by(year) |>
    summarise(mean = mean(sea_level)) |>
    arrange(year) |>
    collect()
#> # A tibble: 63 × 2
#>     year  mean
#>    <int> <dbl>
#>  1  1958  3.11
#>  2  1959  3.12
#>  3  1960  3.16
#>  4  1961  3.16
#>  5  1962  3.09
#>  6  1963  3.13
#>  7  1964  3.13
#>  8  1965  3.11
#>  9  1966  3.15
#> 10  1967  3.14
#> # ℹ 53 more rows
```

Any other queries could be made, as long as the **dplyr** verbs used are
supported by the **arrow** package. For example, we could ask for the
minimum, mean, and maximum sea level values for Ireland per year


```r
da |>
    filter(country == "IRL", use_flag == 1) |>
    group_by(year) |>
    summarise(
        min = min(sea_level),
        mean = mean(sea_level),
        max = max(sea_level)) |>
    collect()
#> # A tibble: 63 × 4
#>     year   min     mean   max
#>    <int> <dbl>    <dbl> <dbl>
#>  1  2018 -4.09  0.0462  11.6 
#>  2  2019 -5.80  0.0436   3.18
#>  3  2003 -1.13 -0.00629  0.9 
#>  4  2004 -1.11  0.0391   4.07
#>  5  2005 -1.32  1.32     4.83
#>  6  2006 -2.47  0.692    4.94
#>  7  2007 -3.48 -0.0374   4.94
#>  8  2008 -3.41 -0.00481  4.85
#>  9  2009 -3.00  0.0108   2.94
#> 10  2010 -3.71 -0.0192   3.11
#> # ℹ 53 more rows
```

This same query could then be used to produce graphics with
**ggplot2**, for example. In this case, note that the call to
`dplyr::collect()` is mandatory in advance of using **ggplot2**
functions, as it will only accept standard R objects (such as `tibble`
or `data.frame`).


```r
library(ggplot2)
da |>
    filter(country == "IRL", use_flag == 1) |>
    group_by(year) |>
    summarise(
        min = min(sea_level),
        mean = mean(sea_level),
        max = max(sea_level)) |>
    collect() |>
    tidyr::pivot_longer(cols = c(min, mean, max)) |>
    ggplot(aes(x = year, y = value, colour = name)) +
    geom_line() +
    theme(legend.position = "top") +
    labs(colour = "")
```

<img src="fig-gg-irl-1.png" width="90%" style="display: block; margin: auto;" />

[dplyr]: https://dplyr.tidyverse.org/index.html
[arrow]: https://arrow.apache.org/docs/r/index.html
[GESLA]: https://gesla787883612.wordpress.com
[Apache Arrow]: https://arrow.apache.org
[Apache Parquet]: https://parquet.apache.org
[Functions available in Arrow dplyr queries]: https://arrow.apache.org/docs/r/reference/acero.html
[Data analysis with dplyr syntax]: https://arrow.apache.org/docs/r/articles/data_wrangling.html