-
Notifications
You must be signed in to change notification settings - Fork 0
/
intro-to-arrow.Rmd
296 lines (250 loc) · 8.94 KB
/
intro-to-arrow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
---
title: "Introduction to Apache Arrow framework"
output: rmarkdown::html_vignette
author: Fernando Mayer, Niamh Cahill
vignette: >
%\VignetteIndexEntry{Introduction to Apache Arrow framework}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
## The Apache Arrow framework
The definition of the Apache Arrow framework is best described from
their website:
> Apache Arrow is a software development platform for building high
performance applications that process and transport large data sets. It
is designed to both improve the performance of analytical algorithms and
the efficiency of moving data from one system or programming language to
another.
> A critical component of Apache Arrow is its in-memory columnar format, a
standardized, language-agnostic specification for representing
structured, table-like datasets in-memory. This data format has a rich
data type system (including nested and user-defined data types) designed
to support the needs of analytic database systems, data frame libraries,
and more.
In other words, the Apache Arrow framework was designed to deal with
large datasets (larger than memory), using in-memory analytics. This
means that the computations made with "Arrow datasets" are extremely
efficient, resulting in very fast computations, otherwise infeasible
with standard computations.
The Apache Arrow framework can be used in many different programming
languages. However, in each of these languages, there are specific
libraries to deal with it. In R, the [arrow][] package is available to
load and manipulate Arrow datasets. The manipulation of Arrow objects
are made through [dplyr][] verbs, which helps users to feel familiar
with it. Not all **dplyr** verbs are available to work with Arrow
datasets, but the vast majority of the most used ones are already
"translated" to be used with Arrow. A list of such functions can be
found in [Functions available in Arrow dplyr queries][]. A general
introduction of using **dplyr** verbs with Arrow can be seen in [Data
analysis with dplyr syntax][].
## Using Apache Arrow with geslaR
The **geslaR** package makes use of the [Apache Arrow][] framework to deal
with the [GESLA][] dataset in R.
In this tutorial, we will use the `download_gesla()` function, to
download the full GESLA dataset, and show some basic data manipulation
with the **arrow** package and **dplyr** verbs.
The first time you load the **geslaR** package, it will automatically
load both the **arrow** and **dplyr** packages.
```r
library(geslaR)
#> Loading required package: arrow
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
```
To download the full GESLA dataset, one can simply use
```r
download_gesla()
```
This will create a directory called `gesla_dataset` in the current
working directory (as defined by `getwd()`) and download the full
dataset locally. This download may take some time (expect around 5 to 10
minutes), as it depends on internet connection. Note that this full
dataset will need at least 7GB of (hard drive) storage, so make sure
this is feasible. However, once downloaded, you will have access to the
full dataset, and you will only need to do this once.
You will notice that the full dataset is composed of 5119 [Apache
Parquet][] files, ending in `.parquet`
```r
## Number of downloaded files
length(list.files("gesla_dataset"))
#> [1] 5119
## Check the first files
head(list.files("gesla_dataset"))
#> [1] "a121-a12-nld-cmems.parquet" "a2-a2-bel-cmems.parquet"
#> [3] "aalesund-aal-nor-cmems.parquet" "aarhus-aar-dnk-cmems.parquet"
#> [5] "aasiaat-aas-grl-gloss.parquet" "abashiri-347a-jpn-uhslc.parquet"
```
These files are the same as those originally distributed in the GESLA
database, so that each one refers to a site from where the data comes
from. To load this full dataset in R, use the `arrow::open_dataset()`
function, specifying the location of the `.parquet` files. Although there
are many files, this function recognizes them as a single dataset,
because they all have the same structure (or "Schema").
```r
## Open dataset
da <- open_dataset("gesla_dataset")
## Check the object
da
#> FileSystemDataset with 5119 Parquet files
#> date_time: timestamp[us]
#> year: int64
#> month: int64
#> day: int64
#> hour: int64
#> country: string
#> site_name: string
#> lat: double
#> lon: double
#> sea_level: double
#> qc_flag: int64
#> use_flag: int64
#> file_name: string
#>
#> See $metadata for additional Schema metadata
## Verify class
class(da)
#> [1] "FileSystemDataset" "Dataset" "ArrowObject"
#> [4] "R6"
```
Since this is an `ArrowObject` object, it will actually not load the
full dataset in memory (as it would if it was a standard R object, such
as a `tibble` or `data.frame`). Note, however, that some basic
information, such as `dim()` and `names()` can be retrieved simply with
```r
dim(da)
#> [1] 1172435674 13
names(da)
#> [1] "date_time" "year" "month" "day" "hour" "country"
#> [7] "site_name" "lat" "lon" "sea_level" "qc_flag" "use_flag"
#> [13] "file_name"
```
Any other manipulation of the dataset must be made using **dplyr**
verbs. For example, to count the number of observations by country, one
could use
```r
da |>
count(country)
#> FileSystemDataset (query)
#> country: string
#> n: int64
#>
#> See $.data for the source Arrow object
```
Note, however, that the output is just a query to the full dataset. To
explicitly return the calculation, you should use `dplyr::collect()`, so
the result is a standard `tibble`
```r
da |>
count(country) |>
collect()
#> # A tibble: 113 × 2
#> country n
#> <chr> <int>
#> 1 BEL 1263467
#> 2 JPN 74580447
#> 3 NOR 28452201
#> 4 DNK 36726648
#> 5 USA 243838504
#> 6 GBR 76038844
#> 7 SLV 812230
#> 8 MEX 10755284
#> 9 CAN 67019312
#> 10 NLD 133699230
#> # ℹ 103 more rows
```
This is intentionally done so that you can manipulate, calculate, and
extract information from the dataset, taking advantage of the Arrow
in-memory analytics framework. This way, the computations should be
faster, and the idea is that you just use `dplyr::collect()` when the
final result is needed as an R object. For example, we could
calculate the mean sea level for Ireland per year as
```r
da |>
filter(country == "IRL", use_flag == 1) |>
group_by(year) |>
summarise(mean = mean(sea_level)) |>
arrange(year) |>
collect()
#> # A tibble: 63 × 2
#> year mean
#> <int> <dbl>
#> 1 1958 3.11
#> 2 1959 3.12
#> 3 1960 3.16
#> 4 1961 3.16
#> 5 1962 3.09
#> 6 1963 3.13
#> 7 1964 3.13
#> 8 1965 3.11
#> 9 1966 3.15
#> 10 1967 3.14
#> # ℹ 53 more rows
```
Any other queries could be made, as long as the **dplyr** verbs used are
supported by the **arrow** package. For example, we could ask for the
minimum, mean, and maximum sea level values for Ireland per year
```r
da |>
filter(country == "IRL", use_flag == 1) |>
group_by(year) |>
summarise(
min = min(sea_level),
mean = mean(sea_level),
max = max(sea_level)) |>
collect()
#> # A tibble: 63 × 4
#> year min mean max
#> <int> <dbl> <dbl> <dbl>
#> 1 2018 -4.09 0.0462 11.6
#> 2 2019 -5.80 0.0436 3.18
#> 3 2003 -1.13 -0.00629 0.9
#> 4 2004 -1.11 0.0391 4.07
#> 5 2005 -1.32 1.32 4.83
#> 6 2006 -2.47 0.692 4.94
#> 7 2007 -3.48 -0.0374 4.94
#> 8 2008 -3.41 -0.00481 4.85
#> 9 2009 -3.00 0.0108 2.94
#> 10 2010 -3.71 -0.0192 3.11
#> # ℹ 53 more rows
```
This same query could then be used to produce graphics with
**ggplot2**, for example. In this case, note that the call to
`dplyr::collect()` is mandatory in advance of using **ggplot2**
functions, as it will only accept standard R objects (such as `tibble`
or `data.frame`).
```r
library(ggplot2)
da |>
filter(country == "IRL", use_flag == 1) |>
group_by(year) |>
summarise(
min = min(sea_level),
mean = mean(sea_level),
max = max(sea_level)) |>
collect() |>
tidyr::pivot_longer(cols = c(min, mean, max)) |>
ggplot(aes(x = year, y = value, colour = name)) +
geom_line() +
theme(legend.position = "top") +
labs(colour = "")
```
<img src="fig-gg-irl-1.png" width="90%" style="display: block; margin: auto;" />
[dplyr]: https://dplyr.tidyverse.org/index.html
[arrow]: https://arrow.apache.org/docs/r/index.html
[GESLA]: https://gesla787883612.wordpress.com
[Apache Arrow]: https://arrow.apache.org
[Apache Parquet]: https://parquet.apache.org
[Functions available in Arrow dplyr queries]: https://arrow.apache.org/docs/r/reference/acero.html
[Data analysis with dplyr syntax]: https://arrow.apache.org/docs/r/articles/data_wrangling.html