-
Notifications
You must be signed in to change notification settings - Fork 1
/
MazamaLocationUtils.Rmd
251 lines (197 loc) · 8.53 KB
/
MazamaLocationUtils.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: "Introduction to MazamaLocationUtils"
author: "Jonathan Callahan"
date: "2023-10-24"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to MazamaLocationUtils}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo=FALSE}
knitr::opts_chunk$set(fig.width = 7, fig.height = 5)
```
# Background
This package is intended for use in data management activities
associated with fixed locations in space. The motivating fields include
air and water quality monitoring where fixed sensors report at regular time
intervals.
When working with environmental monitoring time series, one of the first things
you have to do is create unique identifiers for each individual time series. In
an ideal world, each environmental time series would have both a
`locationID` and a `deviceID` that uniquely identify the specific instrument
making measurements and the physical location where measurements are made. A
unique `timeseriesID` could
be produced as `locationID_deviceID`. Metadata associated with each
`timeseriesID` would contain basic information needed for downstream analysis
including at least:
`timeseriesID, locationID, deviceID, longitude, latitude, ...`
* An extended time series for a mobile sensor would group by `deviceID`.
* Multiple sensors placed at a single location could be be grouped by `locationID`.
* Maps would be created using `longitude, latitude`.
* Time series measurements would be accessed from a secondary `data` table with
`timeseriesID` column names.
Unfortunately, we are rarely supplied with a truly unique and truly spatial
`locationID`. Instead we often use `deviceID` or an associated non-spatial
identifier as a stand-in for `locationID`.
Complications we have seen include:
* GPS-reported longitude and latitude can have _jitter_ in the fourth or fifth
decimal place making it challenging to use them to create a unique `locationID`.
* Sensors are sometimes _repositioned_ in what the scientist considers the "same
location".
* Data from a single sensor goes through different processing pipelines using
different identifiers and is later brought together as two separate timeseries.
* The spatial scale of what constitutes a "single location" depends on the
instrumentation and scientific question being asked.
* Deriving location-based metadata from spatial datasets is computationally
intensive unless saved and identified with a unique `locationID`.
* Automated searches for spatial metadata occasionally produce incorrect results
because of the non-infinite resolution of spatial datasets and must be corrected
by hand.
# Functionality
A solution to all these problems is possible if we store spatial metadata in
simple tables in a standard directory. These tables will be referred to as
_collections_. Location lookups can be performed with
geodesic distance calculations where a longitude-latitude pair is assigned to a pre-existing
_known location_ if it is within `distanceThreshold` meters of that location.
These lookups will be extremely fast.
If no previously _known location_ is found, the relatively slow (seconds)
creation of a new _known location_ metadata record can be performed and then
added to the growing collection.
For collections of stationary environmental monitors that only number in the
thousands, this entire _collection_ can be stored as either a
`.rda` or `.csv` file and will be under a megabyte in size making it fast to
load. This small size also makes it possible to store multiple _known locations_
files, each created with different locations and different distance thresholds
to address the needs of different scientific studies.
# Example Usage
The package comes with some example _known locations_ tables.
Lets take some metadata we have for air quality monitors in Washington state and
create a _known locations_ table for them.
```{r load_data}
wa <- get(data("wa_airfire_meta", package = "MazamaLocationUtils"))
names(wa)
```
```{r load_data_hidden, eval = TRUE, echo = FALSE}
library(MazamaLocationUtils)
wa_monitors_500 <-
get(data("wa_monitors_500", package = "MazamaLocationUtils")) %>%
dplyr::mutate(elevation = as.numeric(NA))
```
## Creating a Known Locations table
We can create a _known locations_ table for them with a minimum 500 meter
separation between distinct locations. _(NOTE: This will take some time to
performa all the spatial queries.)_
To speed things up, we call `table_addLocation()` with defaults:
`elevationService = NULL, addressService = NULL`. This avoids these slow
web service requests and results in a table with `NA` for these columns.
```{r create_table, eval = FALSE, echo = TRUE}
library(MazamaLocationUtils)
# Initialize with standard directories
initializeMazamaSpatialUtils()
setLocationDataDir("./data")
wa_monitors_500 <-
table_initialize() %>%
table_addLocation(wa$longitude, wa$latitude, distanceThreshold = 500)
```
At this point, our _known locations_ table contains only automatically generated
spatial metadata.
```{r basic_columns}
dplyr::glimpse(wa_monitors_500, width = 75)
```
## Merging external metadata
Perhaps we would like to import some of the original metadata into our new
table. This is a very common use case where non-spatial metadata like uniform
identifiers or owner information for a monitor can be added.
Just to make it interesting, let's assume that our _known locations_ table is
already large and we are only providing additional metadata for a subset of the
records.
```{r import_colmns}
# Use a subset of the wa metadata
wa_indices <- seq(5,65,5)
wa_sub <- wa[wa_indices,]
# Use a generic name for the location table
locationTbl <- wa_monitors_500
# Find the location IDs associated with our subset
locationID <- table_getLocationID(
locationTbl,
longitude = wa_sub$longitude,
latitude = wa_sub$latitude,
distanceThreshold = 500
)
# Now add the "AQSID" column for our subset of locations
locationData <- wa_sub$AQSID
locationTbl <- table_updateColumn(
locationTbl,
columnName = "AQSID",
locationID = locationID,
locationData = locationData
)
# Lets see how we did
locationTbl_indices <- table_getRecordIndex(locationTbl, locationID)
locationTbl[locationTbl_indices, c("city", "AQSID")]
```
Very nice. We have added `AQSID` to our known locations table for a more
detailed description of each monitors' location.
## Finding known locations
The whole point of a known locations table is to speed up access to spatial
and other metadata. Here's how we can use it with a set of longitudes and
latitudes that are not currently in our table.
```{r new_locations}
# Create new locations near our known locations
lons <- jitter(wa_sub$longitude)
lats <- jitter(wa_sub$latitude)
# Any known locations within 50 meters?
table_getNearestLocation(
wa_monitors_500,
longitude = lons,
latitude = lats,
distanceThreshold = 50
) %>% dplyr::pull(city)
# Any known locations within 250 meters
table_getNearestLocation(
wa_monitors_500,
longitude = lons,
latitude = lats,
distanceThreshold = 250
) %>% dplyr::pull(city)
# How about 5000 meters?
table_getNearestLocation(
wa_monitors_500,
longitude = lons,
latitude = lats,
distanceThreshold = 5000
) %>% dplyr::pull(city)
```
# Standard Setup
Before using **MazamaLocationUtils** you must first install
**MazamaSpatialUtils** and then install core spatial data with:
```{r MSU_setup, echo = TRUE, eval = FALSE}
library(MazamaSpatialUtils)
setSpatialDataDir("~/Data/Spatial")
installSpatialData("EEZCountries")
installSpatialData("OSMTimezones")
installSpatialData("NaturalEarthAdm1")
installSpatialData("USCensusCounties")
```
The `initializeMazamaSpatialData()` function by default assumes spatial data are
installed in the standard location and is just a wrapper for:
```{r standard_setup, echo = TRUE, eval = FALSE}
MazamaSpatialUtils::setSpatialDataDir("~/Data/Spatial")
MazamaSpatialUtils::loadSpatialData("EEZCountries.rda")
MazamaSpatialUtils::loadSpatialData("OSMTimezones.rda")
MazamaSpatialUtils::loadSpatialData("NaturalEarthAdm1.rda")
MazamaSpatialUtils::loadSpatialData("USCensusCounties.rda")
```
Once the required datasets have been installed, the easiest way to set things
up each session is with:
```{r easy_setup, echo = TRUE, eval = FALSE}
library(MazamaLocationUtils)
initializeMazamaSpatialData()
setLocationDataDir("~/Data/KnownLocations")
```
Every time you `table_save()` your location table, a backup will be created
so you can experiment without losing your work. File sizes are pretty tiny
so you don't have to worry about filling up your disk.
----
Best wishes for well organized spatial metadata!