-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathread-and-clean-data.Rmd
292 lines (216 loc) · 8.6 KB
/
read-and-clean-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
---
title: "Read and clean data"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Read and clean data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
This vignette demonstrates how the functions included in this package can be used to read and clean different data formats.
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup, message=FALSE, results='hide'}
library(camtrapviz)
library(dplyr)
```
## Write data in tempfile
```{r}
# records and cameras in separate files ------------------------------------------
data(recordTableSample, package = "camtrapR")
data(camtraps, package = "camtrapR")
# Create subfolder
dir.create(paste0(tempdir(), "/csv"))
# Write files
recordfile <- paste0(tempdir(), "/csv/records.csv")
camtrapfile <- paste0(tempdir(), "/csv/camtraps.csv")
write.csv(recordTableSample, recordfile,
row.names = FALSE)
write.csv(camtraps, camtrapfile,
row.names = FALSE)
```
```{r}
# records and cameras in same file ------------------------------------------
# Create file
recordcam <- recordTableSample |>
dplyr::left_join(camtraps, by = "Station")
# Create subfolder
dir.create(paste0(tempdir(), "/csvcam"))
# Write file
recordcamfile <- paste0(tempdir(), "/csvcam/recordcam.csv")
write.csv(recordcam, recordcamfile,
row.names = FALSE)
```
## Records and cameras in separate csv files
First, we see how data import and cleaning is performed with two csv files (records and cameras):
### Read data
```{r}
dat <- read_data(path_rec = recordfile,
path_cam = camtrapfile,
sep_rec = ",", sep_cam = ",")
```
```{r}
head(dat$data$observations) |>
knitr::kable()
head(dat$data$deployments) |>
knitr::kable()
```
The imported file is a list with one component `$data` containing 2 dataframes:
+ `$observations` contains the records
+ `$deployments` contains the cameras information
### Clean data
This step ensures all columns have the desired type. It will also move these columns to the beginning of the table.
To cast data to the appropriate type, this function has two arguments, created below: `rec_type` (for the records table) and `cam_type` (for the cameras table).
```{r}
rec_type <- list(Station = "as.character",
Date = list("as_date",
format = "%Y-%m-%d"),
Time = "times",
DateTimeOriginal = list("as.POSIXct",
tz = "Etc/GMT-8"))
cam_type <- list(Station = "as.character",
Setup_date = list("as.Date",
format = "%d/%m/%Y"),
Retrieval_date = list("as.Date",
format = "%d/%m/%Y"))
```
These lists contain the information about how to convert column types.
+ Values contain the casting function to apply (e.g. `"as.Date"` will translate to `as.Date(x)`). Values cal also be lists to provide additional arguments: for instance, `list("as.Date", format = "%d/%m/%Y")` will translate to `as.Date(x, format = "%d/%m/%Y")`.
+ the names of the list give the corresponding column of the data that should be casted.
```{r}
dat_clean <- clean_data(dat,
rec_type = rec_type,
cam_type = cam_type)
```
```{r}
head(dat_clean$data$observations) |>
knitr::kable()
head(dat_clean$data$deployments) |>
knitr::kable()
```
In case cameras in records and in the cameras file do not match, `clean_data` has an option allowing to keep only shared cameras. We create a new dataset where the observations table has Stations A, B and C and the deployments table has stations B, C and D:
```{r}
# Initialize new data
dat_diffcam <- dat
# Replace a camera in deployments
newcam <- dat_diffcam$data$deployments[1, ]
newcam$Station <- "StationD"
dat_diffcam$data$deployments <- dat_diffcam$data$deployments |>
filter(Station != "StationA") |>
bind_rows(newcam)
unique(dat_diffcam$data$observations$Station)
unique(dat_diffcam$data$deployments$Station)
```
Cleaning the data will keep only data with cameras that are common between the two datasets (B and C);
```{r}
dat_diffcam_clean <- clean_data(dat_diffcam,
rec_type = rec_type,
cam_type = cam_type,
cam_col_dfrec = "Station",
cam_col_dfcam = "Station",
only_shared_cam = TRUE)
unique(dat_diffcam_clean$data$observations$Station)
unique(dat_diffcam_clean$data$deployments$Station)
```
## Records and cameras in the same csv (1 csv file)
Then, we see how data import and cleaning is performed with a unique csv file containing records and cameras information:
### Read data
```{r}
dat <- read_data(path_rec = recordcamfile,
sep_rec = ",")
```
```{r}
head(dat$data$observations) |>
knitr::kable()
head(dat$data$deployments) |>
knitr::kable()
```
Again the imported file is a list with one component `$data`:
+ `$data$observations` contains the cameras and records information
+ `$data$deployments` is `NULL` (because only one file was imported)
### Clean data
In this step, will split the information from the observations table between observations and deployments. To do this, `clean_data` will move all columns listed in `cam_cols` in the deployments table. The column containing cameras IDs must be specified in the `cam_col_dfrec` argument (so that this column is kept in the observations table).
Since at the beginning, all columns are in the observations dataframe, the casting specifications should be in the `rec_type` argument.
```{r}
cam_cols <- c("Station", "Setup_date", "Retrieval_date",
"utm_y", "utm_x", "Problem1_from", "Problem1_to")
rec_type2 <- list(Station = "as.character",
Date = list("as_date",
format = "%Y-%m-%d"),
Time = "times",
DateTimeOriginal = list("as.POSIXct",
tz = "Etc/GMT-8"),
Setup_date = list("as.Date",
format = "%d/%m/%Y"),
Retrieval_date = list("as.Date",
format = "%d/%m/%Y"),
Problem1_from = list("as.Date",
format = "%d/%m/%Y"),
Problem1_to = list("as.Date",
format = "%d/%m/%Y"))
```
```{r}
dat_clean <- clean_data(dat,
rec_type = rec_type2,
cam_col_dfrec = "Station",
cam_cols = cam_cols,
split = TRUE)
```
```{r}
head(dat_clean$data$observations) |>
knitr::kable()
head(dat_clean$data$deployments) |>
knitr::kable()
```
## CamtrapDP format (json file)
Then, we see how data import and cleaning is performed with a dataset in [camtrapDP](https://tdwg.github.io/camtrap-dp/) format.
### Read data
The `read_data` function can also read json files corresponding to the camtrapDP datapackage.
```{r}
camtrap_dp_file <- system.file(
"extdata", "mica", "datapackage.json",
package = "camtraptor"
)
dat <- read_data(path_rec = camtrap_dp_file)
# dat <- read_data(path_rec = "https://raw.githubusercontent.com/tdwg/camtrap-dp/main/example/datapackage.json")
```
Internally, we use the function `read_camtrap_dp` from the `camtraptor` package (here, it would give the same result to use use directly this function).
The imported object is a `list` with several slots, and the observations and deployments info are in the `$data` slot.
```{r}
class(dat)
names(dat)
head(dat$data$observations) |>
knitr::kable()
head(dat$data$deployments) |>
knitr::kable()
```
### Clean data
Here, the data follows the camtrapDP standard and does not need cleaning. However, for this demonstration we change the time stamp type to character:
```{r}
dat$data$observations$timestamp <- as.character(dat$data$observations$timestamp)
class(dat$data$observations$timestamp)
```
```{r}
rec_type <- list(timestamp = list("as.POSIXct",
tz = "UTC"))
dat_clean <- clean_data(dat,
rec_type = rec_type)
```
In the cleaned data, `timestamp` is converted back to POSIX:
```{r}
class(dat_clean$data$observations$timestamp)
```
The timezone is UTC, as we specified in the casting function:
```{r}
attr(dat_clean$data$observations$timestamp, "tzone")
```
Else, the data is unchanged.
```{r}
head(dat_clean$data$observations) |>
knitr::kable()
head(dat_clean$data$deployments) |>
knitr::kable()
```