vignettes/read-and-clean-data.Rmd

---
title: "Read and clean data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Read and clean data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This vignette demonstrates how the functions included in this package can be used to read and clean different data formats.

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, message=FALSE, results='hide'}
library(camtrapviz)
library(dplyr)
```

## Write data in tempfile

```{r}
# records and cameras in separate files ------------------------------------------
data(recordTableSample, package = "camtrapR")
data(camtraps, package = "camtrapR")

# Create subfolder
dir.create(paste0(tempdir(), "/csv"))

# Write files
recordfile <- paste0(tempdir(), "/csv/records.csv")
camtrapfile <- paste0(tempdir(), "/csv/camtraps.csv")

write.csv(recordTableSample, recordfile, 
          row.names = FALSE)
write.csv(camtraps, camtrapfile, 
          row.names = FALSE)
```


```{r}
# records and cameras in same file ------------------------------------------
# Create file
recordcam <- recordTableSample |>
  dplyr::left_join(camtraps, by = "Station")

# Create subfolder
dir.create(paste0(tempdir(), "/csvcam"))

# Write file
recordcamfile <- paste0(tempdir(), "/csvcam/recordcam.csv")
write.csv(recordcam, recordcamfile, 
          row.names = FALSE)
```


## Records and cameras in separate csv files

First, we see how data import and cleaning is performed with two csv files (records and cameras):

### Read data

```{r}
dat <- read_data(path_rec = recordfile,
                 path_cam = camtrapfile,
                 sep_rec = ",", sep_cam = ",")
```

```{r}
head(dat$data$observations) |> 
  knitr::kable()
head(dat$data$deployments) |> 
  knitr::kable()
```

The imported file is a list with one component `$data` containing 2 dataframes:

+ `$observations` contains the records
+ `$deployments` contains the cameras information

### Clean data

This step ensures all columns have the desired type. It will also move these columns to the beginning of the table.

To cast data to the appropriate type, this function has two arguments, created below: `rec_type` (for the records table) and `cam_type` (for the cameras table).

```{r}
rec_type <- list(Station = "as.character",
                 Date = list("as_date",
                             format = "%Y-%m-%d"),
                 Time = "times",
                 DateTimeOriginal = list("as.POSIXct",
                                         tz = "Etc/GMT-8"))

cam_type <- list(Station = "as.character",
                 Setup_date = list("as.Date",
                                   format = "%d/%m/%Y"), 
                 Retrieval_date = list("as.Date",
                                       format = "%d/%m/%Y"))
```

These lists contain the information about how to convert column types.

+ Values contain the casting function to apply (e.g. `"as.Date"` will translate to `as.Date(x)`). Values cal also be lists to provide additional arguments: for instance, `list("as.Date", format = "%d/%m/%Y")` will translate to `as.Date(x,  format = "%d/%m/%Y")`.
+ the names of the list give the corresponding column of the data that should be casted.

```{r}
dat_clean <- clean_data(dat, 
                        rec_type = rec_type,
                        cam_type = cam_type)
```


```{r}
head(dat_clean$data$observations) |> 
  knitr::kable()
head(dat_clean$data$deployments) |> 
  knitr::kable()
```

In case cameras in records and in the cameras file do not match, `clean_data` has an option allowing to keep only shared cameras. We create a new dataset where the observations table has Stations A, B and C and the deployments table has stations B, C and D:


```{r}
# Initialize new data
dat_diffcam <- dat

# Replace a camera in deployments
newcam <- dat_diffcam$data$deployments[1, ]
newcam$Station <- "StationD"

dat_diffcam$data$deployments <- dat_diffcam$data$deployments |> 
  filter(Station != "StationA") |> 
  bind_rows(newcam)

unique(dat_diffcam$data$observations$Station)
unique(dat_diffcam$data$deployments$Station)
```

Cleaning the data will keep only data with cameras that are common between the two datasets (B and C);

```{r}
dat_diffcam_clean <- clean_data(dat_diffcam, 
                                rec_type = rec_type,
                                cam_type = cam_type,
                                cam_col_dfrec = "Station",
                                cam_col_dfcam = "Station", 
                                only_shared_cam = TRUE)

unique(dat_diffcam_clean$data$observations$Station)
unique(dat_diffcam_clean$data$deployments$Station)
```


## Records and cameras in the same csv (1 csv file)

Then, we see how data import and cleaning is performed with a unique csv file containing records and cameras information:

### Read data

```{r}
dat <- read_data(path_rec = recordcamfile,
                 sep_rec = ",")
```

```{r}
head(dat$data$observations) |> 
  knitr::kable()
head(dat$data$deployments) |> 
  knitr::kable()
```

Again the imported file is a list with one component `$data`:

+ `$data$observations` contains the cameras and records information
+ `$data$deployments` is `NULL` (because only one file was imported)

### Clean data

In this step, will split the information from the observations table between observations and deployments. To do this, `clean_data` will move all columns listed in `cam_cols` in the deployments table. The column containing cameras IDs must be specified in the `cam_col_dfrec` argument (so that this column is kept in the observations table).

Since at the beginning, all columns are in the observations dataframe, the casting specifications should be in the `rec_type` argument.

```{r}
cam_cols <- c("Station", "Setup_date", "Retrieval_date", 
              "utm_y", "utm_x", "Problem1_from", "Problem1_to")

rec_type2 <- list(Station = "as.character",
                  Date = list("as_date",
                             format = "%Y-%m-%d"),
                  Time = "times",
                  DateTimeOriginal = list("as.POSIXct",
                                         tz = "Etc/GMT-8"),
                  Setup_date = list("as.Date",
                                    format = "%d/%m/%Y"), 
                  Retrieval_date = list("as.Date",
                                        format = "%d/%m/%Y"),
                  Problem1_from = list("as.Date",
                                       format = "%d/%m/%Y"),
                  Problem1_to = list("as.Date",
                                     format = "%d/%m/%Y"))
```


```{r}
dat_clean <- clean_data(dat, 
                        rec_type = rec_type2,
                        cam_col_dfrec = "Station",
                        cam_cols = cam_cols,
                        split = TRUE)
```

```{r}
head(dat_clean$data$observations) |> 
  knitr::kable()
head(dat_clean$data$deployments) |> 
  knitr::kable()
```

## CamtrapDP format (json file)

Then, we see how data import and cleaning is performed with a dataset in [camtrapDP](https://tdwg.github.io/camtrap-dp/) format.

### Read data

The `read_data` function can also read json files corresponding to the camtrapDP datapackage.

```{r}
camtrap_dp_file <- system.file(
  "extdata", "mica", "datapackage.json", 
  package = "camtraptor"
)
dat <- read_data(path_rec = camtrap_dp_file)

# dat <- read_data(path_rec = "https://raw.githubusercontent.com/tdwg/camtrap-dp/main/example/datapackage.json")
```

Internally, we use the function `read_camtrap_dp` from the `camtraptor` package (here, it would give the same result to use use directly this function). 

The imported object is a `list` with several slots, and the observations and deployments info are in the `$data` slot.

```{r}
class(dat)
names(dat)

head(dat$data$observations) |> 
  knitr::kable()
head(dat$data$deployments) |> 
  knitr::kable()
```

### Clean data

Here, the data follows the camtrapDP standard and does not need cleaning. However, for this demonstration we change the time stamp type to character:

```{r}
dat$data$observations$timestamp <- as.character(dat$data$observations$timestamp)

class(dat$data$observations$timestamp)
```


```{r}
rec_type <- list(timestamp = list("as.POSIXct",
                                  tz = "UTC"))

dat_clean <- clean_data(dat, 
                        rec_type = rec_type)
```

In the cleaned data, `timestamp` is converted back to POSIX:
```{r}
class(dat_clean$data$observations$timestamp)
```

The timezone is UTC, as we specified in the casting function:
```{r}
attr(dat_clean$data$observations$timestamp, "tzone")
```

Else, the data is unchanged.
```{r}
head(dat_clean$data$observations) |> 
  knitr::kable()
head(dat_clean$data$deployments) |> 
  knitr::kable()
```