vignettes/from_textfiles.Rmd

---
title: "Prepare data from textfiles"
author: "Reto Stauffer"
output:
  html_document:
    toc: true
    toc_float: true
    theme: flatly
bibliography: annex.bib
link-citations: true
vignette: >
  %\VignetteIndexEntry{annex: Annex86 Data Analysis Package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteDepends{Formula}
  %\VignetteKeywords{Annex86}
  %\VignettePackage{annex}
---

```{r setup, include = FALSE}
suppressPackageStartupMessages(library('zoo'))
suppressPackageStartupMessages(library('Formula'))
# Make a copy when building the article
file <- system.file("demos/demo_Bedroom_config.TXT", package = "annex", mustWork = TRUE)
file.copy(file, "demo_Bedroom_config.TXT")
file <- system.file("demos/demo_Bedroom.txt", package = "annex", mustWork = TRUE)
file.copy(file, "demo_Bedroom.txt")
```

## Introduction

This article explains how to import and process data with the `annex` package
when the require data is available as tabular text files (CSV).

To demonstrate this, two files are used called
`r xfun::embed_file("demo_Bedroom.txt", text = "demo_Bedroom.txt")`
(contains the measurement data) as well as
`r xfun::embed_file("demo_Bedroom_config.TXT", text = "demo_Bedroom_config.TXT")`
(contains configuration; see article [Config file](configfile.html)).


Both files can easily be read using base R functions, namely `read.table()` and its
interfacing functions like `read.csv()`, `utils::read.delim()` etc.
(see `?read.table` for more details).

## Reading the data

The first step is to import both (i) the measurement data (stored on `raw_df`)
and (ii) the configuration (stored on `config`):

```{r}
raw_df <- read.csv("demo_Bedroom.txt")
config <- read.table("demo_Bedroom_config.TXT",
                     comment.char = "#", sep = "",
                     header = TRUE, na.strings = c("NA", "empty"))
                     # see ?read.table for details

# Class and dimension of the objects
c("raw_df" = is.data.frame(raw_df), "config" = is.data.frame(config))
cbind("raw_df" = dim(raw_df), "config" = dim(config))
```

Both objects are of class `data.frame` (tibble data frames to be precise)
with a dimension of $`r paste(dim(raw_df), collapse = " \\times ")`$ (`raw_df`)
and  $`r paste(dim(config), collapse = " \\times ")`$  (`config`) respectively.

The first few observations (rows) of the two objects look as follows:

```{r}
head(raw_df[, 1:4], n = 3) # First three columns only
head(config, n = 3)
```

The object `raw_df` contains variables (columns) named
`r paste(sprintf("\"%s\"", names(raw_df)[1:4]), collapse = ", ")` which
are the original names from the XLSX sheet, the `config` object contains
the definition what the columns in `raw_df` contains and where they belong to.
For more details read the article about the [Config file](configfile.html).

### Checking the config object

To check whether or not the `config` object is as expected by the `annex` package,
the function `annex_check_config()` can be used. In case problems would
be detected, an error will be thrown (see [Config file](configfile.html)).
Else, the function is silent as in this example:

```{r}
library("annex")
annex_check_config(config)
```

... no errors, the `config` object meets the `annex` requirements. Note that
this step is _not_ necessary as it will be performed automatically when
calling `annex_prepare()` but can be handy during development.


## Preparing data

While `raw_df` contains the raw data set, the `config` object
contains the information on how to rename the columns and where the
observations belong to. `prepare_annex()` is a helper function
to prepare the data set for further steps.

```{r, error = TRUE}
prepared_df <- annex_prepare(raw_df, config, quiet = TRUE)
```

At this moment we get an error as the variable containing the date and time
information is not a proper datetime object (object of class `POSIXt`) but
a character. As the information comes in a proper ISO format, we simply
convert the column (column `X` in `raw_df`) and call `annex_prepare()` again.

```{r}
# see ?as.POSIXct for details and options
raw_df <- transform(raw_df, X = as.POSIXct(X, tz = "UTC"))
class(raw_df$X)
prepared_df <- annex_prepare(raw_df, config, quiet = TRUE)
head(prepared_df)
```

`annex_prepare()` performs a series of tasks:

* Checking the `config` object (calls `annex_check_config()` internally).
  If the `config` object is valid,
* the variables (columns) in `raw_df` are renamed and checked to be of the
  correct class,
* informs the user if there are any columns in `raw_df` not included in
  `config` (just a note) and additional columns defined in `config` which
  do not occur in `raw_df`, and returns the modified (possibly subsetted) object,
* ensures that `datetime` is a proper datetime object (`POSIXt`).

The checks of missing/additional definitions in `config` are intended to inform
the user about possible misspecifications and will not result in an error.


## Next steps

After performing the data preparation,
the following steps can be performed:

* [Calculating statistics (analysis)](calculate_stats.html)
* [Visualizing the data (optional)](visualization.html)
* [Write annex stats to final file](write_and_validate.html#write)
* [Update meta data in final file](write_and_validate.html#meta)
* [Validate final file](write_and_validate.html#validate)


```{r, include = FALSE}
files <- c("demo_Bedroom_config.TXT", "demo_Bedroom.txt")
for (f in files) {
    if (file.exists(f)) file.remove(f)
}
```