04-reference-data.Rmd

---
title: "Reference data"
author: "Kamil Foltyński"
date: "11/04/2019"
output:
  html_document:
    fig_caption: yes
    keep_md: no
    toc: yes
    toc_float:
      collapsed: no
      smooth_scroll: yes
runtime: shiny
---

```{r, echo=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = ">")
```

# What is reference data? And why should we care?

Along with development process, code changes, but its behaviour and results should not - 
and if they must change, let them change into a way we control. 

Reference data is used as input for unit tests. It's frozen, so it ensures **reproducibility** of results.
During development process **code changes**, but thanks to **reproducibility** it is possible to run 
code (with reference data as input) and **always get same, unchanged results**. However if results vary, then
it's a sign something went wrong and code is probably broken.

# How to prepare reference data?

* `dput` - the easiest way to prepare reference data is to use `dput` function. It prints object's structure to terminal, so
it is just about copy-pasting from there to our test script, see example:

```{r}
library(dplyr)
result <- band_members %>% inner_join(band_instruments)
dput(result)
```

so unit test can be witten as:

```{r}
library(dplyr)
ref_data <- structure(
  list(
    name = c("John", "Paul"),
    band = c("Beatles",
             "Beatles"),
    plays = c("guitar", "bass")
  ),
  row.names = c(NA, -2L),
  class = c("tbl_df", "tbl", "data.frame")
)

testthat::expect_equal(band_members %>% inner_join(band_instruments), ref_data)
testthat::expect_identical(band_members %>% inner_join(band_instruments), ref_data)
```
  
  BTW: `dput` is also recommended way to provide Minimal Working Example (MWE) when 
  asking on [stackoverflow.com](https://stackoverflow.com).
  

* `.rds`/`.RData` - `dput` is useful form small _handy_ data, but imagine reference `data.frame` counting thousands of rows
and pasted into test script...


  <br>
  <details>
  <summary>Click to expand</summary>
  ```{r}
  dput(iris)
  ```
  </details>
  <br>

  In such cases it's worth keeping reference data in R formats such as `.rds`/`.RData`. 
  
  But when we work with R package test data should always go with package, because when installing R package 
  on multiple environments we can run unit tests to see whether everything works properly or not. 
  **So where test data should be stored**? Every package is installed together with `inst` directory. 
  The recommended location for test data is `inst/testdata` directory in R package. 
  
  To get path to reference data `system.file` functionshould be used:
  
```{r,eval=FALSE}
ref_data <- system.file(file.path("testdata", "test_data.rds"), package = "SpendWorx")
testthat::expect_equal(functionTotest(), ref_data)
```
  
* `snapshot`/`screenshot` - sometimes we want to would like to test output from plotting function. 
The simplest way to do that is comparing screenshots of plot:

```{r,eval=FALSE}
testthat::context("Testing plot(s) output")

testthat::test_that("plotly renders correctly", {
library(plotly)

target.path <- file.path("inst", "testdata", "plot1.png") 

p <- economics %>% plot_ly(x = ~date, y = ~unemploy/pop) %>% add_lines()

# compareImages function already contains testing function testthat::expect_lte
unitTestsWorkshopPackage::compareImages(takeScreenshot(p), target.path)
})
```

# Recreating reference data

> _Along with development process, code changes, but its behaviour and results should not - 
and **if they must change, let them change into a way we control**_. 

Code is being improved, it evolves and there will always be a moment when 
results, outputs change (but we konw about that, **it is a conscious change**), 
so reference data must be updated. I really like approach where developer 
puts few lines of **code to easily recreate reference data**.

For example, if we go back to plotly unit test, the whole test could look like this:

```{r,eval=FALSE}
  testthat::context("Testing plot(s) output")
  
  testthat::test_that("plotly renders correctly", {
    library(plotly)
    
    target.path <- system.file(file.path("testdata", "plot1.png"), package = "unitTestsWorkshopPackage")
    
    p <- economics %>% plot_ly(x = ~date, y = ~unemploy/pop) %>% add_lines()
    
    # compareImages function already contains testing function testthat::expect_lte
    unitTestsWorkshopPackage::compareImages(takeScreenshot(p), target.path)
    
    # following code recreates reference screenshot
    if (exists("RECREATE_REF_DATA") && isTRUE(RECREATE_REF_DATA)) {
      
      message("Recreating plotly reference screenshot...")
      
      testdata_dir <- file.path("unitTestsWorkshopPackage", "inst", "testdata")
      
      if (!dir.exists(file.path(testdata_dir)))
        dir.create(file.path(testdata_dir), recursive = TRUE)
      
      target.path <- file.path(testdata_dir, "plot1.png") 
      
      unitTestsWorkshopPackage::takeScreenshot(p, target.path)
    }
  })
```
  
  so to recreate reference screenshot just set `RECREATE_REF_DATA` to `TRUE` and run test.

# Practice

## Task #1
  
Test output of function `unitTestsWorkshopPackage::renderBoxplot`. 

  * Use following code as input data:
```{r,eval=FALSE}
dat <- data.frame(xval = sample(100, 1000, replace = TRUE),
                  group = as.factor(sample(c("a", "b", "c"), 1000, replace = TRUE)))
```
  * Use `unitTestsWorkshopPackage::takeScreenshot` to prepare reference plot
  * Use function `unitTestsWorkshopPackage::compareImages` to test output