Skip to content

Commit

Permalink
Cache auto-loaded data during load.project() (#160)
Browse files Browse the repository at this point in the history
* added in auto cache when data loaded directly, and also made migrate.project able to migrate old projects smoothly.  Regression tests pass, but no tests for new functionality written yet

* changed the set up of the version tests which were creating a test_project directory in the working directory instead of in a temp directory.  Tests themselves not changed and still work

* Added some tests for migration and load.project

* Updated some variables in migrate.project in order to pass R CMD checks

* Changed caching logic to inspect global environment before and after dataload to determine what shoudl be cached.  Created some tests to test this also.

* updated behaviour for how project.info is calculated and also expand the scope of what cache.project does

* create a temporary migrate.project file to merge manually

* updated migrate.project with test for cache_loaded_data flag warning

* fixed bug which caused two migration tests to fail (wrong config variable name)

* changed default missing value to FALSE and added some website docuemntation
  • Loading branch information
connectedblue authored and KentonWhite committed Nov 3, 2016
1 parent 62937da commit 2ff97d6
Show file tree
Hide file tree
Showing 10 changed files with 204 additions and 12 deletions.
6 changes: 5 additions & 1 deletion R/cache.project.R
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,11 @@
#' cache.project()}
cache.project <- function()
{
for (dataset in get.project()[['data']])
# get all data related to the project
project_data <- unique(c(get.project()[['data']], .cached.variables()))

# and cache each one (already cached items will be re-cached if they have changed)
for (dataset in project_data)
{
message(paste('Caching', dataset))
cache(dataset)
Expand Down
38 changes: 36 additions & 2 deletions R/load.project.R
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,10 @@ load.project <- function(override.config = NULL)
}

# Then we consider loading things from data/.

# First save the variables already in the global env
before.data.load <- .var.diff.from()

if (config$data_loading)
{
message('Autoloading data')
Expand All @@ -122,7 +126,24 @@ load.project <- function(override.config = NULL)

.convert.to.data.table(my.project.info$data)
}


# If we have just loaded data from the data directory, cache it straight away
# if the cache_loaded_data config is TRUE.
new.vars <- .var.diff.from(before.data.load)
if (config$cache_loaded_data && (length(new.vars)>0))
{
sapply(new.vars, cache)
}

# update project.info$data with any additional datasets generated during autoload
if (length(new.vars) > 0)
my.project.info$data <- unique(c(my.project.info$data, new.vars))

# remove any items in project.info$data which are not in the global environment
remove <- setdiff(my.project.info$data, .var.diff.from())
my.project.info$data <- my.project.info$data[! (my.project.info$data %in% remove)]


if (config$munging)
{
message('Munging data')
Expand Down Expand Up @@ -216,7 +237,7 @@ load.project <- function(override.config = NULL)
ignore.case = TRUE,
perl = TRUE))

# If this variable already exists in cache, don't load it from data.
# If this variable already exists in global env, don't load it from data.
if (variable.name %in% ls(envir = .TargetEnv))
{
next()
Expand Down Expand Up @@ -337,3 +358,16 @@ load.project <- function(override.config = NULL)
if(sum(file.exists(check_files))==length(check_files)) return(TRUE)
return(FALSE)
}

# Compare the variables (excluding functions) in the global env with a passed
# in string of names and return the difference
.var.diff.from <- function(given.var.list="", env=.TargetEnv) {
# Get variables in target environment of determine if they are a function
current.var.list <- sapply(ls(envir = env), function(x) is.function(get(x)))
current.var.list <- names(current.var.list[current.var.list==FALSE])

# return those not in list
setdiff(current.var.list, given.var.list)
}


12 changes: 12 additions & 0 deletions R/migrate.project.R
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,18 @@ migrate.project <- function()

# Specific logic here for new config items that need special migration treatment

if(grepl("cache_loaded_data", config_warnings)) {
# switch the setting to FALSE so as to not mess up any existing
# munge script, but warn the user
loaded.config$cache_loaded_data <- FALSE
message(paste0(c(
"\n",
"There is a new config item called cache_loaded_data which auto-caches data",
"after it has been loaded from the data directory. This has been switched",
"off for this project in case it breaks your scripts. However you can switch",
"it on manually by editing global.dcf"),
collapse="\n"))
}

}

Expand Down
1 change: 1 addition & 0 deletions inst/defaults/config/default.dcf
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ libraries: reshape, plyr, ggplot2, stringr, lubridate
as_factors: TRUE
data_tables: FALSE
attach_internal_libraries: TRUE
cache_loaded_data: FALSE
1 change: 1 addition & 0 deletions inst/defaults/full/config/global.dcf
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ libraries: reshape, plyr, dplyr, ggplot2, stringr, lubridate
as_factors: TRUE
data_tables: FALSE
attach_internal_libraries: FALSE
cache_loaded_data: TRUE
2 changes: 1 addition & 1 deletion man/default.config.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/new.config.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

93 changes: 93 additions & 0 deletions tests/testthat/test-load.R
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,96 @@ test_that('Dont load when not in ProjectTemplate directory', {
expect_message(load.project(), "is not a ProjectTemplate directory")

})

test_that('auto loaded data is cached by default', {
test_project <- tempfile('test_project')
suppressMessages(create.project(test_project, minimal = FALSE))
on.exit(unlink(test_project, recursive = TRUE), add = TRUE)

oldwd <- setwd(test_project)
on.exit(setwd(oldwd), add = TRUE)


test_data <- data.frame(Names=c("a", "b", "c"), Ages=c(20,30,40))

# save test data as a csv in the data directory
write.csv(test_data, file="data/test.csv", row.names = FALSE)

suppressMessages(load.project())

# check that the cached file loads without error
expect_error(load("cache/test.RData", envir = environment()), NA)

# and check that the loaded data from the cache is what we saved
expect_equal(test, test_data)
})

test_that('auto loaded data is not cached when cached_loaded_data is FALSE', {
test_project <- tempfile('test_project')
suppressMessages(create.project(test_project, minimal = FALSE))
on.exit(unlink(test_project, recursive = TRUE), add = TRUE)

oldwd <- setwd(test_project)
on.exit(setwd(oldwd), add = TRUE)


test_data <- data.frame(Names=c("a", "b", "c"), Ages=c(20,30,40))

# save test data as a csv in the data directory
write.csv(test_data, file="data/test.csv", row.names = FALSE)

# Read the config data and set cache_loaded_data to FALSE
config <- read.dcf("config/global.dcf")
expect_error(config$cache_loaded_data <- FALSE, NA)
write.dcf(config, "config/global.dcf" )

suppressMessages(load.project())

# check that the the test variable has not been cached
expect_error(load("cache/test.RData", envir = environment()), "cannot open the connection")


})



test_that('auto loaded data from an R script is cached correctly', {
test_project <- tempfile('test_project')
suppressMessages(create.project(test_project, minimal = FALSE))
on.exit(unlink(test_project, recursive = TRUE), add = TRUE)

oldwd <- setwd(test_project)
on.exit(setwd(oldwd), add = TRUE)

# clear the global environment
rm(list=ls(envir = .TargetEnv), envir = .TargetEnv)

# create some variables in the global env that shouldn't be cached
test_data11 <- data.frame(Names=c("a", "b", "c"), Ages=c(20,30,40))
test_data21 <- data.frame(Names=c("a1", "b1", "c1"), Ages=c(20,30,40))

# Create some R code and put in data directory
CODE <- paste0(deparse(substitute({
test_data12 <- data.frame(Names=c("a", "b", "c"), Ages=c(20,30,40))
test_data22 <- data.frame(Names=c("a1", "b1", "c1"), Ages=c(20,30,40))

})), collapse ="\n")

# save R code in the data directory
writeLines(CODE, "data/test.R")

# load the project and R code
suppressMessages(load.project())

# check that the test variables have been cached correctly
expect_error(load("cache/test_data12.RData", envir = environment()), NA)
expect_error(load("cache/test_data22.RData", envir = environment()), NA)

# check that the other test variables have not been cached
expect_error(load("cache/test_data11.RData", envir = environment()),
"cannot open the connection")
expect_error(load("cache/test_data21.RData", envir = environment()),
"cannot open the connection")
})


49 changes: 48 additions & 1 deletion tests/testthat/test-migration.R
Original file line number Diff line number Diff line change
Expand Up @@ -47,15 +47,62 @@ test_that('migrating a project which doesnt need config update results in an Up
})


test_that('migrating a project with a missing config file results in a message to user', {

test_that('projects without the cached_loaded_data config have their migrated config set to FALSE ', {

test_project <- tempfile('test_project')
suppressMessages(create.project(test_project, minimal = FALSE))
on.exit(unlink(test_project, recursive = TRUE), add = TRUE)

oldwd <- setwd(test_project)
on.exit(setwd(oldwd), add = TRUE)

# Read the config data and remove the cache_loaded_data flag
config <- as.data.frame(read.dcf("config/global.dcf"))
expect_error(config$cache_loaded_data <- NULL, NA)
write.dcf(config, "config/global.dcf" )

# should get a warning because of the missing cache_loaded_data
expect_warning(suppressMessages(load.project()), "missing the following entries")

test_data <- data.frame(Names=c("a", "b", "c"), Ages=c(20,30,40))

# save test data as a csv in the data directory
write.csv(test_data, file="data/test.csv", row.names = FALSE)


# run load.project again and check that the the test variable has not been cached
# because the default should be FALSE if the missing_loaded_data is missing before migrate.project
# is called
suppressMessages(load.project())
expect_error(load("cache/test.RData", envir = environment()), "cannot open the connection")

# Migrate the project
expect_message(migrate.project(), "new config item called cache_loaded_data")

# Read the config data and check cached_loaded_data is FALSE
config <- as.data.frame(read.dcf("config/global.dcf"), stringsAsFactors=FALSE)
expect_equal(config$cache_loaded_data, "FALSE")

# Should be a clean load.project
expect_warning(suppressMessages(load.project()), NA)

# check that the the test variable has not been cached
expect_error(load("cache/test.RData", envir = environment()), "cannot open the connection")


})


test_that('migrating a project with a missing config file results in a message to user', {

test_project <- tempfile('test_project')
suppressMessages(create.project(test_project, minimal = FALSE))
on.exit(unlink(test_project, recursive = TRUE), add = TRUE)

oldwd <- setwd(test_project)
on.exit(setwd(oldwd), add = TRUE)

# remove the config file
unlink('config/global.dcf')

Expand Down
12 changes: 6 additions & 6 deletions website/configuring.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,18 @@ Both types are stored in the `config` object accessible from the global environm
The current `ProjectTemplate` configuration settings exist in the `config/global.dcf` file:

* `data_loading`: This can be set to 'on' or 'off'. If `data_loading` is on, the system will load data from both the `cache` and `data` directories with `cache` taking precedence in the case of name conflict. By default, `data_loading` is on.
* `cache_loading`: This can be set to 'on' or 'off'. If `cache_loading` is on, the system will load data from the `cache` directory before any attempt to load from the `data` directory. By default, `cache_loading` is on.
* `recursive_loading`: This can be set to 'on' or 'off'. If `recursive_loading` is on, the system will load data from the `data` directory and all its sub difrectories recursively. By default, `recursive_loading` is off.
* `munging`: This can be set to 'on' or 'off'. If `munging` is on, the system will execute the files in the `munge` directory sequentially using the order implied by the `sort()` function. If `munging` is off, none of the files in the `munge` directory will be executed. By default, `munging` is on.
* `logging`: This can be set to 'on' or 'off'. If `logging` is on, a logger object using the `log4r` package is automatically created when you run `load.project()`. This logger will write to the `logs` directory. By default, `logging` is off.
* `logging_level`: The value of `logging_level` is passed to a logger object using the `log4r` package during logging when when you run `load.project()`. By default, `logging` is INFO.
* `load_libraries`: This can be set to 'on' or 'off'. If `load_libraries` is on, the system will load all of the R packages listed in the `libraries` field described below. By default, `load_libraries` is off.
* `libraries`: This is a comma separated list of all the R packages that the user wants to automatically load when `load.project()` is called. These packages must already be installed before calling `load.project()`. By default, the reshape, plyr, ggplot2, stringr and lubridate packages are included in this list.
* `as_factors`: This can be set to 'on' or 'off'. If `as_factors` is on, the system will convert every character vector into a factor when creating data frames; most importantly, this automatic conversion occurs when reading in data automatically. If 'off', character vectors will remain character vectors. By default, `as_factors` is on.
* `data_tables`: This can be set to 'on' or 'off'. If `data_tables` is on, the system will convert every data set loaded from the `data` directory into a `data.table`. By default, `data_tables` is off.
* `attach_internal_libraries`: `This can be set to 'on' or 'off'. If `attach_internal_libraries` is on, then every time a new package is loaded into memory during `load.project()` a warning will be displayed informing that has happened. By default, `attach_internal_libraries` is off.
* `cache_loaded_data`: This can be set to 'on' or 'off'. If `cache_loaded_data` is on, then data loaded from the `data` directory during `load.project()` will be automatically cached (so it won't need to be reloaded next time `load.project()` is called). By default, `cache_loaded_data` is on for newly created projects. Existing projects created without this configuration setting will default to off. Similarly, when `migrate.project()` is called in those cases, the default will be off.


The project specific configuration is specified in the `lib/globals.R` file using the `add.config` function. This will contain whatever is relevant for your project, and will look something like this:

Expand All @@ -31,9 +37,3 @@ To use project specific configuaration in any `lib`, `munge` or `src` script, si
`ProjectTemplate` will automatically load project specific content in `lib/globals.R` before any other file in `lib`, so the filename should not be changed.

The `add.config()` function can also be used anywhere in the project. So if a particular analysis in `src` wanted to override the value in `globals.R`, you can simply add the relevant `add.config()` command to the top of that script.


The following configuration settings still require documentation:
* `cache_loading`
* `recursive_loading`
* `attach_internal_libraries`

0 comments on commit 2ff97d6

Please sign in to comment.