Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of support for data.table? #10

Open
simon-lowe opened this issue Apr 27, 2020 · 8 comments
Open

Lack of support for data.table? #10

simon-lowe opened this issue Apr 27, 2020 · 8 comments

Comments

@simon-lowe
Copy link

Hi,

I think I like this package, but it does seem to not support data.table.
The code below does identical things, except that once it uses data.table and once data.frame. The latter works while the former does not.

rm(list = ls())

N_pop <- 100
dat_frame <- data.frame(x = rnorm(N_pop, 10, 1))
mean(dat_frame$x)

func_frame <- function(N_pop, N_samp){
  tmp <- dat_frame[sample(seq(1:N_pop), N_samp),]
  return(list("mean" = mean(tmp)))
}

func_frame(N_pop, 10)

MC <- MonteCarlo(func = func_frame, nrep = 100, param_list = list("N_pop" = N_pop, "N_samp" = 10), ncpus = 4)


N_pop <- 100
dat_table <- data.table(x = rnorm(N_pop, 10, 1))
mean(dat_table$x)

func_table <- function(N_pop, N_samp){
  tmp <- dat_table[sample(seq(1:N_pop), N_samp),]
  return(list("mean" = mean(tmp$x)))
}

func_table(N_pop, 10)

MC <- MonteCarlo(func = func_table, nrep = 100, param_list = list("N_pop" = N_pop, "N_samp" = 10), ncpus = 4)
@simon-lowe
Copy link
Author

Ok, well a basic hack is simply to convert the data.frame to a data.table, but that's a little annoying:

N_pop <- 100
dat_frame <- data.frame(x = rnorm(N_pop, 10, 1))
mean(dat_frame$x)

func_frame <- function(N_pop, N_samp){
  tmp <- as.data.table(dat_frame)
  tmp <- tmp[sample(seq(1:N_pop), N_samp),]
  return(list("mean" = mean(tmp$x)))
}

func_frame(N_pop, 10)

MC <- MonteCarlo(func = func_frame, nrep = 100, param_list = list("N_pop" = N_pop, "N_samp" = 10), ncpus = 4)

@simon-lowe
Copy link
Author

OK, the problem gets stranger because while the above works, if I use basic functionalities of data.table it stops working:

N_pop <- 100
dat_frame <- data.frame(x = rnorm(N_pop, 10, 1))
mean(dat_frame$x)

func_frame <- function(N_pop, N_samp){
  tmp <- as.data.table(dat_frame)
  tmp <- tmp[x < 5]
  tmp <- dat_frame[sample(seq(1:N_pop), N_samp),]
  return(list("mean" = mean(tmp)))
}

func_frame(N_pop, 10)

MC <- MonteCarlo(func = func_frame, nrep = 100, param_list = list("N_pop" = N_pop, "N_samp" = 10), ncpus = 4)

Remove the line tmp[x < 5] and it works. The error says something about x being unknown ...

@simon-lowe
Copy link
Author

Correction, it looks like the same problem applies if you use dplyr and data.frames, so ...

@FunWithR
Copy link
Owner

Hi! you can circumvent the issue if you add data.table as an argument to export_also. Here is an example.
`rm(list = ls())

library(MonteCarlo)
library(data.table)

N_pop <- 100
dat_table <- data.table(x = rnorm(N_pop, 10, 1))
mean(dat_table$x)

func_table <- function(N_pop, N_samp){
tmp <- dat_table[sample(seq(1:N_pop), N_samp),]
return(list("mean" = mean(tmp$x)))
}

func_table(N_pop, 10)

MC <- MonteCarlo(func = func_table, nrep = 100,
param_list = list("N_pop" = N_pop, "N_samp" = 10), ncpus = 4,
export_also = list("packages" = "data.table"))
`
Parallel execution always requires the workers to be set up correctly. That means that for example external libraries have to be loaded. MonteCarlo tries to do that automatically for you, by analyzing the functions used in your code and loading all the packages into the workers that contain the functions you are using.

Your code does something that is a little unusual, because you create all the random variables in advance and then sample from them again inside of the func_table function. You then sample again inside of func_table and use properties of the data.table object that is created outside of it. MonteCarlo would therefore have to analyze the objects as well as the functions. That would probably be a nice feature if I manage to do a new iteration.

Do you want to use the package for bootstrapping? Otherwise the way to structure it that you have chosen does not make too much sense.

If you want to do a simple Monte Carlo simulation you can do it like this:

`

rm(list = ls())

library(MonteCarlo)
library(data.table)

func_table <- function(N_samp){
tmp <- data.table(x = rnorm(N_samp, 10, 1))
return(list("mean" = mean(tmp$x)))
}

func_table(10)

MC <- MonteCarlo(func = func_table, nrep = 100,
param_list = list("N_samp" = c(10, 100)), ncpus = 4,
export_also = list("packages" = "data.table"))

`

@simon-lowe
Copy link
Author

Hi,
Thank you for your answer. I should say I really like the idea of this package! The structure of the Monte-Carlo is actually on purpose. Essentially it is the randomization inference approach, where the actual "state of the world" is fixed, the only thing that changes is the random treatment assignment at each iteration.

Concerning your solution I don't think that the problem was the incorrect import of data.table, I think for some reason because of the syntax of the data.table package (but as I've said you have the same issue with the maybe more standard tidyverse) where you use names of columns of the data.table without adding dat$ (which is a big plus of these packages).

For example:

library(MonteCarlo)
library(data.table)

N_pop <- 100
dat_frame <- data.table(x = rnorm(N_pop, 10, 1))
mean(dat_frame$x)

func_frame <- function(N_pop, N_samp){
  tmp <- dat_frame[x < 9.9]
  tmp <- dat_frame[sample(seq(1:N_pop), N_samp),]
  return(list("mean" = mean(tmp$x)))
}

func_frame(N_pop, 10)

MC <- MonteCarlo(func = func_frame, nrep = 100, param_list = list("N_pop" = N_pop, "N_samp" = 10), ncpus = 4,
                 export_also = list("packages" = "data.table"))

This code returns the error:

r in snowfall::sfExport("func2", "func", "libloc_strings", "dat_frame", :
Unknown/unfound variable x in export. (local=TRUE)

And it does so independently of whether the export_also option is used.
I've taken a cursory look at the code and I think it is simply looking to deeply in what variables to declare for export to the workers. This error would come from declaring x for export even though it is actually a column of the data.table.
I have reverted to a manual and it works indeed fine when you simply export the data.table.

Anyway, thank you for your answer and efforts!

@danielwojtaszek
Copy link

danielwojtaszek commented Jun 1, 2020

I have the same issue as @simon-lowe . In my case, I use a data.table to store the bounds on uniformly distributed random numbers, which are then sampled in the function using runif().

I've pasted my code below, but unfortunately cannot attach the input file needed for it to run.

`library(Smisc)
library(DBI)
library(odbc)
library(RSQLite)
library(MonteCarlo)
library(data.table)

db = dbConnect(RSQLite::SQLite(), "LWR.sqlite")
inv = dbGetQuery(db, "select Time, Prototype, InventoryName, Quantity as 'U235' From explicitinventory inner join AgentEntry on explicitinventory.AgentId=AgentEntry.AgentId where (AgentEntry.Kind = 'Facility')and(explicitinventory.NucId = 922350000) group by Time, Prototype, InventoryName")

dta = dbGetQuery(db, "SELECT Time, Sender, Receiver, Commodity, sum(Quantity)*MassFrac as 'U235' From Compositions inner join (SELECT Time, Sender, Receiver, Commodity, Quantity, QualId from resources inner join (SELECT Time, Commodity, Sender, Prototype as 'Receiver', Resourceid, Commodity From AgentEntry inner join (SELECT Time, Commodity, Prototype as 'Sender', Receiverid, Resourceid, Commodity From transactions INNER Join AgentEntry on AgentEntry.AgentId = transactions.SenderId) as ij1 on AgentEntry.AgentId = ij1.ReceiverId) as ij2 on resources.Resourceid = ij2.Resourceid) as ij3 on Compositions.QualId = ij3.QualId where (Compositions.NucId = 922350000) group by Time, Receiver, Commodity")

dbDisconnect(db)
flowTable <- setDT(dta)
invyTable <- setDT(inv)
ship_EU <- flowTable[Commodity == "Fresh_UOX_comm" & Receiver == "LWR", .(EU = sum(U235)),by = "Time"]
setkey(ship_EU,Time)
receive <- flowTable[Commodity == "U_nat_comm", .(NU = sum(U235)),by = "Time"]
setkey(receive,Time)
new_table <- merge(ship_EU,receive, all=TRUE)
DU_out <- flowTable[Commodity == "Enrich_Tails_comm", .(DU = sum(U235)),by = "Time"]
setkey(DU_out,Time)
new_table <- merge(new_table,DU_out, all=TRUE)
E_invy <- invyTable[Prototype == "EnrichPlant" & InventoryName == "inventory", .(Inventory = U235), by = "Time"]
setkey(E_invy,Time)
new_table <- merge(new_table,E_invy, all=TRUE)

init_inv_Time = 2
pct_err = 0.05
abs_err = 5.0
##################################

assign("bounds", new_table[,
.(NU_LB=NU-NUpct_err, NU_UB=NU+NUpct_err,
EU_LB=EU-EUpct_err, EU_UB=EU+EUpct_err,
DU_LB=DU-DUpct_err, DU_UB=DU+DUpct_err,
inv_LB=Inventory-abs_err, inv_UB=Inventory+abs_err)
, by = "Time"],envir=.GlobalEnv)
setkey(bounds,Time)

K = 10.0
CL = 50.0

cusum_test <- function(K, CL){
meas <- bounds[,
list(NU=runif(1,NU_LB,NU_UB),
EU=runif(1,EU_LB,EU_UB),
DU=runif(1,DU_LB,DU_UB),
Inventory=runif(1,inv_LB,inv_UB)
), by = "Time"]

init_inv = meas[Time==init_inv_Time,Inventory]
meas = meas[Time > init_inv_Time, ]
sum_meas <- meas[order(Time), list(Time=Time,R=cumsum(NU),S=cumsum(EU+DU),Inventory=Inventory), ]
MUF <- sum_meas[,list(MUF=R-S+init_inv-Inventory), by = "Time"]

y = cusum(MUF[,MUF], K, CL)
return(list("alarm"=length(signal(y))>0))
}

K_grid = numeric(10)+K
CL_grid = c(CL)
param_list=list("K"=K_grid,"CL"=CL_grid)
result <- MonteCarlo(func=cusum_test, nrep=10, param_list=param_list, ncpus = 2, export_also=list("packages"="data.table"))`

The code gives the following error:

Error in snowfall::sfExport("func2", "func", "libloc_strings", "bounds", :
Unknown/unfound variable DU in export. (local=TRUE)

Simon mentioned that simply exporting the data.table solved his problem. How would I do that?

Cheers.

@danielwojtaszek
Copy link

I've answered my own question, I think. The "data" element in export_also parameter list can be used to export the data.table I want to access in the function. Using export_also = list("data"=bounds,"packages"=data.table" changes the error to

Error in snowfall::sfExport("func2", "func", "libloc_strings", "1:359", :
Unknown/unfound variable 1:359 in export. (local=TRUE)

I don't know how 1:359 is being seen as a variable. Maybe it has something to do with referencing rows in one of the data.tables, since there are 359 rows in the 'bounds' data.table.

@danielwojtaszek
Copy link

I got it working by putting in explicit references to column names using $ and setting export_also = list("data"="bounds","packages"=data.table")

I think it would improve the montecarlo package a lot if you could make it work better with data.tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants