New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] slow when function returning R6 object #539

Closed
BlindApe opened this Issue May 22, 2017 · 35 comments

Comments

Projects
None yet
3 participants
@BlindApe

BlindApe commented May 22, 2017

I'll give a detailed enviroment sheet and a reproducible example, but meanwhile this is the problem:

d.level1 <- lgb.Dataset(data = data.matrix(level1))
lightgbm::setinfo(d.level1, 'label', level1.response)
Error in paste(as.character(obj), collapse = " ") :
long vectors not supported yet: memory.c:1652
length(level1.response)
[1] 4214186

level1 has 4214186 rows and 1152 columns.

@guolinke

This comment has been minimized.

Member

guolinke commented May 22, 2017

I think 4214186 isn't long. And this issue is strange.
Did you run it immediately after installing LightGBM ? you can try to restart R and run it again.

@BlindApe

This comment has been minimized.

BlindApe commented May 22, 2017

The problem is reproducible after restarting the machine.
With a few columns it works, so I think the problem could be with 4212186 * 1152 that is greater than 2^32.
I've noticed lightgbm::setinfo in R takes a long long time with this big dataset, and finally crash. I've tried the same dataset with xgboost and setinfo is a blink.

@guolinke

This comment has been minimized.

Member

guolinke commented May 22, 2017

@BlindApe
SetInfo have nothing to do with number of columns. It only use one column.
And paste(as.character(obj), collapse = " ") isn't called by lightgbm.
Did you check the type of level1.response ?
can you try following code?

d.level1 <- lgb.Dataset(data = data.matrix(level1))
label <- level1.response
print(typeof(label))
print(label)
lgb.setinfo(d.level1, "label", label)
@guolinke

This comment has been minimized.

Member

guolinke commented May 22, 2017

BTW, SetInfo only copy ( copy the reference ) the passed array. So, It should be very fast.

@Laurae2

This comment has been minimized.

Collaborator

Laurae2 commented May 22, 2017

@BlindApe label must be a vector like in xgboost.

@BlindApe

This comment has been minimized.

BlindApe commented May 22, 2017

@guolinke , @Laurae2 , thank for your support.

As you see this code takes 1 minute with a matrix of 20 columns, nearly 5 minutes with 100 columns and an eternity (I'm still waiting a crash) with 1300 columns.
lgb.setinfo isn't visible so I need to use setinfo().
The type of label is 'double'.

This is the sessionInfo():
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 15.10

locale:
[1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8 LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8
[7] LC_PAPER=es_ES.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] lightgbm_0.1 R6_2.2.1 data.table_1.10.4

loaded via a namespace (and not attached):
[1] magrittr_1.5 tools_3.3.1

It's a dual xeon server with 2 x E5-2690 v2 and 256 Gb RAM.

Code:

data <- matrix(1, 4000000, 20)
label <- rep(0, 4000000)
print(typeof(label))
[1] "double"
print(head(label))
[1] 0 0 0 0 0 0
train <- lgb.Dataset(data = data)
tic = proc.time()[3]
setinfo(train, 'label', label)
<lgb.Dataset>
Public:
construct: function ()
create_valid: function (data, info = list(), ...)
dim: function ()
finalize: function ()
get_colnames: function ()
getinfo: function (name)
initialize: function (data, params = list(), reference = NULL, colnames = NULL,
save_binary: function (fname)
set_categorical_feature: function (categorical_feature)
set_colnames: function (colnames)
setinfo: function (name, info)
set_reference: function (reference)
slice: function (idxset, ...)
update_params: function (params)
Private:
categorical_feature: NULL
colnames: NULL
free_raw_data: TRUE
get_handle: function ()
handle: NULL
info: list
params: list
predictor: NULL
raw_data: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
reference: NULL
set_predictor: function (predictor)
used_indices: NULL
cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n')
0.96 mins

data <- matrix(1, 4000000, 100)
label <- rep(0, 4000000)
train <- lgb.Dataset(data = data)
tic = proc.time()[3]
setinfo(train, 'label', label)
<lgb.Dataset>
Public:
construct: function ()
create_valid: function (data, info = list(), ...)
dim: function ()
finalize: function ()
get_colnames: function ()
getinfo: function (name)
initialize: function (data, params = list(), reference = NULL, colnames = NULL,
save_binary: function (fname)
set_categorical_feature: function (categorical_feature)
set_colnames: function (colnames)
setinfo: function (name, info)
set_reference: function (reference)
slice: function (idxset, ...)
update_params: function (params)
Private:
categorical_feature: NULL
colnames: NULL
free_raw_data: TRUE
get_handle: function ()
handle: NULL
info: list
params: list
predictor: NULL
raw_data: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
reference: NULL
set_predictor: function (predictor)
used_indices: NULL
cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n')
4.79 mins

data <- matrix(1, 4000000, 1300)
label <- rep(0, 4000000)
train <- lgb.Dataset(data = data)
tic = proc.time()[3]
setinfo(train, 'label', label)

@BlindApe

This comment has been minimized.

BlindApe commented May 22, 2017

With xgboost (dataset creation with 1300 columns takes a while, but setinfo is a blink):

rm(list = ls(all = TRUE))
require(data.table)
Loading required package: data.table
data.table 1.10.4
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
require(xgboost)
Loading required package: xgboost

data <- matrix(1, 4000000, 20)
label <- rep(0, 4000000)
print(typeof(label))
[1] "double"
print(head(label))
[1] 0 0 0 0 0 0
train <- xgb.DMatrix(data = data)
tic = proc.time()[3]
setinfo(train, 'label', label)
[1] TRUE
cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n')
0 mins

data <- matrix(1, 4000000, 100)
label <- rep(0, 4000000)
train <- xgb.DMatrix(data = data)
tic = proc.time()[3]
setinfo(train, 'label', label)
[1] TRUE
cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n')
0 mins

data <- matrix(1, 4000000, 1300)
label <- rep(0, 4000000)
train <- xgb.DMatrix(data = data)
tic = proc.time()[3]
setinfo(train, 'label', label)
[1] TRUE
cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n')
0 mins

@Laurae2

This comment has been minimized.

Collaborator

Laurae2 commented May 22, 2017

I can't reproduce it.

I am getting this:

> library(R.utils)
> library(lightgbm)
> data <- matrix(1, 4000000, 20)
> label <- rep(0, 4000000)
> train <- lgb.Dataset(data = data, label = label)
> Laurae::timer_func_print(setinfo(train, 'label', label))
The function ran in 2.001 milliseconds.
[1] 2.000977
> library(R.utils)
> library(lightgbm)
> data <- matrix(1, 4000000, 100)
> label <- rep(0, 4000000)
> label2 <- rep(1, 4000000)
> train <- lgb.Dataset(data = data, label = label)
> Laurae::timer_func_print(setinfo(train, 'label', label2))
The function ran in 3.004 milliseconds.
[1] 3.00415

From a fully fresh session:

> library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.21.0 (2016-10-30) successfully loaded. See ?R.oo for help.

Attaching package:R.ooThe following objects are masked frompackage:methods:

    getClasses, getMethods

The following objects are masked frompackage:base:

    attach, detach, gc, load, save

R.utils v2.5.0 (2016-11-07) successfully loaded. See ?R.utils for help.

Attaching package:R.utilsThe following object is masked frompackage:RevoMods:

    timestamp

The following object is masked frompackage:utils:

    timestamp

The following objects are masked frompackage:base:

    cat, commandArgs, getOption, inherits, isOpen, parse, warnings

> library(lightgbm)
Loading required package: R6
Warning message:
packageR6was built under R version 3.3.3 
> data <- matrix(1, 4000000, 100)
> label <- rep(0, 4000000)
> label2 <- rep(1, 4000000)
> train <- lgb.Dataset(data = data)
> Laurae::timer_func_print(setinfo(train, 'label', label2))
The function ran in 1.001 milliseconds.
[1] 1.000977

Using a laptop here.

Did I miss something from your script?

@BlindApe

This comment has been minimized.

BlindApe commented May 22, 2017

The problem is setinfo after lgb.Dataset. Doing data and label at same time isn't problem:

data <- matrix(1, 4000000, 20)
label <- rep(0, 4000000)
print(typeof(label))
[1] "double"
print(head(label))
[1] 0 0 0 0 0 0
train <- lgb.Dataset(data = data, label = label)
tic = proc.time()[3]
cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n')
0 mins

@Laurae2

This comment has been minimized.

Collaborator

Laurae2 commented May 22, 2017

@BlindApe My last example was done without doing data and label at the same time. It is still taking about 1 millisecond.

However, when I replace all 1s by 0s, I get the issue you mentioned (but it takes only 10 seconds here, not 5 minutes).

@guolinke any reason why it is slower when using 0s instead of 1s? It seems to duplicate the data also, which duplicates the memory used.

@BlindApe

This comment has been minimized.

BlindApe commented May 22, 2017

This is in other server, running different R version under Windows:

sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C LC_TIME=Spanish_Spain.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lightgbm_0.1 R6_2.2.1 data.table_1.10.4
loaded via a namespace (and not attached):
[1] compiler_3.4.0 magrittr_1.5

rm(list = ls(all = TRUE))
require(data.table)
Loading required package: data.table
data.table 1.10.4
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
require(lightgbm)
Loading required package: lightgbm
Loading required package: R6

data <- matrix(1, 4000000, 20)
label <- rep(0, 4000000)
print(typeof(label))
[1] "double"
print(head(label))
[1] 0 0 0 0 0 0
train <- lgb.Dataset(data = data)
tic = proc.time()[3]
setinfo(train, 'label', label)
<lgb.Dataset>
Public:
construct: function ()
create_valid: function (data, info = list(), ...)
dim: function ()
finalize: function ()
get_colnames: function ()
getinfo: function (name)
initialize: function (data, params = list(), reference = NULL, colnames = NULL,
save_binary: function (fname)
set_categorical_feature: function (categorical_feature)
set_colnames: function (colnames)
set_reference: function (reference)
setinfo: function (name, info)
slice: function (idxset, ...)
update_params: function (params)
Private:
categorical_feature: NULL
colnames: NULL
free_raw_data: TRUE
get_handle: function ()
handle: NULL
info: list
params: list
predictor: NULL
raw_data: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
reference: NULL
set_predictor: function (predictor)
used_indices: NULL
cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n')
1.79 mins

@Laurae2

This comment has been minimized.

Collaborator

Laurae2 commented May 22, 2017

Can you change label <- rep(0, 4000000) with label <- rep(1, 4000000) and report the time?

@BlindApe

This comment has been minimized.

BlindApe commented May 22, 2017

I was doing:

data <- matrix(1, 4000000, 20)
label <- rep(1, 4000000)
print(typeof(label))
[1] "double"
print(head(label))
[1] 1 1 1 1 1 1
train <- lgb.Dataset(data = data)
tic = proc.time()[3]
setinfo(train, 'label', label)
<lgb.Dataset>
Public:
construct: function ()
create_valid: function (data, info = list(), ...)
dim: function ()
finalize: function ()
get_colnames: function ()
getinfo: function (name)
initialize: function (data, params = list(), reference = NULL, colnames = NULL,
save_binary: function (fname)
set_categorical_feature: function (categorical_feature)
set_colnames: function (colnames)
set_reference: function (reference)
setinfo: function (name, info)
slice: function (idxset, ...)
update_params: function (params)
Private:
categorical_feature: NULL
colnames: NULL
free_raw_data: TRUE
get_handle: function ()
handle: NULL
info: list
params: list
predictor: NULL
raw_data: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
reference: NULL
set_predictor: function (predictor)
used_indices: NULL
cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n')
1.83 mins

@guolinke

This comment has been minimized.

Member

guolinke commented May 23, 2017

@BlindApe
can you try to replace setinfo(train, 'label', label) with train$setinfo(`label`, label) (Doing data and label at the same time will call this) ?
@Laurae2 I check the code for setinfo: https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.Dataset.R#L413-L451
I think the most time consuming part is the data conversion: https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.Dataset.R#L423-L428 .
The LGBM_DatasetSetField_R won't be called here since we use lazy init.

@guolinke

This comment has been minimized.

Member

guolinke commented May 23, 2017

@BlindApe
can you also test the time cost of label <- as.numeric(label) in your machine ?

@BlindApe

This comment has been minimized.

BlindApe commented May 23, 2017

@guolinke

Seems the problem isn't in as.numeric().
Replacing the function with directly invoke the method has the same behavior.

In linux server:

rm(list = ls(all = TRUE))
require(lightgbm)

data <- matrix(1, 4000000, 20)
label <- rep(0, 4000000)
tic = proc.time()[3]
label <- as.numeric(label)
cat(proc.time()[3] - tic, 'secs', '\n')
0 secs
print(typeof(label))
[1] "double"
print(head(label))
[1] 0 0 0 0 0 0
train <- lgb.Dataset(data = data)
tic = proc.time()[3]
train$setinfo('label', label)
cat(proc.time()[3] - tic, 'secs', '\n')
57.443 secs

In Windows server:

rm(list = ls(all = TRUE))
require(lightgbm)

data <- matrix(1, 4000000, 20)
label <- rep(0, 4000000)
tic = proc.time()[3]
label <- as.numeric(label)
cat(proc.time()[3] - tic, 'secs', '\n')
0.05 secs
print(typeof(label))
[1] "double"
print(head(label))
[1] 0 0 0 0 0 0
train <- lgb.Dataset(data = data)
tic = proc.time()[3]
train$setinfo('label', label)
cat(proc.time()[3] - tic, 'secs', '\n')
93.71 secs

@BlindApe

This comment has been minimized.

BlindApe commented May 23, 2017

Time is proportional to rows and columns, and the behavior is the same both in Windows and Linux

data <- matrix(1, 4000, 20)
label <- rep(0, 4000)
tic = proc.time()[3]
label <- as.numeric(label)
cat(proc.time()[3] - tic, 'secs', '\n')
0.002 secs
print(typeof(label))
[1] "double"
print(head(label))
[1] 0 0 0 0 0 0
train <- lgb.Dataset(data = data)
tic = proc.time()[3]
train$setinfo('label', label)
cat(proc.time()[3] - tic, 'secs', '\n')
0.124 secs

data <- matrix(1, 4000000, 100)
label <- rep(0, 4000000)
tic = proc.time()[3]
label <- as.numeric(label)
cat(proc.time()[3] - tic, 'secs', '\n')
0.001 secs
print(typeof(label))
[1] "double"
print(head(label))
[1] 0 0 0 0 0 0
train <- lgb.Dataset(data = data)
tic = proc.time()[3]
train$setinfo('label', label)
cat(proc.time()[3] - tic, 'secs', '\n')
287.985 secs

@Laurae2

This comment has been minimized.

Collaborator

Laurae2 commented May 23, 2017

81.27s:

require(lightgbm)
data <- matrix(1, 4000000, 20)
label <- rep(0, 4000000)
tic = proc.time()[3]
label <- as.numeric(label)
cat(proc.time()[3] - tic, 'secs', '\n')
print(typeof(label))
print(head(label))
train <- lgb.Dataset(data = data)
tic = proc.time()[3]
train$setinfo('label', label)
cat(proc.time()[3] - tic, 'secs', '\n') // 81.27 secs 

but running this is really strange as it does exactly the same thing but faster!:

library(profvis)

profvis({
  library(lightgbm)
  data <- matrix(1, 4000000, 20)
  label <- rep(1, 4000000)
  train <- lgb.Dataset(data = data)
  train$setinfo("label", label)
})

str(train$getinfo("label")) // CORRECT ANSWER: num [1:4000000] 1 1 1 1 1 1 1 1 1 1 ...

image

The behavior is really strange when out of a function and running in the global environment.

@BlindApe

This comment has been minimized.

BlindApe commented May 23, 2017

I think the message I obtained first time I tried with 1152 columns could give a tip:

Error in paste(as.character(obj), collapse = " ") :
long vectors not supported yet: memory.c:1652

It's like internally all data were pasted in some way...

@guolinke

This comment has been minimized.

Member

guolinke commented May 23, 2017

@Laurae2 @BlindApe
maybe the reason is the memory relocation of R6 Class (Or R?).
When set this : https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.Dataset.R#L431, R6 may copy the data to a new memory address. Thus, when data is large, it will cost more time.

And it calls paste(as.character(obj), collapse = " ") . in memory.c, this seems is for some memory operations as well.

@Laurae2

This comment has been minimized.

Collaborator

Laurae2 commented May 23, 2017

I ran this:

library(lightgbm)
data <- matrix(1, 4000000, 20)
label <- rep(1, 4000000)
train <- lgb.Dataset(data = data)

debugonce(train$setinfo)
train$setinfo("label", label)

It seems there is a problem with return(self). It takes forever from that point.

This post shows an issue with assignments on R6 objects and type checking leading to 30x slower assignments on small stuff: https://cran.r-project.org/web/packages/R6/vignettes/Performance.html

@BlindApe

This comment has been minimized.

BlindApe commented May 23, 2017

@Laurae2 I've reproduced the behaviour doing inside a naive function:

rm(list = ls(all = TRUE))
require(lightgbm)
Loading required package: lightgbm
Loading required package: R6
nothing <- function() {
data <- matrix(1, 4000000, 100)
label <- rep(0, 4000000)
tic = proc.time()[3]
label <- as.numeric(label)
cat(proc.time()[3] - tic, 'secs', '\n')
print(typeof(label))
print(head(label))
train <- lgb.Dataset(data = data)
tic = proc.time()[3]
train$setinfo('label', label)
cat(proc.time()[3] - tic, 'secs', '\n')
}

nothing()
0 secs
[1] "double"
[1] 0 0 0 0 0 0
0 secs

So a temporary work around is call it inside a function.

@guolinke

This comment has been minimized.

Member

guolinke commented May 26, 2017

@Laurae2 What is the state of this issue now ?

@Laurae2

This comment has been minimized.

Collaborator

Laurae2 commented May 26, 2017

@guolinke Currently can't seem to fix this, I managed to delay the issue but this is not a good workaround.

@Laurae2

This comment has been minimized.

Collaborator

Laurae2 commented May 26, 2017

@BlindApe Your/Our workaround works until we print the train model. You can test this code to see the issue:

lgb.unloader(wipe = TRUE)
require(lightgbm)
nothing <- function() {
  data <- matrix(1, 4000000, 20)
  label <- rep(0, 4000000)
  tic = proc.time()[3]
  label <- as.numeric(label)
  cat(proc.time()[3] - tic, 'secs', '\n')
  print(typeof(label))
  print(head(label))
  train <- lgb.Dataset(data = data)
  tic = proc.time()[3]
  train$setinfo('label', label)
  cat(proc.time()[3] - tic, 'secs', '\n')
  return(train)
}
train <- nothing() # FAST
train # SLOW
@guolinke

This comment has been minimized.

Member

guolinke commented May 26, 2017

@Laurae2
So R console will print the return object, and the print will have some serialization, which causes the slow-down ?

@Laurae2

This comment has been minimized.

Collaborator

Laurae2 commented May 26, 2017

@guolinke The cause is in the screenshot:

image

I have no idea what this as.character is actually doing here. But it is directly related to print and it explains why the duration scales with the size of the matrix(ces).

In RStudio, doing this reproduces exactly the issue:

library(profvis)

profvis({
  library(lightgbm)
  data <- matrix(1, 4000000, 20)
  label <- rep(1, 4000000)
  train <- lgb.Dataset(data = data)
  train$setinfo("label", label)
  print(train)
})

@guolinke guolinke changed the title from Problem with setinfo (in R): long vectors not supported yet to [R] slow when function returning R's object May 26, 2017

@guolinke

This comment has been minimized.

Member

guolinke commented May 26, 2017

@Laurae2

This comment has been minimized.

Collaborator

Laurae2 commented May 26, 2017

@guolinke the issue is still here when returning invisibly. I start to think it is the fault of the environment / R6 object, and not of LightGBM. The print function is 100% unrelated to LightGBM, even though LightGBM seems to create the issue.

@guolinke guolinke changed the title from [R] slow when function returning R's object to [R] slow when function returning R6 object May 27, 2017

@guolinke

This comment has been minimized.

Member

guolinke commented May 27, 2017

@Laurae2 after using invisible, the problem only show up when manually print the R6 object.

@guolinke

This comment has been minimized.

Member

guolinke commented May 27, 2017

@BlindApe can you try the latest code and see what happen ?

@Laurae2

This comment has been minimized.

Collaborator

Laurae2 commented May 27, 2017

@guolinke It defers the issue when printing. We may add a warning on docs about it.

@BlindApe

This comment has been minimized.

BlindApe commented May 31, 2017

@guolinke @Laurae2

I confirm the problem is still there when printing the object (running the Laurae2 example code).
I can live using invisible and without printing the objects.
Adding a warning on docs would be nice.

@Laurae2

This comment has been minimized.

Collaborator

Laurae2 commented Jun 4, 2017

@BlindApe Warning added in #555

@Laurae2 Laurae2 added the bug label Jan 21, 2018

@guolinke

This comment has been minimized.

Member

guolinke commented Jun 13, 2018

close since it is R6`s bug.

@guolinke guolinke closed this Jun 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment