New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Forest is extremely slow #749

Closed
Laurae2 opened this Issue Jul 29, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@Laurae2
Collaborator

Laurae2 commented Jul 29, 2017

Specs:

  • R 3.4.0
  • MinGW 7.1
  • Windows Server 2012 R2
  • 2x 10 core Xeon (total of 40 threads)

Random Forest can be extremely slow for unknown reasons.

To reproduce the issue (requires Bosch dataset), run the following:

setwd("E:/datasets")
sparse <- FALSE
rf <- TRUE
zero_as_missing <- TRUE
if (rf == FALSE) {
  params <- list(num_threads = 40,
                 learning_rate = 0.05,
                 max_depth = 6,
                 num_leaves = 63,
                 max_bin = 255,
                 zero_as_missing = zero_as_missing)
} else {
  params <- list(num_threads = 40,
                 learning_rate = 1,
                 max_depth = -1,
                 num_leaves = 4097,
                 max_bin = 255,
                 zero_as_missing = zero_as_missing,
                 boosting_type = "rf",
                 bagging_freq = 1,
                 bagging_fraction = 0.632,
                 feature_fraction = ceiling(sqrt(970)) / 970)
}

library(data.table)
library(Matrix)
library(lightgbm)
library(R.utils)

data <- fread(file = "bosch_data.csv")


# Do xgboost / LightGBM

# When dense:
# > sum(data == 0, na.rm = TRUE)
# [1] 43574349
# > sum(is.na(data))
# [1] 929125166

# Split
if (sparse == TRUE) {
  library(recommenderlab)
  gc()
  train_1 <- dropNA(as.matrix(data[1:1000000, 1:969]))
  train_2 <- data[1:1000000, 970]$Response
  gc()
  test_1 <- dropNA(as.matrix(data[1000001:1183747, 1:969]))
  test_2 <- data[1000001:1183747, 970]$Response
  gc()
} else {
  gc()
  train_1 <- as.matrix(data[1:1000000, 1:969])
  train_2 <- data[1:1000000, 970]$Response
  gc()
  test_1 <- as.matrix(data[1000001:1183747, 1:969])
  test_2 <- data[1000001:1183747, 970]$Response
  gc()
}


# For LightGBM
train  <- lgb.Dataset(data = train_1, label = train_2)
test <- lgb.Dataset(data = test_1, label = test_2, reference=train)
train$construct()
test$construct()

gc()
Laurae::timer_func_print({temp_model <- lgb.train(params = params,
                                                  data = train,
                                                  nrounds = 500,
                                                  valids = list(test = test),
                                                  objective = "binary",
                                                  metric = "auc",
                                                  verbose = 2)})

# temp_model$best_iter
perf <- as.numeric(rbindlist(temp_model$record_evals$test$auc))
max(perf)
which.max(perf)
@guolinke

This comment has been minimized.

Member

guolinke commented Jul 30, 2017

@Laurae2
it is normal seems you are using num_leave=4097.
you can try to use the comparable parameters as normal mode, and test is it still slow ?

@Laurae2

This comment has been minimized.

Collaborator

Laurae2 commented Jul 30, 2017

@guolinke There is a massive issue with MinGW in Windows. Check below the performance:

Compiler LightGBM AUC Best Iter Time (ms)
Visual Studio GBT 0.6599964 25 87060.578
Visual Studio RF 0.6539715 24 40204.351
MinGW 7.1 GBT 0.6590533 25 279344.248
MinGW 7.1 RF 0.6541649 24 198392.4

new code:

setwd("E:/datasets")
sparse <- TRUE # keep this true for reproducing my results
rf <- TRUE
if (rf == FALSE) {
  params <- list(num_threads = 40,
                 learning_rate = 0.05,
                 max_depth = -1,
                 num_leaves = 4095,
                 max_bin = 255)
} else {
  params <- list(num_threads = 40,
                 learning_rate = 1,
                 max_depth = -1,
                 num_leaves = 4095,
                 max_bin = 255,
                 boosting_type = "rf",
                 bagging_freq = 1,
                 bagging_fraction = 0.632,
                 feature_fraction = ceiling(sqrt(970)) / 970)
}

library(data.table)
library(Matrix)
library(lightgbm)
library(R.utils)

data <- fread(file = "bosch_data.csv")


# Do xgboost / LightGBM

# When dense:
# > sum(data == 0, na.rm = TRUE)
# [1] 43574349
# > sum(is.na(data))
# [1] 929125166

# Split
if (sparse == TRUE) {
  library(recommenderlab)
  gc()
  train_1 <- dropNA(as.matrix(data[1:1000000, 1:969]))
  train_2 <- data[1:1000000, 970]$Response
  gc()
  test_1 <- dropNA(as.matrix(data[1000001:1183747, 1:969]))
  test_2 <- data[1000001:1183747, 970]$Response
  gc()
} else {
  gc()
  train_1 <- as.matrix(data[1:1000000, 1:969])
  train_2 <- data[1:1000000, 970]$Response
  gc()
  test_1 <- as.matrix(data[1000001:1183747, 1:969])
  test_2 <- data[1000001:1183747, 970]$Response
  gc()
}


# For LightGBM
train  <- lgb.Dataset(data = train_1, label = train_2)
test <- lgb.Dataset(data = test_1, label = test_2, reference=train)
# train$construct()
# test$construct()

gc()
Laurae::timer_func_print({temp_model <- lgb.train(params = params,
                                                  data = train,
                                                  nrounds = 25,
                                                  valids = list(test = test),
                                                  objective = "binary",
                                                  metric = "auc",
                                                  verbose = 2)})

perf <- as.numeric(rbindlist(temp_model$record_evals$test$auc))
max(perf)
which.max(perf)

@Laurae2 Laurae2 closed this Jul 30, 2017

@guolinke

This comment has been minimized.

Member

guolinke commented Jul 30, 2017

@Laurae2 I don't know why MinGW is so slow ...

@Laurae2

This comment has been minimized.

Collaborator

Laurae2 commented Jul 30, 2017

@guolinke Now we have a good reproducible example in case someone wants to check the performance discrepancy between Visual Studio and MinGW for LightGBM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment