Random Forest is extremely slow #749

Laurae2 · 2017-07-29T19:52:57Z

Specs:

R 3.4.0
MinGW 7.1
Windows Server 2012 R2
2x 10 core Xeon (total of 40 threads)

Random Forest can be extremely slow for unknown reasons.

To reproduce the issue (requires Bosch dataset), run the following:

setwd("E:/datasets")
sparse <- FALSE
rf <- TRUE
zero_as_missing <- TRUE
if (rf == FALSE) {
  params <- list(num_threads = 40,
                 learning_rate = 0.05,
                 max_depth = 6,
                 num_leaves = 63,
                 max_bin = 255,
                 zero_as_missing = zero_as_missing)
} else {
  params <- list(num_threads = 40,
                 learning_rate = 1,
                 max_depth = -1,
                 num_leaves = 4097,
                 max_bin = 255,
                 zero_as_missing = zero_as_missing,
                 boosting_type = "rf",
                 bagging_freq = 1,
                 bagging_fraction = 0.632,
                 feature_fraction = ceiling(sqrt(970)) / 970)
}

library(data.table)
library(Matrix)
library(lightgbm)
library(R.utils)

data <- fread(file = "bosch_data.csv")


# Do xgboost / LightGBM

# When dense:
# > sum(data == 0, na.rm = TRUE)
# [1] 43574349
# > sum(is.na(data))
# [1] 929125166

# Split
if (sparse == TRUE) {
  library(recommenderlab)
  gc()
  train_1 <- dropNA(as.matrix(data[1:1000000, 1:969]))
  train_2 <- data[1:1000000, 970]$Response
  gc()
  test_1 <- dropNA(as.matrix(data[1000001:1183747, 1:969]))
  test_2 <- data[1000001:1183747, 970]$Response
  gc()
} else {
  gc()
  train_1 <- as.matrix(data[1:1000000, 1:969])
  train_2 <- data[1:1000000, 970]$Response
  gc()
  test_1 <- as.matrix(data[1000001:1183747, 1:969])
  test_2 <- data[1000001:1183747, 970]$Response
  gc()
}


# For LightGBM
train  <- lgb.Dataset(data = train_1, label = train_2)
test <- lgb.Dataset(data = test_1, label = test_2, reference=train)
train$construct()
test$construct()

gc()
Laurae::timer_func_print({temp_model <- lgb.train(params = params,
                                                  data = train,
                                                  nrounds = 500,
                                                  valids = list(test = test),
                                                  objective = "binary",
                                                  metric = "auc",
                                                  verbose = 2)})

# temp_model$best_iter
perf <- as.numeric(rbindlist(temp_model$record_evals$test$auc))
max(perf)
which.max(perf)

guolinke · 2017-07-30T01:14:41Z

@Laurae2
it is normal seems you are using num_leave=4097.
you can try to use the comparable parameters as normal mode, and test is it still slow ?

Laurae2 · 2017-07-30T08:26:03Z

@guolinke There is a massive issue with MinGW in Windows. Check below the performance:

Compiler	LightGBM	AUC	Best Iter	Time (ms)
Visual Studio	GBT	0.6599964	25	87060.578
Visual Studio	RF	0.6539715	24	40204.351
MinGW 7.1	GBT	0.6590533	25	279344.248
MinGW 7.1	RF	0.6541649	24	198392.4

new code:

setwd("E:/datasets")
sparse <- TRUE # keep this true for reproducing my results
rf <- TRUE
if (rf == FALSE) {
  params <- list(num_threads = 40,
                 learning_rate = 0.05,
                 max_depth = -1,
                 num_leaves = 4095,
                 max_bin = 255)
} else {
  params <- list(num_threads = 40,
                 learning_rate = 1,
                 max_depth = -1,
                 num_leaves = 4095,
                 max_bin = 255,
                 boosting_type = "rf",
                 bagging_freq = 1,
                 bagging_fraction = 0.632,
                 feature_fraction = ceiling(sqrt(970)) / 970)
}

library(data.table)
library(Matrix)
library(lightgbm)
library(R.utils)

data <- fread(file = "bosch_data.csv")


# Do xgboost / LightGBM

# When dense:
# > sum(data == 0, na.rm = TRUE)
# [1] 43574349
# > sum(is.na(data))
# [1] 929125166

# Split
if (sparse == TRUE) {
  library(recommenderlab)
  gc()
  train_1 <- dropNA(as.matrix(data[1:1000000, 1:969]))
  train_2 <- data[1:1000000, 970]$Response
  gc()
  test_1 <- dropNA(as.matrix(data[1000001:1183747, 1:969]))
  test_2 <- data[1000001:1183747, 970]$Response
  gc()
} else {
  gc()
  train_1 <- as.matrix(data[1:1000000, 1:969])
  train_2 <- data[1:1000000, 970]$Response
  gc()
  test_1 <- as.matrix(data[1000001:1183747, 1:969])
  test_2 <- data[1000001:1183747, 970]$Response
  gc()
}


# For LightGBM
train  <- lgb.Dataset(data = train_1, label = train_2)
test <- lgb.Dataset(data = test_1, label = test_2, reference=train)
# train$construct()
# test$construct()

gc()
Laurae::timer_func_print({temp_model <- lgb.train(params = params,
                                                  data = train,
                                                  nrounds = 25,
                                                  valids = list(test = test),
                                                  objective = "binary",
                                                  metric = "auc",
                                                  verbose = 2)})

perf <- as.numeric(rbindlist(temp_model$record_evals$test$auc))
max(perf)
which.max(perf)

guolinke · 2017-07-30T08:31:21Z

@Laurae2 I don't know why MinGW is so slow ...

Laurae2 · 2017-07-30T08:32:47Z

@guolinke Now we have a good reproducible example in case someone wants to check the performance discrepancy between Visual Studio and MinGW for LightGBM.

Laurae2 closed this as completed Jul 30, 2017

Laurae2 mentioned this issue Jul 30, 2017

Add warning for many core systems in Windows #754

Merged

Laurae2 mentioned this issue Aug 22, 2017

[docs] fixed links and typos #855

Merged

lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random Forest is extremely slow #749

Random Forest is extremely slow #749

Laurae2 commented Jul 29, 2017 •

edited

Loading

guolinke commented Jul 30, 2017 •

edited

Loading

Laurae2 commented Jul 30, 2017

guolinke commented Jul 30, 2017 •

edited

Loading

Laurae2 commented Jul 30, 2017

Random Forest is extremely slow #749

Random Forest is extremely slow #749

Comments

Laurae2 commented Jul 29, 2017 • edited Loading

guolinke commented Jul 30, 2017 • edited Loading

Laurae2 commented Jul 30, 2017

guolinke commented Jul 30, 2017 • edited Loading

Laurae2 commented Jul 30, 2017

Laurae2 commented Jul 29, 2017 •

edited

Loading

guolinke commented Jul 30, 2017 •

edited

Loading

guolinke commented Jul 30, 2017 •

edited

Loading