## CatBoost vs (Random Forest & Gradient Boosting Machine) 비교
* Data : Churn [>>Link](https://github.com/yhat/demo-churn-pred/blob/master/model/churn.csv)
* Parameter : default, only seed = 1234
* Check AUC, Logloss
* Python과는 다르게, validation set을 지정해주는 parameter를 찾기 어려워, Train:Test = 7:3으로 지정

#### 패키지 설치 (Window 10 기준)
* 시도했던 방법
    * devtools::install_github 이용해서 CatBoost 직접 설치.
    * [CatBoost Github](https://github.com/catboost/catboost/)에서 clone으로 zip파일을 받아 설치
    * devtools::install_url 이용해서 CatBoost 설치(binary install)  -- __유일하게 성공__

In [1]:
install.packages("devtools")
devtools::install_url("https://github.com/catboost/catboost/releases/download/v0.8.1/catboost-R-Windows-0.8.1.tgz")

Installing package into 'C:/Users/hsw/Documents/R/win-library/3.4'
(as 'lib' is unspecified)


package 'devtools' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\hsw\AppData\Local\Temp\RtmpsNU66t\downloaded_packages


Downloading package from url: https://github.com/catboost/catboost/releases/download/v0.8.1/catboost-R-Windows-0.8.1.tgz
Installing catboost
"C:/Users/hsw/Anaconda3/envs/R/lib/R/bin/x64/R" --no-site-file --no-environ  \
  --no-save --no-restore --quiet CMD INSTALL  \
  "C:/Users/hsw/AppData/Local/Temp/RtmpsNU66t/devtools3aac517976c6/catboost"  \
  --library="C:/Users/hsw/Documents/R/win-library/3.4" --install-tests 



#### 현재 돌아가고 있는 R 정보 출력. (다른 컴퓨터에서 설치시 참고 바람)

In [32]:
sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=Korean_Korea.949  LC_CTYPE=Korean_Korea.949   
[3] LC_MONETARY=Korean_Korea.949 LC_NUMERIC=C                
[5] LC_TIME=Korean_Korea.949    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] h2o_3.19.0.4306      catboost_0.8.1       RevoUtils_10.0.8    
[4] RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
 [1] withr_2.1.1     digest_0.6.13   crayon_1.3.4    bitops_1.0-6   
 [5] IRdisplay_0.4.4 repr_0.12.0     R6_2.2.2        jsonlite_1.5   
 [9] magrittr_1.5    evaluate_0.10.1 httr_1.3.1      stringi_1.1.7  
[13] curl_3.1        uuid_0.1-2      IRkernel_0.8.11 devtools_1.13.3
[17] tools_3.4.3     stringr_1.3.1   RCurl_1.95-4.9  compiler_3.4.3 
[21] memoise_1.1.0   pbdZMQ_0.2-6   

#### CatBoost load

In [1]:
library(catboost)

#### data frame을 train, valid, test 셋으로 나누기 위해 함수 선언

In [2]:
dt_splitFrame <- function(dt, ratio, seed){
  set.seed(seed)
  train_index <- sample(nrow(dt), as.integer(nrow(dt)*ratio[1]))
  train <- dt[train_index, ]
  valid_test <- dt[-train_index, ]
  valid_index <- sample(nrow(valid_test), as.integer(nrow(train)/ratio[1]*ratio[2]))
  valid <- valid_test[valid_index, ]
  test <- valid_test[-valid_index, ]
  rm(valid_test)
  return(list(train, valid, test))
}

#### 데이터 로드

In [3]:
churn <- read.csv("../data/churn.csv")

#### Target을 0,1로 바꾸어주어야 돌아감

In [4]:
churn[,20] <- ifelse(churn[,20] == "True.", 1, 0)

#### categorical 변수를 전부 factor로 변환해주어야 함

In [5]:
cat_features <- c(1,4,5)
for(i in cat_features){
  churn[,i] <- as.factor(churn[,i])
}

#### Train:Valid:Test = 6:2:2

In [6]:
splits <- dt_splitFrame(dt = churn, ratio = c(0.7, 0.3), seed = 1234)
train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]
test <- rbind(valid, test)
print(paste("Train : ", nrow(train), sep = ""))
print(paste("Test : ", nrow(test), sep = ""))

[1] "Train : 2333"
[1] "Test : 1000"


#### Target index 지정. ("Churn.")

In [7]:
target_idx <- grep("Churn.", colnames(churn))
target_idx

#### CatBoost를 돌리기 위해 Pool형태로 변환. 이 때 categorical 변수가 있는 index 위치들을 cat_features로 명명

In [8]:
train_pool <- catboost.load_pool(data = train[,-target_idx], label = train[,target_idx], cat_features = cat_features)
test_pool <- catboost.load_pool(data = test[,-target_idx], label = test[,target_idx], cat_features = cat_features)

#### Parameter 설명
* loss_function : loss function 지정(여기서는 분류모형이므로 Logloss)
* logging_level : 모델링을 돌릴 때 매 iteration 마다 결과를 출력할건지
    * Silent : 출력하지 않겠다.
    * Verbose : 출력하겠다.
    * 자세한 사항은 [홈페이지 설명](https://tech.yandex.com/catboost/doc/dg/concepts/r-reference_catboost-train-docpage/) 참고
* random_seed : seed number
* custom_loss : 모델링 할 때 추가로 추출할 값들 (train_dir로 지정한 곳으로 해당 결과를 파일로 내보내준다)
* train_dir : 모델링 한 결과를 저장할 directory

In [11]:
start.time <- Sys.time()
fit_params <- list(
                   loss_function = 'Logloss',
                   logging_level = "Silent",
                   random_seed = 1234,
                   custom_loss = "AUC",
                   train_dir = "../data/CatBoost_R_output"
                   )
model <- catboost.train(learn_pool = train_pool, test_pool = test_pool, params = fit_params)
print(paste("Time : ", Sys.time() - start.time, sep = ""))

[1] "Time : 29.7614259719849"


#### 위에서 지정한 directory에서 "test_error.tsv" 파일 호출
* 해당 파일에는 다음 정보가 입력돼 있다.
    * iteration
    * Logloss
    * AUC

In [12]:
test_error <- read.table("../data/CatBoost_R_output/test_error.tsv", sep = "\t", header = TRUE)

#### 마지막 iteration 정보를 가져오기 위해.
* Logloss : 0.1565074
* AUC : 0.9258676

In [13]:
tail(test_error)

Unnamed: 0,iter,Logloss,AUC
995,994,0.1563895,0.9259737
996,995,0.1563093,0.9261042
997,996,0.1563762,0.9259655
998,997,0.1564997,0.9258268
999,998,0.1565012,0.9258758
1000,999,0.1565074,0.9258676


#### h2o load. 오류는 Anaconda에서 제공하는 h2o version과 R version이 잘 안맞아서 나오는데, 무시해도 모델링에 지장 없다.

In [14]:
library(h2o)
h2o.init()


----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

----------------------------------------------------------------------


Attaching package: 'h2o'

The following objects are masked from 'package:stats':

    cor, sd, var

The following objects are masked from 'package:base':

    %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
    colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc



 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 hours 58 minutes 
    H2O cluster timezone:       
    H2O data parsing timezone:  
    H2O cluster version:        3.16.0.2 
    H2O cluster version age:    6 months and 15 days !!! 
    H2O cluster name:           H2O_started_from_R_hsw_gge519 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   6.91 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.4.3 (2017-11-30) 


"
Your H2O cluster version is too old (6 months and 15 days)!
Please download and install the latest version from http://h2o.ai/download/"




ERROR: Error in h2o.init(): Version mismatch! H2O is running version 3.16.0.2 but h2o-R package is version 3.19.0.4306.
         Install the matching h2o-R version from - http://h2o-release.s3.amazonaws.com/h2o/rel-wheeler/2/index.html


#### categorical 변수의 index 지정 후 factor 변환

In [15]:
cat_features <- c(1,4,5,20)

churn_hex <- as.h2o(churn)

for(i in cat_features){
  churn_hex[,i] <- as.factor(churn_hex[,i])
}



#### Data split. (Train:Test = 7:3)

In [16]:
splits_hex <- h2o.splitFrame(
  churn_hex,          
  c(0.7),   
  seed=1234) 

In [17]:
train.hex <- h2o.assign(splits_hex[[1]], "train.hex")   
test.hex <- h2o.assign(splits_hex[[2]], "test.hex")     

#### Random Forest 모델링 실시. (parameter default)
* 후에 test 데이터 기준으로 AUC 값, Logloss 값 추출

In [18]:
start.time <- Sys.time()
ml_rf <- h2o.randomForest(         ## h2o.randomForest function
  training_frame = train.hex,        ## the H2O frame for training
  x=1:19,                        ## the predictor columns, by column index
  y=20,  
  seed = 1234)               

h2o.performance(ml_rf, newdata = test.hex)
print(paste("Time : ", Sys.time() - start.time, sep = ""))



H2OBinomialMetrics: drf

MSE:  0.05451486
RMSE:  0.2334842
LogLoss:  0.3607438
Mean Per-Class Error:  0.1195295
AUC:  0.9170754
Gini:  0.8341508

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
         0   1    Error     Rate
0      821  20 0.023781  =20/841
1       31 113 0.215278  =31/144
Totals 852 133 0.051777  =51/985

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.360000 0.815884  29
2                       max f2  0.300000 0.818554  33
3                 max f0point5  0.480000 0.858209  23
4                 max accuracy  0.380000 0.948223  28
5                max precision  1.000000 1.000000   0
6                   max recall  0.000000 1.000000  66
7              max specificity  1.000000 1.000000   0
8             max absolute_mcc  0.360000 0.786656  29
9   max min_per_class_accuracy  0.160000 0.868056  40
10 max mean_per_class_accuracy  0.30

[1] "Time : 1.57041788101196"


#### GBM 모델링 실시. (parameter default)
* 후에 test 데이터 기준으로 AUC 값, Logloss 값 추출

In [19]:
start.time <- Sys.time()
ml_gbm <- h2o.gbm(         
  training_frame = train.hex,        
  x=1:19,                        
  y=20,  
  seed = 1234)               

h2o.performance(ml_gbm, newdata = test.hex)
print(paste("Time : ", Sys.time() - start.time, sep = ""))



H2OBinomialMetrics: gbm

MSE:  0.05007391
RMSE:  0.223772
LogLoss:  0.1930566
Mean Per-Class Error:  0.1386742
AUC:  0.9159978
Gini:  0.8319956

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
         0   1    Error     Rate
0      818  23 0.027348  =23/841
1       36 108 0.250000  =36/144
Totals 854 131 0.059898  =59/985

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.219443 0.785455 115
2                       max f2  0.151270 0.784044 133
3                 max f0point5  0.437263 0.857143  92
4                 max accuracy  0.399330 0.944162  95
5                max precision  0.986585 1.000000   0
6                   max recall  0.004118 1.000000 396
7              max specificity  0.986585 1.000000   0
8             max absolute_mcc  0.364223 0.760869  97
9   max min_per_class_accuracy  0.068371 0.854167 203
10 max mean_per_class_accuracy  0.151

[1] "Time : 1.69335508346558"


### CatBoost vs Random Forest vs GBM 비교. (parameter default 기준)

|           | CatBoost | Random Forest | GBM    |
| --------- | -------- | ------------- | ------ |
| Time(sec) | 29.73    | 1.57          | 1.69   |
| AUC       | 0.9259   | 0.9170        | 0.9160 |
| Logloss   | 0.1565   | 0.3607        | 0.1931 |