## CatBoost vs (Random Forest & Gradient Boosting Machine) 비교
* Data : Churn [>>Link](https://github.com/yhat/demo-churn-pred/blob/master/model/churn.csv)
* Parameter : default, only seed = 1234
* Check AUC, Logloss

In [2]:
from catboost import CatBoostClassifier, Pool
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

import time

In [4]:
churn = pd.read_csv("../data/churn.csv")

#### X에는 'Churn.' 칼럼을 제외한 데이터가 입력돼 있다. CatBoost에서 Target은 자동적으로 변환되지 않으니, 손수 0,1로 바꾸어 주는 작업이 필요하다.

In [3]:
X = churn.drop(['Churn.'], axis=1)
y = churn['Churn.']
y = np.where(y == "True.", 1, 0)

#### Train:Valid:Test = 6:2:2

In [4]:
X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, train_size=0.6, test_size = 0.4, random_state=1234)
X_valid, X_test, y_valid, y_test = train_test_split(X_tmp, y_tmp, train_size = 0.5, test_size = 0.5, random_state = 1234)
print("Train : {}".format(X_train.shape[0]))
print("Valid : {}".format(X_valid.shape[0]))
print("Test : {}".format(X_test.shape[0]))

Train : 1999
Valid : 667
Test : 667


#### Categorical 변수의 위치(index)를 categorical_features_indices에 저장.

In [5]:
categorical_features_indices = np.where(X.dtypes == object)[0]
X.iloc[:,categorical_features_indices].head()

Unnamed: 0,State,Int.l.Plan,VMail.Plan
0,KS,no,yes
1,OH,no,yes
2,NJ,no,no
3,OH,yes,no
4,OK,yes,no


#### 모델링 실시. (parameter default)

In [6]:
start_time = time.time()
ml_cb = CatBoostClassifier(random_seed = 1234)

ml_cb_output = ml_cb.fit(X = X_train, 
                         y = y_train, 
                         cat_features = categorical_features_indices, 
                         eval_set=(X_valid, y_valid),
                         verbose=False,
                         plot=False)
print("Time : {}".format(time.time() - start_time))

Time : 22.502959966659546


#### AUC, Logloss를 구하기 위해 catboost.core.Pool 타입으로 변환

In [7]:
test_pool = Pool(X_test, y_test, cat_features=categorical_features_indices)
type(test_pool)

catboost.core.Pool

#### Test 데이터의 Auc, Logloss 확인

In [96]:
eval_metrics = ml_cb_output.eval_metrics(test_pool, ['AUC', 'Logloss'], plot=False)

In [97]:
print("catboost AUC : {}".format(eval_metrics['AUC'][-1]))
print("catboost Logloss : {}".format(eval_metrics['Logloss'][-1]))

catboost AUC : 0.9538802233776237
catboost Logloss : 0.11481557354303855


---

#### h2o에 내장돼 있는 GBM과 Random Forest 함수 이용해서 비교

In [10]:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator

In [11]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,2 hours 21 mins
H2O cluster timezone:,Asia/Seoul
H2O data parsing timezone:,UTC
H2O cluster version:,3.18.0.5
H2O cluster version age:,2 months and 16 days
H2O cluster name:,H2O_from_python_hsw_5f1e6p
H2O cluster total nodes:,1
H2O cluster free memory:,6.904 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


#### CatBoost와 마찬가지로 Train:Valid:Test = 6:2:2

In [12]:
churn_hex = h2o.H2OFrame(churn)
train, valid, test = churn_hex.split_frame([0.6, 0.2], seed=1234)

Parse progress: |█████████████████████████████████████████████████████████| 100%


#### x,y 칼럼명 지정

In [13]:
X = churn_hex.col_names[:-1]
y = churn_hex.col_names[-1]

#### Random Forest 학습. (Parameter default)

In [14]:
start_time = time.time()
ml_rf = H2ORandomForestEstimator(seed = 1234)
ml_rf.train(X, y, training_frame=train, validation_frame=valid)
print("Time : {}".format(time.time() - start_time))

drf Model Build progress: |███████████████████████████████████████████████| 100%
Time : 0.7278151512145996


#### Test 데이터의 Auc, Logloss 확인

In [18]:
performance_rf = ml_rf.model_performance(test_data=test)

print("RandomForest AUC is : {}".format(performance_rf.auc()))
print("RandomForest Logloss is : {}".format(performance_rf.logloss()))

RandomForest AUC is : 0.9152839978692762
RandomForest Logloss is : 0.3038702620531364


#### GBM도 마찬가지 작업 실시

In [19]:
start_time = time.time()
ml_gbm = H2OGradientBoostingEstimator(seed = 1234)
ml_gbm.train(X, y, training_frame=train, validation_frame=valid)
print("Time : {}".format(time.time() - start_time))

gbm Model Build progress: |███████████████████████████████████████████████| 100%
Time : 0.49232959747314453


In [20]:
performance_gbm = ml_gbm.model_performance(test_data=test)

print("GBM AUC is : {}".format(performance_gbm.auc()))
print("GBM Logloss is : {}".format(performance_gbm.logloss()))

GBM AUC is : 0.9118117071438436
GBM Logloss is : 0.19535343851362647


#### CatBoost vs Random Forest vs GBM 비교. (parameter default 기준)

|           | CatBoost | Random Forest | GBM    |
| --------- | -------- | ------------- | ------ |
| Time(sec) | 22.50    | 0.73          | 0.49   |
| AUC       | 0.9539   | 0.9153        | 0.9118 |
| Logloss   | 0.1148   | 0.3039        | 0.1954 |

