In [1]:
import numpy as np
import catboost

In [2]:
!uname -a

Linux 4.15.0-38-generic #41~16.04.1-Ubuntu SMP Wed Oct 10 20:16:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux


In [3]:
print ('catboost', catboost.__version__)
print ('numpy', np.__version__)

catboost 0.11.1
numpy 1.15.4


### Data creation and model training.

In [4]:
n_features, n_objects = 100, 100000

In [5]:
features = np.random.randn(n_objects, n_features).astype(np.float32)

In [6]:
target = np.random.randint(2, size=n_objects)

In [7]:
cb_clf = catboost.CatBoostClassifier(
    iterations = 2000, depth = 10,  
    metric_period=250, boosting_type='Plain'
)

In [8]:
%%time
cb_clf = cb_clf.fit(features[0:1000], target[0:1000])

0:	learn: 0.6904036	total: 105ms	remaining: 3m 30s
250:	learn: 0.2499386	total: 9.84s	remaining: 1m 8s
500:	learn: 0.1250157	total: 19.7s	remaining: 58.8s
750:	learn: 0.0759547	total: 29.4s	remaining: 49s
1000:	learn: 0.0516550	total: 39.1s	remaining: 39.1s
1250:	learn: 0.0380106	total: 48.8s	remaining: 29.2s
1500:	learn: 0.0295658	total: 58.6s	remaining: 19.5s
1750:	learn: 0.0239441	total: 1m 8s	remaining: 9.71s
1999:	learn: 0.0199772	total: 1m 17s	remaining: 0us
CPU times: user 24min 12s, sys: 8.94 s, total: 24min 21s
Wall time: 1min 22s


In [9]:
pool_100000 = catboost.Pool(features[0:100000])
pool_10000 = catboost.Pool(features[0:10000])
pool_10001 = catboost.Pool(features[0:10001])

### Let's run prediction on 100k objects with different thread_counts.

In [10]:
%%timeit
out = cb_clf.predict(pool_100000, thread_count=1)

1.09 s ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [11]:
%%timeit
out = cb_clf.predict(pool_100000, thread_count=2)

591 ms ± 4.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
%%timeit
out = cb_clf.predict(pool_100000, thread_count=10)

177 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### We see good performance and good speedups from parallelism. Let's do the same with 10k objects.

In [13]:
%%timeit
out = cb_clf.predict(pool_10000, thread_count=1)

108 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [14]:
%%timeit
out = cb_clf.predict(pool_10000, thread_count=2)

111 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [15]:
%%timeit
out = cb_clf.predict(pool_10000, thread_count=10)

110 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### We see no speedup at all, §top inspection give as 100% cpu load during all calls -> threading does not work.

### Let's run the same task with 10k + 1 object.

In [16]:
%%timeit
out = cb_clf.predict(pool_10001, thread_count=1)

108 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [17]:
%%timeit
out = cb_clf.predict(pool_10001, thread_count=2)

58.7 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [18]:
%%timeit
out = cb_clf.predict(pool_10001, thread_count=10)

64.5 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### We see that for thread_count 2 there is the same great speed up as in 100k case, but for 10 threads the speed up dissappears. §top inspection shows 200% cpu load for both the thread_count=2 and thread_count=10 cases.