## Description - `np.where()`
Interacting with numpy seems to cause slowdowns.

In [1]:
import pandas as pd
import numpy as np
import collections.abc
# Koalas needs these mappings
collections.Iterable = collections.abc.Iterable
collections.Callable = collections.abc.Callable
import databricks.koalas as ks
import utils

# It seems you need to set this option for performance reasons.
# See: https://github.com/databricks/koalas/issues/1769 (it seems the issue is not only related to apply())
ks.set_option('compute.default_index_type', 'distributed')

koalas_df = ks.read_csv('../datasets/iris.csv')
pandas_df = pd.read_csv('../datasets/iris.csv')

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/28 02:43:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
%%time_cell
# We have to use .to_numpy(). I tried both versions below. Both were slow. I picked the fastest.
# koalas_arr = np.where(koalas_df['sepal_width'].to_numpy() < koalas_df['sepal_length'].to_numpy(), 10, 20)
koalas_arr = np.where((koalas_df['sepal_width'] < koalas_df['sepal_length']).to_numpy(), 10, 20)
print(koalas_arr)

[10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10]


In [3]:
koalas_time = _TIMED_CELL
print(f"Koalas time: {koalas_time:.1f}s")

Koalas time: 0.2s


In [4]:
%%time_cell
pandas_arr = np.where(pandas_df['sepal_width'] < pandas_df['sepal_length'], 10, 20)
print(pandas_arr)

[10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10]


In [5]:
pandas_time = _TIMED_CELL
print(f"Pandas time: {pandas_time:.3f}s")

Pandas time: 0.001s


In [6]:
slowdown = koalas_time / pandas_time
utils.print_md(f"### Koalas is {slowdown:.1f}x slower.")

### Koalas is 202.3x slower.

In [7]:
assert (koalas_arr == pandas_arr).all()