## Note on Printing
In the following experiments, there is a lot of printing. I tried to reduce it by printing e.g., the first 10 elements with `x[:10]`. However, this slowed down the experiments by an order of magnitude.

In [1]:
import pandas as pd
import numpy as np
import collections.abc
# Koalas needs these mappings
collections.Iterable = collections.abc.Iterable
collections.Callable = collections.abc.Callable
import databricks.koalas as ks
import utils

# It seems you need to set this option for performance reasons.
# See: https://github.com/databricks/koalas/issues/1769 (it seems the issue is not only related to apply())
ks.set_option('compute.default_index_type', 'distributed')

koalas_df = ks.read_csv('../datasets/yellow_tripdata_2015-01.csv')
pandas_df = pd.read_csv('../datasets/yellow_tripdata_2015-01.csv')

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/28 03:01:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


                                                                                

## Example 1 - Math Ops Series to Series

In [2]:
%%time_cell
x = koalas_df['pickup_longitude'] + koalas_df['pickup_latitude']
print(x)

0     -33.243786
1     -33.277405
2     -33.160553
3     -33.295269
4     -33.208748
5     -33.100327
6     -33.257267
7     -33.268520
8     -33.138687
9     -33.217640
10    -33.265514
11    -33.242363
12    -33.303986
13    -33.213497
14    -33.274944
15    -33.236614
16    -33.214458
17    -33.093479
18    -33.166119
19    -33.100449
20    -33.224705
21    -33.249878
22    -33.253876
23    -33.284885
24    -33.261097
25    -33.128284
26    -33.241047
27    -33.142448
28    -33.252220
29    -33.241207
30    -33.184162
31      0.000000
32    -33.228104
33    -33.204002
34    -33.232216
35    -33.278320
36    -33.275333
37    -33.213924
38    -33.267071
39    -33.259174
40    -33.236713
41    -33.267288
42    -33.166325
43    -33.171787
44    -33.171093
45    -33.215206
46    -33.256138
47    -33.264549
48    -33.268845
49    -33.226212
50    -33.166000
51    -33.260696
52    -33.185463
53    -33.233635
54    -33.291344
55    -33.268070
56    -33.183220
57    -33.268822
58    -33.2353

In [3]:
koalas_time = _TIMED_CELL
print(f"Koalas time: {koalas_time:.1f}s")

Koalas time: 0.3s


In [4]:
%%time_cell
y = pandas_df['pickup_longitude'] + pandas_df['pickup_latitude']
print(y)

0          -33.243786
1          -33.277405
2          -33.160553
3          -33.295269
4          -33.208748
              ...    
12748981   -33.165771
12748982   -33.254559
12748983   -33.229774
12748984   -33.261082
12748985   -33.193951
Length: 12748986, dtype: float64


In [5]:
pandas_time = _TIMED_CELL
print(f"Pandas time: {pandas_time:.1f}s")

Pandas time: 0.0s


In [6]:
slowdown = koalas_time / pandas_time
utils.print_md(f"### Koalas is {slowdown:.1f}x slower.")

### Koalas is 9.4x slower.

## Example 2 - Math Ops Series to Constant

In [7]:
%%time_cell
x = koalas_df['pickup_longitude'] * 2
print(x)

0     -147.987793
1     -148.003296
2     -147.926682
3     -148.018173
4     -147.942352
5     -147.748749
6     -147.966553
7     -148.005325
8     -147.566086
9     -147.971176
10    -147.977234
11    -147.987564
12    -148.016724
13    -147.947891
14    -148.013443
15    -147.952850
16    -147.937408
17    -147.726120
18    -147.891083
19    -147.748917
20    -147.953201
21    -147.989914
22    -148.001877
23    -148.005554
24    -147.994919
25    -147.904556
26    -147.982254
27    -147.573151
28    -147.987335
29    -147.970581
30    -147.939545
31       0.000000
32    -147.970184
33    -147.932327
34    -147.956848
35    -147.996704
36    -148.006104
37    -147.954193
38    -147.989624
39    -147.971878
40    -147.959976
41    -147.977936
42    -147.897369
43    -147.891022
44    -147.867889
45    -147.966049
46    -147.981918
47    -147.983276
48    -147.985275
49    -147.972321
50    -147.896957
51    -148.003265
52    -147.950684
53    -147.959763
54    -147.985657
55    -147

In [8]:
koalas_time = _TIMED_CELL
print(f"Koalas time: {koalas_time:.1f}s")

Koalas time: 0.1s


In [9]:
%%time_cell
y = pandas_df['pickup_longitude'] * 2
print(y)

0          -147.987793
1          -148.003296
2          -147.926682
3          -148.018173
4          -147.942352
               ...    
12748981   -147.903976
12748982   -147.965485
12748983   -147.958649
12748984   -147.999130
12748985   -147.920700
Name: pickup_longitude, Length: 12748986, dtype: float64


In [10]:
pandas_time = _TIMED_CELL
print(f"Pandas time: {pandas_time:.1f}s")

Pandas time: 0.0s


In [11]:
slowdown = koalas_time / pandas_time
utils.print_md(f"### Koalas is {slowdown:.1f}x slower.")

### Koalas is 6.5x slower.

## Example 3 - Compare Series to Series

In [12]:
%%time_cell
x = koalas_df['pickup_longitude'] < koalas_df['pickup_latitude']
assert x.any()

                                                                                

In [13]:
koalas_time = _TIMED_CELL
print(f"Koalas time: {koalas_time:.1f}s")

Koalas time: 1.8s


In [14]:
%%time_cell
y = pandas_df['pickup_longitude'] < pandas_df['pickup_latitude']
assert y.any()

In [15]:
pandas_time = _TIMED_CELL
print(f"Pandas time: {pandas_time:.1f}s")

Pandas time: 0.0s


In [16]:
slowdown = koalas_time / pandas_time
utils.print_md(f"### Koalas is {slowdown:.1f}x slower.")

### Koalas is 85.3x slower.

## Example 4 - Compare Series to Constant

In [17]:
%%time_cell
x = koalas_df['pickup_longitude'] < 2.3
assert x.any()

                                                                                

In [18]:
koalas_time = _TIMED_CELL
print(f"Koalas time: {koalas_time:.1f}s")

Koalas time: 1.0s


In [19]:
%%time_cell
y = pandas_df['pickup_longitude'] < 2.3
print(y)

0           True
1           True
2           True
3           True
4           True
            ... 
12748981    True
12748982    True
12748983    True
12748984    True
12748985    True
Name: pickup_longitude, Length: 12748986, dtype: bool


In [20]:
pandas_time = _TIMED_CELL
print(f"Pandas time: {pandas_time:.1f}s")

Pandas time: 0.0s


In [21]:
slowdown = koalas_time / pandas_time
utils.print_md(f"### Koalas is {slowdown:.1f}x slower.")

### Koalas is 72.0x slower.

## Example 5 - Unary Reductions 1

In [22]:
%%time_cell
x = koalas_df['pickup_longitude'].std()
print(x)

[Stage 10:>                                                       (0 + 96) / 96]

10.125103592972916


                                                                                

In [23]:
koalas_time = _TIMED_CELL
print(f"Koalas time: {koalas_time:.1f}s")

Koalas time: 1.2s


In [24]:
%%time_cell
y = pandas_df['pickup_longitude'].std()
print(y)

10.125103592972902


In [25]:
pandas_time = _TIMED_CELL
print(f"Pandas time: {pandas_time:.1f}s")

Pandas time: 0.1s


In [26]:
slowdown = koalas_time / pandas_time
utils.print_md(f"### Koalas is {slowdown:.1f}x slower.")

### Koalas is 12.2x slower.

## Example 6 - Unary Reductions 2

Koalas fails here! See `value_counts.ipynb`. Same effect, same reason. We will again use `VendorID`.

In [27]:
%%time_cell
x = koalas_df["VendorID"].unique()
print(x)

                                                                                

0    1
1    2
Name: VendorID, dtype: int32


In [28]:
koalas_time = _TIMED_CELL
print(f"Koalas time: {koalas_time:.1f}s")

Koalas time: 1.7s


In [29]:
%%time_cell
y = pandas_df["VendorID"].unique()
print(y)

[2 1]


In [30]:
pandas_time = _TIMED_CELL
print(f"Pandas time: {pandas_time:.1f}s")

Pandas time: 0.1s


In [31]:
slowdown = koalas_time / pandas_time
utils.print_md(f"### Koalas is {slowdown:.1f}x slower.")

### Koalas is 33.9x slower.