## Summary

PolaRS seems about the same as Pandas, sometimes even faster. I think any slowdowns we see are due to going to Rust and because PolaRS has a query optimizer. Given that a cell will probably have some more computation, which can be computed faster than eagerly, I think PolaRS should be at least about the same as Pandas.

Also, interestingly, for these simple operations, the PolaRS API agrees with the Pandas one.

In [1]:
!export POLARS_MAX_THREADS=12

In [2]:
import pandas as pd
import numpy as np
import polars as pl
import utils

polars_df = pl.read_csv("../datasets/yellow_tripdata_2015-01.csv")
pandas_df = pd.read_csv('../datasets/yellow_tripdata_2015-01.csv')

## Example 1 - Math Ops Series to Series

In [3]:
%%time_cell
x = polars_df['pickup_longitude'] + polars_df['pickup_latitude']
print(x[:4], x[-4:])

shape: (4,)
Series: 'pickup_longitude' [f64]
[
	-33.243786
	-33.277405
	-33.160553
	-33.295269
] shape: (4,)
Series: 'pickup_longitude' [f64]
[
	-33.254559
	-33.229774
	-33.261082
	-33.193951
]


In [4]:
polars_time = _TIMED_CELL
print(f"PolaRS time: {polars_time:.1f}s")

PolaRS time: 0.2s


In [5]:
%%time_cell
y = pandas_df['pickup_longitude'] + pandas_df['pickup_latitude']
print(y)

0          -33.243786
1          -33.277405
2          -33.160553
3          -33.295269
4          -33.208748
              ...    
12748981   -33.165771
12748982   -33.254559
12748983   -33.229774
12748984   -33.261082
12748985   -33.193951
Length: 12748986, dtype: float64


In [6]:
pandas_time = _TIMED_CELL
print(f"Pandas time: {pandas_time:.1f}s")

Pandas time: 0.0s


In [7]:
slowdown = polars_time / pandas_time
utils.print_md(f"### PolaRS is {slowdown:.1f}x slower.")

### PolaRS is 7.6x slower.

## Example 2 - Math Ops Series to Constant

In [8]:
%%time_cell
x = polars_df['pickup_longitude'] * 2
print(x[:4], x[-4:])

shape: (4,)
Series: 'pickup_longitude' [f64]
[
	-147.987793
	-148.003296
	-147.926682
	-148.018173
] shape: (4,)
Series: 'pickup_longitude' [f64]
[
	-147.965485
	-147.958649
	-147.99913
	-147.9207
]


In [9]:
polars_time = _TIMED_CELL
print(f"PolaRS time: {polars_time:.1f}s")

PolaRS time: 0.1s


In [10]:
%%time_cell
y = pandas_df['pickup_longitude'] * 2
print(y)

0          -147.987793
1          -148.003296
2          -147.926682
3          -148.018173
4          -147.942352
               ...    
12748981   -147.903976
12748982   -147.965485
12748983   -147.958649
12748984   -147.999130
12748985   -147.920700
Name: pickup_longitude, Length: 12748986, dtype: float64


In [11]:
pandas_time = _TIMED_CELL
print(f"Pandas time: {pandas_time:.1f}s")

Pandas time: 0.0s


In [12]:
slowdown = polars_time / pandas_time
utils.print_md(f"### PolaRS is {slowdown:.1f}x slower.")

### PolaRS is 2.6x slower.

## Example 3 - Compare Series to Series

In [13]:
%%time_cell
x = polars_df['pickup_longitude'] < polars_df['pickup_latitude']
assert x.any()

In [14]:
polars_time = _TIMED_CELL
print(f"PolaRS time: {polars_time:.1f}s")

PolaRS time: 0.0s


In [15]:
%%time_cell
y = pandas_df['pickup_longitude'] < pandas_df['pickup_latitude']
assert y.any()

In [16]:
pandas_time = _TIMED_CELL
print(f"Pandas time: {pandas_time:.1f}s")

Pandas time: 0.0s


In [17]:
slowdown = polars_time / pandas_time
utils.print_md(f"### PolaRS is {slowdown:.1f}x slower.")

### PolaRS is 1.2x slower.

## Example 4 - Compare Series to Constant

In [18]:
%%time_cell
x = polars_df['pickup_longitude'] < 2.3
assert x.any()

In [19]:
polars_time = _TIMED_CELL
print(f"PolaRS time: {polars_time:.1f}s")

PolaRS time: 0.0s


In [20]:
%%time_cell
y = pandas_df['pickup_longitude'] < 2.3
print(y)

0           True
1           True
2           True
3           True
4           True
            ... 
12748981    True
12748982    True
12748983    True
12748984    True
12748985    True
Name: pickup_longitude, Length: 12748986, dtype: bool


In [21]:
pandas_time = _TIMED_CELL
print(f"Pandas time: {pandas_time:.1f}s")

Pandas time: 0.0s


In [22]:
slowdown = polars_time / pandas_time
utils.print_md(f"### PolaRS is {slowdown:.1f}x slower.")

### PolaRS is 1.0x slower.

## Example 5 - Unary Reductions 1

In [23]:
%%time_cell
x = polars_df['pickup_longitude'].std()
print(x)

10.12510359296947


In [24]:
polars_time = _TIMED_CELL
print(f"PolaRS time: {polars_time:.1f}s")

PolaRS time: 0.1s


In [25]:
%%time_cell
y = pandas_df['pickup_longitude'].std()
print(y)

10.125103592972902


In [26]:
pandas_time = _TIMED_CELL
print(f"Pandas time: {pandas_time:.1f}s")

Pandas time: 0.1s


In [27]:
slowdown = polars_time / pandas_time
utils.print_md(f"### PolaRS is {slowdown:.1f}x slower.")

### PolaRS is 0.9x slower.

## Example 6 - Unary Reductions 2

Koalas fails here! See `value_counts.ipynb`. Same effect, same reason. We will again use `VendorID`.

In [28]:
%%time_cell
x = polars_df["VendorID"].unique()
print(x)

shape: (2,)
Series: 'VendorID' [i64]
[
	1
	2
]


In [29]:
polars_time = _TIMED_CELL
print(f"PolaRS time: {polars_time:.1f}s")

PolaRS time: 0.2s


In [30]:
%%time_cell
y = pandas_df["VendorID"].unique()
print(y)

[2 1]


In [31]:
pandas_time = _TIMED_CELL
print(f"Pandas time: {pandas_time:.1f}s")

Pandas time: 0.1s


In [32]:
slowdown = polars_time / pandas_time
utils.print_md(f"### PolaRS is {slowdown:.1f}x slower.")

### PolaRS is 3.1x slower.