# Polars VS Pandas

Polars is a framework to manipulate tabular data. It is similar to Pandas in many aspects, but has a slightly different syntax (often cleaner when chaining operations). It is written in Rust, using Apache Arrow as its memory model, and does multi-threading efficiently. It handles streaming for larger-than-RAM tables.

Polars has `LazyFrame` objects, which are conceptually similar to how Spark operates. You can build dataframes by applying operations to other dataframes, without the operations being actually performed. Then you call `.collect()` at the end, which runs an optimised version of the chain of operations you built. For instance, Polars might spot duplicate columns, or drop from the beginning the columns or rows that never end up being used.

See also:

https://docs.pola.rs/user-guide/migration/pandas/

https://github.com/mdh266/PolarsDuckDBPlayGround

http://michael-harmon.com/blog/PolarsDuckDB.html

In [1]:
import pandas as pd
import polars as pl

csv_file_name = 'gaia_dr3_lite_random_1M.csv'

## Pandas

In [2]:
%%time
df_pandas = pd.read_csv(csv_file_name)

CPU times: user 1.77 s, sys: 236 ms, total: 2.01 s
Wall time: 2.01 s


In [3]:
%%time
df_pandas.head(10)

CPU times: user 178 μs, sys: 38 μs, total: 216 μs
Wall time: 181 μs


Unnamed: 0,l,b,ra,dec,pmra,pmra_error,pmdec,pmdec_error,parallax,parallax_error,...,vbroad,g,bp_rp,has_xp_continuous,ruwe,non_single_star,astrometric_matched_transits,visibility_periods_used,hpx2,hpx5
0,34.939058,-3.306332,286.716913,0.276195,2.549028,0.030442,-4.075544,0.026224,1.084924,0.033431,...,,15.244129,1.522128,True,1.033783,0,26,15,118.0,7580.0
1,286.532346,-6.648696,152.625068,-64.267699,-2.606806,2.147447,4.414332,1.408532,0.854356,1.282721,...,,20.906347,0.702095,False,0.912624,0,12,10,145.0,9330.0
2,104.935184,-15.490523,348.434982,43.940704,-1.729704,0.514036,-3.353289,0.558126,1.042008,0.633289,...,,20.531225,1.649284,False,1.051422,0,47,23,53.0,3442.0
3,346.09665,4.141483,252.729717,-37.917215,-3.566062,1.053221,-4.132508,0.71286,0.58766,0.633092,...,,20.145899,2.006773,False,1.083541,0,32,16,165.0,10607.0
4,344.190022,-4.748318,260.678787,-44.788772,0.209898,0.773631,-3.433852,0.594055,-0.293376,0.619885,...,,19.787357,1.57341,False,1.251593,0,34,19,165.0,10575.0
5,334.929013,17.429027,231.361017,-35.700316,,,,,,,...,,21.05777,0.809551,False,,0,6,6,166.0,10682.0
6,298.619882,11.891354,186.180736,-50.75659,-8.274602,0.201844,0.95555,0.18576,0.43694,0.216148,...,,19.050842,1.396774,False,1.03738,0,59,24,170.0,10881.0
7,341.483871,7.543388,245.577837,-39.059397,-10.72246,1.084326,-2.04772,0.705856,2.379563,0.831078,...,,20.33379,1.788349,False,0.999621,0,20,13,167.0,10690.0
8,292.814115,-18.093064,144.849128,-77.016786,-3.334796,0.230657,2.677913,0.220169,0.352183,0.185059,...,,18.942072,1.227215,False,1.215692,0,38,27,144.0,9243.0
9,12.763693,11.052306,263.432898,-12.341759,0.176833,1.254534,-2.42647,0.85644,-0.283236,1.247819,...,,20.702047,1.673468,False,0.913886,0,23,14,115.0,7394.0


## Polars 

In [4]:
%%time
df_polars = pl.read_csv(csv_file_name)

CPU times: user 1.24 s, sys: 113 ms, total: 1.35 s
Wall time: 218 ms


In [5]:
%%time
df_polars.head(20)

CPU times: user 274 μs, sys: 140 μs, total: 414 μs
Wall time: 387 μs


l,b,ra,dec,pmra,pmra_error,pmdec,pmdec_error,parallax,parallax_error,radial_velocity,radial_velocity_error,vbroad,g,bp_rp,has_xp_continuous,ruwe,non_single_star,astrometric_matched_transits,visibility_periods_used,hpx2,hpx5
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,bool,f64,i64,i64,i64,f64,f64
34.939058,-3.306332,286.716913,0.276195,2.549028,0.030442,-4.075544,0.026224,1.084924,0.033431,,,,15.244129,1.5221281,true,1.0337826,0,26,15,118.0,7580.0
286.532346,-6.648696,152.625068,-64.267699,-2.606806,2.147447,4.414332,1.4085325,0.854356,1.2827208,,,,20.906347,0.702095,false,0.912624,0,12,10,145.0,9330.0
104.935184,-15.490523,348.434982,43.940704,-1.729704,0.514036,-3.353289,0.5581261,1.042008,0.6332887,,,,20.531225,1.6492844,false,1.0514224,0,47,23,53.0,3442.0
346.09665,4.141483,252.729717,-37.917215,-3.566062,1.0532212,-4.132508,0.71286,0.58766,0.6330921,,,,20.145899,2.006773,false,1.0835409,0,32,16,165.0,10607.0
344.190022,-4.748318,260.678787,-44.788772,0.209898,0.7736309,-3.433852,0.594055,-0.293376,0.619885,,,,19.787357,1.57341,false,1.2515925,0,34,19,165.0,10575.0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
3.492412,-6.058198,274.391063,-28.888978,0.565778,0.111975,3.520154,0.086621,0.242796,0.089933,,,,17.278742,1.2051601,true,1.0057354,0,27,15,112.0,7193.0
347.768133,-40.719769,314.982677,-50.876891,2.366481,0.193474,-1.405522,0.213664,0.59973,0.231076,,,,19.126814,1.672224,false,1.0187637,0,56,22,179.0,11507.0
21.620563,-2.93419,280.300772,-11.401042,1.987209,0.21463,-5.339681,0.176381,0.058791,0.1837972,,,,18.44229,1.7789154,false,1.0381606,0,21,15,113.0,7294.0
39.948557,7.763874,279.075733,9.748366,0.420828,0.08779,0.581696,0.079152,0.330182,0.073009,,,,17.185175,1.1714039,true,0.9583767,0,42,16,124.0,7957.0


Pandas took 2s to lead the file and 0.2s to display the top 20 rows.

Polars took 0.2s to load the file and 0.06s to display the top 20 rows!

## Aggregate quantities

Here polars is about twice as fast. 

With its slightly different syntax, it also allows to write operations directly inside the `.agg()` method:

    df_polars.group_by("hpx5").agg(
        pl.col("l").mean().alias("meanL"),
        pl.col("b").mean().alias("meanB"),
        1000*pl.col("pmra").mean().alias("meanPMRA"),
        1000*pl.col("pmdec").mean().alias("meanPMDEC"),
        pl.col("l").count().alias("count"),    
    ).sort("hpx5")

where the panda syntax would be:

    ...
    meanPMRA=('pmra', lambda x: 1000*x.mean()),
    ...

Returning a dataframe with only a subset of columns also has a different syntax, and actually takes extra time for pandas.

In [6]:
%%time
df_pandas.groupby(
  ['hpx5']
).agg(
  meanL=('l', 'mean'),
  meanB=('b', 'mean'),
  meanPMRA=('pmra', 'mean'),
  meanPMDEC=('pmdec', 'mean'),
  count=('l','count')
).sort_values("hpx5")[['meanPMRA','meanPMDEC','count']]

CPU times: user 56.6 ms, sys: 19.3 ms, total: 75.8 ms
Wall time: 73.3 ms


Unnamed: 0_level_0,meanPMRA,meanPMDEC,count
hpx5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,6.009433,-3.094447,5
1.0,16.423222,-17.684266,10
2.0,1.419328,-4.925885,6
3.0,8.614509,-6.011953,5
4.0,4.918780,-5.167080,5
...,...,...,...
12283.0,1.154579,-6.805211,34
12284.0,1.460014,-7.063548,29
12285.0,0.948179,-5.556799,26
12286.0,-0.748367,-4.266005,36


In [7]:
%%time
df_polars.group_by("hpx5").agg(
    pl.col("l").mean().alias("meanL"),
    pl.col("b").mean().alias("meanB"),
    pl.col("pmra").mean().alias("meanPMRA"),
    pl.col("pmdec").mean().alias("meanPMDEC"),
    pl.col("l").count().alias("count"),    
).sort("hpx5").select('meanPMRA','meanPMDEC','count')

CPU times: user 190 ms, sys: 88.9 ms, total: 278 ms
Wall time: 48.6 ms


meanPMRA,meanPMDEC,count
f64,f64,u32
6.009433,-3.094447,5
16.423222,-17.684266,10
1.419328,-4.925885,6
8.614509,-6.011953,5
4.91878,-5.16708,5
…,…,…
1.154579,-6.805211,34
1.460014,-7.063548,29
0.948179,-5.556799,26
-0.748367,-4.266005,36


## Chaining operations with polars VS intermediate steps with pandas

Polars has the `select()` method, which allows operations (you can select column by name, but you can also select the sum, or product or any function of combinations of columns), and you can choose what the resulting columns are called. Pandas on the other hand requires an intermediate step, because it needs to know what the df is called. You cannot refer to "the column named XYZ in the current dataframe".

Here I compute the sum of two of the aggregate columns (using `.select()` in polars and using `.assign()` in pandas), then filter the resulting table according to this sum. 

In [8]:
%%time
intermediate_df = df_pandas.groupby(
  ['hpx5']
).agg(
  meanL=('l', 'mean'),
  meanB=('b', 'mean'),
  meanPMRA=('pmra', 'mean'),
  meanPMDEC=('pmdec', 'mean'),
  count=('l','count')
).sort_values("hpx5")[['meanPMRA','meanPMDEC','count']]

third_df = intermediate_df.assign(
    totalPM=intermediate_df['meanPMRA']+intermediate_df['meanPMDEC']
)[['meanPMRA','meanPMDEC','totalPM','count']]

third_df[third_df['totalPM']>0]

CPU times: user 75.2 ms, sys: 20 ms, total: 95.2 ms
Wall time: 93.7 ms


Unnamed: 0_level_0,meanPMRA,meanPMDEC,totalPM,count
hpx5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,6.009433,-3.094447,2.914987,5
3.0,8.614509,-6.011953,2.602555,5
6.0,7.081828,-3.887544,3.194283,5
7.0,6.280178,-3.867381,2.412797,4
8.0,6.895964,-6.735954,0.160010,2
...,...,...,...,...
12121.0,8.262829,-2.947541,5.315288,10
12125.0,5.110782,-2.660810,2.449973,4
12140.0,8.190770,-6.337214,1.853556,14
12143.0,10.329461,-9.651335,0.678127,18


In [9]:
%%time
df_polars.group_by("hpx5").agg(
    pl.col("l").mean().alias("meanL"),
    pl.col("b").mean().alias("meanB"),
    pl.col("pmra").mean().alias("meanPMRA"),
    pl.col("pmdec").mean().alias("meanPMDEC"),
    pl.col("l").count().alias("count"),    
).sort("hpx5").select(
    'meanPMRA',
    'meanPMDEC',
    (pl.col('meanPMRA')+pl.col('meanPMDEC')).alias("totalPM"),
    'count').filter(pl.col('totalPM')>0)

CPU times: user 202 ms, sys: 43.7 ms, total: 246 ms
Wall time: 61.7 ms


meanPMRA,meanPMDEC,totalPM,count
f64,f64,f64,u32
6.009433,-3.094447,2.914987,5
8.614509,-6.011953,2.602555,5
7.081828,-3.887544,3.194283,5
6.280178,-3.867381,2.412797,4
6.895964,-6.735954,0.16001,2
…,…,…,…
8.262829,-2.947541,5.315288,10
5.110782,-2.66081,2.449973,4
8.19077,-6.337214,1.853556,14
10.329461,-9.651335,0.678127,18


# Query plans

Using `scan_csv` rather than `read_csv` returns a lazyframe (which is a query plan), instead of a dataframe.

You can do `.show_graph()` on it to see an optimised query plan, or `.profile()` to see the timing of each layer of operations.

And you can do `.collect()` to materialise the result as a dataframe.

In the present case, doing:

    pl.scan_csv("...")
    .some_operations...
        ...
    .collect()

takes about 150ms, while the non-optimised:

    pl.read_csv("...")
    .some_operations...
        ...

takes 250ms. The pandas version takes two entire seconds!

In [10]:
%%time
pl.scan_csv(csv_file_name).group_by("hpx5").agg(
    pl.col("l").mean().alias("meanL"),
    pl.col("b").mean().alias("meanB"),
    pl.col("pmra").mean().alias("meanPMRA"),
    pl.col("pmdec").mean().alias("meanPMDEC"),
    pl.col("l").count().alias("count"),    
).sort("hpx5").select(
    'meanPMRA',
    'meanPMDEC',
    (pl.col('meanPMRA')+pl.col('meanPMDEC')).alias("totalPM"),
    'count').filter(pl.col('totalPM')>0)

CPU times: user 1.65 ms, sys: 1.71 ms, total: 3.36 ms
Wall time: 2.07 ms


In [11]:
%%time
pl.scan_csv(csv_file_name).group_by("hpx5").agg(
    pl.col("l").mean().alias("meanL"),
    pl.col("b").mean().alias("meanB"),
    pl.col("pmra").mean().alias("meanPMRA"),
    pl.col("pmdec").mean().alias("meanPMDEC"),
    pl.col("l").count().alias("count"),    
).sort("hpx5").select(
    'meanPMRA',
    'meanPMDEC',
    (pl.col('meanPMRA')+pl.col('meanPMDEC')).alias("totalPM"),
    'count').filter(pl.col('totalPM')>0).profile()

CPU times: user 630 ms, sys: 96.3 ms, total: 726 ms
Wall time: 145 ms


(shape: (2_286, 4)
 ┌───────────┬───────────┬──────────┬───────┐
 │ meanPMRA  ┆ meanPMDEC ┆ totalPM  ┆ count │
 │ ---       ┆ ---       ┆ ---      ┆ ---   │
 │ f64       ┆ f64       ┆ f64      ┆ u32   │
 ╞═══════════╪═══════════╪══════════╪═══════╡
 │ 6.009433  ┆ -3.094447 ┆ 2.914987 ┆ 5     │
 │ 8.614509  ┆ -6.011953 ┆ 2.602555 ┆ 5     │
 │ 7.081828  ┆ -3.887544 ┆ 3.194283 ┆ 5     │
 │ 6.280178  ┆ -3.867381 ┆ 2.412797 ┆ 4     │
 │ 6.895964  ┆ -6.735954 ┆ 0.16001  ┆ 2     │
 │ …         ┆ …         ┆ …        ┆ …     │
 │ 8.262829  ┆ -2.947541 ┆ 5.315288 ┆ 10    │
 │ 5.110782  ┆ -2.66081  ┆ 2.449973 ┆ 4     │
 │ 8.19077   ┆ -6.337214 ┆ 1.853556 ┆ 14    │
 │ 10.329461 ┆ -9.651335 ┆ 0.678127 ┆ 18    │
 │ 14.164176 ┆ -2.600706 ┆ 11.56347 ┆ 12    │
 └───────────┴───────────┴──────────┴───────┘,
 shape: (7, 3)
 ┌─────────────────────────────────┬────────┬────────┐
 │ node                            ┆ start  ┆ end    │
 │ ---                             ┆ ---    ┆ ---    │
 │ str            

## Misc

* The lazy loading works on parquet files too (`scan_parquet(...)`).

* Sometimes even the result of a LazyFrame is too large to fit in memory. Instead of `.collect()` you can write it to a file, with `.sink_csv()` or `.sink_parquet()`, and the output will be streamed into the file. For instance, to convert a very large CSV file to a very large parquet file without ever loading all of it into memory:

        lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv")  
        lf.sink_parquet("out.parquet")  

* When debugging queries, you can use `.fetch(5)` (or any number N of rows, default is 500) rather than `.collect()`.

* Polars can read directly from a database:

        uri = "postgresql://username:password@server:port/database"
        query = "SELECT * FROM foo"
        pl.read_database_uri(query=query, uri=uri)

