In [None]:
from string import ascii_letters
import time

import pandas as pd
import polars as pl
import plotly
import numpy as np

So, by now I have probably been glazing you up about this Polars package. But, how good can it really be? Let us do some timings.

First, let's initiallize some random data set.

In [None]:
import random

some_ints = [random.randint(0, 1000) for _ in range(100000)]
some_floats = [random.uniform(0.0, 1000.0) for _ in range(100000)]
some_strs = [''.join(random.choices(ascii_letters, k=random.randint(1, 10))) for _ in range(100000)]
some_gauss = [random.gauss(mu=50, sigma=5) for _ in range(100000)]
some_cats = ["Spicy", "Very spicy", "Mild", "Ultra Mild", "Bomboclat 😮"] * 20000

data_dict = {
    "integers": some_ints,
    "floats": some_floats,
    "strings": some_strs,
    "gauss": some_gauss,
    "categories": some_cats,
}

pandas_df = pd.DataFrame(data_dict)
polars_df = pl.DataFrame(data_dict)

print("Pandas DataFrame and description: \n\n", pandas_df.head(), '\n'*3, pandas_df.describe(), "\n================================\n")
print("Polars DataFrame and description: \n\n ",  polars_df.head(), '\n'*3, polars_df.describe())

So far so good? As you can see, there is not much difference yet. At least the data itself should look the same, so we know calculations are performed similarly. Now, just from these print statements alone, we can see that Polars is a lot prettier. However, I don't really care about looks and this comparison is about performance, so let's dive a little deeper:

# Exercise 1a
Let us perform some typical DataFrame transformations: implement the following two functions and adhere to the type hints. You should be able to write some optimal code.

In [None]:
def do_some_pandas_aggregations(df: pd.DataFrame) -> pd.DataFrame:
    """ To be implemented: group the data over its categories and perform the following data aggregations: 
    - the minimum and maximum float values per group
    - the average integer value per group
    - the standard deviation of the gaussian sample
    """
    df = _
    return df

def do_some_pandas_filtering(df: pd.DataFrame) -> pd.DataFrame:
    """ To be implemented: filter the data set on entries that have strings that start with either a or b (and A or B)
    """
    df = _
    return df

The following code shows you how to do this in Polars. Take a quick look and determine what is different.

In [None]:
def do_some_polars_aggregations(df: pd.DataFrame) -> pd.DataFrame:
    return df.group_by("categories").agg(
        pl.col("floats").min(),
        pl.col("floats").max(),
        pl.col("integers").mean(),
        pl.col("gauss").std(),
    )

def do_some_polars_filtering(df: pl.DataFrame) -> pl.DataFrame:
    return df.filter(
        pl.col("strings").str.starts_with('a') | 
        pl.col("strings").str.starts_with('A') |
        pl.col("strings").str.starts_with('b') | 
        pl.col("strings").str.starts_with('B')
    )

do_some_polars_aggregations(polars_df)

# Exercise 1b
Oopsie whoopsie, I made a fucky wucky. Clearly, this code is not finished yet. Inspect the *clear* error message, and fix the code.

Now that you have written your own Polars code, you will see that the code is a little more verbose than trusty old pandas. There is also this returning pl.col() function, but more on that later. If you implemented your own code correctly, we can now run the following code for a fair comparison.

In [None]:
import plotly.express as px

pd_agg_time = %timeit -o do_some_pandas_aggregations(pandas_df)
pl_agg_time = %timeit -o do_some_polars_aggregations(polars_df)
pd_filt_time = %timeit -o do_some_pandas_filtering(pandas_df)
pl_filt_time = %timeit -o do_some_polars_filtering(polars_df)

In [None]:
time_data = {
    "best_time": [pd_agg_time.best, pl_agg_time.best, pd_filt_time.best, pl_filt_time.best],
    "avg_time": [pd_agg_time.average, pl_agg_time.average, pd_filt_time.average, pl_filt_time.average],
    "worst_time": [pd_agg_time.worst, pl_agg_time.worst, pd_filt_time.worst, pl_filt_time.worst],
    "function": ["agg", "agg", "filter", "filter"],
    "lib": ["pandas", "polars", "pandas", "polars"]
}
fig = px.bar(time_data, x='function', y='best_time', barmode='group', color='lib')
fig.show()

fig = px.bar(time_data, x='function', y='avg_time', barmode='group', color='lib')
fig.show()

fig = px.bar(time_data, x='function', y='worst_time', barmode='group', color='lib')
fig.show()

Apparently, all those extra words are good for something! Thats a pretty significant speed up if you ask me. Especially when we get in to big data terratory.