# Polars


In [None]:
import datetime as dt

import polars as pl

## Motivation

Polars was a project that initially began in 2020 as an alternative to Pandas.

The Polars team's goal is

> ... to provide a lightning fast DataFrame library that:
> * Utilizes all available cores on your machine
> * Optimizes queries to reduce unneeded work/memory allocations
> * Handles datasets much larger than your available RAM
> * A consistent and predictable API
> * Adheres to a strict schema (data-types should be known before running the query)


## Pandas vs Polars

Both projects provide a "DataFrame environment" that can be used within Python.

Pandas has been around for much longer (since 2008) and so has undergone a much longer development cycle. This can be seen as both a good thing and a bad thing.

The good thing for Pandas is that it has been around for a long time and has been widely used -- This means that many people have already written code in Pandas and that bugs are more likely to have already been found. Pandas is incredibly stable and widely used.

The bad thing for Pandas is that the Polars team has more flexibility in their design and can choose the most modern technologies for certain patterns. The Polars team leverages that by having more parallelization than Pandas and things like "lazy evaluation".

## A quick start in using Polars

We are going to do a very quick introduction to using Polars so that people are familiar with it.

Datasets:

* [May taxi data](https://rice.box.com/s/ymo3mfat1fr8arkhd6n3u7bsnui87aet)
* [June taxi data](https://rice.box.com/s/uyz48onda82up04ab28u99o3crgltjle)

We recommend that you download these files by hand and put them in your working directory so that we can demonstrate some of the advantages that come from lazy evaluation. I have placed these in the `./data/taxi/` so you may need to change the code to read the files if you place them somewhere else.

### Reading and writing data

One can read/write data using Polars with similarly named methods as in Pandas.

The available methods for reading include:

* `pl.read_csv`
* `pl.read_excel`
* `pl.read_json`
* `pl.read_parquet`

Similarly, the methods for writing include:

* `df.write_csv`
* `df.write_excel`
* `df.write_json`
* `df.write_parquet`

In [None]:
taxi_data = pl.read_parquet("data/taxi/yellow_tripdata_2024-06.parquet")

In [None]:
taxi_data.head(10)

In [None]:
taxi_data.tail(3)

In [None]:
taxi_data.write_csv("test.csv")

### Selecting/combining rows

Just as in Pandas, we can select different rows of data or combine rows of data to create a new one.

The way that we do this feels a bit different than Pandas however.

If we want to select a certain group of columns and combine them, we can use the `select` method and `pl.col` function.

In [None]:
taxi_data.select(
    # `.alias` allows us to assign a new column name to a colum
    pl.col("tpep_pickup_datetime").alias("dt"),
    # We can also select columns based on name and leave it the same
    pl.col("trip_distance"),
    pl.col("fare_amount"),
    pl.col("mta_tax"),
    pl.col("tolls_amount"),
    pl.col("tip_amount"),
    pl.col("total_amount"),
    pl.col("payment_type"),
    # Here we combine two columns and then alias them to a new name
    (pl.col("tpep_dropoff_datetime") - pl.col("tpep_pickup_datetime")).alias("duration")
).filter(
    pl.col("duration") > dt.timedelta(hours=12)
)

`df.with_columns` acts similarly to `df.select` except it automatically selects all existing rows.

In [None]:
taxi_data.with_columns(
    pl.col("tpep_pickup_datetime").dt.date().alias("trip_day")
)

### Filtering data

Just as we can select certain rows of data in Pandas, we can select subsets of data in Polars using the `df.filter` method.

Again, note that this uses the `pl.col` function to create queries that the `filter` method then interprets (just like `select` and `with_columns`)

In [None]:
taxi_data.filter(
    # Between two dates
    pl.col("tpep_pickup_datetime").is_between(dt.date(2024, 6, 1), dt.date(2024, 6, 7)),
    # Only rides between 5 and 25 miles
    pl.col("trip_distance").is_between(5, 25),
    # Only credit card payments
    pl.col("payment_type") == 1
)

### Group by methods

We can do the typically group by -> aggregation style operation that we did in Pandas

In [None]:
# Count number of observations
(
    taxi_data.select(
        pl.all().exclude("tpep_dropoff_datetime")
    )
    .group_by(
        pl.col("tpep_pickup_datetime").dt.date()
    )
    .agg(
        pl.col("VendorID").count().alias("number_of_rides")
    )
    .sort(
        pl.col("tpep_pickup_datetime")
    )
)

We can also do "window functions" which allow us to apply an operation within a group.

For example, below let's put taxi rides within a day in order and label them 1, 2, 3, 4, ...

In [None]:
(
    taxi_data
    .select(
        pl.all(),
        (
            pl.col("tpep_dropoff_datetime")
            .rank("dense", descending=False)
            .over(pl.col("tpep_pickup_datetime").dt.date())
            .alias("ride_number")
        )
    )
    .filter(
        pl.col("tpep_pickup_datetime").is_between(dt.date(2024, 6, 1), dt.date(2024, 6, 2))
    )
    .sort("tpep_dropoff_datetime")
)

### Joining data

Joining data works in a similar way as before.

In [None]:
df1 = (
    taxi_data.select(
        pl.col("tpep_pickup_datetime"),
        pl.col("tpep_dropoff_datetime"),
        pl.col("passenger_count"),
        pl.col("trip_distance"),
        pl.col("PULocationID"),
        pl.col("DOLocationID")
    )
)

df2 = (
    taxi_data.select(
        pl.col("tpep_pickup_datetime"),
        pl.col("tpep_dropoff_datetime"),
        pl.col("passenger_count"),
        pl.col("trip_distance"),
        pl.col("total_amount")
    )
)

In [None]:
df1.join(
    df2,
    on=["tpep_pickup_datetime", "tpep_dropoff_datetime", "passenger_count", "trip_distance"],
    how="left"
)

We can also "stack together datasets"

In [None]:
taxi_data_may = pl.read_parquet("data/taxi/yellow_tripdata_2024-05.parquet")

In [None]:
pl.concat([taxi_data, taxi_data_may], how="vertical")

## Lazy evaluation

Everything that we have done so far used Polars "eager mode of operations".

Eager mode is similar (but more performant) to what is used by Pandas when Pandas operates. Eager mode takes each operation in order and performs the operation step-by-step.

Lazy mode defers all execution until after all steps have been specified.

Why would lazy evaluation be something that we want? It allows Polars to construct a "graph" of all operations that are going to be done and then Polars is able to optimize within that graph by choosing the "optimal"(ish) order for the calculations to happen! This empowers Polars to be even faster than it already was.

### Lazy example

Let's load the data, focus on all taxi drives during Friday/Saturday/Sunday, calculate the average number of rides that happen on Friday/Saturday/Sunday.

In [None]:
q = (
    pl.scan_parquet("data/taxi/yellow_tripdata_2024-06.parquet")
    .with_columns(
        pl.col("tpep_pickup_datetime").dt.weekday().alias("day_of_week")
    )
    .group_by(
        pl.col("day_of_week")
    )
    .agg(
        pl.col("tpep_dropoff_datetime").count().alias("number_of_rides")
    )
    .group_by(
        pl.col("day_of_week")
    )
    .agg(
        pl.col("number_of_rides").mean()
    )
    # Friday/Saturday/Sunday
    .filter(
        pl.col("day_of_week").is_between(5, 7)
    )
    .sort(
        pl.col("day_of_week")
    )
)

In [None]:
q

In [None]:
q.show_graph()

In [None]:
q.collect()

### If lazy is so great, why don't we always do lazy calculations?

Polars docs recommend

> In general, the lazy API should be preferred unless you are either interested in the intermediate results or are doing exploratory work and don't know yet what your query is going to look like.

In [None]:
print(q.explain())

In [None]:
import pandas as pd
import polars as pl

In [None]:
import datetime as dt

may_1 = dt.date(2024, 5, 1)
july_1 = dt.date(2024, 7, 1)

In [None]:
%%time

taxi_data_may = pd.read_parquet("data/taxi/yellow_tripdata_2024-05.parquet")
taxi_data_june = pd.read_parquet("data/taxi/yellow_tripdata_2024-06.parquet")

taxi_data = pd.concat(
    [taxi_data_may, taxi_data_june], ignore_index=True, axis=0
)

taxi_data["dt"] = taxi_data["tpep_pickup_datetime"].dt.date


result = (
    taxi_data
    .query("dt >= @may_1 & dt < @july_1")
    .groupby("dt")
    ["trip_distance"]
    .mean()
)

In [None]:
result.head()

In [None]:
%%time

taxi_data_may = pl.read_parquet("data/taxi/yellow_tripdata_2024-05.parquet")
taxi_data_june = pl.read_parquet("data/taxi/yellow_tripdata_2024-06.parquet")

taxi_data = pl.concat(
    [taxi_data_may, taxi_data_june], how="vertical"
)

result = (
    taxi_data
    .with_columns(
        pl.col("tpep_pickup_datetime").dt.date().alias("dt")
    )
    .filter(
        pl.col("dt").is_between(may_1, july_1)
    )
    .group_by(pl.col("dt"))
    .agg(pl.col("trip_distance").mean())
    .sort("dt")
)


In [None]:
result.head()

In [None]:
%%time

q = (
    pl.concat(
        [
            pl.scan_parquet("data/taxi/yellow_tripdata_2024-05.parquet"),
            pl.scan_parquet("data/taxi/yellow_tripdata_2024-06.parquet")
        ], how="vertical"
    )
    .with_columns(
        pl.col("tpep_pickup_datetime").dt.date().alias("dt")
    )
    .filter(
        pl.col("dt").is_between(may_1, july_1)
    )
    .group_by(pl.col("dt"))
    .agg(pl.col("trip_distance").mean())
    .sort("dt")
)

result = q.collect()


In [None]:
result.head()

In [None]:
result.to_pandas()