In [1]:
import polars as pl, pandas as pd
import seaborn as sns

In [2]:
# Path to a sample dataset. This dataset has a good variety of numeric / categorical columns. 
dataset_path = "../datasets/taxi_data.csv"

# Basics

Going through the very basics of `polars`. This assumes some familiarity with `pandas` sytax.

[Reference](https://pola-rs.github.io/polars/user-guide/concepts/contexts/#select)

[Full API Reference](https://pola-rs.github.io/polars/py-polars/html/reference/index.html)

## Lazy / Eager API

This is a brief summary of `polars'` main operation modes (Lazy / Eager). 

In short, the default mode of `polars` is _Lazy_. That is to say: execution of the query isn't carried out line-by-line (as in the _Eager_ mode) but is only executed once it is _'needed'_. 

This has the benefit of: 

1. Allowing automatic query optimization
2. The ability to work with larger than memory datasets using streaming 
3. Catching scehma errors before data processing

Find out more [here](https://pola-rs.github.io/polars/user-guide/lazy/using/)

### Eager API
Example use case with the eager API:

In [3]:
df = pl.read_csv(dataset_path)
# Filter out trips with more than one passenger
df = df.filter(pl.col('passengers')<=1)
# Get the total fares by payment
df = df.group_by("payment").agg(pl.col("fare").sum())

In [4]:
df

payment,fare
str,f64
,444.5
"""credit card""",46939.87
"""cash""",14825.5


Observe that every step was executed immediately to return intermediate results. This is wasteful as we did not need to load in all the data. 

### Lazy API

Recreating the same example but with the _Lazy_ api. 

In [5]:
q = (
    # SCAN does not read in the entire dataset into memory (unlike READ)
    pl.scan_csv(dataset_path)
    .filter(pl.col('passengers')<=1)
    .group_by('payment')
    .agg(pl.col("fare").sum())
)
df = q.collect()

For large datasets this significantly lowers the load on memory & CPU, allowing bigger datasets to be processed at greater speeds. 

Crucially, once the query is defined we call `.collect()` to execute the query. 

## Contexts

A __context__ refers to the context (haha) in which an expression needs to be evaluated, and there are three main contexts :

1. Selection - `df.select([...])`, `df.with_columns([..])`
2. Filtering - `df.filter()`
3. Group By / Aggregation - `df.group_by(...).agg(...)`

### Selection

In this context the selection applies expressions over columns. The result must be a `series` that are all of equal length or have a length of 1 (then broadcasated to match the height of the `DataFrame`)

In [6]:
q = (
    pl.scan_csv(dataset_path)
    .select(
        # Sum the fares. Results in ONE value -> Broadcasted
        pl.sum('fare') 
        # Sort a column of values and change the column name
        , pl.col('pickup_borough').sort(nulls_last=True).alias('borough_pickup')
        # Aggregate a column to produce one value and do an operation on it -> Broadcaster
        , (pl.mean('distance')*10).alias('mean_distance')
    )
)
df = q.collect()
df.head(5)

fare,borough_pickup,mean_distance
f64,str,f64
84214.87,"""Bronx""",30.246168
84214.87,"""Bronx""",30.246168
84214.87,"""Bronx""",30.246168
84214.87,"""Bronx""",30.246168
84214.87,"""Bronx""",30.246168


Within the `select` context, any manipulation of a `Series` can be performed so long as it meets either one of the aforementioned conditions.


### Filtering & Group By

In the `filter` context you filter the existing dataframe based on any arbitrary expression which evaluates to the Boolean data type. 

The `group_by` context is slightly more complex. The data is first grouped based on a specified column, then a chained `.agg()` call does the aggregation, producing a single result for each group. 

In [8]:
tdf = pd.read_csv(dataset_path)
tdf.head()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.7,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.1,0.0,13.4,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan


In [16]:
q = (
    pl.scan_csv(dataset_path)
    .group_by('payment').agg(
        pl.sum('tip')
        , pl.col('dropoff_zone')
        , pl.col('distance')
    )
)
df = q.collect()
df

payment,tip,dropoff_zone,distance
str,f64,list[str],list[f64]
"""credit card""",12732.32,"[""UN/Turtle Bay South"", ""West Village"", … ""Windsor Terrace""]","[1.6, 1.37, … 3.85]"
,0.0,"[""Flatiron"", ""Columbia Street"", … ""Long Island City/Hunters Point""]","[1.4, 1.3, … 0.7]"
"""cash""",0.0,"[""Upper West Side South"", ""Astoria"", … ""Bushwick North""]","[0.79, 3.9, … 4.14]"


All expressions are applied to the group defined by the group_by context. Simply selecting a columnin  the `agg()` portion results in a list of that series for that group, as the `pl.col()` expression returns that column Series. 

## Expressions

Expressions in `polars` __are a mapping from a series to a series__ (or mathematically Fn(Series) -> Series). As expressions have a Series as an input and a Series as an output then it is straightforward to do a sequence of expressions (similar to method chaining in Pandas)

Essentially, any `polars` method call that transforms a series into another series is an expression. Every expression produces a new expression, and __expressions can be piped together__. 

In [22]:
q = (
    pl.scan_csv(dataset_path)
    .select(
        # 2 separate expressions chained together
        pl.col('pickup').sort().head(2)
        # Another 2 separate expressions. In this example we also play around with the select context. 
        # Observe the automatic broadcasting of the second column
        , pl.col('fare').filter(pl.col('pickup')=='2019-02-28 23:29:03').sum().alias("dummy")
    )
)
df = q.collect()
print(df)

shape: (2, 2)
┌─────────────────────┬───────┐
│ pickup              ┆ dummy │
│ ---                 ┆ ---   │
│ str                 ┆ f64   │
╞═════════════════════╪═══════╡
│ 2019-02-28 23:29:03 ┆ 5.0   │
│ 2019-03-01 00:03:29 ┆ 5.0   │
└─────────────────────┴───────┘


All expressions are run in parallel, meaning that separate `polars` expressions are embarrassingly parallel. Note that within an expression there may be more parallelization going on.