In [15]:
import polars as pl, pandas as pd
import seaborn as sns

In [6]:
# Path to a sample dataset. This dataset has a good variety of numeric / categorical columns. 
dataset_path = "../datasets/taxi_data.csv"

# Basics

Going through the very basics of `polars`. This assumes some familiarity with `pandas` sytax.

[Reference](https://pola-rs.github.io/polars/user-guide/concepts/contexts/#select)

[Full API Reference](https://pola-rs.github.io/polars/py-polars/html/reference/index.html)

# Lazy / Eager API

This is a brief summary of `polars'` main operation modes (Lazy / Eager). 

In short, the default mode of `polars` is _Lazy_. That is to say: execution of the query isn't carried out line-by-line (as in the _Eager_ mode) but is only executed once it is _'needed'_. 

This has the benefit of: 

1. Allowing automatic query optimization
2. The ability to work with larger than memory datasets using streaming 
3. Catching scehma errors before data processing

Find out more [here](https://pola-rs.github.io/polars/user-guide/lazy/using/)

## Eager API
Example use case with the eager API:

In [12]:
df = pl.read_csv(dataset_path)
# Filter out trips with more than one passenger
df = df.filter(pl.col('passengers')<=1)
# Get the total fares by payment
df = df.group_by("payment").agg(pl.col("fare").sum())

In [13]:
df

payment,fare
str,f64
"""credit card""",46939.87
"""cash""",14825.5
,444.5


Observe that every step was executed immediately to return intermediate results. This is wasteful as we did not need to load in all the data. 

### Lazy API

Recreating the same example but with the _Lazy_ api. 

In [14]:
q = (
    # SCAN does not read in the entire dataset into memory (unlike READ)
    pl.scan_csv(dataset_path)
    .filter(pl.col('passengers')<=1)
    .group_by('payment')
    .agg(pl.col("fare").sum())
)
df = q.collect()

For large datasets this significantly lowers the load on memory & CPU, allowing bigger datasets to be processed at greater speeds. 

Crucially, once the query is defined we call `.collect()` to execute the query. 

## Contexts

A __context__ refers to the context (haha) in which an expression needs to be evaluated, and there are three main contexts :

1. Selection - `df.select([...])`, `df.with_columns([..])`
2. Filtering - `df.filter()`
3. Group By / Aggregation - `df.group_by(...).agg(...)`

### Selection

In this context the selection applies expressions over columns. The result must be a `series` that are all of equal length or have a length of 1 (then broadcasated to match the height of the `DataFrame`)

In [28]:
q = (
    pl.scan_csv(dataset_path)
    .select(
        # Sum the fares. Results in ONE value -> Broadcasted
        pl.sum('fare') 
        # Sort a column of values and change the column name
        , pl.col('pickup_borough').sort(nulls_last=True).alias('borough_pickup')
        # Aggregate a column to produce one value and do an operation on it -> Broadcaster
        , (pl.mean('distance')*10).alias('mean_distance')
    )
)
df = q.collect()
df.head(5)

fare,borough_pickup,mean_distance
f64,str,f64
84214.87,"""Bronx""",30.246168
84214.87,"""Bronx""",30.246168
84214.87,"""Bronx""",30.246168
84214.87,"""Bronx""",30.246168
84214.87,"""Bronx""",30.246168


Within the `select` context, any manipulation of a `Series` can be performed so long as it meets either one of the aforementioned conditions.
