## Polars - Getting to know the Syntax

Polars is a lightning fast DataFrame library. The [key features](https://docs.pola.rs/) of polars are:

**Fast and Accessible**: Written from scratch in Rust, designed close to the machine and without external dependencies. It also has python and R bindings!

**I/O**: First class support for all common data storage layers: local, cloud storage & databases.

**Handle Datasets** larger than RAM

**Intuitive API**: Write your queries the way they were intended. Polars, internally, will determine the most efficient way to execute using its query optimizer.
Out of Core: The streaming API allows you to process your results without requiring all your data to be in memory at the same time

**Parallel**: Utilises the power of your machine by dividing the workload among the available CPU cores without any additional configuration.


The philosophy of Polars is to provide a dataframe library that utilises available cores, has an intuitive api and is performance - hence adheres to a strict schema (data-types should be known before running the query).


In its domains specific language that is designed to be human readable while performing common data manipulation processes, polars describes its operations with [Contexts](https://docs.pola.rs/user-guide/concepts/contexts/) and [Expressions](https://docs.pola.rs/user-guide/concepts/expressions/). Contexts refers to the context in which an expression needs to be evaluated - i.e Filtering, Selecting and Groupby aggregations.  


`select` : select columns

`filter` : filter rows

`with_columns` : create / do something with columns

`group_by` : group by a factor and follow with

`agg` : aggregation
    

In [1]:
import polars as pl
from datetime import datetime

# Data Frames and Series are supported by polars
df = pl.DataFrame(
    {
        "integer": [1, 2, 3,4],
        "date": [
            datetime(2025, 1, 1),
            datetime(2025, 1, 2),
            datetime(2025, 1, 3),
            datetime(2025, 1, 3),
        ],
        "float": [4.0, 5.0, 6.0,12],
        "string": ["a", "b", "c","b"],
    }
)
# Seriers types need to be the same
sr = pl.Series([1,2,3,4,500])
# can specify the data types for better performance

sr = pl.Series([1,2,3,4,500],dtype=pl.Int64)


df2 = pl.DataFrame(
    {
        "x": range(8),
        "y": ["A", "A", "A", "B", "B", "C", "X", "X"],
    }
)

In [2]:
# To inspect the data you can use head(),tail(),print()
# sample() to select random rows, and describe() for sumamry statistics
df.describe()
df.head(3)

integer,date,float,string
i64,datetime[μs],f64,str
1,2025-01-01 00:00:00,4.0,"""a"""
2,2025-01-02 00:00:00,5.0,"""b"""
3,2025-01-03 00:00:00,6.0,"""c"""


## Expressions 

Polars has a powerful concept called expressions that is central to its very fast performance.

Expressions are at the core of many data science operations. Essentially they represent data transformations within the above contexts (selecting, filtering, aggregating). Common expressions are taking a sample of rows from a column, multiplying values in a column, extracting a column of years from dates ... and so on.

The important thing about expressions in polar, and why its central to its very fast performance, is Polars perfoms:

1. automatic query **optimization** on each expression

2. automatic **parallelization** of expressions on many columns

For more info and examples on [expressions](https://docs.pola.rs/user-guide/expressions/).

**All expressions are run in parallel, meaning that separate Polars expressions are embarrassingly parallel. Note that within an expression there may be more parallelization going on.**

## Learning the Syntax
Basic examples of expressions and contexts are below.

In [3]:

df.select(pl.col("float")) # selecting a column

df.select(pl.col("date","string")) # selecting multiple columns 
#(use "*" or pl.all() for all columns)

df.filter(pl.col("integer") >= 2)  #filtering rows

df.filter((pl.col("integer") >=2) & 
          (pl.col("float") == 5.0)) #filtering rows with multiple conditions (| = or & = and)

df.with_columns((pl.col("integer") + 3).alias("new_column")) # creating column and naming it

df.group_by("string").agg(pl.col("integer").sum().alias("sum"),
                          pl.col("date").sort().first().alias("earliest"), 
                          pl.col("float") / pl.col("integer")) 
                                

string,sum,earliest,float
str,i64,datetime[μs],list[f64]
"""a""",1,2025-01-01 00:00:00,[4.0]
"""b""",6,2025-01-02 00:00:00,"[2.5, 3.0]"
"""c""",3,2025-01-03 00:00:00,[2.0]


There are many ways to use [Select](https://docs.pola.rs/user-guide/expressions/column-selections/)

In [4]:
# Can combine expressions for compactness
# Can use regular expressions and helper functions to select columns in different ways
import polars.selectors as cs
df3 = df.with_columns((pl.col("float") * pl.col("integer"))
                .alias("product")).select(pl.all().exclude("integer")).select(cs.contains("uct"),cs.string())
                
df3

product,string
f64,str
4.0,"""a"""
10.0,"""b"""
18.0,"""c"""
48.0,"""b"""


### Other info:


Polars supports traditional [Data Transformation](https://docs.pola.rs/user-guide/transformations/) such as join, Concatenation, pivot and unpivot. 

Polars provides many [functions for expressions](https://docs.pola.rs/api/python/stable/reference/expressions/functions.html/). Check out the [API documentation](https://docs.pola.rs/api/python/stable/reference/index.html) for more information 

### Data types

Under the hood polars use Arrow data types and memory arrays and offer support for String, Numberic, Nested, Temporay and other types. Most data types are specified by the arrow [syntax](https://docs.pola.rs/user-guide/concepts/data-types/overview/) with the exception of String, Categorical and Object types.

Categorical data represents string data where the values in the column have a finite set of values (yet for performance implementation different to strings). Polars supports both **Enum** data type, where categories are known up front, and the more flexible **Categorical** data type where values are not known beforehand. Conversion between them is trivial. Relying on polars inferring the categories with Categorical types comes at a performance cost (as the encoding is a dictionary like object). See [Categorical](https://docs.pola.rs/user-guide/concepts/data-types/categoricals/#categorical-data-type) page for more information.

In [5]:
# Use Enum where categories are known
cat_types = pl.Enum(["polar","panda","teddy"])
animals = pl.Series(["polar","polar","teddy","panda"],dtype= cat_types)
# Use Categprical otherwise
fictional_animals = pl.Series(["poobear","minimouse","teddy","poobear"],dtype= pl.Categorical)


## Lazy / Eager API

Polars supports two modes of operation: lazy and eager. In the eager API the query is executed immediately while in the lazy API the query is only evaluated once it is 'needed'. Deferring the execution to the last minute can have significant performance advantages.

Before we explore the two run this code to fetch the iris dataset

In [6]:
# fetch iris dataset and save as csv in local path
from ucimlrepo import fetch_ucirepo 
import pandas as pd
iris = fetch_ucirepo(id=53) 
X = iris.data.features 
y = iris.data.targets 
iris_df = pd.concat([X, y], axis=1)
iris_df.rename(columns={'class': 'species'}, inplace=True)
iris_df.to_csv("iris_data.csv", index=False)

In this example we use the eager API we
1. Read the iris dataset.
2. Filter the dataset based on sepal length
3. Calculate the mean of the sepal width per species

Every step is executed immediately returning the intermediate results. This can be very wasteful as we might do work or load extra data that is not being used.


In [7]:
df = pl.read_csv("iris_data.csv")
df_small = df.filter(pl.col("sepal length") > 5)
df_agg = df_small.group_by("species").agg(pl.col("sepal width").mean())

If we instead used the lazy API and waited on execution until all the steps are defined then the query planner could perform various optimizations. In this case:

1. Predicate pushdown: Apply filters as early as possible while reading the dataset, thus only reading rows with sepal length greater than 5.

2. Projection pushdown: Select only the columns that are needed while reading the dataset, thus removing the need to load additional columns (e.g. petal length & petal width)


In [8]:
q = (
    pl.scan_csv("iris_data.csv") #doesnt read it all before other operation is performed
        .filter(pl.col("sepal length") > 5)
        .group_by("species").agg(pl.col("sepal width").mean())
)

q # a lazyframe

df_agg = q.collect() # inform polars that you want to execute the query

One additional benefit of the lazy API is that it allows queries to be executed in a streaming manner. Instead of processing the data all-at-once Polars can execute the query in batches allowing you to process datasets that are larger-than-memory. To tell Polars we want to execute a query in streaming mode we pass the streaming=True argument to collect.

When to use Lazy versus Eager:

In general the lazy API should be preferred unless you are either interested in the intermediate results or are doing exploratory work and don't know yet what your query is going to look like

### Typical pipeline Demo 

Given your new knowledge of polars, here is an example on how to integrate into the usual pipeline consisting of:

1. Data ingestion and manipulation.

2. Data visualization

3. Model Preperation - Training and Testing

4. Prediction 

In [9]:

# Do you aggregating and manipulation to clean data
# keep "as lazy" as possible
q = (
    pl.scan_csv("iris_data.csv") # data manipulation stuff - trivial example
        .filter(pl.col("sepal length") > 2)
)

iris_df = q.collect() #trigger computation on and store the polars dataframe

## Ecosystem

On [this](https://docs.pola.rs/user-guide/ecosystem/#key-features) page you can find a non-exhaustive list of libraries and tools that support Polars. As the data ecosystem is evolving fast, more libraries will likely support Polars in the future. One of the main drivers is that Polars makes adheres its memory layout to the Apache Arrow spec.





In [10]:
# Polars dataframs has kMachine learning 

# Example preprocessing with Polars: Adding a new feature
iris_df = df.with_columns(
    (df['sepal length'] * df['sepal width']).alias('sepal area')
)

# Separate features and target for scikit-learn
X = iris_df.select(['sepal length', 'sepal width', 'petal length', 'petal width', 'sepal area'])
y = iris_df.select('species').to_series()


In [12]:
# Possible demo in  either using scikit-learn or an easier ML package to do classification
# Tom.M
iris_df

sepal length,sepal width,petal length,petal width,species,sepal area
f64,f64,f64,f64,str,f64
5.1,3.5,1.4,0.2,"""Iris-setosa""",17.85
4.9,3.0,1.4,0.2,"""Iris-setosa""",14.7
4.7,3.2,1.3,0.2,"""Iris-setosa""",15.04
4.6,3.1,1.5,0.2,"""Iris-setosa""",14.26
5.0,3.6,1.4,0.2,"""Iris-setosa""",18.0
…,…,…,…,…,…
6.7,3.0,5.2,2.3,"""Iris-virginica""",20.1
6.3,2.5,5.0,1.9,"""Iris-virginica""",15.75
6.5,3.0,5.2,2.0,"""Iris-virginica""",19.5
6.2,3.4,5.4,2.3,"""Iris-virginica""",21.08
