## Polars - Getting to know the Syntax

Polars is a lightning fast DataFrame library. The [key features](https://docs.pola.rs/) of polars are:

**Fast and Accessible**: Written from scratch in Rust, designed close to the machine and without external dependencies. It also has python and R bindings!

**I/O**: First class support for all common data storage layers: local, cloud storage & databases.

**Handle Datasets** larger than RAM

**Intuitive API**: Write your queries the way they were intended. Polars, internally, will determine the most efficient way to execute using its query optimizer.
Out of Core: The streaming API allows you to process your results without requiring all your data to be in memory at the same time

**Parallel**: Utilises the power of your machine by dividing the workload among the available CPU cores without any additional configuration.


The philosophy of Polars is to provide a dataframe library that utilises available cores, has an intuitive api and is performance - hence adheres to a strict schema (data-types should be known before running the query).


In its domains specific language that is designed to be human readable while performing common data manipulation processes, polars describes its operations with [Contexts](https://docs.pola.rs/user-guide/concepts/contexts/) and [Expressions](https://docs.pola.rs/user-guide/concepts/expressions/). Contexts refers to the context in which an expression needs to be evaluated - i.e Filtering, Selecting and Groupby aggregations.  


`select` : select columns

`filter` : filter rows

`with_columns` : create / do something with columns

`group_by` : group by a factor and follow with

`agg` : aggregation
    

In [59]:
import polars as pl
from datetime import datetime

# Data Frames and Series are supported by polars
df = pl.DataFrame(
    {
        "integer": [1, 2, 3],
        "date": [
            datetime(2025, 1, 1),
            datetime(2025, 1, 2),
            datetime(2025, 1, 3),
        ],
        "float": [4.0, 5.0, 6.0],
        "string": ["a", "b", "c"],
    }
)
# Seriers types need to be the same
sr = pl.Series([1,2,3,4,500])
# can specify the data types for better performance

sr = pl.Series([1,2,3,4,500],dtype=pl.Int64)


df2 = pl.DataFrame(
    {
        "x": range(8),
        "y": ["A", "A", "A", "B", "B", "C", "X", "X"],
    }
)

In [60]:
# To inspect the data you can use head(),tail(),print()
# sample() to select random rows, and describe() for sumamry statistics
df.describe()
df.head(3)

integer,date,float,string
i64,datetime[μs],f64,str
1,2025-01-01 00:00:00,4.0,"""a"""
2,2025-01-02 00:00:00,5.0,"""b"""
3,2025-01-03 00:00:00,6.0,"""c"""


## Expressions 

Polars has a powerful concept called expressions that is central to its very fast performance.

Expressions are at the core of many data science operations. Essentially they represent data transformations within the above contexts (selecting, filtering, aggregating). Common expressions are taking a sample of rows from a column, multiplying values in a column, extracting a column of years from dates ... and so on.

The important thing about expressions in polar, and why its central to its very fast performance, is Polars perfoms:

1. automatic query **optimization** on each expression

2. automatic **parallelization** of expressions on many columns

For more info and examples on [expressions](https://docs.pola.rs/user-guide/expressions/).


## Learning the Syntax
Basic examples of expressions and contexts are below.

In [61]:

df.select(pl.col("float")) # selecting a column

df.select(pl.col("date","string")) # selecting multiple columns 
#(use "*" or pl.all() for all columns)

df.filter(pl.col("integer") >= 2)  #filtering rows

df.filter((pl.col("integer") >=2) & 
          (pl.col("float") == 5.0)) #filtering rows with multiple conditions (| = or & = and)

df.with_columns((pl.col("integer") + 3).alias("new_column")) # creating column and naming it

df2.group_by("y").agg(pl.col("x").sum().alias("sum"), #aggregate after groupby
                      pl.col("x").count().alias("count"))


y,sum,count
str,i64,u32
"""X""",13,2
"""C""",5,1
"""A""",3,3
"""B""",7,2


In [62]:
# Can combine expressions for compactness
df3 = df.with_columns((pl.col("float") * pl.col("integer"))
                .alias("product")).select(pl.all().exclude("integer")).select(pl.all())
                




Polars supports the traditional join strategies by the join() method on a dataframe. Specifying the how argument alters the join type. See [here](https://docs.pola.rs/user-guide/transformations/joins/) for examples.



In [63]:
# Can join tables in the usual way. Specific joing expressed via the "how" parameter with inner as default. 

joined = df.join(df2, left_on="integer", right_on="x")

### Data types

Under the hood polars use Arrow data types and memory arrays and offer support for String, Numberic, Nested, Temporay and other types. Most data types are specified by the arrow [syntax](https://docs.pola.rs/user-guide/concepts/data-types/overview/) with the exception of String, Categorical and Object types.

Categorical data represents string data where the values in the column have a finite set of values (yet for performance implementation different to strings). Polars supports both **Enum** data type, where categories are known up front, and the more flexible **Categorical** data type where values are not known beforehand. Conversion between them is trivial. Relying on polars inferring the categories with Categorical types comes at a performance cost (as the encoding is a dictionary like object). See [Categorical](https://docs.pola.rs/user-guide/concepts/data-types/categoricals/#categorical-data-type) page for more information.

In [64]:
# Use Enum where categories are known
cat_types = pl.Enum(["polar","panda","teddy"])
animals = pl.Series(["polar","polar","teddy","panda"],dtype= cat_types)
# Use Categprical otherwise
fictional_animals = pl.Series(["poobear","minimouse","teddy","poobear"],dtype= pl.Categorical)
