### Introduction to Polars for Data Processing.

Polars is a blazingly fast DataFrame library implemented in Rust with Python bindings. Created by Ritchie Vink in 2020, it's designed from the ground up for performance, leveraging Apache Arrow, parallel execution, and query optimization.


#### Architecture: Why Polars is Fast

1. Apache Arrow Memory Format: Columnar, cache-friendly layout with zero-copy reads
2. Written in Rust: Memory-safe systems language without GIL (Global Interpreter Lock)
3. SIMD Vectorization: Single Instruction Multiple Data operations on modern CPUs
4. Parallel Execution: Automatically multi-threaded using Rayon (work-stealing scheduler)
5. Query Optimization: Lazy API builds execution plan, applies predicate pushdown and projection pruning

In [1]:
import polars as pl
import numpy as np

# Polars excels at large-scale operations
df = pl.DataFrame({
    'id': range(10_000_000),
    'value': np.random.randn(10_000_000)
})

# Automatic parallelization across all CPU cores
result = df.filter(pl.col('value') > 0).group_by('id').agg(pl.col('value').sum())
# This runs in parallel without any explicit configuration!
result

id,value
i64,f64
3,1.141362
4,0.663457
5,0.975381
6,0.118117
7,1.860039
…,…
9999993,0.516047
9999996,0.518794
9999997,2.008294
9999998,0.701666


#### Eager vs Lazy Execution: A Critical Distinction

##### Eager Mode (Immediate Execution)

Use for:
- Small datasets (< 1GB)
- Interactive exploration
- When you need immediate results

In [2]:
# Operations execute immediately
df = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
result = df.filter(pl.col('a') > 1)  # Executes now
result

a,b
i64,i64
2,5
3,6


##### Lazy Mode (Query Optimization)

In [3]:
# Build a query plan without executing
lazy_df = pl.scan_csv("data/employees.csv")

# Chain operations - nothing executes yet
query = (
    lazy_df
    .filter(pl.col('salary') > 50000)
    .select(['name', 'department', 'salary'])
    .group_by('department')
    .agg(pl.col('salary').mean())
)

# View optimized query plan
print(query.explain())

AGGREGATE[maintain_order: false]
  [col("salary").mean()] BY [col("department")]
  FROM
  Csv SCAN [data/employees.csv]
  PROJECT 2/6 COLUMNS
  SELECTION: [(col("salary")) > (50000)]
  ESTIMATED ROWS: 100


Shows:
- Predicate pushdown (filter applied during CSV read)
- Projection pushdown (only read needed columns)
- Parallelization strategy

In [4]:
# Execute with optimization
result = query.collect()
result

department,salary
str,f64
"""HR""",107561.619048
"""Engineering""",100800.96
,83998.142857
"""Marketing""",97095.956522
"""Sales""",89187.166667


#### Creating Dataframes 

The first and easiest way to create a dataframe is by using a dictionary. You can simply call the DataFrame api from polars and this will generate a dataframe.

In [5]:
data = {
    "name": ["Alice", "Bob", "Charlie", "Jane"],
    "age": [25, 30, 35, 45],
    "city": ["NY", "LA", "SF", "TX"]
}

df = pl.DataFrame(data)
df

name,age,city
str,i64,str
"""Alice""",25,"""NY"""
"""Bob""",30,"""LA"""
"""Charlie""",35,"""SF"""
"""Jane""",45,"""TX"""


One useful way to ensure type safefy and expect data operations to run is to set the schema and specifically identify the datatypes that exist within a particular column. Below is an example of how to achieve this.

In [6]:
# Explicit schema (type safety)
df = pl.DataFrame(
    data,
    schema={
        'name': pl.Utf8,
        'age': pl.Int32,  # More memory-efficient than Int64
        'city': pl.Categorical  # Automatic string interning
    }
)

df

name,age,city
str,i32,cat
"""Alice""",25,"""NY"""
"""Bob""",30,"""LA"""
"""Charlie""",35,"""SF"""
"""Jane""",45,"""TX"""


##### Creating a Dataframe for Pandas

Since pandas came first, most people will be familiar with using pandas. Polars offers a nice and easy API to convert pandas dataframes to polars dataframe and back

In [7]:
import pandas as pd

pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], 
                    'b': [3, 1, 2, 4, 5, 3]})

# Convert to Polars
df = pl.from_pandas(pdf)
df

a,b
i64,i64
1,3
2,1
3,2
4,4
5,5
6,3


##### Reading Data: Polars' Strengths

There are two main ways of reading data. One is the eager implementation and the other is the lazy reading. Let's begin with the first, <inlinecode>read_csv()</inlinecode>.

In [8]:
# reading employees.csv
df = pl.read_csv("data/employees.csv")
df.head()

id,name,age,department,salary,join_date
i64,str,f64,str,i64,str
1,"""User_1""",,,123355,"""2020-01-01"""
2,"""User_2""",28.0,"""Sales""",118399,"""2020-01-02"""
3,"""User_3""",29.0,"""Sales""",88727,"""2020-01-03"""
4,"""User_4""",48.0,"""Engineering""",71572,"""2020-01-04"""
5,"""User_5""",22.0,"""Engineering""",81849,"""2020-01-05"""


##### Lazy Reading

Alternatively, we can also load data lazily, particularly, when you have a large dataset. This will perform a scan and provide a strategy for reading the data. and implementing processing.

In [9]:
# Lazy scan (recommended for large files)
df = pl.scan_csv( "data/employees.csv")
df

In [10]:
# Lazy benefits:
result = (
    df
    .filter(pl.col('salary') > 50000)  # Pushed down to file scan
    .select(['name', 'salary'])         # Only these columns read
    .collect()                          # Execute optimized plan
)
result

name,salary
str,i64
"""User_1""",123355
"""User_2""",118399
"""User_3""",88727
"""User_4""",71572
"""User_5""",81849
…,…
"""User_96""",62415
"""User_97""",89151
"""User_98""",122927
"""User_99""",106714


##### Reading JSON Datasets

Often times when we work with web data which follows the JSON format, and of course Polars over a JSON API allowing us to read json files. Below is an example implementation of reading a JSON file. 

In [11]:
# Flat JSON
df_json = pl.read_json("../data/employees.json")

df_json.head()

id,name,age,department,salary,join_date
i64,str,f64,str,i64,i64
1,"""User_1""",,,123355,1577836800000
2,"""User_2""",28.0,"""Sales""",118399,1577923200000
3,"""User_3""",29.0,"""Sales""",88727,1578009600000
4,"""User_4""",48.0,"""Engineering""",71572,1578096000000
5,"""User_5""",22.0,"""Engineering""",81849,1578182400000


In the next section, we will go over expression which form the power tools for processing data.