# Session 03: Intro to polars

In this notebook we will introduce the polars library, a super fast DataFrame library that is similar to pandas but with a Rust backend. We will cover the following topics:

- Creating a DataFrame
- Reading and writing data
- Selecting and filtering data
- Grouping and aggregating data

## Installing polars

You can install polars using pip:

```bash
pip install polars
```

In [1]:
!pip install polars

Defaulting to user installation because normal site-packages is not writeable
Collecting polars
  Downloading polars-1.8.2-cp38-abi3-win_amd64.whl.metadata (14 kB)
Downloading polars-1.8.2-cp38-abi3-win_amd64.whl (32.4 MB)
   ---------------------------------------- 0.0/32.4 MB ? eta -:--:--
   - -------------------------------------- 1.0/32.4 MB 24.6 MB/s eta 0:00:02
   ----- ---------------------------------- 4.2/32.4 MB 10.1 MB/s eta 0:00:03
   ---------- ----------------------------- 8.4/32.4 MB 13.3 MB/s eta 0:00:02
   --------------- ------------------------ 12.6/32.4 MB 15.5 MB/s eta 0:00:02
   ------------------- -------------------- 15.7/32.4 MB 16.5 MB/s eta 0:00:02
   ------------------------ --------------- 19.9/32.4 MB 16.1 MB/s eta 0:00:01
   ----------------------------- ---------- 24.1/32.4 MB 17.0 MB/s eta 0:00:01
   ------------------------------------ --- 29.4/32.4 MB 17.4 MB/s eta 0:00:01
   ---------------------------------------  32.2/32.4 MB 17.6 MB/s eta 0:00:01

## Polars basic components

Just like pandas, polars has two main components: `Series` and `DataFrame`. A `Series` is a single column of data, while a `DataFrame` is a collection of `Series` forming a table.

The main commands in pandas have a similar counterpart in polars.

- `pd.read_csv` -> `pl.read_csv`
- `pd.DataFrame` -> `pl.DataFrame`
- `pd.Series` -> `pl.Series`
- `df.groupby` -> `df.groupby`
- `df.join` -> `df.join`
- `df.filter` -> `df.filter`
- ...

## Creating a DataFrame

Let's start by creating a DataFrame. We can create a DataFrame from a dictionary or a list of tuples, in a very similar way to pandas.

```python
import polars as pl

data = {
    "column1": [1, 2, 3, 4],
    "column2": [5, 6, 7, 8],
}

df = pl.DataFrame(data)
```

In [2]:
import polars as pl

data = {
    "A": [1, 2, 3, 4, 5],
    "B": [5, 4, 3, 2, 1],
}

df = pl.DataFrame(data)

df

A,B
i64,i64
1,5
2,4
3,3
4,2
5,1


We can see that the first difference is that polars dataframes don't have an index, they are just a collection of columns.

Let's see the usual commands to inspect the data:

In [3]:
df.head()

A,B
i64,i64
1,5
2,4
3,3
4,2
5,1


In [4]:
df.tail()

A,B
i64,i64
1,5
2,4
3,3
4,2
5,1


In [5]:
df.describe()

statistic,A,B
str,f64,f64
"""count""",5.0,5.0
"""null_count""",0.0,0.0
"""mean""",3.0,3.0
"""std""",1.581139,1.581139
"""min""",1.0,1.0
"""25%""",2.0,2.0
"""50%""",3.0,3.0
"""75%""",4.0,4.0
"""max""",5.0,5.0


In [6]:
#  a new one: schema: it returns the schema of the DataFrame

df.schema

Schema([('A', Int64), ('B', Int64)])

Examples of column slices in polars:

* `df["column1"]` -> Select a single column
* `df[["column1", "column2"]]` -> Select multiple columns

For slicing rows, we can use the `filter` method (will be covered later).

In [7]:
df["A"]

A
i64
1
2
3
4
5


## Polars expressions and context

In Polars, an expression is a lazy representation of a data transformation. Polars will only actually perform the operation when it's needed, which can be more efficient than pandas.

Let's create a dataframe to see how expressions work:

In [8]:
from datetime import date

df = pl.DataFrame(
    {
        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
        "birthdate": [
            date(1997, 1, 10),
            date(1985, 2, 15),
            date(1983, 3, 22),
            date(1981, 4, 30),
        ],
        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
    }
)

print(df)

shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘


In [9]:
# creating an expression: calculating the BMI

bmi_expr = pl.col("weight") / (pl.col("height") ** 2)
bmi_expr

As you can see, when evaluating the expression, you only get the operation itself, not the result. This is because polars is lazy and only evaluates the expression when it's needed.

If we want this operation to be executed, we need to provide a context. Depending on the context, the results might be different.

There are more contexts available, but these are the most common ones:

* `select`
* `with_columns`
* `filter`
* `groupby`


### `select` context

The `select` context is used to apply expressions over columns. It also allows us to extract columns from the DataFrame.


In [10]:
# select

result = df.select(
    [
        pl.col("name"),
        pl.col("birthdate"),
        pl.col("weight"),
        pl.col("height"),
        bmi_expr.alias("bmi"), # apply the expression under the alias "bmi"
        bmi_expr.mean().alias("mean_bmi"), # apply the mean of the expression under the alias "mean_bmi"
    ]
)

print(result)

shape: (4, 6)
┌────────────────┬────────────┬────────┬────────┬───────────┬───────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ bmi       ┆ mean_bmi  │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---       ┆ ---       │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ f64       ┆ f64       │
╞════════════════╪════════════╪════════╪════════╪═══════════╪═══════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 23.791913 ┆ 23.438973 │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ 23.141498 ┆ 23.438973 │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ 19.687787 ┆ 23.438973 │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ 27.134694 ┆ 23.438973 │
└────────────────┴────────────┴────────┴────────┴───────────┴───────────┘


In [11]:
# another example
result = df.select(deviation=(bmi_expr - bmi_expr.mean()) / bmi_expr.std())
print(result)

shape: (4, 1)
┌───────────┐
│ deviation │
│ ---       │
│ f64       │
╞═══════════╡
│ 0.115645  │
│ -0.097471 │
│ -1.22912  │
│ 1.210946  │
└───────────┘


### `with_columns` context

The `with_columns` context is used to add new columns to the DataFrame, and it's similar to the `select` context.

The main difference between the two is that the context `with_columns` creates a new dataframe that contains the columns from the original dataframe and the new columns according to its input expressions, whereas the context `select` only includes the columns selected by its input expressions

In [12]:
result = df.with_columns(
    bmi=bmi_expr,
    avg_bmi=bmi_expr.mean(),
    ideal_max_bmi=25,
)
print(result)

shape: (4, 7)
┌────────────────┬────────────┬────────┬────────┬───────────┬───────────┬───────────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ bmi       ┆ avg_bmi   ┆ ideal_max_bmi │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---       ┆ ---       ┆ ---           │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ f64       ┆ f64       ┆ i32           │
╞════════════════╪════════════╪════════╪════════╪═══════════╪═══════════╪═══════════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 23.791913 ┆ 23.438973 ┆ 25            │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ 23.141498 ┆ 23.438973 ┆ 25            │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ 19.687787 ┆ 23.438973 ┆ 25            │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ 27.134694 ┆ 23.438973 ┆ 25            │
└────────────────┴────────────┴────────┴────────┴───────────┴───────────┴───────────────┘


### `filter` context

The `filter` context is used to filter rows from the DataFrame. It's similar to the `select` context, but it filters rows instead of columns. You can use it to filter rows based on a condition on several columns.

Rows where the filter does not evaluate to True are discarded, including nulls: any comparison involving null values will always result in null, which is considered false.

In [13]:
df.filter(pl.col("height") > 1.6)

name,birthdate,weight,height
str,date,f64,f64
"""Ben Brown""",1985-02-15,72.5,1.77
"""Chloe Cooper""",1983-03-22,53.6,1.65
"""Daniel Donovan""",1981-04-30,83.1,1.75


In [14]:
# multiple conditions, and
df.filter((pl.col("height") > 1.6) & (pl.col("weight") < 60))

name,birthdate,weight,height
str,date,f64,f64
"""Chloe Cooper""",1983-03-22,53.6,1.65


In [15]:
# multiple conditions, or
df.filter((pl.col("height") > 1.6) | (pl.col("weight") < 60))

name,birthdate,weight,height
str,date,f64,f64
"""Alice Archer""",1997-01-10,57.9,1.56
"""Ben Brown""",1985-02-15,72.5,1.77
"""Chloe Cooper""",1983-03-22,53.6,1.65
"""Daniel Donovan""",1981-04-30,83.1,1.75


In [16]:
df.filter(~pl.col("name").is_in(["Alice Archer", "Ben Brown"]))

name,birthdate,weight,height
str,date,f64,f64
"""Chloe Cooper""",1983-03-22,53.6,1.65
"""Daniel Donovan""",1981-04-30,83.1,1.75


In [17]:
df.filter(name="Chloe Cooper")

name,birthdate,weight,height
str,date,f64,f64
"""Chloe Cooper""",1983-03-22,53.6,1.65


### `groupby` context

The `group_by` context is used to group rows by one or more columns. It's similar to the `groupby` method in pandas, but it's more flexible and powerful.

In [18]:
df = pl.DataFrame({
    "category_1": ["A", "B", "A", "B", "A"],
    "category_2": ["X", "X", "Y", "Z", "X"],
    "B": [5, 4, 3, 2, 1],
    "C": [10, 20, 30, 40, 50],
})

df

category_1,category_2,B,C
str,str,i64,i64
"""A""","""X""",5,10
"""B""","""X""",4,20
"""A""","""Y""",3,30
"""B""","""Z""",2,40
"""A""","""X""",1,50


In [19]:
df.group_by("category_1").agg(pl.col("B").sum())

category_1,B
str,i64
"""A""",9
"""B""",6


In [20]:
# group by multiple columns and aggregate multiple columns using different aggregations
(
    df
    .group_by(["category_1", "category_2"])
    .agg([
        pl.col("B").sum(),
        pl.col("C").mean()
    ])
)

category_1,category_2,B,C
str,str,i64,f64
"""B""","""X""",4,20.0
"""A""","""X""",6,30.0
"""A""","""Y""",3,30.0
"""B""","""Z""",2,40.0


## Parquet files

Parquet is a columnar storage format that is widely used in the big data ecosystem. It's a very efficient format for reading and writing data, and it's supported by many tools. When using Polars with large datasets, it's a good idea to use Parquet files to store your data.

In [21]:
import pandas as pd

In [22]:
%%time

# read a csv file with pandas
df_from_csv_with_pandas = pd.read_csv("../data/indicators.csv")

CPU times: user 14.2 ms, sys: 2.52 ms, total: 16.7 ms
Wall time: 17.7 ms


In [23]:
%%time

# read a csv file with polars
df_from_csv_with_polars = pl.read_csv("../data/indicators.csv")

CPU times: user 9.74 ms, sys: 10.8 ms, total: 20.5 ms
Wall time: 32.5 ms


In [24]:
%%time

# read a parquet file with pandas
df_from_parquet_with_pandas = pd.read_parquet("../data/indicators.parquet")

CPU times: user 16 ms, sys: 12.8 ms, total: 28.8 ms
Wall time: 81.2 ms


In [25]:
%%time

# read a parquet file with polars
df_from_parquet_with_polars = pl.read_parquet("../data/indicators.parquet")

CPU times: user 1.52 ms, sys: 4.05 ms, total: 5.57 ms
Wall time: 15.5 ms


As you can see, the reading of Parquet files with Polars is the best option for performance. It's much faster than reading CSV files.

You can write dataframes to Parquet files using the `to_parquet` method:

```python
df.write_parquet("data.parquet")
```

In [26]:
df = pl.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": [5, 4, 3, 2, 1],
})

df.write_parquet("example.parquet")