# Rows filtering


In [None]:
import polars as pl

csv_file = 'Titanic.csv'

df = pl.read_csv(csv_file)

Polars doesn't have an explicit index. It does, however, have an implicit integer row number index. It is possible to select individual row with traditional square brackets synthax.

In [None]:
# Single row

df[0]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""


In [None]:
# Specific rows

df[[2,3]]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs.…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""


In [None]:
# Slicing

df[:2]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


Unlike pandas, polars does not accept a list of boolean OR a Series of booleans values for rows filtering:

df[True for _ in range(len(df))] - will yield an error:

### Use case of indexing with [] notation

Unlike in pandas, in polars filter method is the primary way to filter rows in Polars. The main use case of [] to select rows is when inspecting data in interactive mode. In this case, the filter method is an example of the Expression API, which is native to polars. Using the Expression API is the most important step to writing high performance queries in Polars.

In [None]:
# An example of using the filter method on a DataFrame

df.filter(pl.col('Pclass') == 1).head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
4,1,1,"""Futrelle, Mrs.…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
7,0,1,"""McCarthy, Mr. …","""male""",54.0,0,0,"""17463""",51.8625,"""E46""","""S"""


In [None]:
# Another example of using the filter method

df.filter(pl.col('Parch') > 1).head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
9,1,3,"""Johnson, Mrs. …","""female""",27.0,0,2,"""347742""",11.1333,,"""S"""
14,0,3,"""Andersson, Mr.…","""male""",39.0,1,5,"""347082""",31.275,,"""S"""
26,1,3,"""Asplund, Mrs. …","""female""",38.0,1,5,"""347077""",31.3875,,"""S"""


# Key differences between [] and filter

* [] indexing can only be used in eager mode, while filter can also be used in lazy mode
* filter expressions are optimized in lazy mode by the query optimizer

So the general rule of thumb is to use [] for data inspection in interactive mode and filter method in all other cases

In [None]:
# An example of applying filter method with a condition based on the number of rows.

df = df.with_row_count(name = 'row_nr')
df.head(3)

row_nr,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
0,1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
1,2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
2,3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


In [None]:
df.filter(pl.col('row_nr') < 4)

row_nr,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
0,1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
1,2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
2,3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""
3,4,1,1,"""Futrelle, Mrs.…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""


Multiple conditions filtering can be implemented using the AND statement along with chainging calls to the filter method.and

**IMPORTANT!** In eager mode chaining creates a new DataFrame after each filter call. It is better to concatenate multiple AND conditions in a single filter call using & operator. As we will see below, the synthax is very similar to pandas.

In [None]:
# An example of chaining multiple AND conditions without concatenation (e.g. bad implementation in eager mode)

dfFiltered = df.filter(pl.col('Pclass') == 1).filter(pl.col('Age') > 70)
dfFiltered.head(3)

row_nr,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
96,97,0,1,"""Goldschmidt, M…","""male""",71.0,0,0,"""PC 17754""",34.6542,"""A5""","""C"""
493,494,0,1,"""Artagaveytia, …","""male""",71.0,0,0,"""PC 17609""",49.5042,,"""C"""
630,631,1,1,"""Barkworth, Mr.…","""male""",80.0,0,0,"""27042""",30.0,"""A23""","""S"""


In [None]:
# An example of chaining multiple AND conditions with concatenatio (e.g. good implementation in eager mode)

df.filter((pl.col('Age') > 70) & (pl.col('Pclass') == 1)) 

row_nr,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
u32,i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
96,97,0,1,"""Goldschmidt, M…","""male""",71.0,0,0,"""PC 17754""",34.6542,"""A5""","""C"""
493,494,0,1,"""Artagaveytia, …","""male""",71.0,0,0,"""PC 17609""",49.5042,,"""C"""
630,631,1,1,"""Barkworth, Mr.…","""male""",80.0,0,0,"""27042""",30.0,"""A23""","""S"""


### Filtering in Lazy mode

In [None]:
# Read the data again in Lazy mode

dfLazy = pl.scan_csv(csv_file)
dfLazy

When we apply filter in lazy mode, a FILTER line is added to the naive query plan in polars LazyFrame. Note that all query pland are read from bottom to top.

In [None]:
dfLazyF = dfLazy.filter(pl.col('Age') > 30)
dfLazyF

In lazy mode if we pass multiple filter conditions then the query optimizer combines them into a single condition inside the SELECTION section. Therefore, it is not essential to combine them manually - polars takes care of it under the hood.

In [None]:
dfLazyF = dfLazy.filter(pl.col('Pclass') == 1).filter(pl.col('Age') > 70)
print(dfLazyF.describe_optimized_plan())


  CSV SCAN Titanic.csv
  PROJECT */12 COLUMNS
  SELECTION: [([(col("Pclass")) == (1)]) & ([(col("Age")) > (70.0)])]
