Polars is a DataFrame library that is completely written in Rust. In this article, I will walk you through the basics of Polars and how it can be used in place of Pandas. 

#What is Polars?
The best way to understand Polars is that it is a better dataframe library than Pandas. Here are some advantages of Polars over Pandas:

- Polars does not use an index for the dataframe. Eliminating the index makes it much easier to manipulate the dataframe (the index is mostly redundant in Pandas dataframe anyway).
- Polars represents data internally using Apache Arrow arrays while Pandas stores data internally using NumPy arrays. Apache Arrow arrays is much more efficient in areas like load time, memory usage, and computation.
- Polars supports more parallel operations than Pandas. As Polars is written in Rust, it can run many operations in parallel.
- Polars supports lazy evaluation. Based on your query, Polars will examine your queries, optimize them, and look for ways to accelerate the query or reduce memory usage. Pandas, on the other hand, support only eager evaluation, which immediately evaluates an expression as soon as it encounters one.

In [1]:
#Installing Polars
!pip install polars --q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.6/16.6 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import polars as pl

df = pl.DataFrame(
     {
         'Model': ['iPhone X','iPhone XS','iPhone 12',
                   'iPhone 13','Samsung S11','Samsung S12',
                   'Mi A1','Mi A2'],
         'Sales': [80,170,130,205,400,30,14,8],     
         'Company': ['Apple','Apple','Apple','Apple',
                     'Samsung','Samsung','Xiao Mi','Xiao Mi'],
     }
)

df

Model,Sales,Company
str,i64,str
"""iPhone X""",80,"""Apple"""
"""iPhone XS""",170,"""Apple"""
"""iPhone 12""",130,"""Apple"""
"""iPhone 13""",205,"""Apple"""
"""Samsung S11""",400,"""Samsung"""
"""Samsung S12""",30,"""Samsung"""
"""Mi A1""",14,"""Xiao Mi"""
"""Mi A2""",8,"""Xiao Mi"""


In [3]:
import polars as pl

df = pl.read_csv("https://j.mp/iriscsv")

In [4]:
df.head()

sepal_length,sepal_width,petal_length,petal_width,species
f64,f64,f64,f64,str
5.1,3.5,1.4,0.2,"""setosa"""
4.9,3.0,1.4,0.2,"""setosa"""
4.7,3.2,1.3,0.2,"""setosa"""
4.6,3.1,1.5,0.2,"""setosa"""
5.0,3.6,1.4,0.2,"""setosa"""


In [5]:
df.dtypes

[Float64, Float64, Float64, Float64, Utf8]

In [6]:
df.columns 

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

> Polars does not have the concept of index, unlike Pandas. The design philosophy of Polars explicitly states that index is not useful in dataframes.

In [None]:
df.rows()

#Selecting Column(s)
Selecting column(s) in Polars is straight-forward — simply specify the column name using the select() method:

In [None]:
df[['species']]

> Polars also support the square bracket indexing method, the method that most Pandas developers are familiar with. However, the documentation for Polars specifically mentioned that the square bracket indexing method is an anti-pattern for Polars. While you can do the above using df[:,[0]], there is a possibility that the square bracket indexing method may be removed in a future version of Polars.

In [None]:
df.select(
    'species'
)

In [None]:
df.select(
    ['species','sepal_length']
)

Alle 3 koder ovenover giver det samme

If you want to retrieve all the integer (specifically Int64) columns in the dataframe, you can use an expression within the select() method:

In [11]:
df.select(
    pl.col(pl.Float64)
)

sepal_length,sepal_width,petal_length,petal_width
f64,f64,f64,f64
5.1,3.5,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5.0,3.6,1.4,0.2
5.4,3.9,1.7,0.4
4.6,3.4,1.4,0.3
5.0,3.4,1.5,0.2
4.4,2.9,1.4,0.2
4.9,3.1,1.5,0.1


> The statement pl.col(pl.Float64) is known as an expression in Polars. This expression is interpreted as “get me all the columns whose data type is Float64. 

In [12]:
df.select(
    pl.col(['species','sepal_length']).sort_by('sepal_length', descending=True)    
)

species,sepal_length
str,f64
"""virginica""",7.9
"""virginica""",7.7
"""virginica""",7.7
"""virginica""",7.7
"""virginica""",7.7
"""virginica""",7.6
"""virginica""",7.4
"""virginica""",7.3
"""virginica""",7.2
"""virginica""",7.2


In [13]:
df.select(
    [pl.col(pl.Float64),'species']
)

sepal_length,sepal_width,petal_length,petal_width,species
f64,f64,f64,f64,str
5.1,3.5,1.4,0.2,"""setosa"""
4.9,3.0,1.4,0.2,"""setosa"""
4.7,3.2,1.3,0.2,"""setosa"""
4.6,3.1,1.5,0.2,"""setosa"""
5.0,3.6,1.4,0.2,"""setosa"""
5.4,3.9,1.7,0.4,"""setosa"""
4.6,3.4,1.4,0.3,"""setosa"""
5.0,3.4,1.5,0.2,"""setosa"""
4.4,2.9,1.4,0.2,"""setosa"""
4.9,3.1,1.5,0.1,"""setosa"""


In [14]:
df.select(
    [pl.col(pl.Utf8)]
)

species
str
"""setosa"""
"""setosa"""
"""setosa"""
"""setosa"""
"""setosa"""
"""setosa"""
"""setosa"""
"""setosa"""
"""setosa"""
"""setosa"""


#Selecting Row(s)
To select a single row in a dataframe, pass in the row number using the row() method:

In [15]:
df.row(0)   # get the first row

(5.1, 3.5, 1.4, 0.2, 'setosa')

If you need to get multiple rows based on row numbers, you need to use the square bracket indexing method, although it is not the recommended way to do in Polars. Here are some examples:

> - df[:2]# first 2 rows
> - df[[1,3]] # second and fourth row

To select multiple rows, Polars recommends using the filter() function. For example, if you want to retrieve all Apple’s products, you can use the following expression:

In [16]:
df.filter(
    pl.col('species') == 'setosa'
)

sepal_length,sepal_width,petal_length,petal_width,species
f64,f64,f64,f64,str
5.1,3.5,1.4,0.2,"""setosa"""
4.9,3.0,1.4,0.2,"""setosa"""
4.7,3.2,1.3,0.2,"""setosa"""
4.6,3.1,1.5,0.2,"""setosa"""
5.0,3.6,1.4,0.2,"""setosa"""
5.4,3.9,1.7,0.4,"""setosa"""
4.6,3.4,1.4,0.3,"""setosa"""
5.0,3.4,1.5,0.2,"""setosa"""
4.4,2.9,1.4,0.2,"""setosa"""
4.9,3.1,1.5,0.1,"""setosa"""


You can also specify multiple conditions using the logical operator:


In [17]:
df.filter(
    (pl.col('species') == 'setosa') | 
    (pl.col('species') == 'versicolor')
)

sepal_length,sepal_width,petal_length,petal_width,species
f64,f64,f64,f64,str
5.1,3.5,1.4,0.2,"""setosa"""
4.9,3.0,1.4,0.2,"""setosa"""
4.7,3.2,1.3,0.2,"""setosa"""
4.6,3.1,1.5,0.2,"""setosa"""
5.0,3.6,1.4,0.2,"""setosa"""
5.4,3.9,1.7,0.4,"""setosa"""
4.6,3.4,1.4,0.3,"""setosa"""
5.0,3.4,1.5,0.2,"""setosa"""
4.4,2.9,1.4,0.2,"""setosa"""
4.9,3.1,1.5,0.1,"""setosa"""


You can use the following logical operators in Polars:

> - OR (|)
> - AND (&)
> - Not (~)

#Selecting Rows and Columns
Very often, you need to select rows and columns at the same time. You can do so by chaining the filter() and select() methods, like this:

In [18]:
df.filter(pl.col('species') == 'versicolor').select(['petal_length', 'petal_width'])

petal_length,petal_width
f64,f64
4.7,1.4
4.5,1.5
4.9,1.5
4.0,1.3
4.6,1.5
4.5,1.3
4.7,1.6
3.3,1.0
4.6,1.3
3.9,1.4


In [19]:
df.filter(pl.col("sepal_length") > 5).groupby("species", maintain_order=True).agg(pl.all().sum())

species,sepal_length,sepal_width,petal_length,petal_width
str,f64,f64,f64,f64
"""setosa""",116.9,81.7,33.2,6.1
"""versicolor""",281.9,131.8,202.9,63.3
"""virginica""",324.5,146.2,273.1,99.6


<img src="https://raw.githubusercontent.com/aaubs/ds-master/main/data/Images/Exercise.png" width="600">