### PACKAGE POLARS

https://pypi.org/project/polars/
https://www.linkedin.com/feed/update/urn:li:activity:7027558322694156288/

#### 4 RAISONS DE PASSER DE PANDAS À POLARS :

##### 1 - Gagnez du temps

Polars est environ 50 à 100 fois plus rapide que Pandas.

Si vous utilisez Pandas et attendez 3 minutes ici, 5 minutes là, etc. pour que votre calcul s’exécute, alors passer à Polars peut fortement réduire votre temps de développement.

Surtout si vous itérez sur votre analyse.

(réduire 50 fois signifie passer de 10 minutes à 12 secondes)

##### 2 - Économisez du temps pour vos utilisateurs (et gagnez plus d’argent)

Si vous créez une application orientée utilisateur, vos utilisateurs attendront moins et bénéficieront d’une meilleure expérience utilisateur.

Selon une estimation récente, un site qui se charge en 1 seconde a un taux de conversion 3x plus élevé qu’un site qui se charge en 5 secondes. Et 5x plus haut qu’un site qui se charge en 10 secondes !

##### 3 - Économisez du temps sur votre serveur (et économisez de l’argent)

Si vous utilisez le cloud et payez pour l’informatique en fonction du temps, Polars réduira le temps de calcul et économisera de l’argent

##### 4- Un synthax plus intuitif et plus puissant

Il y a une courbe d’apprentissage venant des pandas, mais cela en vaut la peine. 

L’API Polars est expressive et facile à écrire / comprendre.

Ma partie préférée sont les fonctions de fenêtre, qui peuvent souvent remplacer les fonctions groupby() et exploses()

In [2]:
# %pip install polars

In [11]:
import polars as pl
from datetime import datetime, timedelta 
import numpy as np

In [4]:
# Création d'une série avec un tuple
series = pl.Series("a", [1, 2, 3, 4, 5])
series

a
i64
1
2
3
4
5


In [6]:
# Eréation d'une série avec une liste
series = pl.Series([1, 2, 3, 4, 5])
series

1
2
3
4
5


In [12]:
# Création d'une DF
df = pl.DataFrame({"a": np.arange(0, 8), 
                   "b": np.random.rand(8), 
                   "c": [datetime(2022, 12, 1) + timedelta(days=idx) for idx in range(8)],
                   "d": [1, 2.0, np.NaN, np.NaN, 0, -5, -42, None]
                  })

df

a,b,c,d
i32,f64,datetime[μs],f64
0,0.768031,2022-12-01 00:00:00,1.0
1,0.494708,2022-12-02 00:00:00,2.0
2,0.184233,2022-12-03 00:00:00,
3,0.529369,2022-12-04 00:00:00,
4,0.806704,2022-12-05 00:00:00,0.0
5,0.573304,2022-12-06 00:00:00,-5.0
6,0.661648,2022-12-07 00:00:00,-42.0
7,0.161291,2022-12-08 00:00:00,


In [18]:
# 5 premières lignes
df.head()

a,b,c,d
i32,f64,datetime[μs],f64
0,0.768031,2022-12-01 00:00:00,1.0
1,0.494708,2022-12-02 00:00:00,2.0
2,0.184233,2022-12-03 00:00:00,
3,0.529369,2022-12-04 00:00:00,
4,0.806704,2022-12-05 00:00:00,0.0


In [19]:
# 5 dernières lignes
df.tail()

a,b,c,d
i32,f64,datetime[μs],f64
3,0.529369,2022-12-04 00:00:00,
4,0.806704,2022-12-05 00:00:00,0.0
5,0.573304,2022-12-06 00:00:00,-5.0
6,0.661648,2022-12-07 00:00:00,-42.0
7,0.161291,2022-12-08 00:00:00,


In [21]:
# 3 lignes au hasard
df.sample(n=3)

a,b,c,d
i32,f64,datetime[μs],f64
0,0.768031,2022-12-01 00:00:00,1.0
3,0.529369,2022-12-04 00:00:00,
7,0.161291,2022-12-08 00:00:00,


In [22]:
# Nombre de lignes, min, max, moyenne...
df.describe()

describe,a,b,c,d
str,f64,f64,str,f64
"""count""",8.0,8.0,"""8""",8.0
"""null_count""",0.0,0.0,"""0""",1.0
"""mean""",3.5,0.522411,,
"""std""",2.44949,0.241636,,
"""min""",0.0,0.161291,"""2022-12-01 00:...",-42.0
"""max""",7.0,0.806704,"""2022-12-08 00:...",2.0
"""median""",3.5,0.551337,,1.0


In [24]:
# Colonnes à afficher
df.select(pl.col(['a', 'b']))

a,b
i32,f64
0,0.768031
1,0.494708
2,0.184233
3,0.529369
4,0.806704
5,0.573304
6,0.661648
7,0.161291


In [29]:
# Colonnes à ne pas afficher
df.select([pl.exclude(['a', 'd'])])

b,c
f64,datetime[μs]
0.768031,2022-12-01 00:00:00
0.494708,2022-12-02 00:00:00
0.184233,2022-12-03 00:00:00
0.529369,2022-12-04 00:00:00
0.806704,2022-12-05 00:00:00
0.573304,2022-12-06 00:00:00
0.661648,2022-12-07 00:00:00
0.161291,2022-12-08 00:00:00


In [30]:
# Filtre
df.filter(
    pl.col("c").is_between(datetime(2022, 12, 2), datetime(2022, 12, 8)))

a,b,c,d
i32,f64,datetime[μs],f64
1,0.494708,2022-12-02 00:00:00,2.0
2,0.184233,2022-12-03 00:00:00,
3,0.529369,2022-12-04 00:00:00,
4,0.806704,2022-12-05 00:00:00,0.0
5,0.573304,2022-12-06 00:00:00,-5.0
6,0.661648,2022-12-07 00:00:00,-42.0
7,0.161291,2022-12-08 00:00:00,


In [31]:
# Multi-filtres
df.filter(
    (pl.col('a') <= 3) & (pl.col('d').is_not_nan()))

a,b,c,d
i32,f64,datetime[μs],f64
0,0.768031,2022-12-01 00:00:00,1.0
1,0.494708,2022-12-02 00:00:00,2.0


In [32]:
# Ajout de colonnes
df.with_columns([
    pl.col('b').sum().alias('e'), # colonne e : somme de la colonne b
    (pl.col('b') + 42).alias('b+42') # colonne b+42 : ligne colonne b + 42
    ])

a,b,c,d,e,b+42
i32,f64,datetime[μs],f64,f64,f64
0,0.768031,2022-12-01 00:00:00,1.0,4.179288,42.768031
1,0.494708,2022-12-02 00:00:00,2.0,4.179288,42.494708
2,0.184233,2022-12-03 00:00:00,,4.179288,42.184233
3,0.529369,2022-12-04 00:00:00,,4.179288,42.529369
4,0.806704,2022-12-05 00:00:00,0.0,4.179288,42.806704
5,0.573304,2022-12-06 00:00:00,-5.0,4.179288,42.573304
6,0.661648,2022-12-07 00:00:00,-42.0,4.179288,42.661648
7,0.161291,2022-12-08 00:00:00,,4.179288,42.161291


In [33]:
# Nouvelle DF
df2 = pl.DataFrame({
                    "x": np.arange(0, 8), 
                    "y": ['A', 'A', 'A', 'B', 'B', 'C', 'X', 'X'],})
df2

x,y
i32,str
0,"""A"""
1,"""A"""
2,"""A"""
3,"""B"""
4,"""B"""
5,"""C"""
6,"""X"""
7,"""X"""


In [41]:
# TCD : quantié par valeur de la colonne 'y'
(df2.groupby("y", 
            maintain_order=True) # False : ordre aléatoire
 .count())

y,count
str,u32
"""A""",3
"""B""",2
"""C""",1
"""X""",2


In [42]:
# TCD nombre lignes de chaque valeur y et somme de chaque valeur y
df2.groupby("y", maintain_order=True).agg([
    pl.col("*").count().alias("count"),
    pl.col("*").sum().alias("sum")
])

y,count,sum
str,u32,i32
"""A""",3,3
"""B""",2,7
"""C""",1,5
"""X""",2,13


In [44]:
# Combinaison : ajout d'une colonne 'a * b' et exclusion des colonnes c et d
df_x = df.with_columns(
    (pl.col("a") * pl.col("b")).alias("a * b")
).select([
    pl.all().exclude(['c', 'd'])
])
df_x

a,b,a * b
i32,f64,f64
0,0.768031,0.0
1,0.494708,0.494708
2,0.184233,0.368465
3,0.529369,1.588107
4,0.806704,3.226817
5,0.573304,2.866521
6,0.661648,3.969889
7,0.161291,1.129037


In [47]:
# Fusion entre deux DF
df = pl.DataFrame({"a": np.arange(0, 8), 
                   "b": np.random.rand(8), 
                   "c": [datetime(2022, 12, 1) + timedelta(days=idx) for idx in range(8)],
                   "d": [1, 2.0, np.NaN, np.NaN, 0, -5, -42, None]
                  })

df2 = pl.DataFrame({
                    "x": np.arange(0, 8), 
                    "y": ['A', 'A', 'A', 'B', 'B', 'C', 'X', 'X'],
})
df.join(df2, left_on="a", right_on="x")

a,b,c,d,y
i32,f64,datetime[μs],f64,str
0,0.56043,2022-12-01 00:00:00,1.0,"""A"""
1,0.46446,2022-12-02 00:00:00,2.0,"""A"""
2,0.037626,2022-12-03 00:00:00,,"""A"""
3,0.539072,2022-12-04 00:00:00,,"""B"""
4,0.438167,2022-12-05 00:00:00,0.0,"""B"""
5,0.948216,2022-12-06 00:00:00,-5.0,"""C"""
6,0.253099,2022-12-07 00:00:00,-42.0,"""X"""
7,0.500195,2022-12-08 00:00:00,,"""X"""


In [48]:
# Concaténation
pl.concat([df,df2], how="horizontal")

a,b,c,d,x,y
i32,f64,datetime[μs],f64,i32,str
0,0.56043,2022-12-01 00:00:00,1.0,0,"""A"""
1,0.46446,2022-12-02 00:00:00,2.0,1,"""A"""
2,0.037626,2022-12-03 00:00:00,,2,"""A"""
3,0.539072,2022-12-04 00:00:00,,3,"""B"""
4,0.438167,2022-12-05 00:00:00,0.0,4,"""B"""
5,0.948216,2022-12-06 00:00:00,-5.0,5,"""C"""
6,0.253099,2022-12-07 00:00:00,-42.0,6,"""X"""
7,0.500195,2022-12-08 00:00:00,,7,"""X"""
