>Polars has developed its own Domain Specific Language (DSL) for transforming data. \
The language is very easy to use and allows for complex queries that remain human readable. \
The two core components of the language are Contexts and Expressions


Polars 开发了自己的特定领域语言 (DSL)，用于转换数据。\
该语言非常容易使用，允许进行复杂的查询，但仍保持人类可读性。\
该语言的两个核心组成部分是上下文和表达式

Polars 自己设计了一套用于处理数据的功能。\
该功能易于使用，而且能以易理解的方式进行复杂的数据处理。\
上下文与表达式是该功能的两个核心组成部分。

# 上下文与表达式概述

## Contexts 上下文

顾名思义，上下文是指需要计算表达式的上下文

- 选择：df.select(...)，df.with_columns(...)
- 过滤：df.filter()
- 分组聚合：df.group_by(...).agg(...)

## Expressions  表达式

表达式是许多数据科学运算的核心：

- 选取特定的列
- 从一列中抽取特定的行
- 将一列与值相乘
- 从一个日期列中，提取年份
- 将一列字符串转换为小写
- ......

**在Polars中，需要Contexts 上下文 与 Expressions 表达式，结合使用**

# 运行环境

In [1]:
import sys

print('python 版本：',sys.version.split('|')[0])

python 版本： 3.11.5 


In [2]:
import polars as pl

print("polars 版本：",pl.__version__)

polars 版本： 0.20.22


# 演示数据

In [3]:
df=pl.read_csv('./data/iris.csv')

In [4]:
df.head(10)

index,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
i64,f64,f64,f64,f64,str
1,5.1,3.5,1.4,0.2,"""setosa"""
2,4.9,3.0,1.4,0.2,"""setosa"""
3,4.7,3.2,1.3,0.2,"""setosa"""
4,4.6,3.1,1.5,0.2,"""setosa"""
5,5.0,3.6,1.4,0.2,"""setosa"""
6,5.4,3.9,1.7,0.4,"""setosa"""
7,4.6,3.4,1.4,0.3,"""setosa"""
8,5.0,3.4,1.5,0.2,"""setosa"""
9,4.4,2.9,1.4,0.2,"""setosa"""
10,4.9,3.1,1.5,0.1,"""setosa"""


In [5]:
print(df.head(10))   #注意字符串 str 列类型与上面的区别

shape: (10, 6)
┌───────┬──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ index ┆ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
│ ---   ┆ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     │
│ i64   ┆ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str     │
╞═══════╪══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 1     ┆ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  │
│ 2     ┆ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  │
│ 3     ┆ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  │
│ 4     ┆ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa  │
│ 5     ┆ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa  │
│ 6     ┆ 5.4          ┆ 3.9         ┆ 1.7          ┆ 0.4         ┆ setosa  │
│ 7     ┆ 4.6          ┆ 3.4         ┆ 1.4          ┆ 0.3         ┆ setosa  │
│ 8     ┆ 5.0          ┆ 3.4         ┆ 1.5       

In [6]:
df.shape

(150, 6)

# 选取需要的列

In [7]:
df.select(pl.col("Sepal.Length"))  #选取特定的列

Sepal.Length
f64
5.1
4.9
4.7
4.6
5.0
…
6.7
6.3
6.5
6.2


In [8]:
df.select(pl.col("Sepal.Length","Petal.Length"))

Sepal.Length,Petal.Length
f64,f64
5.1,1.4
4.9,1.4
4.7,1.3
4.6,1.5
5.0,1.4
…,…
6.7,5.2
6.3,5.0
6.5,5.2
6.2,5.4


In [9]:
df.select(pl.col("*"))  #选取所有列

index,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
i64,f64,f64,f64,f64,str
1,5.1,3.5,1.4,0.2,"""setosa"""
2,4.9,3.0,1.4,0.2,"""setosa"""
3,4.7,3.2,1.3,0.2,"""setosa"""
4,4.6,3.1,1.5,0.2,"""setosa"""
5,5.0,3.6,1.4,0.2,"""setosa"""
…,…,…,…,…,…
146,6.7,3.0,5.2,2.3,"""virginica"""
147,6.3,2.5,5.0,1.9,"""virginica"""
148,6.5,3.0,5.2,2.0,"""virginica"""
149,6.2,3.4,5.4,2.3,"""virginica"""


In [10]:
df.select(pl.all())  #选取所有列

index,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
i64,f64,f64,f64,f64,str
1,5.1,3.5,1.4,0.2,"""setosa"""
2,4.9,3.0,1.4,0.2,"""setosa"""
3,4.7,3.2,1.3,0.2,"""setosa"""
4,4.6,3.1,1.5,0.2,"""setosa"""
5,5.0,3.6,1.4,0.2,"""setosa"""
…,…,…,…,…,…
146,6.7,3.0,5.2,2.3,"""virginica"""
147,6.3,2.5,5.0,1.9,"""virginica"""
148,6.5,3.0,5.2,2.0,"""virginica"""
149,6.2,3.4,5.4,2.3,"""virginica"""


In [11]:
df.select(pl.col("*").exclude("index", "Species"))  #选取列时，排除特定列

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
f64,f64,f64,f64
5.1,3.5,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5.0,3.6,1.4,0.2
…,…,…,…
6.7,3.0,5.2,2.3
6.3,2.5,5.0,1.9
6.5,3.0,5.2,2.0
6.2,3.4,5.4,2.3


In [12]:
df.select(pl.col("^.*Length$"))  #支持正则表达式，需要以 ^ 开始 $ 结尾

Sepal.Length,Petal.Length
f64,f64
5.1,1.4
4.9,1.4
4.7,1.3
4.6,1.5
5.0,1.4
…,…
6.7,5.2
6.3,5.0
6.5,5.2
6.2,5.4


In [13]:
df.select(pl.col(pl.Float64))  #根据列的类型，进行选取

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
f64,f64,f64,f64
5.1,3.5,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5.0,3.6,1.4,0.2
…,…,…,…
6.7,3.0,5.2,2.3
6.3,2.5,5.0,1.9
6.5,3.0,5.2,2.0
6.2,3.4,5.4,2.3


# 过滤出需要的行

In [14]:
df.filter(pl.col("Sepal.Length")>5)  

index,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
i64,f64,f64,f64,f64,str
1,5.1,3.5,1.4,0.2,"""setosa"""
6,5.4,3.9,1.7,0.4,"""setosa"""
11,5.4,3.7,1.5,0.2,"""setosa"""
15,5.8,4.0,1.2,0.2,"""setosa"""
16,5.7,4.4,1.5,0.4,"""setosa"""
…,…,…,…,…,…
146,6.7,3.0,5.2,2.3,"""virginica"""
147,6.3,2.5,5.0,1.9,"""virginica"""
148,6.5,3.0,5.2,2.0,"""virginica"""
149,6.2,3.4,5.4,2.3,"""virginica"""


In [15]:
df.filter((pl.col("Sepal.Length")>5) & (pl.col("Petal.Length")>5))  #需要把2个条件分别括起来！！！

index,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
i64,f64,f64,f64,f64,str
84,6.0,2.7,5.1,1.6,"""versicolor"""
101,6.3,3.3,6.0,2.5,"""virginica"""
102,5.8,2.7,5.1,1.9,"""virginica"""
103,7.1,3.0,5.9,2.1,"""virginica"""
104,6.3,2.9,5.6,1.8,"""virginica"""
…,…,…,…,…,…
145,6.7,3.3,5.7,2.5,"""virginica"""
146,6.7,3.0,5.2,2.3,"""virginica"""
148,6.5,3.0,5.2,2.0,"""virginica"""
149,6.2,3.4,5.4,2.3,"""virginica"""


In [16]:
df.filter((pl.col("Sepal.Length")>5) | (pl.col("Petal.Length")>5))

index,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
i64,f64,f64,f64,f64,str
1,5.1,3.5,1.4,0.2,"""setosa"""
6,5.4,3.9,1.7,0.4,"""setosa"""
11,5.4,3.7,1.5,0.2,"""setosa"""
15,5.8,4.0,1.2,0.2,"""setosa"""
16,5.7,4.4,1.5,0.4,"""setosa"""
…,…,…,…,…,…
146,6.7,3.0,5.2,2.3,"""virginica"""
147,6.3,2.5,5.0,1.9,"""virginica"""
148,6.5,3.0,5.2,2.0,"""virginica"""
149,6.2,3.4,5.4,2.3,"""virginica"""


In [17]:
df.select(pl.col("Sepal.Width","Petal.Width").filter(pl.col("Sepal.Length")>5))  #根据过滤条件，选取特定列

Sepal.Width,Petal.Width
f64,f64
3.5,0.2
3.9,0.4
3.7,0.2
4.0,0.2
4.4,0.4
…,…
3.0,2.3
2.5,1.9
3.0,2.0
3.4,2.3


# 增加新列

In [18]:
df.with_columns(pl.lit(10),pl.lit(2).alias("lit_5"))  #增加常数列，并设置别名

index,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,literal,lit_5
i64,f64,f64,f64,f64,str,i32,i32
1,5.1,3.5,1.4,0.2,"""setosa""",10,2
2,4.9,3.0,1.4,0.2,"""setosa""",10,2
3,4.7,3.2,1.3,0.2,"""setosa""",10,2
4,4.6,3.1,1.5,0.2,"""setosa""",10,2
5,5.0,3.6,1.4,0.2,"""setosa""",10,2
…,…,…,…,…,…,…,…
146,6.7,3.0,5.2,2.3,"""virginica""",10,2
147,6.3,2.5,5.0,1.9,"""virginica""",10,2
148,6.5,3.0,5.2,2.0,"""virginica""",10,2
149,6.2,3.4,5.4,2.3,"""virginica""",10,2


In [19]:
df.with_columns(pl.max("Sepal.Length").alias("max_Sepal.Length"),
                pl.min("Sepal.Length").alias("min_Sepal.Length"),
                pl.mean("Sepal.Length").alias("avg_Sepal.Length"),
                pl.std("Sepal.Length").alias("std_Sepal.Length")
               )  #有点类似窗口函数

index,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,max_Sepal.Length,min_Sepal.Length,avg_Sepal.Length,std_Sepal.Length
i64,f64,f64,f64,f64,str,f64,f64,f64,f64
1,5.1,3.5,1.4,0.2,"""setosa""",7.9,4.3,5.843333,0.828066
2,4.9,3.0,1.4,0.2,"""setosa""",7.9,4.3,5.843333,0.828066
3,4.7,3.2,1.3,0.2,"""setosa""",7.9,4.3,5.843333,0.828066
4,4.6,3.1,1.5,0.2,"""setosa""",7.9,4.3,5.843333,0.828066
5,5.0,3.6,1.4,0.2,"""setosa""",7.9,4.3,5.843333,0.828066
…,…,…,…,…,…,…,…,…,…
146,6.7,3.0,5.2,2.3,"""virginica""",7.9,4.3,5.843333,0.828066
147,6.3,2.5,5.0,1.9,"""virginica""",7.9,4.3,5.843333,0.828066
148,6.5,3.0,5.2,2.0,"""virginica""",7.9,4.3,5.843333,0.828066
149,6.2,3.4,5.4,2.3,"""virginica""",7.9,4.3,5.843333,0.828066


# 数值列运算

In [20]:
df.select(pl.col("Sepal.Length"),
          (pl.col("Sepal.Length")*100).alias("Sepal.Length * 100"),
          (pl.col("Sepal.Length")/100).alias("Sepal.Length / 100"),
          (pl.col("Sepal.Length")/pl.max("Sepal.Length")).alias("Sepal.Length /max_Sepal.Length")
         )

Sepal.Length,Sepal.Length * 100,Sepal.Length / 100,Sepal.Length /max_Sepal.Length
f64,f64,f64,f64
5.1,510.0,0.051,0.64557
4.9,490.0,0.049,0.620253
4.7,470.0,0.047,0.594937
4.6,460.0,0.046,0.582278
5.0,500.0,0.05,0.632911
…,…,…,…
6.7,670.0,0.067,0.848101
6.3,630.0,0.063,0.797468
6.5,650.0,0.065,0.822785
6.2,620.0,0.062,0.78481


# 字段串列运算

In [21]:
df.select(pl.col("Species"),
          pl.col("Species").str.len_bytes().alias("byte_count"),
          pl.col("Species").str.len_chars().alias("chars_count")
         )

Species,byte_count,chars_count
str,u32,u32
"""setosa""",6,6
"""setosa""",6,6
"""setosa""",6,6
"""setosa""",6,6
"""setosa""",6,6
…,…,…
"""virginica""",9,9
"""virginica""",9,9
"""virginica""",9,9
"""virginica""",9,9


In [22]:
df.select(pl.col("Species"),
          pl.col("Species").str.contains("set|vir").alias("regex"),
          pl.col("Species").str.starts_with("set").alias("starts_with"),
          pl.col("Species").str.ends_with("ca").alias("ends_with"),
         )

Species,regex,starts_with,ends_with
str,bool,bool,bool
"""setosa""",true,true,false
"""setosa""",true,true,false
"""setosa""",true,true,false
"""setosa""",true,true,false
"""setosa""",true,true,false
…,…,…,…
"""virginica""",true,false,true
"""virginica""",true,false,true
"""virginica""",true,false,true
"""virginica""",true,false,true


# 去重统计

In [23]:
df.select(pl.col("Species").n_unique())

Species
u32
3


# 分组聚合运算

In [24]:
df.group_by("Species").agg(
    pl.len(),
    pl.col("index"),
    pl.count("Sepal.Length").name.suffix("_count_1"),  #别名，另一种方式
    pl.col("Sepal.Length").count().name.suffix("_count_2"),
    pl.mean("Sepal.Length").name.suffix("_mean"),
    pl.std("Sepal.Length").name.suffix("_std"),
)

Species,len,index,Sepal.Length_count_1,Sepal.Length_count_2,Sepal.Length_mean,Sepal.Length_std
str,u32,list[i64],u32,u32,f64,f64
"""setosa""",50,"[1, 2, … 50]",50,50,5.006,0.35249
"""virginica""",50,"[101, 102, … 150]",50,50,6.588,0.63588
"""versicolor""",50,"[51, 52, … 100]",50,50,5.936,0.516171


In [25]:
df.group_by("Species").agg(
    (pl.col("Sepal.Length")>5).sum().alias("Sepal.Length>5"),
    (pl.col("Petal.Length")>5).sum().alias("Petal.Length>5"),
)

Species,Sepal.Length>5,Petal.Length>5
str,u32,u32
"""virginica""",49,41
"""setosa""",22,0
"""versicolor""",47,1


# 排序

In [26]:
df.sort("Sepal.Length",descending=True)

index,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
i64,f64,f64,f64,f64,str
132,7.9,3.8,6.4,2.0,"""virginica"""
118,7.7,3.8,6.7,2.2,"""virginica"""
119,7.7,2.6,6.9,2.3,"""virginica"""
123,7.7,2.8,6.7,2.0,"""virginica"""
136,7.7,3.0,6.1,2.3,"""virginica"""
…,…,…,…,…,…
42,4.5,2.3,1.3,0.3,"""setosa"""
9,4.4,2.9,1.4,0.2,"""setosa"""
39,4.4,3.0,1.3,0.2,"""setosa"""
43,4.4,3.2,1.3,0.2,"""setosa"""


In [27]:
df.sort(["Sepal.Length","Petal.Length"],descending=[True,False])

index,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
i64,f64,f64,f64,f64,str
132,7.9,3.8,6.4,2.0,"""virginica"""
136,7.7,3.0,6.1,2.3,"""virginica"""
118,7.7,3.8,6.7,2.2,"""virginica"""
123,7.7,2.8,6.7,2.0,"""virginica"""
119,7.7,2.6,6.9,2.3,"""virginica"""
…,…,…,…,…,…
42,4.5,2.3,1.3,0.3,"""setosa"""
39,4.4,3.0,1.3,0.2,"""setosa"""
43,4.4,3.2,1.3,0.2,"""setosa"""
9,4.4,2.9,1.4,0.2,"""setosa"""
