# Expresiones en Polars: Introducción y Ejemplos

Las **expresiones** en Polars son el núcleo de la manipulación eficiente de datos. Por sí solas, las expresiones son *perezosas* (lazy): no hacen nada hasta que se aplican sobre un `DataFrame` o `LazyFrame` mediante métodos específicos.

En este notebook, exploraremos cómo las expresiones se utilizan en la práctica a través ejemplos claros y reproducibles. Veremos cómo aplicar expresiones para:

- **Seleccionar columnas** con `df.select()`
- **Crear nuevas columnas** con `df.with_columns()`
- **Filtrar filas** con `df.filter()`
- **Agrupar y agregar** con `df.group_by()`
- **Ordenar filas** con `df.sort()`

> **Nota:** Aunque cada método tiene su importancia, lo fundamental es entender cómo las expresiones definen la lógica de transformación y análisis de los datos. Cada método será tratado en detalle en capítulos posteriores.

---

Este notebook sigue las buenas prácticas del proyecto: los ejemplos son reproducibles, los datos son ligeros y se utiliza Polars para la manipulación y Plotnine para la visualización. ¡Comencemos!

In [2]:
import polars as pl 

In [3]:
fruit = pl.read_csv("data/fruit.csv")
fruit

name,weight,color,is_round,origin
str,i64,str,bool,str
"""Avocado""",200,"""green""",False,"""South America"""
"""Banana""",120,"""yellow""",False,"""Asia"""
"""Blueberry""",1,"""blue""",False,"""North America"""
"""Cantaloupe""",2500,"""orange""",True,"""Africa"""
"""Cranberry""",2,"""red""",False,"""North America"""
"""Elderberry""",1,"""black""",False,"""Europe"""
"""Orange""",130,"""orange""",True,"""Asia"""
"""Papaya""",1000,"""orange""",False,"""South America"""
"""Peach""",150,"""orange""",True,"""Asia"""
"""Watermelon""",5000,"""green""",True,"""Africa"""


In [4]:
fruit.select(
    pl.col("name"),
    pl.col("^.*or.*$"),
    pl.col("weight") / 1000,
    "is_round"
)

name,color,origin,weight,is_round
str,str,str,f64,bool
"""Avocado""","""green""","""South America""",0.2,False
"""Banana""","""yellow""","""Asia""",0.12,False
"""Blueberry""","""blue""","""North America""",0.001,False
"""Cantaloupe""","""orange""","""Africa""",2.5,True
"""Cranberry""","""red""","""North America""",0.002,False
"""Elderberry""","""black""","""Europe""",0.001,False
"""Orange""","""orange""","""Asia""",0.13,True
"""Papaya""","""orange""","""South America""",1.0,False
"""Peach""","""orange""","""Asia""",0.15,True
"""Watermelon""","""green""","""Africa""",5.0,True


### Creando nuevas columnas con expresiones

In [5]:
fruit.with_columns(
    pl.lit(True).alias("is_fruit"),
    is_berry=pl.col("name").str.ends_with("berry")
)

name,weight,color,is_round,origin,is_fruit,is_berry
str,i64,str,bool,str,bool,bool
"""Avocado""",200,"""green""",False,"""South America""",True,False
"""Banana""",120,"""yellow""",False,"""Asia""",True,False
"""Blueberry""",1,"""blue""",False,"""North America""",True,True
"""Cantaloupe""",2500,"""orange""",True,"""Africa""",True,False
"""Cranberry""",2,"""red""",False,"""North America""",True,True
"""Elderberry""",1,"""black""",False,"""Europe""",True,True
"""Orange""",130,"""orange""",True,"""Asia""",True,False
"""Papaya""",1000,"""orange""",False,"""South America""",True,False
"""Peach""",150,"""orange""",True,"""Asia""",True,False
"""Watermelon""",5000,"""green""",True,"""Africa""",True,False


### Nombrado de columnas en Polars

Al crear nuevas columnas con expresiones en Polars, existen dos formas principales de asignarles nombre:

- **Usando `Expr.alias()`**: El método `.alias()` asigna un nombre directamente a la expresión, que se mantiene dondequiera que se use esa expresión.
- **Usando sintaxis de palabra clave**: Al pasar la expresión como argumento nombrado (`nuevo_nombre=expresión`), el nombre es local al contexto y tiene prioridad sobre `.alias()` si ambos se usan.

**Consideraciones:**
- Si se usa una palabra clave, todos los argumentos siguientes deben ser palabras clave.
- El nombre debe ser válido en Python: no puede comenzar con un número, contener caracteres especiales ni ser una palabra reservada.

**Recomendación:**  
Se prefiere la sintaxis de palabra clave por su claridad y brevedad.

## Filtrando filas con expresiones

In [6]:
fruit.filter(
    (pl.col("weight") > 1000)
    & pl.col("is_round")
)

name,weight,color,is_round,origin
str,i64,str,bool,str
"""Cantaloupe""",2500,"""orange""",True,"""Africa"""
"""Watermelon""",5000,"""green""",True,"""Africa"""


## Agregación con expresiones

In [7]:
fruit.group_by(
    pl.col("origin").str.split(" ").list.last()).agg(
        pl.len(),
        average_weight=pl.col("weight").mean()
    ).sort("average_weight", descending=True)

origin,len,average_weight
str,u32,f64
"""Africa""",2,3750.0
"""America""",4,300.75
"""Asia""",3,133.333333
"""Europe""",1,1.0


## Orden con expresiones

In [8]:
fruit.sort(
    pl.col("name").str.len_bytes(),
    descending=True
)

name,weight,color,is_round,origin
str,i64,str,bool,str
"""Cantaloupe""",2500,"""orange""",True,"""Africa"""
"""Elderberry""",1,"""black""",False,"""Europe"""
"""Watermelon""",5000,"""green""",True,"""Africa"""
"""Blueberry""",1,"""blue""",False,"""North America"""
"""Cranberry""",2,"""red""",False,"""North America"""
"""Avocado""",200,"""green""",False,"""South America"""
"""Banana""",120,"""yellow""",False,"""Asia"""
"""Orange""",130,"""orange""",True,"""Asia"""
"""Papaya""",1000,"""orange""",False,"""South America"""
"""Peach""",150,"""orange""",True,"""Asia"""


In [9]:
(pl.DataFrame({"a": [1, 2, 3], "b":  [0.4, 0.5, 0.6]}).with_columns(
    pl.all().mul(10).name.suffix("_times_10")
    )
)

a,b,a_times_10,b_times_10
i64,f64,i64,f64
1,0.4,10,4.0
2,0.5,20,5.0
3,0.6,30,6.0


In [10]:
pl.all().mul(10).name.suffix("_times_10").meta.has_multiple_outputs()

True

## Propiedades de las expresiones

### Propiedades clave de las expresiones en Polars

- **Pereza (Lazy):**  
    Las expresiones son *perezosas*: no realizan ninguna acción por sí solas. Solo se ejecutan cuando se aplican a un `DataFrame` o `LazyFrame` mediante métodos como `.select()`, `.with_columns()`, `.filter()`, etc.  
    > *Ventaja:* Permite optimizaciones automáticas y evita cálculos innecesarios.

- **Dependencia de función y datos:**  
    El resultado de una expresión depende tanto de la función que la ejecuta como del DataFrame sobre el que se aplica.  
    - La **función** (por ejemplo, `.select()` o `.filter()`) define cómo se transforma la Serie resultante.
    - El **DataFrame** determina el tipo y la longitud de la Serie, ya que las expresiones operan sobre sus columnas y filas.

> **Resumen:**  
Las expresiones en Polars son herramientas flexibles y eficientes para definir transformaciones, pero solo cobran vida cuando se combinan con un DataFrame y una función que las ejecuta. Esto permite construir pipelines de datos claros, reproducibles y optimizados.

In [11]:
is_orange = pl.col("color") == "orange"

In [12]:
fruit.with_columns(is_orange)

name,weight,color,is_round,origin
str,i64,bool,bool,str
"""Avocado""",200,False,False,"""South America"""
"""Banana""",120,False,False,"""Asia"""
"""Blueberry""",1,False,False,"""North America"""
"""Cantaloupe""",2500,True,True,"""Africa"""
"""Cranberry""",2,False,False,"""North America"""
"""Elderberry""",1,False,False,"""Europe"""
"""Orange""",130,True,True,"""Asia"""
"""Papaya""",1000,True,False,"""South America"""
"""Peach""",150,True,True,"""Asia"""
"""Watermelon""",5000,False,True,"""Africa"""


In [13]:
fruit.filter(is_orange)

name,weight,color,is_round,origin
str,i64,str,bool,str
"""Cantaloupe""",2500,"""orange""",True,"""Africa"""
"""Orange""",130,"""orange""",True,"""Asia"""
"""Papaya""",1000,"""orange""",False,"""South America"""
"""Peach""",150,"""orange""",True,"""Asia"""


In [14]:
fruit.group_by(is_orange).len()

color,len
bool,u32
False,6
True,4


In [15]:
flowers = pl.DataFrame(
    {
    "name": ["Tiger lily", "Blue flag", "African marigold"],
    "latin": ["Lilium columbianum", "Iris versicolor", "Tagetes erecta"],
    "color": ["orange", "purple", "orange"],
    }
)

In [16]:
flowers.filter(is_orange)

name,latin,color
str,str,str
"""Tiger lily""","""Lilium columbianum""","""orange"""
"""African marigold""","""Tagetes erecta""","""orange"""


## Creando expresiones

In [17]:
fruit.select(pl.col("color")).columns

['color']

In [18]:
fruit.select(pl.col("^.*or.*$")).columns

['color', 'origin']

In [19]:
print(fruit.select(pl.all()).columns)
print(fruit.select(pl.col("*")).columns)

['name', 'weight', 'color', 'is_round', 'origin']
['name', 'weight', 'color', 'is_round', 'origin']


In [20]:
# Select all string columns
fruit.select(pl.col(pl.String)).columns

['name', 'color', 'origin']

In [21]:
# Select all integer and boolean columns
fruit.select(pl.col(pl.Int64, pl.Boolean)).columns

['weight', 'is_round']

In [22]:
# Select specific columns
fruit.select(["name", "color"]).columns

['name', 'color']

## From literal values

In [23]:
pl.select(pl.lit(42).alias("answer"))

answer
i32
42


In [24]:
pl.select(answer=pl.lit(42))

answer
i32
42


In [25]:
fruit.with_columns(
    planet=pl.lit("Earth"),
    sunny=pl.lit("Sunny")
)

name,weight,color,is_round,origin,planet,sunny
str,i64,str,bool,str,str,str
"""Avocado""",200,"""green""",False,"""South America""","""Earth""","""Sunny"""
"""Banana""",120,"""yellow""",False,"""Asia""","""Earth""","""Sunny"""
"""Blueberry""",1,"""blue""",False,"""North America""","""Earth""","""Sunny"""
"""Cantaloupe""",2500,"""orange""",True,"""Africa""","""Earth""","""Sunny"""
"""Cranberry""",2,"""red""",False,"""North America""","""Earth""","""Sunny"""
"""Elderberry""",1,"""black""",False,"""Europe""","""Earth""","""Sunny"""
"""Orange""",130,"""orange""",True,"""Asia""","""Earth""","""Sunny"""
"""Papaya""",1000,"""orange""",False,"""South America""","""Earth""","""Sunny"""
"""Peach""",150,"""orange""",True,"""Asia""","""Earth""","""Sunny"""
"""Watermelon""",5000,"""green""",True,"""Africa""","""Earth""","""Sunny"""


In [26]:
# Error: shape mismatch
#fruit.with_columns(pl.lit(pl.Series([False, True])).alias("row_is_even"))

In [27]:
fruit.with_columns(row_is_even=pl.lit([False, True]))

name,weight,color,is_round,origin,row_is_even
str,i64,str,bool,str,list[bool]
"""Avocado""",200,"""green""",False,"""South America""","[false, true]"
"""Banana""",120,"""yellow""",False,"""Asia""","[false, true]"
"""Blueberry""",1,"""blue""",False,"""North America""","[false, true]"
"""Cantaloupe""",2500,"""orange""",True,"""Africa""","[false, true]"
"""Cranberry""",2,"""red""",False,"""North America""","[false, true]"
"""Elderberry""",1,"""black""",False,"""Europe""","[false, true]"
"""Orange""",130,"""orange""",True,"""Asia""","[false, true]"
"""Papaya""",1000,"""orange""",False,"""South America""","[false, true]"
"""Peach""",150,"""orange""",True,"""Asia""","[false, true]"
"""Watermelon""",5000,"""green""",True,"""Africa""","[false, true]"


In [28]:
pl.select(pl.repeat("Ella", 3).alias("umbrella"), pl.zeros(3), pl.ones(3))

umbrella,zeros,ones
str,f64,f64
"""Ella""",0.0,1.0
"""Ella""",0.0,1.0
"""Ella""",0.0,1.0


In [29]:
# Error: shape mismatch
 # fruit.with_columns(planet=pl.repeat("Earth", 9))

## From Ranges

In [30]:
pl.select(
    start=pl.int_range(0, 5), end=pl.arange(0, 10, 2).pow(2)
).with_columns(int_range=pl.int_ranges("start", "end")).with_columns(
    range_length=pl.col("int_range").list.len()
)

start,end,int_range,range_length
i64,i64,list[i64],u32
0,0,[],0
1,4,"[1, 2, 3]",3
2,16,"[2, 3, … 15]",14
3,36,"[3, 4, … 35]",33
4,64,"[4, 5, … 63]",60


In [31]:
pl.select(
    start=pl.date_range(pl.date(1985, 10, 21), pl.date(1985, 10, 26)),
    end=pl.repeat(pl.date(2021, 10, 21), 6),
).with_columns(range=pl.datetime_ranges("start", "end", interval="1h"))

start,end,range
date,date,list[datetime[μs]]
1985-10-21,2021-10-21,"[1985-10-21 00:00:00, 1985-10-21 01:00:00, … 2021-10-21 00:00:00]"
1985-10-22,2021-10-21,"[1985-10-22 00:00:00, 1985-10-22 01:00:00, … 2021-10-21 00:00:00]"
1985-10-23,2021-10-21,"[1985-10-23 00:00:00, 1985-10-23 01:00:00, … 2021-10-21 00:00:00]"
1985-10-24,2021-10-21,"[1985-10-24 00:00:00, 1985-10-24 01:00:00, … 2021-10-21 00:00:00]"
1985-10-25,2021-10-21,"[1985-10-25 00:00:00, 1985-10-25 01:00:00, … 2021-10-21 00:00:00]"
1985-10-26,2021-10-21,"[1985-10-26 00:00:00, 1985-10-26 01:00:00, … 2021-10-21 00:00:00]"


## Renaming Expressions

In [32]:
df = pl.DataFrame({"text": "value", "An integer": 5040, "BOOLEAN": True})
df

text,An integer,BOOLEAN
str,i64,bool
"""value""",5040,True


In [33]:
df.select(
    pl.col("text").name.to_uppercase(),
    pl.col("An integer").alias("int"),
    pl.col("BOOLEAN").name.to_lowercase()
)

TEXT,int,boolean
str,i64,bool
"""value""",5040,True


In [34]:
df.select(
    pl.all().name.map(lambda s: s.lower().replace(" ", "_"))
)

text,an_integer,boolean
str,i64,bool
"""value""",5040,True


## Expression Are Idiomatic

In [35]:
fruit

name,weight,color,is_round,origin
str,i64,str,bool,str
"""Avocado""",200,"""green""",False,"""South America"""
"""Banana""",120,"""yellow""",False,"""Asia"""
"""Blueberry""",1,"""blue""",False,"""North America"""
"""Cantaloupe""",2500,"""orange""",True,"""Africa"""
"""Cranberry""",2,"""red""",False,"""North America"""
"""Elderberry""",1,"""black""",False,"""Europe"""
"""Orange""",130,"""orange""",True,"""Asia"""
"""Papaya""",1000,"""orange""",False,"""South America"""
"""Peach""",150,"""orange""",True,"""Asia"""
"""Watermelon""",5000,"""green""",True,"""Africa"""


In [36]:
fruit.filter((fruit["weight"] > 1000) & fruit["is_round"])

name,weight,color,is_round,origin
str,i64,str,bool,str
"""Cantaloupe""",2500,"""orange""",True,"""Africa"""
"""Watermelon""",5000,"""green""",True,"""Africa"""


In [None]:
(
    fruit.lazy()
    .filter((pl.col("weight") > 1000) & pl.col("is_round"))
    .with_columns(is_berry=pl.col("name").str.ends_with("berry"))
    .collect()
)

name,weight,color,is_round,origin,is_berry
str,i64,str,bool,str,bool
"""Cantaloupe""",2500,"""orange""",True,"""Africa""",False
"""Watermelon""",5000,"""green""",True,"""Africa""",False


In [38]:
(
    fruit.lazy()
    .filter((fruit["weight"] > 1000) & fruit["is_round"])
    .with_columns(is_berry=fruit["name"].str.ends_with("berry"))
    .collect()
)

ShapeError: unable to add a column of length 10 to a DataFrame of height 2