# [Expressions: User-defined Python functions](https://docs.pola.rs/user-guide/expressions/user-defined-python-functions/)

There a re two ways to do this:

- map_elements: Call a function separately on each value in the Series.
- map_batches: Always passes the full Series to the function.

## Processing individual values with `map_elements()`

In [1]:
import polars as pl
df = pl.DataFrame(
    {
        "keys": ["a", "a", "b", "b"],
        "values": [10, 7, 1, 23],
    }
)
print(df)

shape: (4, 2)
┌──────┬────────┐
│ keys ┆ values │
│ ---  ┆ ---    │
│ str  ┆ i64    │
╞══════╪════════╡
│ a    ┆ 10     │
│ a    ┆ 7      │
│ b    ┆ 1      │
│ b    ┆ 23     │
└──────┴────────┘


In [2]:
import math

def my_log(value):
    return math.log(value)

df.select(pl.col("values").map_elements(my_log, return_dtype=pl.Float64))

Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - pl.col("values").map_elements(my_log)
with this one instead:
  + pl.col("values").log()

  df.select(pl.col("values").map_elements(my_log, return_dtype=pl.Float64))


values
f64
2.302585
1.94591
0.0
3.135494


## Processing a whole Series with `map_batches()`

In [3]:
def diff_from_mean(series):
    total = 0
    for value in series:
        total += value
    mean = total / len(series)
    return pl.Series([value - mean for value in series])

# apply fucntion to full series
out = df.select(pl.col("values").map_batches(diff_from_mean))
print("== select() with UDF ==")
print(out)

# apply fucntion per group
out = df.group_by("keys").agg(pl.col("values").map_batches(diff_from_mean))
print("== group_by() with UDF ==")
print(out)

== select() with UDF ==
shape: (4, 1)
┌────────┐
│ values │
│ ---    │
│ f64    │
╞════════╡
│ -0.25  │
│ -3.25  │
│ -9.25  │
│ 12.75  │
└────────┘
== group_by() with UDF ==
shape: (2, 2)
┌──────┬───────────────┐
│ keys ┆ values        │
│ ---  ┆ ---           │
│ str  ┆ list[f64]     │
╞══════╪═══════════════╡
│ b    ┆ [-11.0, 11.0] │
│ a    ┆ [1.5, -1.5]   │
└──────┴───────────────┘


## Fast operations with user-defined functions

To keep performance up use ufuncs, like NumPy and SciPy use.

In [4]:
import numpy as np
df.select(pl.col("values").map_batches(np.log))

values
f64
2.302585
1.94591
0.0
3.135494


## Example: A fast custom function using Numba

In [5]:
from numba import float64, guvectorize, int64

# https://numba.readthedocs.io/en/stable/user/vectorize.html
@guvectorize([(int64[:], float64[:])], "(n)->(n)")
def diff_from_mean_numba(arr, result):
    total = 0
    for value in arr:
        total += value
    mean = total / len(arr)
    for i, value in enumerate(arr):
        result[i] = value - mean

out = df.select(pl.col("values").map_batches(diff_from_mean_numba))
print("== select() with UDF ==")
print(out)

out = df.group_by("keys").agg(pl.col("values").map_batches(diff_from_mean_numba))
print("== group_by() with UDF ==")
print(out)

== select() with UDF ==
shape: (4, 1)
┌────────┐
│ values │
│ ---    │
│ f64    │
╞════════╡
│ -0.25  │
│ -3.25  │
│ -9.25  │
│ 12.75  │
└────────┘
== group_by() with UDF ==
shape: (2, 2)
┌──────┬───────────────┐
│ keys ┆ values        │
│ ---  ┆ ---           │
│ str  ┆ list[f64]     │
╞══════╪═══════════════╡
│ a    ┆ [1.5, -1.5]   │
│ b    ┆ [-11.0, 11.0] │
└──────┴───────────────┘


## Missing data is not allowed when calling generalized ufuncs

## Combining multiple column values

In [9]:
@guvectorize([(int64[:], int64[:], float64[:])], "(n),(n)->(n)")
def add(arr, arr2, result):
    for i in range(len(arr)):
        result[i] = arr[i] + arr2[i]


df3 = pl.DataFrame({"values1": [1, 2, 3], "values2": [10, 20, 30]})

out = df3.select(
    # Create a struct that has two columns in it:
    pl.struct(["values1", "values2"])
    # Pass the struct to a lambda that then passes the individual columns to
    # the add() function:
    .map_batches(
        lambda combined: add(
            combined.struct.field("values1"), combined.struct.field("values2")
        )
    )
    .alias("add_columns")
)
print(out)

shape: (3, 1)
┌─────────────┐
│ add_columns │
│ ---         │
│ f64         │
╞═════════════╡
│ 11.0        │
│ 22.0        │
│ 33.0        │
└─────────────┘


## Streaming calculations

If the your function uses a lot of memory, you can set `is_element=True` in `map_batches.

## Return types

The mapping of Python types to Polars data types is as follows:

- int -> Int64
- float -> Float64
- bool -> Boolean
- str -> String
- list[tp] -> List[tp] (where the inner type is inferred with the same rules)
- dict[str, [tp]] -> struct
- Any -> object (Prevent this at all times)
