# https://docs.databricks.com/spark/latest/spark-sql/udf-python.html

**The following contains Python UDF examples.**

It shows how to:
1. register UDFs,
2. invoke UDFs, and
3. caveats regarding evaluation order of subexpressions in Spark SQL.

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("udfs-yippee").getOrCreate()

# How to register a function as a UDF:

In [3]:
def squared(value):
    return value * value


_ = spark.udf.register(name="squared_with_Python", f=squared)

## According to the ```docstring``` of ```spark.udf.register```:

```spark.udf.register(name, f, returnType=None)```  
**Register a Python function (including lambda function) or a user-defined function
as a SQL function.**
- ```name```:  name of the uder-defined function in SQL statements.
- ```f```: a Python function, or a user-defined function.  The udf can be either row-at-a-time or vectorized.  See ```pyspark.sql.functions.udf``` and ```pyspark.sql.functions.pandas_udf```.
- ```returnType```: The return type of the registered user-defined function.  The value can be either a ```pyspark.sql.types.DataType``` object or a DDL-formatted type string.

**```return```**:  a user-defined function.

```returnType``` can be optionally specified when ```f``` is a Python function, but **NOT** when ```f``` is a user-defined function.

### 1.  When ```f``` is a ```Python``` function:
```returnType``` defaults to string type and can be optionally specified.  The produced object must match the specified type.  In this case, this API works as if ```register(name, f, returnType=StringType())```.

```python
>>> strlen = spark.udf.register("stringLength", lambda x: len(x))  
>>> spark.sql("SELECT strinLength('test')").collect()  
[Row(stringLength(test)='4')]

>>> spark.sql("SELECT 'foo' AS text").select(strlen("text")).collect()  
[Row(stringLengthString(text)='3')]

>>> from pyspark.sql.types import IntegerType  
>>> _ = spark.udf.register(name="stringLengthInt", f=lambda x: len(x), returnType=IntegerType())  # Notice that we specify returnType
>>> spark.sql("SELECT stringLengthInt('test')").collect()  
[Row(stringLengthInt(test)=4)]```

### 2.  When ```f``` is a user-defined function:

Spark uses the return type of the given user-defined function as the return type of the registered user-defined function.  ```returnType``` should **not be specified.**  In this case, this API works as if ```register(name, f)```.

```python
>>> from pyspark.sql types import IntegerType  
>>> from pyspark.sql.functions import udf  
>>> slen = udf(lambda s: len(s), IntegerType())  
>>> _ = spark.udf.register(name="slen", f=slen)  # Notice that we are not specifying returnType.  
>>> spark.sql("SELECT slen('test')").collect()  
[Row(slen(test)=4)]

>>> import random  
>>> from pyspark.sql functions import udf  
>>> from pyspark.sql.types import IntegerType  
>>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic()  
>>> spark.sql("SELECT random_udf()").collect()  
[Row(random_udf()=82)]

>>> from pyspark.sql.function imort pandas_udf, PandasUDFType  
>>> @pandas_udf("integer", PandasUDFType.SCALAR)  
... def add_one(x):
...     return x + 1

>>> _ = spark.udf.register("add_one", add_one)
>>> spark.sql("SELECT add_one(id) FROM range(3)").collect()  
[Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]

>>> @pandas_udf("integer", PandasUDFType.GROUPED_AGG)
... def sum_udf(v):
        return v.sum()

>>> _ = spark.udf.register("sum_udf", sum_udf)  
>>> q = "SELECT sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2"  
>>> spark.sql(q).collect()  
[Row(sum_udf(v1)=1), Row(sum_udf(v1)=5)]```

Optionally set the return type of your UDF.  The **default** return type is ```StringType```.

In [4]:
from pyspark.sql.types import LongType
# pyspark.sql.types.LongType is a Long data type, i.e., a signed 64-bit integer.
# If the values are beyond the range of [-9223372036854775808, 9223372036854775807], please use DecimalType.


def squared_typed(value):
    return value * value


_ = spark.udf.register(name="squared_with_Python", f=squared_typed, returnType=LongType())

# Call the UDF in Spark SQL

In [5]:
spark.range(1, 20).registerTempTable("test")
# Registers this DataFrame as a temporary table using the given name.
# The lifetime of this temporary table is tied to the SparkSession that was used to create this DataFrame.

df = spark.sql("SELECT * FROM test")
df.show()

+---+
| id|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+



In [6]:
df2 = spark.sql("SELECT id, squared_with_Python(id) AS id_squared FROM test")
df2.show()

+---+----------+
| id|id_squared|
+---+----------+
|  1|         1|
|  2|         4|
|  3|         9|
|  4|        16|
|  5|        25|
|  6|        36|
|  7|        49|
|  8|        64|
|  9|        81|
| 10|       100|
| 11|       121|
| 12|       144|
| 13|       169|
| 14|       196|
| 15|       225|
| 16|       256|
| 17|       289|
| 18|       324|
| 19|       361|
+---+----------+



# Use ```UDF``` with DataFrames.

According to [this link](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=udf#pyspark.sql.functions.udf):

```pyspark.sql.functions.udf(f=None, returnType=StringType)```

Creates a user-defined function (UDF).
- ```f```: python function if used as a standalone function.
- ```returnType```: the return type of the user-defined function.  The value can be either a ```pyspark.sql.types.DataType``` object or a DDL-formatted type string.

**Note**:  The user-defined functions are considered deterministic by default.  Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query.  If your function is **not deterministic**, call ```asNondeterministic``` on the user defined function.  E.g.:

In [7]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
import random

random_udf = udf(lambda: int(random.random() * 100), IntegerType()).asNondeterministic()

**Moving on to the DataBricks example:**

In [8]:
from pyspark.sql.functions import udf
from pyspark.sql.types import LongType

In [9]:
squared_udf = udf(squared, LongType())
df = spark.table("test")
df.show()

+---+
| id|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+



In [10]:
df.select("id", squared_udf("id").alias("id_squared")).show()

+---+----------+
| id|id_squared|
+---+----------+
|  1|         1|
|  2|         4|
|  3|         9|
|  4|        16|
|  5|        25|
|  6|        36|
|  7|        49|
|  8|        64|
|  9|        81|
| 10|       100|
| 11|       121|
| 12|       144|
| 13|       169|
| 14|       196|
| 15|       225|
| 16|       256|
| 17|       289|
| 18|       324|
| 19|       361|
+---+----------+



Alternatively, you can declare the same UDF using **annotation syntax:**

In [11]:
from pyspark.sql.functions import udf

In [12]:
@udf("long")
def squared_udf(value):
    return value * value

In [13]:
df = spark.table("test")
df.select("id", squared_udf("id").alias("id_squared")).show()

+---+----------+
| id|id_squared|
+---+----------+
|  1|         1|
|  2|         4|
|  3|         9|
|  4|        16|
|  5|        25|
|  6|        36|
|  7|        49|
|  8|        64|
|  9|        81|
| 10|       100|
| 11|       121|
| 12|       144|
| 13|       169|
| 14|       196|
| 15|       225|
| 16|       256|
| 17|       289|
| 18|       324|
| 19|       361|
+---+----------+



# Evaluation order and Null Checking.

Spark SQL does not guarantee the order of evaluation of subexpressions.  In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order.  For example, logical ```AND``` and ```OR``` expressions do not have left-to-right "short-circuiting" semantics.

Therefore, it iks dangerous to rely on the side effects or order of evaluation of Boolean expressions, and the order of ```WHERE``` and ```HAVING``` clauses, since such expressions and clauses can be reordered during query optimization and planning.  Specifically, if a UDF relies on short-circuiting semantics in SQL for null checking, there's no guarantee that the null check will happen before invoking the UDF.  For example:

In [14]:
_ = spark.udf.register(name="strlen", f=lambda s: len(s), returnType="int")
# spark.sql("SELECT s FROM test1 WHERE s is not null and strlen(s) > 1")  # no guarantee!

The above ```WHERE``` clause does not guarantee the ```strlen``` UDF to be invoked after filtering out nulls.

To perform **proper null checking**, we recommend that you do either of the following:
- Make the ```UDF``` itself null-aware and **do null checking inside the UDF itself.**
- Use ```IF``` or ```CASE WHEN``` expressions to do the null check and invoke the UDF in a conditional branch.

In [15]:
_ = spark.udf.register(name="strlen_nullsafe", f=lambda s: len(s) if not s is False else -1, returnType="int")

In [16]:
# spark.sql("SELECT s FROM test1 WHERE s is not null and strlen_nullsafe(s) > 1") --> OK, good to go.
# spark.sql("SELECT s from test1 WHERE if(s is not null, strlen(s), null) > 1") --> OK, good to go.