# https://changhsinlee.com/pyspark-udf/
# How to Turn Python Functions into PySpark Functions (UDF)

Here’s the problem: I have a Python function that iterates over my data, but going through each row in the dataframe takes several days. If I have a computing cluster with many nodes, how can I distribute this Python function in PySpark to speed up this process — maybe cut the total time down to less than a few hours — with the least amount of work?

In other words, how do I turn a Python function into a Spark user defined function, or _UDF_? I’ll explain my solution here.

# Registering a UDF

PySpark UDFs work in a similar way as the pandas ```.map()``` and ```.apply()``` methods for pandas series and dataframes. If I have a function that can use values from a row in the dataframe as input, then I can map it to the entire dataframe. The only difference is that with PySpark UDFs I have to specify the output data type.

As an example, I will create a PySpark dataframe from a pandas dataframe.

In [3]:
import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example-from-chinese-dude").getOrCreate()

In [4]:
# Example Data:
df_pd = pd.DataFrame(data={'integers': [1, 2, 3],
                           'floats': [-1.0, 0.5, 2.7],
                           'integer_arrays': [[1, 2], [3, 4, 5], [6, 7, 8, 9]]})

df = spark.createDataFrame(df_pd)
df.printSchema()
print("\n")
df.show()

root
 |-- integers: long (nullable = true)
 |-- floats: double (nullable = true)
 |-- integer_arrays: array (nullable = true)
 |    |-- element: long (containsNull = true)



+--------+------+--------------+
|integers|floats|integer_arrays|
+--------+------+--------------+
|       1|  -1.0|        [1, 2]|
|       2|   0.5|     [3, 4, 5]|
|       3|   2.7|  [6, 7, 8, 9]|
+--------+------+--------------+



# Primitive type outputs

Let's say I have a python function ```square()``` that squares a number, and I want to register this functions as a Spark UDF.

In [5]:
def square(value):
    return value**2

As long as the python function's output has a corresponding data type in Spark, then I can turn it into a UDF.  When registering UDFs, I have to specify the data type using the types from ```pyspark.sql.types```.  All the types supported by PySpark can be found [here](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=types#module-pyspark.sql.types).

**Here's a small gotcha**---because Spark UDF does not convert integers to floats, unlike the Python function which works for both integers and floats, a Spark UDF will return a column of NULLS if the input data type does not match the output data type, as in the following example:

### Registering UDF with integer type output.

In [7]:
# Integer type output.
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

In [8]:
square_udf_int = udf(f=lambda z: square(z), returnType=IntegerType())

In [9]:
df.select("integers",
          "floats",
          square_udf_int("integers").alias("int_squared"),
          square_udf_int("floats").alias("float_squared")).show()

+--------+------+-----------+-------------+
|integers|floats|int_squared|float_squared|
+--------+------+-----------+-------------+
|       1|  -1.0|          1|         null|
|       2|   0.5|          4|         null|
|       3|   2.7|          9|         null|
+--------+------+-----------+-------------+



### Registering UDF with float type output.

In [11]:
# Float type output.
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf

In [12]:
square_udf_float = udf(f=lambda z: square(z), returnType=FloatType())

In [14]:
df.select("integers",
          "floats",
          square_udf_float("integers").alias("int_squared"),
          square_udf_float("floats").alias("float_squared")).show()

+--------+------+-----------+-------------+
|integers|floats|int_squared|float_squared|
+--------+------+-----------+-------------+
|       1|  -1.0|       null|          1.0|
|       2|   0.5|       null|         0.25|
|       3|   2.7|       null|         7.29|
+--------+------+-----------+-------------+



### Specifying the float type output in the Python function

Specifying the data type in the Pytyhon function output is probably the safer way.  Because I _(the Chinese guy)_ usually load data into Spark from from Hive tables whose schemas were made by others, specifying the return data type means the UDF should still work as intended even if the Hive schema has changed.

In [15]:
# Forcing the output to be float
def square_float(x):
    return float(x**2)


square_udf_float2 = udf(f=lambda z: square_float(z), returnType=FloatType())

In [16]:
df.select("integers",
          "floats",
          square_udf_float2("integers").alias("int_squared"),
          square_udf_float2("floats").alias("float_squared")).show()

+--------+------+-----------+-------------+
|integers|floats|int_squared|float_squared|
+--------+------+-----------+-------------+
|       1|  -1.0|        1.0|          1.0|
|       2|   0.5|        4.0|         0.25|
|       3|   2.7|        9.0|         7.29|
+--------+------+-----------+-------------+



### Composite type outputs

If the output of the Python function is a liast, then the values in the list have to be of the same type, which is specified within ```ArrayType()``` when registering the UDF.

In [17]:
from pyspark.sql.types import ArrayType

In [None]:
def square_list(some_iterable):
    return [float(val)**2 for val in some_iterable]

square_list_udf = udf(f=lambda y: square_list(y), returnType=ArrayType())