
<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/UNAL_Logosimbolo.svg/583px-UNAL_Logosimbolo.svg.png" alt="" width="1280" height="300" /></p>

# USER DEFINE FUNCTION

UDF (User Defined Function) in Spark is a way to extend the functionality of Spark SQL by allowing you to write custom functions and apply them to data in Spark DataFrames or Datasets.

These functions can be written in languages like Python, Scala, or Java, depending on the Spark environment you are using. UDFs allow you to perform complex operations that Spark SQL does not natively support.

## BENEFITS

* Customization: Allows defining custom logic not available in Spark’s built-in functions.
* Flexibility: Provides the ability to implement your own business logic when built-in functions aren’t sufficient.
* Reuse: UDFs can be reused across multiple queries and operations.
* Compatibility: Enables the integration of external libraries or custom algorithms.
Interoperability: Supports extending Spark’s functionality using languages like Python or Java.

## DISADVANTAGES

* **Performance Overhead**: UDFs run on a per-record basis and bypass Spark’s optimization, leading to slower performance compared to built-in operations.
* **Limited Parallelization**: UDFs are not optimized for parallel execution, reducing performance on large datasets.
* **No Catalyst Optimization**: UDFs cannot leverage the Catalyst query optimizer, missing out on optimizations like predicate pushdown.
* **Serialization Costs**: Data must be serialized and deserialized between JVM and Python, adding overhead.
* **Error Handling**: Debugging UDFs is harder than built-in functions due to limited visibility into execution plans.
* **Memory Control**: UDFs in non-Java environments (like Python) fall outside Spark’s memory management, potentially leading to inefficiencies.

## CREATE

To do that you must have a elemental knoldge of data types and python programing


In [0]:
from typing import List
from pyspark.sql.functions import udf, StringType, ArrayType, col, lower

elements = [
    {"id": 1, "name": "July", "age": 34, "salary": 550, "role": "admin"},
    {"id": 1, "name": "July", "age": 34, "salary": 550, "role": "admin"},
    {"id": 2, "name": "Gabriel", "age": 29, "salary": 720, "role": "developer"},
    {"id": 3, "name": "Luis", "age": 42, "salary": 610, "role": "developer"},
    {"id": 4, "name": "John", "age": 51, "salary": 890, "role": "manager"},
    {"id": 5, "name": "Daniel", "age": 27, "salary": 480, "role": "developer"},
    {"id": 6, "name": "Mary", "age": 38, "salary": 700, "role": "admin"},
    {"id": 7, "name": "Monica", "age": 33, "salary": 460, "role": "tester"},
    {"id": 8, "name": "Andrea", "age": 45, "salary": 680, "role": "admin"},
    {"id": 9, "name": "Sebastian", "age": 31, "salary": 530, "role": "developer"},
    {"id": 10, "name": "Johana", "age": 26, "salary": 410, "role": "tester"},
    {"id": 11, "name": None, "age": 26, "salary": None, "role": "tester"},
    {"id": 12, "name": "Juan", "age": 45, "salary": 680, "role": None},
]
df = spark.createDataFrame(elements)
display(df)


### TEMPORAL ELEMENTAL

In [0]:
from typing import List

def get_list_letters(value: str) -> List[str]:
    if value is None:
        return []
    return list(value.lower())

get_list_letters("Test")

In [0]:
# create udf
my_udf = udf(get_list_letters, ArrayType(StringType()))

In [0]:
from pyspark.sql.functions import col, lower
df.select(
    col("name"),
    my_udf(col("name")).alias("letters")
).display()

### DECORATOR

In [0]:
@udf(returnType=ArrayType(StringType()))
def get_list_letters(value):
    if value is None:
        return []
    return list(str(value).lower())

In [0]:
df.select(
    col("name"),
    get_list_letters(
        col("name")
    ).alias("letters")
).display()

### REGISTER

Registering a UDF in Spark means giving your function a name so it can be used directly in SQL queries, just like built-in Spark functions.

Benefits:
* Use in SQL: Call your UDF inside spark.sql() queries.
* BI Integration: Make the UDF available in dashboards and reporting tools connected to Spark.
* Reusable: Easily call the function by name across multiple queries and notebooks.
* Organized: Helps structure your project like a database with user-defined functions.

But Note:
* Not persistent: Registered UDFs are lost when you stop or restart your Spark session or cluster.
* No performance boost: Registering doesn’t make UDFs faster or optimized.
* Still a black box: Spark’s optimizer (Catalyst) cannot optimize registered UDFs.

#### SET UP

In [0]:

def get_letters(value: str) -> List[str]:
    if value is None:
        return []
    return list(value.lower())

get_list_letters("Test")

In [0]:
spark.udf.register("get_cert_udf", get_letters, ArrayType(StringType()))

#### USING SQL

In [0]:
df.createOrReplaceTempView("my_test_udf_sl")
spark.sql(" SELECT get_cert_udf(name) FROM my_test_udf_sl").display()

#### USING PYSPARK SELECTEXP

In [0]:
df.selectExpr("get_cert_udf(name) as lettesr_list").display()