# 🔧 PySpark UDF (User Defined Function) - Notes

## 🧠 What is a UDF?

A **UDF (User Defined Function)** in PySpark allows you to write custom Python functions that can be applied to columns in a DataFrame.

- Useful when built-in PySpark functions are not sufficient.
- Can be used in transformations like `withColumn`, `select`, and `filter`.
- Operates row-by-row like Python’s `map` function.

> ⚠️ **Note:** UDFs are generally slower than built-in functions because they bypass Spark’s Catalyst optimizer and require serialization.

---

## 🧱 Steps to Use a UDF in PySpark

1. **Import necessary libraries** (like `udf` and data types).
2. **Define your custom Python function.**
3. **Register the function as a UDF**, specifying the return type.
4. **Apply the UDF** on DataFrame columns using transformations like `withColumn`.

---

## 🧪 Sample Use Case

Imagine a situation where you want to assign age categories such as "Young", "Adult", and "Senior" based on age. This logic is not available directly in Spark functions, so you would use a UDF.

---

## 📝 Best Practices

- **Prefer Spark SQL built-in functions** whenever possible — they are faster and optimized.
- Use **UDFs only when**:
  - You need custom logic not supported by built-in functions.
  - Business rules require non-standard computations.
- Consider **Pandas UDFs (vectorized UDFs)** for better performance if you're using Spark 2.3 or later.
- UDFs cannot be automatically optimized or pushed down into query plans.

---

## 🧠 Bonus: SQL UDFs

You can also **register UDFs** to use inside **Spark SQL queries**, allowing SQL users to apply the same Python logic in their workflows.

---

✅ UDFs extend the power of Spark by allowing custom logic, but they should be used wisely for performance-sensitive applications.


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType


In [0]:
spark = SparkSession.builder.appName("UDF Example").getOrCreate()


In [0]:
data = [("Alice", 21), ("Bob", 25), ("Cathy", 30)]
columns = ["name", "age"]

df = spark.createDataFrame(data, columns)
df.show()


+-----+---+
| name|age|
+-----+---+
|Alice| 21|
|  Bob| 25|
|Cathy| 30|
+-----+---+



In [0]:
def age_category(age):
    if age < 25:
        return "Young"
    elif age < 30:
        return "Adult"
    else:
        return "Senior"

In [0]:
age_category_udf = udf(age_category, StringType())
df_with_category = df.withColumn("category", age_category_udf(df["age"]))
df_with_category.show()



+-----+---+--------+
| name|age|category|
+-----+---+--------+
|Alice| 21|   Young|
|  Bob| 25|   Adult|
|Cathy| 30|  Senior|
+-----+---+--------+



### Spark Sql

In [0]:
spark.udf.register("age_category_sql", age_category, StringType())
df.createOrReplaceTempView("people")
spark.sql("SELECT name, age, age_category_sql(age) AS category FROM people").show()


+-----+---+--------+
| name|age|category|
+-----+---+--------+
|Alice| 21|   Young|
|  Bob| 25|   Adult|
|Cathy| 30|  Senior|
+-----+---+--------+

