
<div  style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://raw.githubusercontent.com/derar-alhussein/Databricks-Certified-Data-Engineer-Associate/main/Includes/images/bookstore_schema.png" alt="Databricks Learning" style="width: 600">
</div>

In [0]:
%run ../Includes/Copy-Datasets

In [0]:
SELECT * FROM orders;


## 🔍 Filtering Arrays

The dataset includes complex fields such as arrays of structs (e.g., a list of books per order).  
**Higher-order functions** like `filter` allow users to selectively retain elements from an array based on custom conditions.

- Enables precise subsetting of array elements.
- Useful for isolating values of interest (e.g., quantities greater than a threshold).
- Can be paired with subqueries to handle edge cases like empty arrays.

In [0]:
SELECT
  order_id,
  books,
  FILTER (books, i -> i.quantity >= 2) AS multiple_copies
FROM orders;

In [0]:
SELECT order_id, multiple_copies
FROM (
  SELECT
    order_id,
    FILTER (books, i -> i.quantity >= 2) AS multiple_copies
  FROM orders)
WHERE size(multiple_copies) > 0;


## 🔁 Transforming Arrays

The `transform` function allows element-wise operations on arrays.  
It applies a transformation to each item, enabling calculated fields or modified values within the same structure.

- Commonly used for applying discounts, formatting, or extracting subfields.
- Helps maintain the original array shape while enhancing its content.

In [0]:
SELECT
  order_id,
  books,
  TRANSFORM (
    books,
    b -> CAST(b.subtotal * 0.8 AS INT)
  ) AS subtotal_after_discount
FROM orders;

## 🔧 User Defined Functions (UDFs)

UDFs allow you to register **custom logic** and apply it in Spark SQL queries.

- Ideal for tasks that can’t be handled by built-in SQL functions.
- Support Python and Scala functions that can be reused across multiple queries.
- Registered UDFs are stored in the catalog and persist across sessions.

In [0]:
CREATE OR REPLACE FUNCTION get_url(email STRING)
RETURNS STRING

RETURN concat("https://www.", split(email, "@")[1])

In [0]:
SELECT email, get_url(email) domain
FROM customers;

## 🗃️ UDF Management and Metadata

- UDFs are registered as permanent database objects.
- They can be documented and inspected using metadata commands.
- UDFs support descriptive information such as inputs, outputs, and function usage.

In [0]:
DESCRIBE FUNCTION get_url;

In [0]:
DESCRIBE FUNCTION EXTENDED get_url;

## 💡 Evaluating Conditions with UDFs

UDFs can encapsulate **conditional logic** (e.g., CASE WHEN equivalents), making them flexible for custom data transformations.

- Ideal for mapping or categorizing data based on patterns or string suffixes.
- Simplifies complex `IF-ELSE` logic in SQL queries.

In [0]:
CREATE FUNCTION site_type(email STRING)
RETURNS STRING
RETURN CASE 
          WHEN email like "%.com" THEN "Commercial business"
          WHEN email like "%.org" THEN "Non-profits organization"
          WHEN email like "%.edu" THEN "Educational institution"
          ELSE concat("Unknow extenstion for domain: ", split(email, "@")[1])
       END;

In [0]:
SELECT email, site_type(email) as domain_category
FROM customers;

In [0]:
DROP FUNCTION get_url;
DROP FUNCTION site_type;

## 🧼 Best Practices for UDFs

- Reuse existing Spark SQL functions where possible before creating a UDF.
- Remove unused UDFs to keep the catalog clean and improve performance.
- Document UDF behavior and use metadata tools to monitor their usage.