# Lecture 23. Higher Order Functions and SQL UDFs (Hands On)

In this notebook, we will explore higher order functions and user defined functions, also known as UDF.

During this notebook, we will continue using our bookstore dataset, which contains the three tables of `customers`, `orders`, and `books`.

<div style="text-align: center;">
<img src="../../assets/images/Presentation-Images/bookstore_schema.png" style="width:640px" >
</div> 

Let us start by copying this dataset.

In [0]:
%run ../Includes/Copy-Datasets

Let us take a look again on our `orders` table.

In [0]:
%sql
SELECT * FROM orders

order_id,order_timestamp,customer_id,quantity,total,books
3559,1657722056,C00001,2,48,"List(List(B09, 2, 48))"
4243,1658786901,C00002,2,55,"List(List(B07, 1, 33), List(B06, 1, 22))"
4321,1658934252,C00003,2,40,"List(List(B04, 2, 40))"
4392,1659034513,C00004,2,82,"List(List(B08, 2, 82))"
3673,1657934721,C00005,2,87,"List(List(B01, 1, 49), List(B11, 1, 38))"
4464,1659171834,C00006,2,77,"List(List(B01, 1, 49), List(B02, 1, 28))"
3495,1657541649,C00007,2,52,"List(List(B09, 1, 24), List(B02, 1, 28))"
4105,1658590490,C00008,2,69,"List(List(B08, 1, 41), List(B02, 1, 28))"
3825,1658166059,C00009,2,48,"List(List(B09, 2, 48))"
4062,1658519180,C00010,2,48,"List(List(B09, 2, 48))"


As you can see here, the books column is of complex data type.
In fact, it's an array of a struct type.

## Higher Order Functions

To work directly with such a complex datatype, we need to use higher order functions.

**Higher order functions** allow you to work directly with hierarchical data like arrays and map type objects.

### Filtering Arrays

One of the most common higher order functions is the filter function which filters an array using a given lambda function.

In this example, we are creating a new column called `multiple_copies`, where we filter the books column to extract only those books that have a quantity greater or equal to 2.
(It means they have been bought in multiple copies, 2 or more)

In [0]:
%sql
SELECT
  order_id,
  books,
  FILTER (books, i -> i.quantity >= 2) AS multiple_copies
FROM orders

order_id,books,multiple_copies
3559,"List(List(B09, 2, 48))","List(List(B09, 2, 48))"
4243,"List(List(B07, 1, 33), List(B06, 1, 22))",List()
4321,"List(List(B04, 2, 40))","List(List(B04, 2, 40))"
4392,"List(List(B08, 2, 82))","List(List(B08, 2, 82))"
3673,"List(List(B01, 1, 49), List(B11, 1, 38))",List()
4464,"List(List(B01, 1, 49), List(B02, 1, 28))",List()
3495,"List(List(B09, 1, 24), List(B02, 1, 28))",List()
4105,"List(List(B08, 1, 41), List(B02, 1, 28))",List()
3825,"List(List(B09, 2, 48))","List(List(B09, 2, 48))"
4062,"List(List(B09, 2, 48))","List(List(B09, 2, 48))"


So we are creating a new column called `multiple_opies`, where we have an array that contains only the filtered data.

However, as you can see, we are creating a lot of empty arrays in this new column.

In this case, it is useful to use a `WHERE` clause to show only non empty array values in the return column. 
Let us do so.

Here, we can accomplish that by using a ***subquery***, which is a query within another query in order to apply the `WHERE` clause on the size of the returned column. Let us run this query.

In [0]:
%sql
SELECT order_id, multiple_copies
FROM (
  SELECT
    order_id,
    FILTER (books, i -> i.quantity >= 2) AS multiple_copies
  FROM orders)
WHERE size(multiple_copies) > 0;

order_id,multiple_copies
3559,"List(List(B09, 2, 48))"
4321,"List(List(B04, 2, 40))"
4392,"List(List(B08, 2, 82))"
3825,"List(List(B09, 2, 48))"
4062,"List(List(B09, 2, 48))"
3951,"List(List(B09, 2, 48))"
3910,"List(List(B09, 2, 48))"
3848,"List(List(B09, 2, 48))"
3491,"List(List(B04, 2, 40))"
3888,"List(List(B09, 2, 48))"


The empty arrays are no more there.

### Transforming Arrays

Our second higher order function is the transform function that is used to apply a transformation on all the items in an array and extract the transformed value.

Here in this example, for each book in the `books` array, we are applying a discount on the subtotal value.
Let us run this query.

In [0]:
%sql
SELECT
  order_id,
  books,
  TRANSFORM (
    books,
    b -> CAST(b.subtotal * 0.8 AS INT)
  ) AS subtotal_after_discount
FROM orders;

order_id,books,subtotal_after_discount
3559,"List(List(B09, 2, 48))",List(38)
4243,"List(List(B07, 1, 33), List(B06, 1, 22))","List(26, 17)"
4321,"List(List(B04, 2, 40))",List(32)
4392,"List(List(B08, 2, 82))",List(65)
3673,"List(List(B01, 1, 49), List(B11, 1, 38))","List(39, 30)"
4464,"List(List(B01, 1, 49), List(B02, 1, 28))","List(39, 22)"
3495,"List(List(B09, 1, 24), List(B02, 1, 28))","List(19, 22)"
4105,"List(List(B08, 1, 41), List(B02, 1, 28))","List(32, 22)"
3825,"List(List(B09, 2, 48))",List(38)
4062,"List(List(B09, 2, 48))",List(38)


As you can see, we created a new column containing an array of the transformed values for each element in the books array.

## User Defined Functions (UDF)

Let us now talk about ***user defined functions*** or ***UDFs***, 
- which allow you to register a custom combination of SQL logic as function in a database, making these methods reusable in any SQL query.

- In addition, UDF functions leverage spark SQL directly maintaining all the optimization of Spark when applying your custom logic to large datasets.

### An Example of UDF

At minimum, it requires a function name, optional parameters, the type to be returned, and some custom logic, of course.

  - Our function here is named `get_url`, that accepts an email address as an argument and return a value of type string.

  - Here we are splitting the email into two parts based on the @ character, and we are taking the second element of index 1 
(knowing that the split function is returning a zero-indexed list)

  - And finally, we are adding the HTTP protocol to the domain name.

Let us run this command to create the function.

In [0]:
%sql
CREATE OR REPLACE FUNCTION get_url(email STRING)
RETURNS STRING

RETURN concat("https://www.", split(email, "@")[1])

The function has been created.

Let us now start using it.

Here we are applying our UDF on the customer emails to get the URL address.

In [0]:
%sql
SELECT email, get_url(email) domain
FROM customers

email,domain
sgonnely5a@aol.com,https://www.aol.com
rgonningcc@nbcnews.com,https://www.nbcnews.com
rgoode5l@epa.gov,https://www.epa.gov
rgoodier7m@skype.com,https://www.skype.com
igoodlipc4@twitter.com,https://www.twitter.com
kgoodramj6@dagondesign.com,https://www.dagondesign.com
,
,
,
,


It works.

Note that user defined functions are permanent objects that are persisted to the database, so you can use them between different Spark sessions and notebooks.

With `DESCRIBE FUNCTION` command, we can see where it was registered and basic information about expected inputs and the expected return type.

In [0]:
%sql
DESCRIBE FUNCTION get_url

function_desc
Function: hive_metastore.default.get_url
Type: SCALAR
Input: email STRING
Returns: STRING


As you can see, our function, it belongs to the default database and accepts the email address as a string input, and returns a string value.

We can get even more information by running `DESCRIBE FUNCTION EXTENDED`.

For example, the `Body` field at the bottom shows the SQL logic used in the function itself.

### Another Example of UDF

Of course, we can have more complex logic in our function.

For example, here we are applying the standard SQL `CASE WHEN` statements in order to evaluate multiple condition statements within our function.

Here, for example, we are checking the email extension using the `like` command in order to detect the category of the domain name.
And otherwise we are reporting it as an unknown extension.

In [0]:
%sql
CREATE FUNCTION site_type(email STRING)
RETURNS STRING
RETURN CASE 
          WHEN email like "%.com" THEN "Commercial business"
          WHEN email like "%.org" THEN "Non-profits organization"
          WHEN email like "%.edu" THEN "Educational institution"
          ELSE concat("Unknow extenstion for domain: ", split(email, "@")[1])
       END;

Let us now apply this UDF on our `customers` table.

In [0]:
%sql
SELECT email, site_type(email) as domain_category
FROM customers

email,domain_category
sgonnely5a@aol.com,Commercial business
rgonningcc@nbcnews.com,Commercial business
rgoode5l@epa.gov,Unknow extenstion for domain: epa.gov
rgoodier7m@skype.com,Commercial business
igoodlipc4@twitter.com,Commercial business
kgoodramj6@dagondesign.com,Commercial business
,
,
,
,


As you can see, UDF functions are really powerful.

And remember, everything is evaluated natively in Spark.

And so it's optimized for parallel execution.

Let us drop our user defined functions.

In [0]:
%sql
DROP FUNCTION get_url;
DROP FUNCTION site_type;