d-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 400px">
</div>

# User-Defined Functions

##### Methods
- UDF Registration (`spark.udf`) (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=udfregistration#pyspark.sql.UDFRegistration" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/UDFRegistration.html" target="_blank">Scala</a>): `register`
- Built-In Functions (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#module-pyspark.sql.functions" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html" target="_blank">Scala</a>): `udf`
- Python UDF Decorator (<a href="https://docs.databricks.com/spark/latest/spark-sql/udf-python.html#use-udf-with-dataframes" target="_blank">Databricks</a>): `@udf`
- Pandas UDF Decorator (<a href="https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html#pandas-user-defined-functions" target="_blank">Databricks</a>): `@pandas_udf`

In [0]:
%run ./Includes/Classroom-Setup

In [0]:
salesDF = spark.read.parquet(salesPath)
display(salesDF.limit(5))

order_id,email,transaction_timestamp,total_item_quantity,purchase_revenue_in_usd,unique_items,items
257437,kmunoz@powell-duran.com,1592194221828900,1,1995.0,1,"List(List(null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1))"
282611,bmurillo@hotmail.com,1592504237604072,1,940.5,1,"List(List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1))"
257448,bradley74@gmail.com,1592200438030141,1,945.0,1,"List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))"
257440,jameshardin@campbell-morris.biz,1592197217716495,1,1045.0,1,"List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))"
283949,whardin@hotmail.com,1592510720760323,1,535.5,1,"List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))"


### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Define a function

Define a function in local Python/Scala to get the first letter of a string from the `email` field.

In [0]:
def firstLetterFunction(email):
  return email[0]

firstLetterFunction("annagray@kaufman.com")

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Create and apply UDF
Define a UDF that wraps the function. This serializes the function and sends it to executors to be able to use in our DataFrame.

In [0]:
firstLetterUDF = udf(firstLetterFunction)

Apply UDF on the `email` column.

In [0]:
from pyspark.sql.functions import col
display(salesDF.select(firstLetterUDF(col("email"))).limit(5))

firstLetterFunction(email)
k
b
b
j
w


### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Register UDF to use in SQL
Register UDF using `spark.udf.register` to create UDF in the SQL namespace.

In [0]:
salesDF.createOrReplaceTempView("sales")

spark.udf.register("sql_udf", firstLetterFunction)

In [0]:
%sql
SELECT sql_udf(email) AS firstLetter FROM sales LIMIT 5

firstLetter
k
b
b
j
w


### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Decorator Syntax (Python Only)

Alternatively, define UDF using decorator syntax in Python with the datatype the function returns. 

You will no longer be able to call the local Python function (e.g. `decoratorUDF("annagray@kaufman.com")` will not work)

In [0]:
%python
# Our input/output is a string
@udf("string")
def decoratorUDF(email: str) -> str:
  return email[0]

In [0]:
%python
from pyspark.sql.functions import col
salesDF = spark.read.parquet("/mnt/training/ecommerce/sales/sales.parquet")
display(salesDF.select(decoratorUDF(col("email"))).limit(5))

decoratorUDF(email)
k
b
b
j
w


### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Vectorized UDF (Python Only)

Use Vectorized UDF to help speed up the computation using Apache Arrow.

In [0]:
%python
import pandas as pd
from pyspark.sql.functions import pandas_udf

# We have a string input/output
@pandas_udf("string")
def vectorizedUDF(email: pd.Series) -> pd.Series:
  return email.str[0]

# Alternatively
vectorizedUDF = pandas_udf(lambda s: s.str[0], "string")  

In [0]:
%python
display(salesDF.select(vectorizedUDF(col("email"))).limit(5))

(email)
k
b
b
j
w


We can also register these Vectorized UDFs to the SQL namespace.

In [0]:
%python
spark.udf.register("sql_vectorized_udf", vectorizedUDF)

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Sort Day Lab

Start with a DataFrame of the average number of active users by day of week.

This was the resulting `df` in a previous lab.

In [0]:
from pyspark.sql.functions import approx_count_distinct, avg, col, date_format, to_date

df = (spark.read.parquet(eventsPath)
  .withColumn("ts", (col("event_timestamp") / 1e6).cast("timestamp"))
  .withColumn("date", to_date("ts"))
  .groupBy("date").agg(approx_count_distinct("user_id").alias("active_users"))
  .withColumn("day", date_format(col("date"), "E"))
  .groupBy("day").agg(avg(col("active_users")).alias("avg_users")))

display(df)

day,avg_users
Sun,282905.5
Mon,238195.5
Thu,264620.0
Sat,278482.0
Wed,227214.0
Fri,247180.66666666663
Tue,260942.5


### 1. Define UDF to label day of week
- Use the **`labelDayOfWeek`** provided below to create the udf **`labelDowUDF`**

In [0]:
from pyspark.sql.functions import *
def labelDayOfWeek(day):
  dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
         "Fri": "5", "Sat": "6", "Sun": "7"}
  return dow.get(day) + "-" + day

labelDowUDF = udf(labelDayOfWeek)

### 2. Apply UDF to label and sort by by weekday
- Update the **`day`** column by applying the UDF and replacing this column
- Sort by **`day`**
- Plot as bar graph

In [0]:
# TODO
finalDF = (df.withColumn("day", labelDowUDF(col("day")))
           .sort(col("day"))
)

display(finalDF)

day,avg_users
1-Mon,238195.5
2-Tue,260942.5
3-Wed,227214.0
4-Thu,264620.0
5-Fri,247180.66666666663
6-Sat,278482.0
7-Sun,282905.5


### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup
