# User Defined Functions

From time to time you hit a wall where you need a simple transformation, but Spark does not offer an appropriate function in the `pyspark.sql.functions` module. Fortunately you can simply define new functions, so called *user defined functions* or short *UDFs*.

In [None]:
from pyspark.sql.functions import *
from pyspark.sql.types import *


df = spark.createDataFrame([('Alice & Bob',12),('Thelma & Louise',17)],['name','age'])
df.toPandas()

In [None]:
import html


html.escape("Thelma & Louise")

In [None]:
import html


html_encode = # YOUR CODE HERE

result = # YOUR CODE HERE
result.toPandas()

As an alternative, you can also use a Python decorator for declaring a UDF:

In [None]:
# YOUR CODE HERE

result = df.select(html_encode('name').alias('html_name'))
result.toPandas()

## Complex return types

PySpark also supports complex return types, for example structs (or also arrays)

In [None]:
@udf(StructType([
    StructField("org_name", StringType()), 
    StructField("html_name", StringType())
]))
def html_encode(s):
    return (s,html.escape(s))

result = df.select(html_encode('name').alias('both_names'))
result.toPandas()

## SQL Support

If you wanto to use the Python UDF inside a SQL query, you also need to register it, so PySpark knows its name.

In [None]:
html_encode = # YOUR CODE HERE

df.createOrReplaceTempView("famous_pairs")
result = # YOUR CODE HERE
result.toPandas()

# Pandas UDFs

"Normal" Python UDFs are pretty expensive (in terms of execution time), since for every record the following steps need to be performed:
* record is serialized inside JVM
* record is sent to an external Python process
* record is deserialized inside Python
* record is Processed in Python
* result is serialized in Python
* result is sent back to JVM
* result is deserialized and stored inside result DataFrame

This does not only sound like a lot of work, it actually is. Therefore Python UDFs are a magnitude slower than native UDFs written in Scala or Java, which run directly inside the JVM.

But since Spark 2.3 an alternative approach is available for defining Python UDFs with so called *Pandas UDFs*. Pandas is a commonly used Python framework which also offers DataFrames (but Pandas DataFrames, not Spark DataFrames). Spark 2.3 now can convert inside the JVM a Spark DataFrame into a shareable memory buffer by using a library called *Arrow*. Python then can also treat this memory buffer as a Pandas DataFrame and can directly work on this shared memory.

This approach has two major advantages:
* No need for serialization and deserialization, since data is shared directly in memory between the JVM and Python
* Pandas has lots of very efficient implementations in C for many functions

Due to these two facts, Pandas UDFs are much faster and should be preferred over traditional Python UDFs whenever possible.

In [None]:
r = spark.range(0,100)
df = r.withColumn('v', r.id.cast("double")).withColumn("group", r.id % 5)
df.limit(10).toPandas()

## Classic UDF Approach

As an example, let's create a function which simply increments a numeric column by one. First let us have a look using a traditional Python UDF:

In [None]:
from pyspark.sql.functions import udf


# Use udf to define a row-at-a-time udf
@udf('double')
# Input/output are both a single double value
def plus_one(v):
      return v + 1

result = df.withColumn('v2', plus_one(df.v))
result.limit(10).toPandas()

## Pandas UDF

Increment a value using a Pandas UDF. The Pandas UDF receives a `pandas.Series` object and also has to return a `pandas.Series` object.

In [None]:
from pyspark.sql.functions import PandasUDFType, pandas_udf


# YOUR CODE HERE

result = # YOUR CODE HERE
result.limit(10).toPandas()

## Grouped Pandas Aggregate UDFs

Since version 2.4.0, Spark also supports Pandas aggregation functions. This is the only way to implement custom aggregation functions in Python. Note that this type of UDF does not support partial aggregation and all data for a group or window will be loaded into memory.

In [None]:
# YOUR CODE HERE

result = # YOUR CODE HERE
result.toPandas()

## Grouped Pandas Map UDFs
While the example above transforms all records independently, but only one column at a time, Spark also offers a so called *grouped Pandas UDF* which operates on complete groups of records (as created by a `groupBy` method). This is a great mean to replace the (in PySpark missing) *User Defined Aggregation Functions* (UDAFs).

For example let's subtract the mean of a group from all entries of a group. In Spark this could be achieved directly by using windowed aggregations. But let's first have a look at a Python implementation which does not use Pandas Grouped UDFs

In [None]:
import pandas as pd


@udf(ArrayType(DoubleType()))
def subtract_mean(values):
    series = pd.Series(values)
    center = series - series.mean()
    return [x for x in center]

groups = df.groupBy('group').agg(collect_list(df.v).alias('values'))
result = groups.withColumn('center', explode(subtract_mean(groups.values))).drop('values')
result.limit(10).toPandas()

This example is even incomplete, as the `id` column is now missing.

### Using Pandas Grouped UDFs

Now let's try to implement the same function using a Pandas grouped UDF

In [None]:
# YOUR CODE HERE
def subtract_mean(pdf):
    return pdf.assign(v=pdf.v - pdf.v.mean())

result = # YOUR CODE HERE
result.limit(10).toPandas()

# Example of grouped regressions

In this section, we want to demanstrate a slightly advanced example for using Pandas grouped transformation for performing many ordinary least square model fits in parallel. We reuse the weather data and try to predict the temperature of all stations with a very simple model per station.

In [None]:
%matplotlib inline

### Load Data
First we load data of a single year.

In [None]:
storageLocation = "s3://dimajix-training/data/weather"

In [None]:
from pyspark.sql.functions import *
from pyspark.sql.types import *


rawWeatherData = spark.read.text(storageLocation + "/2003")
weather_all = rawWeatherData.select(
    substring(col("value"),5,6).alias("usaf"),
    substring(col("value"),11,5).alias("wban"),
    to_timestamp(substring(col("value"),16,12),"yyyyMMddHHmm").alias("timestamp"),
    to_timestamp(substring(col("value"),16,12),"yyyyMMddHHmm").cast("long").alias("ts"),
    substring(col("value"),42,5).alias("report_type"),
    substring(col("value"),61,3).alias("wind_direction"),
    substring(col("value"),64,1).alias("wind_direction_qual"),
    substring(col("value"),65,1).alias("wind_observation"),
    (substring(col("value"),66,4).cast("float") / lit(10.0)).alias("wind_speed"),
    substring(col("value"),70,1).alias("wind_speed_qual"),
    (substring(col("value"),88,5).cast("float") / lit(10.0)).alias("air_temperature"),
    substring(col("value"),93,1).alias("air_temperature_qual")
)

## Analysis of one station

First we only analyse a single station, just to check our approach and the expressiveness of our model. It won't be a very good fit, but it will be good enough for our needs to demonstrate the concept.

So first we pick a single station, and we also only keep those records with a valid temeprature measurement.

In [None]:
weather_single = weather_all.where("usaf='954920' and wban='99999'").cache()

In [None]:
pdf = # YOUR CODE HERE
pdf

### Create Feature Space

Our model will simply predict the temperature depending on the time and day of year. We use sin and cos of with a day-wide period and a year-wide period as features for fitting the model.

In [None]:
import math

import numpy as np


seconds_per_day = 24*60*60
seconds_per_year = 365*seconds_per_day

# Add sin and cos as features for fitting
pdf['daily_sin'] = np.sin(pdf['ts']/seconds_per_day*2.0*math.pi)
pdf['daily_cos'] = np.cos(pdf['ts']/seconds_per_day*2.0*math.pi)
pdf['yearly_sin'] = np.sin(pdf['ts']/seconds_per_year*2.0*math.pi)
pdf['yearly_cos'] = np.cos(pdf['ts']/seconds_per_year*2.0*math.pi)

# Make a plot, just to check how it looks like
pdf[0:200].plot(x='timestamp', y=['daily_sin','daily_cos','air_temperature'], figsize=[16,6])

### Fit model

Now that we have the temperature and some features, we fit a simple model.

In [None]:
import statsmodels.api as sm


# define target variable y
y = pdf['air_temperature']
# define feature variables X
X = pdf[['ts', 'daily_sin', 'daily_cos', 'yearly_sin', 'yearly_cos']]
X = sm.add_constant(X)
# fit model
model = sm.OLS(y, X).fit()

# perform prediction
pdf['pred'] = model.predict(X)

# Make a plot of real temperature vs predicted temperature
pdf[0:200].plot(x='timestamp', y=['pred','air_temperature'], figsize=[16,6])

### Inspect Model

Now let us inspect the model, in order to find a way to store it in a Pandas DataFrame

In [None]:
# YOUR CODE HERE

In [None]:
type(model.params)

Create a DataFrame from the model parameters

In [None]:
x_columns = X.columns
pd.DataFrame([[model.params[i] for i in  x_columns]], columns=x_columns)

## Perform OLS for all stations

Now we want to create a model for all stations. First we filter the data again, such that we only have valid temperature measurements.

In [None]:
valid_weather = weather_all.filter(weather_all.air_temperature_qual == 1)

### Feature extraction

Now we generate the same features, but this time we use Spark instead of Pandas operations. This simplifies later model fitting.

In [None]:
import math


seconds_per_day = 24*60*60
seconds_per_year = 365*seconds_per_day

features = valid_weather.select(
    valid_weather.usaf,
    valid_weather.wban,
    valid_weather.air_temperature,
    valid_weather.ts,
    lit(1.0).alias('const'),
    sin(valid_weather.ts * 2.0 * math.pi / seconds_per_day).alias('daily_sin'),
    cos(valid_weather.ts * 2.0 * math.pi / seconds_per_day).alias('daily_cos'),
    sin(valid_weather.ts * 2.0 * math.pi / seconds_per_year).alias('yearly_sin'),
    cos(valid_weather.ts * 2.0 * math.pi / seconds_per_year).alias('yearly_cos')
)

features.limit(10).toPandas()

### Fit Models

Now we use a Spark Pandas grouped UDF in order to fit models for all weather stations in parallel.

In [None]:
group_columns = ['usaf', 'wban']
y_column = 'air_temperature'
x_columns = ['ts', 'const', 'daily_sin', 'daily_cos', 'yearly_sin', 'yearly_cos']
schema = features.select(*group_columns, *x_columns).schema

# YOUR CODE HERE

models = # YOUR CODE HERE

In [None]:
models.limit(10).toPandas()

## Inspect and compare results

Now let's pick the same station again, and compare the model to the original model.

In [None]:
models.where("usaf='954920' and wban='99999'").toPandas()

In [None]:
model.params