# Lesson 17 - Column Functions

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr
from pyspark.sql.types import StringType, IntegerType

spark = SparkSession.builder.getOrCreate()

## Built-In Functions

Spark provides several built-in functions that can be applied to column objects. These functions are stored in the [`pyspark.sql.functions`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?#module-pyspark.sql.functions) module, which is often imported under the alias `F`, as shown in the the import statement below.

In [0]:
import pyspark.sql.functions as F

This module contains both element-wise functions and aggregation functions that can be applied to columns. We will discuss both types of functions in this lesson, and will provide a few examples of each type. A complete list of available functions can be found in the [Apache Spark Documentation](https://spark.apache.org/docs/3.0.1/api/sql/#lower)

We will create a small DataFrame to use in illustrating the concepts introduced in this lesson.

In [0]:
my_schema = 'name STRING, x1 DOUBLE, x2 INTEGER'

data = [
  ['Emma White', 5.2, 215],
  ['Art Brown', 4.1, 473],
  ['Carly Black', 3.7, 260],
  ['Beth Green', 4.5, 303],
  ['Dan Gray', 2.9, 185]
]

df = spark.createDataFrame(data, schema=my_schema)

df.show()

## Element-Wise Functions


Element-wise functions represent an operation that is to be applied independently to every element of a column in a DataFrame. In the cell below, we provide an example that makes use of the following element-wise functions:

* `F.upper()` - This function can be applied to `string` columns. It returns a column in which all of the strings have been converted to upper case. 
* `F.length()` - This function can be applied to `string` column. It returns a column containing the lengths of the strings in the original column. 
* `F.exp()` - This function accepts a column of numerical values and applies the natural exponential function to each value in the column.
* `F.log()` - This function accepts a column of numerical values and applies the natural logarithm function to each value in the column.

Note that in the example below, we use `alias()` to name the newly created column. If we did not do this then the name of the new column would be an unsightly string representing the operation used to create the column.

In [0]:
df.select(
    F.upper(col('name')).alias('name_upper'),
    F.length(col('name')).alias('name_length'),
    F.exp(col('x1')).alias('exp_x1'),
    F.log(col('x2')).alias('log_x2')
).show()

### Rounding

The built-in `F.round()` function can be used to round values in a numerical column. It accepts two arguments. The first is the column to which the function is being applied and the second is the number of decimal deigits to which the values should be rounded. We will modify our previous example by rounding the third and fourth columns to 2 and 4 decimal places, respectively.

In [0]:
df.select(
    F.upper(col('name')).alias('name_upper'),
    F.length(col('name')).alias('name_length'),
    F.round(F.exp(col('x1')),2).alias('exp_x1'),
    F.round(F.log(col('x2')),4).alias('log_x2')
).show()

### Using Functions in SQL Expression Strings

We can also use `expr()` and SQL expression strings to apply functions to columns. When doing so, the expression string is sent to Spark where it is parsed and executed. Since the strings gets parsed on the backend by Spark, we do not need to import any modules or functions when using this approach. Every function found in `pyspark.sql.functions` has an SQL equivalent of the same name. Note that SQL functions is not case-sensitive and you will often see them written in all-caps.

The cell below illustrates how to use `expr()` and SQL expression strings to recreate the previous example that we considered.

In [0]:
df.select(
    expr('UPPER(name) AS name_upper'),
    expr('LENGTH(name) AS name_length'),
    expr('ROUND(EXP(x1), 2) AS exp_x1'),
    expr('ROUND(LOG(x2), 4) AS log_x2')
).show()

## Aggregation Functions

Aggregation functions combine all of the entries in a column into a single value, and can be applied using the `select()` method. In the example below, we will illustrate the use of the following aggregation functions:

* `F.sum()` - Returns the sum of the elements in a column. 
* `F.mean()` - Returns the arithmetic mean of the elements in a column. 
* `F.stddev()` - Returns the standard deviation of the elements in a column. 
* `F.min()` - Returns the minimum of the elements in a column. 
* `F.max()` - Returns the maximum of the elements in a column. 

Again, we will use `alias()` to assign friendly names to the new columns.

In [0]:
df.select(
    F.sum(col('x2')).alias('sum_x2'),
    F.mean(col('x2')).alias('mean_x2'),
    F.stddev(col('x2')).alias('stddev_x2'),
    F.min(col('x2')).alias('min_x2'),
    F.max(col('x2')).alias('max_x2')
).show()

We will now recreate the results above, this time using SQL expressing strings.

In [0]:
df.select(
    expr('SUM(x2) AS sum_x2'),
    expr('MEAN(x2) AS mean_x2'),
    expr('STDDEV(x2) AS stddev_x2'),
    expr('MIN(x2) AS min_x2'),
    expr('MAX(x2) AS max_x2')
).show()

### Rules for Applying Multiple Column Functions

You are allowed to include several element-wise functions within a single call to `select()` and you can include several aggregations into a single such call. However, you are not allowed to mix element-wise functions and aggregations within the same call to `select()`. The column returned by an element-wise function will contain one entry for every row in the original DataFrame, while the column returned by an aggregator will contain only a single value representing the aggregated value.

## User-Defined Functions

There are occasions when you need to apply a function to a column in a DataFrame, and that function is not provided as a built-in function. In this scenario, you can created a **user-defined function**, or **UDF**. The general process for creating a UDF is to write a Python function that performs the desired operation on a single element, and then use Spark tools to create a column function that applies the Python function to each element in a DataFrame column.

The details of the approach used to create a UDF vary depending on if we plan to apply the function to a column object created by `col()`, or if we wish to use the UDF inside a SQL expression string provided to `expr()`.

#### Creating a UDF in Python

Suppose that we want to create a function in Python that can be applied to a column object created by `col()`. To do so, we start by writing a Python function whose parameters are assumed to be values selected from one or more columns. For the sake of discussion, let's call this function  my_func. We then use the `F.udf()` function to create a UDF that applies `my_func` to each element of a column. The function `F.udf()` expects two arguments. The first is the function we intend to apply element-wise (`my_func`, in our current example). The second argument is a Spark type object (such as `IntegerType()` or `StringType()`) used to specifiy the type to be used for the column returned by the UDF.

We will now demonstrate this process by creating a UDF that extracts the last names of individuals in our dataset. We perform the following tasks in the cell below:
1. We create a Python function that performs the desired operation on strings of the type stored in the `name` column. 
2. We use `F.udf()` to create a function that accepts a column object as an input, and then performs the desired operation on individual strings within that column. 
3. We use `select()` to apply the new UDF to the `name` column.

In [0]:
def get_last_name(name):
    tokens = name.split(' ')
    return tokens[-1]

get_last_name_udf = F.udf(get_last_name, StringType())

df.select(
    '*', 
    get_last_name_udf(col('name')).alias('last_name')
).show()

The cell below provides another example of creating a UDF. In this example, we create a UDF that sums the digits of the elements contained within a integer column. We then apply this UDF to the column `x2`.

In [0]:
def sum_digits(n):
    temp = n
    total = 0
    while temp > 0:
        total += temp % 10
        temp = int(temp / 10)
    return total

sum_digits_udf = F.udf(sum_digits, IntegerType())

df.select(
    '*', 
    sum_digits_udf(col('x2')).alias('digit_sum')
).show()

### Registering UDFs to use in SQL-based Expression Strings

The UDFs we created in the cells above are Python functions. They accept column objects as input and return column objects as output. When the resulting column objects are provided to `select()` and an action is called, Spark will perform the desired operations. 

We can also create new UDFs directly within the Spark backend using `spark.udf.register()`. This function expects two arguments. The first argument is a string representing the name of the new SQL function we are creating. The second argument is the Python function that that defines the elementwise operation represented by our new function. The newly created function will not be available within Python, but it can be used in SQL-based expression strings provided to `expr()`. 

We illustrate the use of this technique in the cell below. Notice that this approach results in cleaner code, as is often the case when using `expr()`. Also, this technique does not require us to import any special functions or classes.

In [0]:
spark.udf.register('GET_LAST_NAME', get_last_name)
spark.udf.register('SUM_DIGITS', sum_digits)

df.select(
    '*', 
    expr('GET_LAST_NAME(name) AS last_name'),
    expr('SUM_DIGITS(x2) AS digit_sum')
).show()