d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

## Higher-order functions

Higher order functions in Spark SQL allow you to work directly with complex data types. When working with hierarchical data, as we were in the previous lesson and lab, records are frequently stored as array or map type objects. This lesson will demonstrate how to use higher-order functions to transform, filter, and flag array data while preserving the original structure. In this notebook, we work strictly with arrays of strings; in the subsquent notebook, you will work with more functions and numerical data. Skilled application of these functions can make your work with this kind of data faster, more powerful, and more reliable. 

In this notebook, you will: 
* Apply higher-order functions (`TRANSFORM`, `FILTER`, `EXISTS`) to arrays of strings 

Run the following queries to learn about how to work with higher-order functions.

## Getting Started
Run the cell below to set up your classroom environment

In [0]:
%run ../Includes/Classroom-Setup

## Working with Text Data

We can use higher-order functions to easily work with arrays of text data. The exercises in this section are meant to demonstrate the `TRANSFORM`, `FILTER`, and `EXISTS` functions to manipulate data and create flags for when a value does or does not exist. 

These examples use data collected about Databricks blog posts. Run the cell below to create the table. Then, run the next cell to view the schema. 

In this data set, the `authors` and `categories` columns are both ArrayType; we'll be using these columns with higher-order functions.

In [0]:
%sql
CREATE TABLE IF NOT EXISTS DatabricksBlog
  USING json
  OPTIONS (
    path "dbfs:/mnt/training/databricks-blog.json",
    inferSchema "true"
  )

In [0]:
%sql
DESCRIBE DatabricksBlog

col_name,data_type,comment
authors,array,
categories,array,
content,string,
creator,string,
dates,struct,
description,string,
id,bigint,
link,string,
slug,string,
status,string,


### Filter

[Filter](https://spark.apache.org/docs/latest/api/sql/#filter) allows us to create a new column based on whether or not values in an array meet a certain condition. Let's say we want to remove the category `"Engineering Blog"` from all records in our `categories` column. I can use the `FILTER` function to create a new column that excludes that value from the each array. 

Let's dissect this line of code to better understand the function:

`FILTER (categories, category -> category <> "Engineering Blog") woEngineering`

**`FILTER`** : the name of the higher-order function <br>
**`categories`** : the name of our input array <br>
**`category`** : the name of the iterator variable. You choose this name and then use it in the lambda function. It iterates over the array, cycling each value into the function one at a time.<br>
**`->`** :  Indicates the start of a function <br>
**`category <> "Engineering Blog"`** : This is the function. Each value is checked to see if it **is different** than the value `"Engineering Blog"`. If it is, it gets filtered into the new column, `woEnginieering`

In [0]:
%sql
SELECT
  categories,
  FILTER (categories, category -> category <> "Engineering Blog") woEngineering
FROM DatabricksBlog


categories,woEngineering
"List(Company Blog, Partners)","List(Company Blog, Partners)"
"List(Apache Spark, Engineering Blog, Machine Learning)","List(Apache Spark, Machine Learning)"
"List(Company Blog, Partners)","List(Company Blog, Partners)"
"List(Apache Spark, Engineering Blog)",List(Apache Spark)
"List(Apache Spark, Engineering Blog)",List(Apache Spark)
"List(Apache Spark, Ecosystem, Engineering Blog)","List(Apache Spark, Ecosystem)"
"List(Company Blog, Customers)","List(Company Blog, Customers)"
"List(Apache Spark, Engineering Blog)",List(Apache Spark)
"List(Announcements, Company Blog)","List(Announcements, Company Blog)"
"List(Apache Spark, Engineering Blog)",List(Apache Spark)


### Filter, subqueries, and `WHERE`

You may write a filter that produces a lot of empty arrays in the created column. When that happens, it can be useful to use a `WHERE` clause to show only non-empty array values in the returned column. 

In this example, we accomplish that by using a subquery. A **subquery** in SQL is a query within a query. They are useful for performing an operations in multiple steps. In this case, we're using it to create the named column that we will use with a `WHERE` clause.

In [0]:
%sql
SELECT
  *
FROM
  (
    SELECT
      authors, title,
      FILTER(categories, category -> category = "Engineering Blog") AS blogType
    FROM
      DatabricksBlog
  )
WHERE
  size(blogType) > 0

authors,title,blogType
List(Tathagata Das),Apache Spark 0.9.1 Released,List(Engineering Blog)
"List(Michael Armbrust, Reynold Xin)",Spark SQL: Manipulating Structured Data Using Apache Spark,List(Engineering Blog)
List(Patrick Wendell),Apache Spark 0.9.0 Released,List(Engineering Blog)
"List(Ali Ghodsi, Ahir Reddy)",Apache Spark In MapReduce (SIMR),List(Engineering Blog)
"List(Jai Ranganathan, Matei Zaharia)",Apache Spark: A Delight for Developers,List(Engineering Blog)
List(Ion Stoica),Apache Spark Now a Top-level Apache Project,List(Engineering Blog)
"List(Ahir Reddy, Reynold Xin)",AMPLab updates the Big Data Benchmark,List(Engineering Blog)
List(Ion Stoica),Apache Spark and Hadoop: Working Together,List(Engineering Blog)
List(Patrick Wendell),Apache Spark 0.8.1 Released,List(Engineering Blog)
List(Pat McDonough),Putting Apache Spark to Use: Fast In-Memory Computing for Your Big Data Applications,List(Engineering Blog)


### Exists

[Exists](https://spark.apache.org/docs/latest/api/sql/#exists) tests whether a statement is true for one or more elements in an array. Let's say we want to flag all blog posts with `"Company Blog"` in the categories field. I can use the `EXISTS` function to mark which entries include that category.

Let's dissect this line of code to better understand the function: 

`EXISTS (categories, c -> c = "Company Blog") companyFlag`

**`EXISTS`** : the name of the higher-order function <br>
**`categories`** : the name of our input array <br>
**`c`** : the name of the iterator variable. You choose this name and then use it in the lambda function. It iterates over the array, cycling each value into the function one at a time. Note that we're using the same kind as references as in the previous command, but we name the iterator with a single letter<br>
**`->`** :  Indicates the start of a function <br>
**`c = "Engineering Blog"`** : This is the function. Each value is checked to see if it **is the same as** the value `"Company Blog"`. If it is, it gets flagged into the new column, `companyFlag`

In [0]:
%sql
SELECT
  categories,
  EXISTS (categories, c -> c = "Company Blog") companyFlag
FROM DatabricksBlog


categories,companyFlag
"List(Company Blog, Partners)",True
"List(Apache Spark, Engineering Blog, Machine Learning)",False
"List(Company Blog, Partners)",True
"List(Apache Spark, Engineering Blog)",False
"List(Apache Spark, Engineering Blog)",False
"List(Apache Spark, Ecosystem, Engineering Blog)",False
"List(Company Blog, Customers)",True
"List(Apache Spark, Engineering Blog)",False
"List(Announcements, Company Blog)",True
"List(Apache Spark, Engineering Blog)",False


### Transform

[Transform](https://spark.apache.org/docs/latest/api/sql/#transform) uses the provided function to transform all elements of an array. SQL's built-in functions are designed to operate on a single, simple data type within a cell. They cannot process array values. Transform can be particularly useful when you want to apply an existing function to each element in an array. In this case, we want to rewrite all of the names in the `categories` column in lowercase. 

Let's dissect this line of code to better understand the function: 

`TRANSFORM(categories, cat -> LOWER(cat)) lwrCategories`

**`TRANSFORM`** : the name of the higher-order function <br>
**`categories`** : the name of our input array <br>
**`cat`** : the name of the iterator variable. You choose this name and then use it in the lambda function. It iterates over the array, cycling each value into the function one at a time. Note that we're using the same kind as references as in the previous command, but we name the iterator with a new variable<br>
**`->`** :  Indicates the start of a function <br>
**`LOWER(cat)`** : This is the function. For each value in the input array, the built-in function `LOWER()` is applied to transform the word to lowercase.

In [0]:
%sql
SELECT 
  TRANSFORM(categories, cat -> LOWER(cat)) lwrCategories
FROM DatabricksBlog

lwrCategories
"List(company blog, partners)"
"List(apache spark, engineering blog, machine learning)"
"List(company blog, partners)"
"List(apache spark, engineering blog)"
"List(apache spark, engineering blog)"
"List(apache spark, ecosystem, engineering blog)"
"List(company blog, customers)"
"List(apache spark, engineering blog)"
"List(announcements, company blog)"
"List(apache spark, engineering blog)"


In [0]:
%run ../Includes/Classroom-Cleanup


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>