In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.master("local[1]") \
                    .appName('test') \
                    .getOrCreate()

23/10/16 17:36:45 WARN Utils: Your hostname, codespaces-d00206 resolves to a loopback address: 127.0.0.1; using 172.16.5.4 instead (on interface eth0)
23/10/16 17:36:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/16 17:36:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Where to Look for APIs

DataFrame (Dataset) Methods

This is actually a bit of a trick because a DataFrame is just a Dataset of Row types, so you’ll
actually end up looking at the Dataset methods, which are available at this link.
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html

Dataset submodules like

DataFrameStatFunctions https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html
and
DataFrameNaFunctions https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html

have more methods that solve specific sets of problems. DataFrameStatFunctions, for example, holds a
variety of statistically related functions, whereas DataFrameNaFunctions refers to functions that are relevant when working with null data.


Column Methods

These were introduced for the most part in Chapter 5. They hold a variety of general column- related methods like alias or contains. You can find the API Reference for Column methods here. https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html

org.apache.spark.sql.functions contains a variety of functions for a range of different data
 
types. Often, you’ll see the entire package imported because they are used so frequently. You can find SQL and DataFrame functions here. https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html

In [3]:
df = spark.read.format("csv")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .load("../data/retail-data/by-day/2010-12-01.csv")

df.printSchema()
df.createOrReplaceTempView('dfTable')

[Stage 1:>                                                          (0 + 1) / 1]

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



                                                                                

# Converting to Spark Types

This function converts a type in another language to its correspnding Spark representation. Here’s how we can convert a couple of different kinds of Scala and Python values to their respective Spark types:

In [4]:
from pyspark.sql.functions import lit

df.select(lit(5), lit('five'), lit(5.0))

DataFrame[5: int, five: string, 5.0: double]

# Working with Booleans

In [5]:
from pyspark.sql.functions import col

df.where(col('InvoiceNo') != 536365).select("InvoiceNo", "Description").show(5, False)

+---------+-----------------------------+
|InvoiceNo|Description                  |
+---------+-----------------------------+
|536366   |HAND WARMER UNION JACK       |
|536366   |HAND WARMER RED POLKA DOT    |
|536367   |ASSORTED COLOUR BIRD ORNAMENT|
|536367   |POPPY'S PLAYHOUSE BEDROOM    |
|536367   |POPPY'S PLAYHOUSE KITCHEN    |
+---------+-----------------------------+
only showing top 5 rows



Another option—and probably the cleanest—is to specify the predicate as an expression in a string. This is valid for Python or Scala. Note that this also gives you access to another way of expressing “does not equal”:

In [6]:
df.where('InvoiceNo <> 536365').show(5, False)

+---------+---------+-----------------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                  |Quantity|InvoiceDate        |UnitPrice|CustomerID|Country       |
+---------+---------+-----------------------------+--------+-------------------+---------+----------+--------------+
|536366   |22633    |HAND WARMER UNION JACK       |6       |2010-12-01 08:28:00|1.85     |17850.0   |United Kingdom|
|536366   |22632    |HAND WARMER RED POLKA DOT    |6       |2010-12-01 08:28:00|1.85     |17850.0   |United Kingdom|
|536367   |84879    |ASSORTED COLOUR BIRD ORNAMENT|32      |2010-12-01 08:34:00|1.69     |13047.0   |United Kingdom|
|536367   |22745    |POPPY'S PLAYHOUSE BEDROOM    |6       |2010-12-01 08:34:00|2.1      |13047.0   |United Kingdom|
|536367   |22748    |POPPY'S PLAYHOUSE KITCHEN    |6       |2010-12-01 08:34:00|2.1      |13047.0   |United Kingdom|
+---------+---------+-----------------------------+--------+----

We mentioned that you can specify Boolean expressions with multiple parts when you use and or or. In Spark, you should always chain together and filters as a sequential filter.
The reason for this is that even if Boolean statements are expressed serially (one after the other), Spark will flatten all of these filters into one statement and perform the filter at the same time, creating the and statement for us. Although you can specify your statements explicitly by using and if you like, they’re often easier to understand and to read if you specify them serially. or statements need to be specified in the same statement:

In [7]:
# Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.
from pyspark.sql.functions import instr

priceFilter = col('UnitPrice') > 600
descripFilter = instr(df.Description, 'POSTAGE') >= 1
df.where(df.StockCode.isin('DOT')).where(priceFilter | descripFilter).show()

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      null|United Kingdom|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      null|United Kingdom|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+



Boolean expressions are not just reserved to filters. To filter a DataFrame, you can also just specify a Boolean column:

In [8]:
from pyspark.sql.functions import instr

DOTCodeFilter = col('StockCode') == 'DOT'
priceFilter = col('UnitPrice') > 600
descripFilter = instr(df.Description, 'POSTAGE') >= 1
df.withColumn('isExpensive', DOTCodeFilter & (priceFilter | descripFilter)).where('isExpensive').select('unitPrice', 'isExpensive').show()

+---------+-----------+
|unitPrice|isExpensive|
+---------+-----------+
|   569.77|       true|
|   607.49|       true|
+---------+-----------+



In fact, it’s often easier to just express
filters as SQL statements than using the programmatic DataFrame interface and Spark SQL allows us to do this without paying any performance penalty. For example, the following two statements are equivalent:

In [9]:
from pyspark.sql.functions import expr

df.withColumn('isExpensive', expr('NOT UnitPrice <= 250')).where('isExpensive').select("Description", "UnitPrice").show(5)

+--------------+---------+
|   Description|UnitPrice|
+--------------+---------+
|DOTCOM POSTAGE|   569.77|
|DOTCOM POSTAGE|   607.49|
+--------------+---------+



One “gotcha” that can come up is if you’re working with null data when creating Boolean expressions. If there is a null in your data, you’ll need to treat things a bit differently. Here’s how you can ensure that you perform a null-safe equivalence test:

In [10]:
df.where(col("Description").eqNullSafe("hello")).show()

+---------+---------+-----------+--------+-----------+---------+----------+-------+
|InvoiceNo|StockCode|Description|Quantity|InvoiceDate|UnitPrice|CustomerID|Country|
+---------+---------+-----------+--------+-----------+---------+----------+-------+
+---------+---------+-----------+--------+-----------+---------+----------+-------+



# Working with Numbers

To fabricate a contrived example, let’s imagine that we found out that we mis-recorded the quantity in our retail dataset and the true quantity is equal to (the current quantity * the unit price)^2 + 5. This will introduce our first numerical function as well as the pow function that raises a column to the expressed power:


In [11]:
from pyspark.sql.functions import expr, pow, col

fabricatedQuantity = pow(col('Quantity') * col('UnitPrice'), 2) + 5
df.select(expr('CustomerId'), fabricatedQuantity.alias('realQuantity')).show(2)

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



Notice that we were able to multiply our columns together because they were both numerical. Naturally we can add and subtract as necessary, as well. In fact, we can do all of this as a SQL expression, as well:

In [12]:
df.selectExpr('CustomerId', '(POWER((Quantity * UnitPrice), 2.0) + 5) as realQuantity').show(2)

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



Another common numerical task is rounding. If you’d like to just round to a whole number, oftentimes you can cast the value to an integer and that will work just fine. However, Spark also has more detailed functions for performing this explicitly and to a certain level of precision. In the following example, we round to one decimal place:

In [13]:
from pyspark.sql.functions import lit, round, bround

df.select(round(col("UnitPrice"), 1).alias("rounded"), col("UnitPrice")).show(5)

+-------+---------+
|rounded|UnitPrice|
+-------+---------+
|    2.6|     2.55|
|    3.4|     3.39|
|    2.8|     2.75|
|    3.4|     3.39|
|    3.4|     3.39|
+-------+---------+
only showing top 5 rows



By default, the round function rounds up if you’re exactly in between two numbers. You can round down by using the bround

In [14]:
from pyspark.sql.functions import lit, round, bround

df.select(round(lit('2.5')), bround(lit("2.5"))).show(2)

+-------------+--------------+
|round(2.5, 0)|bround(2.5, 0)|
+-------------+--------------+
|          3.0|           2.0|
|          3.0|           2.0|
+-------------+--------------+
only showing top 2 rows



Another numerical task is to compute the correlation of two columns. For example, we can see the Pearson correlation coefficient for two columns to see if cheaper things are typically bought in greater quantities. We can do this through a function as well as through the DataFrame statistic methods:

In [15]:
from pyspark.sql.functions import corr

df.stat.corr('Quantity', 'UnitPrice')
df.select(corr('Quantity', 'UnitPrice')).show()

+-------------------------+
|corr(Quantity, UnitPrice)|
+-------------------------+
|     -0.04112314436835551|
+-------------------------+



Another common task is to compute summary statistics for a column or set of columns. We can use the describe method to achieve exactly this. This will take all numeric columns and
calculate the count, mean, standard deviation, min, and max. You should use this primarily for viewing in the console because the schema might change in the future:

In [16]:
df.describe().show()

23/10/16 17:37:04 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 18:>                                                         (0 + 1) / 1]

+-------+-----------------+------------------+--------------------+------------------+------------------+------------------+--------------+
|summary|        InvoiceNo|         StockCode|         Description|          Quantity|         UnitPrice|        CustomerID|       Country|
+-------+-----------------+------------------+--------------------+------------------+------------------+------------------+--------------+
|  count|             3108|              3108|                3098|              3108|              3108|              1968|          3108|
|   mean| 536516.684944841|27834.304044117645|                null| 8.627413127413128| 4.151946589446603|15661.388719512195|          null|
| stddev|72.89447869788873|17407.897548583845|                null|26.371821677029203|15.638659854603892|1854.4496996893627|          null|
|    min|           536365|             10002| 4 PURPLE FLOCK D...|               -24|               0.0|           12431.0|     Australia|
|    max|          C

                                                                                

If you need these exact numbers, you can also perform this as an aggregation yourself by importing the functions and applying them to the columns that you need:

In [17]:
from pyspark.sql.functions import count, mean, stddev_pop, min, max


There are a number of statistical functions available in the StatFunctions Package (accessible using stat as we see in the code block below). These are DataFrame methods that you can use to calculate a variety of different things. For instance, you can calculate either exact or approximate quantiles of your data using the approxQuantile method:


In [18]:
colName = 'UnitPrice'
quantileProbs = [0.5]
relError = 0.05
df.stat.approxQuantile(colName, quantileProbs, relError)

[2.51]

You also can use this to see a cross-tabulation or frequent item pairs (be careful, this output will be large and is omitted for this reason):

In [19]:
# df.stat.crosstab('StockCode', 'Quantity').show()

In [20]:
# df.stat.freqItems(["StockCode", "Quantity"]).show()

As a last note, we can also add a unique ID to each row by using the function monotonically_increasing_id. This function generates a unique value for each row, starting with 0:

In [21]:
from pyspark.sql.functions import monotonically_increasing_id

df.select(monotonically_increasing_id()).show(5)

+-----------------------------+
|monotonically_increasing_id()|
+-----------------------------+
|                            0|
|                            1|
|                            2|
|                            3|
|                            4|
+-----------------------------+
only showing top 5 rows



# Working with Strings

The initcap function will capitalize every word in a given string when that word is separated from another by a space.

In [22]:
from pyspark.sql.functions import initcap, col

df.select(initcap(col('Description'))).show(5)

+--------------------+
|initcap(Description)|
+--------------------+
|White Hanging Hea...|
| White Metal Lantern|
|Cream Cupid Heart...|
|Knitted Union Fla...|
|Red Woolly Hottie...|
+--------------------+
only showing top 5 rows



As just mentioned, you can cast strings in uppercase and lowercase, as well:

In [23]:
from pyspark.sql.functions import lower, upper

df.select(col("Description"), lower(col("Description")), upper(lower(col("Description")))).show(2)

+--------------------+--------------------+-------------------------+
|         Description|  lower(Description)|upper(lower(Description))|
+--------------------+--------------------+-------------------------+
|WHITE HANGING HEA...|white hanging hea...|     WHITE HANGING HEA...|
| WHITE METAL LANTERN| white metal lantern|      WHITE METAL LANTERN|
+--------------------+--------------------+-------------------------+
only showing top 2 rows



Another trivial task is adding or removing spaces around a string. You can do this by using lpad, ltrim, rpad and rtrim, trim:

In [24]:
from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim

df.select(
    ltrim(lit('   HELLO   ')).alias('ltrim'),
    rtrim(lit('   HELLO   ')).alias('rtrim'),
    trim(lit('   HELLO   ')).alias('trim'),
    lpad(lit("HELLO"), 3, " ").alias("lp"),
    rpad(lit("HELLO"), 10, " ").alias("rp")
).show(2)

+--------+--------+-----+---+----------+
|   ltrim|   rtrim| trim| lp|        rp|
+--------+--------+-----+---+----------+
|HELLO   |   HELLO|HELLO|HEL|HELLO     |
|HELLO   |   HELLO|HELLO|HEL|HELLO     |
+--------+--------+-----+---+----------+
only showing top 2 rows



Note that if lpad or rpad takes a number less than the length of the string, it will always remove values from the right side of the string.

## Regular Expressions

Spark takes advantage of the complete power of Java regular expressions. The Java regular expression syntax departs slightly from other programming languages, so it is worth reviewing before putting anything into production. There are two key functions in Spark that you’ll need in order to perform regular expression tasks: regexp_extract and regexp_replace. These functions extract values and replace values, respectively.

Let’s explore how to use the regexp_replace function to replace substitute color names in our description column:

In [25]:
from pyspark.sql.functions import regexp_replace

regexp_string = 'BLACK|WHITE|RED|GREEN|BLUE'
df.select(regexp_replace(col('Description'), regexp_string, 'COLOR').alias('color_clean'), col('Description')).show(2)

+--------------------+--------------------+
|         color_clean|         Description|
+--------------------+--------------------+
|COLOR HANGING HEA...|WHITE HANGING HEA...|
| COLOR METAL LANTERN| WHITE METAL LANTERN|
+--------------------+--------------------+
only showing top 2 rows



Another task might be to replace given characters with other characters. Building this as a regular expression could be tedious, so Spark also provides the translate function to replace these values. This is done at the character level and will replace all instances of a character with the indexed character in the replacement string:

In [26]:
from pyspark.sql.functions import translate

df.select(translate(col('Description'), 'LEET', '1337'), col('Description')).show(2)

+----------------------------------+--------------------+
|translate(Description, LEET, 1337)|         Description|
+----------------------------------+--------------------+
|              WHI73 HANGING H3A...|WHITE HANGING HEA...|
|               WHI73 M37A1 1AN73RN| WHITE METAL LANTERN|
+----------------------------------+--------------------+
only showing top 2 rows



We can also perform something similar, like pulling out the first mentioned color:

In [27]:
from pyspark.sql.functions import regexp_extract

extract_str = '(BLACK|WHITE|RED|GREEN|BLUE)'
df.select(regexp_extract(col('Description'), extract_str, 1).alias("color_clean"), col('Description')).show(2)

+-----------+--------------------+
|color_clean|         Description|
+-----------+--------------------+
|      WHITE|WHITE HANGING HEA...|
|      WHITE| WHITE METAL LANTERN|
+-----------+--------------------+
only showing top 2 rows



Sometimes, rather than extracting values, we simply want to check for their existence. We can do this with the instr method on each column. This will return a Boolean declaring whether the value you specify is in the column’s string:

In [28]:
from pyspark.sql.functions import instr

containsBlack = instr(col('Description'), 'BLACK') >= 1
containsWhite = instr(col('Description'), 'WHITE') >= 1
df.withColumn('hasSimpleColor', containsBlack | containsWhite).where('hasSimpleColor').select('Description').show(3, False)

+----------------------------------+
|Description                       |
+----------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER|
|WHITE METAL LANTERN               |
|RED WOOLLY HOTTIE WHITE HEART.    |
+----------------------------------+
only showing top 3 rows



This is trivial with just two values, but it becomes more complicated when there are values.
Let’s work through this in a more rigorous way and take advantage of Spark’s ability to accept a dynamic number of arguments. We can also do this quite easily in Python. In this case, we’re going to use a different function, locate, that returns the integer location (1 based location). We then convert that to a Boolean before using it as the same basic feature:

In [29]:
from pyspark.sql.functions import locate, expr

simpleColors = ["black", "white", "red", "green", "blue"]

def color_locator(column, color_string):
    return locate(color_string.upper(), column).cast('boolean').alias('is_' + color_string)

selectedColumns = [color_locator(df.Description, c) for c in simpleColors]
selectedColumns.append(expr('*'))

df.select(*selectedColumns).where(expr('is_white or is_red')).select('Description').show(3, False)


+----------------------------------+
|Description                       |
+----------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER|
|WHITE METAL LANTERN               |
|RED WOOLLY HOTTIE WHITE HEART.    |
+----------------------------------+
only showing top 3 rows



This simple feature can often help you programmatically generate columns or Boolean filters in a way that is simple to understand and extend. We could extend this to calculating the smallest common denominator for a given input value, or whether a number is a prime.

# Working with Dates and Timestamps

Let’s begin with the basics and get the current date and the current timestamps:

In [30]:
from pyspark.sql.functions import current_date, current_timestamp

dateDF = spark.range(10).withColumn('today', current_date()).withColumn('now', current_timestamp())

dateDF.createOrReplaceTempView('dateTable')
dateDF.printSchema()

root
 |-- id: long (nullable = false)
 |-- today: date (nullable = false)
 |-- now: timestamp (nullable = false)



Now that we have a simple DataFrame to work with, let’s add and subtract five days from today. These functions take a column and then the number of days to either add or subtract as the arguments:

In [31]:
from pyspark.sql.functions import date_add, date_sub

dateDF.select(date_sub(col('today'), 5), date_add(col('today'), 5)).show(1)

+------------------+------------------+
|date_sub(today, 5)|date_add(today, 5)|
+------------------+------------------+
|        2023-10-11|        2023-10-21|
+------------------+------------------+
only showing top 1 row



Another common task is to take a look at the difference between two dates. We can do this with the datediff function that will return the number of days in between two dates. Most often we just care about the days, and because the number of days varies from month to month, there also exists a function, months_between, that gives you the number of months between two dates:

In [32]:
from pyspark.sql.functions import datediff, months_between, to_date

dateDF.withColumn('week_ago', date_sub(col('today'), 7)).select(datediff(col('week_ago'), col('today'))).show(1)

+-------------------------+
|datediff(week_ago, today)|
+-------------------------+
|                       -7|
+-------------------------+
only showing top 1 row



In [33]:
dateDF.select(to_date(lit('2016-01-01')).alias('start'), to_date(lit('2017-05-22')).alias('end'))\
.select(months_between(col('start'), col('end'))).show(1)

+--------------------------------+
|months_between(start, end, true)|
+--------------------------------+
|                    -16.67741935|
+--------------------------------+
only showing top 1 row



Notice that we introduced a new function: the to_date function. The to_date function allows you to convert a string to a date, optionally with a specified format. We specify our format in the Java SimpleDateFormat which will be important to reference if you use this function:

In [34]:
from pyspark.sql.functions import to_date, lit

spark.range(5).withColumn('date', lit('2017-01-01')).select(to_date(col('date'))).show(1)

+-------------+
|to_date(date)|
+-------------+
|   2017-01-01|
+-------------+
only showing top 1 row



Spark will not throw an error if it cannot parse the date; rather, it will just return null. This can be a bit tricky in larger pipelines because you might be expecting your data in one format and getting it in another. To illustrate, let’s take a look at the date format that has switched from year-
month-day to year-day-month. Spark will fail to parse this date and silently return null instead:

In [35]:
dateDF.select(
    to_date(lit('2016-20-12')),
    to_date(lit('2017-12-11'))
).show(1)

+-------------------+-------------------+
|to_date(2016-20-12)|to_date(2017-12-11)|
+-------------------+-------------------+
|               null|         2017-12-11|
+-------------------+-------------------+
only showing top 1 row




We find this to be an especially tricky situation for bugs because some dates might match the correct format, whereas others do not. In the previous example, notice how the second date appears as Decembers 11th instead of the correct day, November 12th. Spark doesn’t throw an error because it cannot know whether the days are mixed up or that specific row is incorrect.

Let’s fix this pipeline, step by step, and come up with a robust way to avoid these issues entirely. The first step is to remember that we need to specify our date format according to the Java SimpleDateFormat standard.

We will use two functions to fix this: to_date and to_timestamp. The former optionally expects a format, whereas the latter requires one:

In [36]:
from pyspark.sql.functions import to_date

dateFormat = 'yyyy-dd-MM'
cleanDateDF = spark.range(1).select(
    to_date(lit('2017-12-11'), dateFormat).alias('date'),
    to_date(lit('2016-20-12'), dateFormat).alias('date2')
)

cleanDateDF.createOrReplaceTempView("dateTable2")

Now let’s use an example of to_timestamp, which always requires a format to be specified:

In [37]:
from pyspark.sql.functions import to_timestamp

cleanDateDF.select(to_timestamp(col('date'), dateFormat)).show()

+------------------------------+
|to_timestamp(date, yyyy-dd-MM)|
+------------------------------+
|           2017-11-12 00:00:00|
+------------------------------+



After we have our date or timestamp in the correct format and type, comparing between them is actually quite easy. We just need to be sure to either use a date/timestamp type or specify our string according to the right format of yyyy-MM-dd if we’re comparing a date:


In [38]:
cleanDateDF.filter(col('date2') > lit('2016-12-12')).show()

+----------+----------+
|      date|     date2|
+----------+----------+
|2017-11-12|2016-12-20|
+----------+----------+



# Working with Nulls in Data

As a best practice, you should always use nulls to represent missing or empty data in your DataFrames. Spark can optimize working with null values more than it can if you use empty strings or other values. The primary way of interacting with null values, at DataFrame scale, is to
use the .na subpackage on a DataFrame. There are also several functions for performing operations and explicitly specifying how Spark should handle null values.

There are two things you can do with null values: you can explicitly drop nulls or you can fill them with a value (globally or on a per-column basis). Let’s experiment with each of these now.

## Coalesce

Spark includes a function to allow you to select the first non-null value from a set of columns by using the coalesce function. In this case, there are no null values, so it simply returns the first column:

In [39]:
from pyspark.sql.functions import coalesce

df.select(coalesce(col('Description'), col('CustomerId'))).show()

+---------------------------------+
|coalesce(Description, CustomerId)|
+---------------------------------+
|             WHITE HANGING HEA...|
|              WHITE METAL LANTERN|
|             CREAM CUPID HEART...|
|             KNITTED UNION FLA...|
|             RED WOOLLY HOTTIE...|
|             SET 7 BABUSHKA NE...|
|             GLASS STAR FROSTE...|
|             HAND WARMER UNION...|
|             HAND WARMER RED P...|
|             ASSORTED COLOUR B...|
|             POPPY'S PLAYHOUSE...|
|             POPPY'S PLAYHOUSE...|
|             FELTCRAFT PRINCES...|
|             IVORY KNITTED MUG...|
|             BOX OF 6 ASSORTED...|
|             BOX OF VINTAGE JI...|
|             BOX OF VINTAGE AL...|
|             HOME BUILDING BLO...|
|             LOVE BUILDING BLO...|
|             RECIPE BOX WITH M...|
+---------------------------------+
only showing top 20 rows



## ifnull, nullIf, nvl, and nvl2

There are several other SQL functions that you can use to achieve similar things.

- ifnull allows you to select the second value if the first is null, and defaults to the first. 
- Alternatively, you could use nullif, which returns null if the two values are equal or else returns the second if they are not.
- nvl returns the second value if the first is null, but defaults to the first.
- Finally, nvl2 returns the second value if the first is not null; otherwise, it will return the last specified value (else_value in the following example):

In [40]:
df.selectExpr("ifnull(null, 'return_value')",\
              "nullif('value', 'value')",\
              "nvl(null, 'return_value')",\
              "nvl2('not_null', 'return_value', 'else_value')").show(1)

+--------------------------+--------------------+-----------------------+----------------------------------------+
|ifnull(NULL, return_value)|nullif(value, value)|nvl(NULL, return_value)|nvl2(not_null, return_value, else_value)|
+--------------------------+--------------------+-----------------------+----------------------------------------+
|              return_value|                null|           return_value|                            return_value|
+--------------------------+--------------------+-----------------------+----------------------------------------+
only showing top 1 row



## drop

The simplest function is drop, which removes rows that contain nulls. The default is to drop any row in which any value is null:

In [41]:
df.na.drop()

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

Specifying "any" as an argument drops a row if any of the values are null. Using “all” drops the
row only if all values are null or NaN for that row:

In [42]:
df.na.drop("any")
df.na.drop("all")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

We can also apply this to certain sets of columns by passing in an array of columns:

In [43]:
df.na.drop('all', subset=['StockCode', 'InvoiceNo'])

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

## fill

Using the fill function, you can fill one or more columns with a set of values. This can be done by specifying a map—that is a particular value and a set of columns.

For example, to fill all null values in columns of type String, you might specify the following:

In [44]:
df.na.fill("All Null values become this string")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

We could do the same for columns of type Integer by using df.na.fill(5:Integer), or for Doubles df.na.fill(5:Double). To specify columns, we just pass in an array of column names like we did in the previous example:

In [45]:
df.na.fill('all', subset=["StockCode", "InvoiceNo"])

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

We can also do this with with a Scala Map, where the key is the column name and the value is the
value we would like to use to fill null values:

In [46]:
fill_cols_vals = {"StockCode": 5, "Description" : "No Value"}
df.na.fill(fill_cols_vals)

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

## replace

In addition to replacing null values like we did with drop and fill, there are more flexible options that you can use with more than just null values. Probably the most common use case is to replace all values in a certain column according to their current value. The only requirement is that this value be the same type as the original value:

In [47]:
df.na.replace([''], ['UNKNOWN'], 'Description')

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

# Ordering

You can use asc_nulls_first, desc_nulls_first, asc_nulls_last, or desc_nulls_last to specify where you would like your null values to appear in an ordered DataFrame.

# Working with Complex Types

Complex types can help you organize and structure your data in ways that make more sense for the problem that you are hoping to solve. There are three kinds of complex types: structs, arrays, and maps.

## Structs

You can think of structs as DataFrames within DataFrames. A worked example will illustrate this more clearly. We can create a struct by wrapping a set of columns in parenthesis in a query:

In [48]:
df.selectExpr('(Description, InvoiceNo) as complex', '*')
# df.selectExpr('struct(Description, InvoiceNo) as complex', '*')

DataFrame[complex: struct<Description:string,InvoiceNo:string>, InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

In [49]:
from pyspark.sql.functions import struct

complexDF = df.select(struct('Description', 'InvoiceNo').alias('complex'))
complexDF.createOrReplaceTempView('complexDF')

We now have a DataFrame with a column complex. We can query it just as we might another DataFrame, the only difference is that we use a dot syntax to do so, or the column method getField:

In [50]:
complexDF.select('complex.Description')
# complexDF.select(col('complex').getField('Description'))

DataFrame[Description: string]

We can also query all values in the struct by using *. This brings up all the columns to the top- level DataFrame:

In [51]:
complexDF.select('complex.*')

DataFrame[Description: string, InvoiceNo: string]

## Arrays

To define arrays, let’s work through a use case. With our current data, our objective is to take every single word in our Description column and convert that into a row in our DataFrame.
The first task is to turn our Description column into a complex type, an array.

### split

We do this by using the split function and specify the delimiter:

In [52]:
from pyspark.sql.functions import split

df.select(split(col('Description'), ' ')).show(2)

+-------------------------+
|split(Description,  , -1)|
+-------------------------+
|     [WHITE, HANGING, ...|
|     [WHITE, METAL, LA...|
+-------------------------+
only showing top 2 rows



This is quite powerful because Spark allows us to manipulate this complex type as another column. We can also query the values of the array using Python-like syntax:

In [53]:
df.select(split(col('Description'), ' ').alias('array_col')).selectExpr('array_col[0]').show(2)

+------------+
|array_col[0]|
+------------+
|       WHITE|
|       WHITE|
+------------+
only showing top 2 rows



### Array Length

In [54]:
from pyspark.sql.functions import size

df.select(size(split(col('Description'), ' '))).show(2)

+-------------------------------+
|size(split(Description,  , -1))|
+-------------------------------+
|                              5|
|                              3|
+-------------------------------+
only showing top 2 rows



### array_contains

We can also see whether this array contains a value:

In [55]:
from pyspark.sql.functions import array_contains

df.select(array_contains(split(col('Description'), ' '), 'WHITE')).show(2)

+------------------------------------------------+
|array_contains(split(Description,  , -1), WHITE)|
+------------------------------------------------+
|                                            true|
|                                            true|
+------------------------------------------------+
only showing top 2 rows



### explode

The explode function takes a column that consists of arrays and creates one row (with the rest of the values duplicated) per value in the array

In [56]:
from pyspark.sql.functions import split, explode

df.withColumn('splitted', split(col('Description'), ' ')).withColumn('exploded', explode(col('splitted'))).select('Description', 'InvoiceNo', 'exploded').show(2)

+--------------------+---------+--------+
|         Description|InvoiceNo|exploded|
+--------------------+---------+--------+
|WHITE HANGING HEA...|   536365|   WHITE|
|WHITE HANGING HEA...|   536365| HANGING|
+--------------------+---------+--------+
only showing top 2 rows



## Maps

Maps are created by using the map function and key-value pairs of columns. You then can select them just like you might select from an array:

In [57]:
from pyspark.sql.functions import create_map

df.select(create_map(col('Description'), col('InvoiceNo')).alias('complex_map')).show(2)

+--------------------+
|         complex_map|
+--------------------+
|{WHITE HANGING HE...|
|{WHITE METAL LANT...|
+--------------------+
only showing top 2 rows



You can query them by using the proper key. A missing key returns null:

In [58]:
df.select(create_map(col('Description'), col('InvoiceNo')).alias('complex_map')).selectExpr('complex_map["WHITE METAL LANTERN"]').show(2)

+--------------------------------+
|complex_map[WHITE METAL LANTERN]|
+--------------------------------+
|                            null|
|                          536365|
+--------------------------------+
only showing top 2 rows



You can also explode map types, which will turn them into columns:

In [59]:
df.select(create_map(col('Description'), col('InvoiceNo')).alias('complex_map')).selectExpr('explode(complex_map)').show(2)

+--------------------+------+
|                 key| value|
+--------------------+------+
|WHITE HANGING HEA...|536365|
| WHITE METAL LANTERN|536365|
+--------------------+------+
only showing top 2 rows



# Working with JSON

Spark has some unique support for working with JSON data. You can operate directly on strings of JSON in Spark and parse from JSON or extract JSON objects. Let’s begin by creating a JSON column:

In [60]:
jsonDF = spark.range(1).selectExpr("""
'{"myJSONKey" : {"myJSONValue" : [1, 2, 3]}}' as jsonString""")

You can use the get_json_object to inline query a JSON object, be it a dictionary or array.
You can use json_tuple if this object has only one level of nesting:

In [61]:
from pyspark.sql.functions import get_json_object, json_tuple

jsonDF.select(get_json_object(col('jsonString'), "$.myJSONKey.myJSONValue[1]").alias('column'), json_tuple(col('jsonString'), 'myJSONKey')).show(2)

+------+--------------------+
|column|                  c0|
+------+--------------------+
|     2|{"myJSONValue":[1...|
+------+--------------------+



You can also turn a StructType into a JSON string by using the to_json function:

In [62]:
from pyspark.sql.functions import to_json

df.selectExpr('(InvoiceNo, Description) as myStruct').select(to_json(col('myStruct'))).show(2)

+--------------------+
|   to_json(myStruct)|
+--------------------+
|{"InvoiceNo":"536...|
|{"InvoiceNo":"536...|
+--------------------+
only showing top 2 rows



This function also accepts a dictionary (map) of parameters that are the same as the JSON data source. You can use the from_json function to parse this (or other JSON data) back in. This
naturally requires you to specify a schema, and optionally you can specify a map of options, as well:

In [63]:
from pyspark.sql.functions import from_json
from pyspark.sql.types import *

parseSchema = StructType((
    StructField('InvoiceNo', StringType(), True),
    StructField('Description', StringType(), True)))

df.selectExpr('(InvoiceNo, Description) as myStruct')\
    .select(to_json(col('myStruct')).alias('newJSON'))\
    .select(from_json(col('newJSON'), parseSchema), col('newJSON')).show(2)

+--------------------+--------------------+
|  from_json(newJSON)|             newJSON|
+--------------------+--------------------+
|{536365, WHITE HA...|{"InvoiceNo":"536...|
|{536365, WHITE ME...|{"InvoiceNo":"536...|
+--------------------+--------------------+
only showing top 2 rows



# User-Defined Functions

Although you can write UDFs in Scala, Python, or Java, there are performance considerations that you should be aware of. To illustrate this, we’re going to walk through exactly what happens when you create UDF, pass that into Spark, and then execute code using that UDF.

The first step is the actual function. We’ll create a simple one for this example. Let’s write a power3 function that takes a number and raises it to a power of three:

In [64]:
udfExampleDF = spark.range(5).toDF('num')

def power3(double_value):
    return double_value ** 3

power3(2.0)

8.0

When you use the function, there are essentially two different things that occur. If the function is written in Scala or Java, you can use it within the Java Virtual Machine (JVM). This means that there will be little performance penalty aside from the fact that you can’t take advantage of code generation capabilities that Spark has for built-in functions. There can be performance issues if you create or use a lot of objects; we cover that in the section on optimization in Chapter 19.

If the function is written in Python, something quite different happens. Spark starts a Python process on the worker, serializes all of the data to a format that Python can understand (remember, it was in the JVM earlier), executes the function row by row on that data in the Python process, and then finally returns the results of the row operations to the JVM and Spark. Figure 6-2 provides an overview of the process.

First, we need to register the function to make it available as a DataFrame function:

In [65]:
from pyspark.sql.functions import udf

power3udf = udf(power3)

Then, we can use it in our DataFrame code:

In [66]:
from pyspark.sql.functions import col

udfExampleDF.select(power3udf(col('num'))).show(4)

[Stage 51:>                                                         (0 + 1) / 1]

+-----------+
|power3(num)|
+-----------+
|          0|
|          1|
|          8|
|         27|
+-----------+
only showing top 4 rows



                                                                                

At this juncture, we can use this only as a DataFrame function. That is to say, we can’t use it within a string expression, only on an expression. However, we can also register this UDF as a Spark SQL function. This is valuable because it makes it simple to use this function within SQL as well as across languages.

We can also register our Python function to be available as a SQL function and use that in any language, as well.

One thing we can also do to ensure that our functions are working correctly is specify a return type. As we saw in the beginning of this section, Spark manages its own type information, which does not align exactly with Python’s types. Therefore, it’s a best practice to define the return type for your function when you define it. It is important to note that specifying the return type is not necessary, but it is a best practice.

If you specify the type that doesn’t align with the actual type returned by the function, Spark will not throw an error but will just return null to designate a failure. You can see this if you were to switch the return type in the following function to be a DoubleType:

In [67]:
from pyspark.sql.types import IntegerType, DoubleType

spark.udf.register('power3py', power3, IntegerType())

udfExampleDF.selectExpr('power3py(num)').show(4)

+-------------+
|power3py(num)|
+-------------+
|            0|
|            1|
|            8|
|           27|
+-------------+
only showing top 4 rows

