In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.master("local[1]") \
                    .appName('test') \
                    .getOrCreate()

23/10/02 14:08:44 WARN Utils: Your hostname, codespaces-d00206 resolves to a loopback address: 127.0.0.1; using 172.16.5.4 instead (on interface eth0)
23/10/02 14:08:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


# Where to Look for APIs

DataFrame (Dataset) Methods

This is actually a bit of a trick because a DataFrame is just a Dataset of Row types, so you’ll
actually end up looking at the Dataset methods, which are available at this link.
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html

Dataset submodules like

DataFrameStatFunctions https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html
and
DataFrameNaFunctions https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html

have more methods that solve specific sets of problems. DataFrameStatFunctions, for example, holds a
variety of statistically related functions, whereas DataFrameNaFunctions refers to functions that are relevant when working with null data.


Column Methods

These were introduced for the most part in Chapter 5. They hold a variety of general column- related methods like alias or contains. You can find the API Reference for Column methods here. https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html

org.apache.spark.sql.functions contains a variety of functions for a range of different data
 
types. Often, you’ll see the entire package imported because they are used so frequently. You can find SQL and DataFrame functions here. https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html

In [None]:
df = spark.read.format("csv")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .load("../data/retail-data/by-day/2010-12-01.csv")

df.printSchema()
df.createOrReplaceTempView('dfTable')

[Stage 1:>                                                          (0 + 1) / 1]

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



                                                                                

# Converting to Spark Types

This function converts a type in another language to its correspnding Spark representation. Here’s how we can convert a couple of different kinds of Scala and Python values to their respective Spark types:

In [None]:
from pyspark.sql.functions import lit

df.select(lit(5), lit('five'), lit(5.0))

DataFrame[5: int, five: string, 5.0: double]

# Working with Booleans

In [None]:
from pyspark.sql.functions import col

df.where(col('InvoiceNo') != 536365).select("InvoiceNo", "Description").show(5, False)

+---------+-----------------------------+
|InvoiceNo|Description                  |
+---------+-----------------------------+
|536366   |HAND WARMER UNION JACK       |
|536366   |HAND WARMER RED POLKA DOT    |
|536367   |ASSORTED COLOUR BIRD ORNAMENT|
|536367   |POPPY'S PLAYHOUSE BEDROOM    |
|536367   |POPPY'S PLAYHOUSE KITCHEN    |
+---------+-----------------------------+
only showing top 5 rows



Another option—and probably the cleanest—is to specify the predicate as an expression in a string. This is valid for Python or Scala. Note that this also gives you access to another way of expressing “does not equal”:

In [None]:
df.where('InvoiceNo <> 536365').show(5, False)

+---------+---------+-----------------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                  |Quantity|InvoiceDate        |UnitPrice|CustomerID|Country       |
+---------+---------+-----------------------------+--------+-------------------+---------+----------+--------------+
|536366   |22633    |HAND WARMER UNION JACK       |6       |2010-12-01 08:28:00|1.85     |17850.0   |United Kingdom|
|536366   |22632    |HAND WARMER RED POLKA DOT    |6       |2010-12-01 08:28:00|1.85     |17850.0   |United Kingdom|
|536367   |84879    |ASSORTED COLOUR BIRD ORNAMENT|32      |2010-12-01 08:34:00|1.69     |13047.0   |United Kingdom|
|536367   |22745    |POPPY'S PLAYHOUSE BEDROOM    |6       |2010-12-01 08:34:00|2.1      |13047.0   |United Kingdom|
|536367   |22748    |POPPY'S PLAYHOUSE KITCHEN    |6       |2010-12-01 08:34:00|2.1      |13047.0   |United Kingdom|
+---------+---------+-----------------------------+--------+----

We mentioned that you can specify Boolean expressions with multiple parts when you use and or or. In Spark, you should always chain together and filters as a sequential filter.
The reason for this is that even if Boolean statements are expressed serially (one after the other), Spark will flatten all of these filters into one statement and perform the filter at the same time, creating the and statement for us. Although you can specify your statements explicitly by using and if you like, they’re often easier to understand and to read if you specify them serially. or statements need to be specified in the same statement:

In [None]:
# Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.
from pyspark.sql.functions import instr

priceFilter = col('UnitPrice') > 600
descripFilter = instr(df.Description, 'POSTAGE') >= 1
df.where(df.StockCode.isin('DOT')).where(priceFilter | descripFilter).show()

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      null|United Kingdom|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      null|United Kingdom|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+



Boolean expressions are not just reserved to filters. To filter a DataFrame, you can also just specify a Boolean column:

In [None]:
from pyspark.sql.functions import instr

DOTCodeFilter = col('StockCode') == 'DOT'
priceFilter = col('UnitPrice') > 600
descripFilter = instr(df.Description, 'POSTAGE') >= 1
df.withColumn('isExpensive', DOTCodeFilter & (priceFilter | descripFilter)).where('isExpensive').select('unitPrice', 'isExpensive').show()

+---------+-----------+
|unitPrice|isExpensive|
+---------+-----------+
|   569.77|       true|
|   607.49|       true|
+---------+-----------+



In fact, it’s often easier to just express
filters as SQL statements than using the programmatic DataFrame interface and Spark SQL allows us to do this without paying any performance penalty. For example, the following two statements are equivalent:

In [None]:
from pyspark.sql.functions import expr

df.withColumn('isExpensive', expr('NOT UnitPrice <= 250')).where('isExpensive').select("Description", "UnitPrice").show(5)

+--------------+---------+
|   Description|UnitPrice|
+--------------+---------+
|DOTCOM POSTAGE|   569.77|
|DOTCOM POSTAGE|   607.49|
+--------------+---------+



One “gotcha” that can come up is if you’re working with null data when creating Boolean expressions. If there is a null in your data, you’ll need to treat things a bit differently. Here’s how you can ensure that you perform a null-safe equivalence test:

In [None]:
df.where(col("Description").eqNullSafe("hello")).show()

+---------+---------+-----------+--------+-----------+---------+----------+-------+
|InvoiceNo|StockCode|Description|Quantity|InvoiceDate|UnitPrice|CustomerID|Country|
+---------+---------+-----------+--------+-----------+---------+----------+-------+
+---------+---------+-----------+--------+-----------+---------+----------+-------+



# Working with Numbers

To fabricate a contrived example, let’s imagine that we found out that we mis-recorded the quantity in our retail dataset and the true quantity is equal to (the current quantity * the unit price)^2 + 5. This will introduce our first numerical function as well as the pow function that raises a column to the expressed power:


In [None]:
from pyspark.sql.functions import expr, pow, col

fabricatedQuantity = pow(col('Quantity') * col('UnitPrice'), 2) + 5
df.select(expr('CustomerId'), fabricatedQuantity.alias('realQuantity')).show(2)

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



Notice that we were able to multiply our columns together because they were both numerical. Naturally we can add and subtract as necessary, as well. In fact, we can do all of this as a SQL expression, as well:

In [None]:
df.selectExpr('CustomerId', '(POWER((Quantity * UnitPrice), 2.0) + 5) as realQuantity').show(2)

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



Another common numerical task is rounding. If you’d like to just round to a whole number, oftentimes you can cast the value to an integer and that will work just fine. However, Spark also has more detailed functions for performing this explicitly and to a certain level of precision. In the following example, we round to one decimal place:

In [None]:
from pyspark.sql.functions import lit, round, bround

df.select(round(col("UnitPrice"), 1).alias("rounded"), col("UnitPrice")).show(5)

+-------+---------+
|rounded|UnitPrice|
+-------+---------+
|    2.6|     2.55|
|    3.4|     3.39|
|    2.8|     2.75|
|    3.4|     3.39|
|    3.4|     3.39|
+-------+---------+
only showing top 5 rows



By default, the round function rounds up if you’re exactly in between two numbers. You can round down by using the bround

In [None]:
from pyspark.sql.functions import lit, round, bround

df.select(round(lit('2.5')), bround(lit("2.5"))).show(2)

+-------------+--------------+
|round(2.5, 0)|bround(2.5, 0)|
+-------------+--------------+
|          3.0|           2.0|
|          3.0|           2.0|
+-------------+--------------+
only showing top 2 rows



Another numerical task is to compute the correlation of two columns. For example, we can see the Pearson correlation coefficient for two columns to see if cheaper things are typically bought in greater quantities. We can do this through a function as well as through the DataFrame statistic methods:

In [None]:
from pyspark.sql.functions import corr

df.stat.corr('Quantity', 'UnitPrice')
df.select(corr('Quantity', 'UnitPrice')).show()

+-------------------------+
|corr(Quantity, UnitPrice)|
+-------------------------+
|     -0.04112314436835551|
+-------------------------+



Another common task is to compute summary statistics for a column or set of columns. We can use the describe method to achieve exactly this. This will take all numeric columns and
calculate the count, mean, standard deviation, min, and max. You should use this primarily for viewing in the console because the schema might change in the future:

In [None]:
df.describe().show()

23/09/27 14:32:14 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 13:>                                                         (0 + 1) / 1]

+-------+-----------------+------------------+--------------------+------------------+------------------+------------------+--------------+
|summary|        InvoiceNo|         StockCode|         Description|          Quantity|         UnitPrice|        CustomerID|       Country|
+-------+-----------------+------------------+--------------------+------------------+------------------+------------------+--------------+
|  count|             3108|              3108|                3098|              3108|              3108|              1968|          3108|
|   mean| 536516.684944841|27834.304044117645|                null| 8.627413127413128| 4.151946589446603|15661.388719512195|          null|
| stddev|72.89447869788873|17407.897548583845|                null|26.371821677029203|15.638659854603892|1854.4496996893627|          null|
|    min|           536365|             10002| 4 PURPLE FLOCK D...|               -24|               0.0|           12431.0|     Australia|
|    max|          C

                                                                                

If you need these exact numbers, you can also perform this as an aggregation yourself by importing the functions and applying them to the columns that you need:

In [None]:
from pyspark.sql.functions import count, mean, stddev_pop, min, max


There are a number of statistical functions available in the StatFunctions Package (accessible using stat as we see in the code block below). These are DataFrame methods that you can use to calculate a variety of different things. For instance, you can calculate either exact or approximate quantiles of your data using the approxQuantile method:


In [None]:
colName = 'UnitPrice'
quantileProbs = [0.5]
relError = 0.05
df.stat.approxQuantile(colName, quantileProbs, relError)

[2.51]

You also can use this to see a cross-tabulation or frequent item pairs (be careful, this output will be large and is omitted for this reason):

In [None]:
# df.stat.crosstab('StockCode', 'Quantity').show()

In [None]:
# df.stat.freqItems(["StockCode", "Quantity"]).show()

As a last note, we can also add a unique ID to each row by using the function monotonically_increasing_id. This function generates a unique value for each row, starting with 0:

In [None]:
from pyspark.sql.functions import monotonically_increasing_id

df.select(monotonically_increasing_id()).show(5)

+-----------------------------+
|monotonically_increasing_id()|
+-----------------------------+
|                            0|
|                            1|
|                            2|
|                            3|
|                            4|
+-----------------------------+
only showing top 5 rows



# Working with Strings

The initcap function will capitalize every word in a given string when that word is separated from another by a space.

In [None]:
from pyspark.sql.functions import initcap, col

df.select(initcap(col('Description'))).show(5)

+--------------------+
|initcap(Description)|
+--------------------+
|White Hanging Hea...|
| White Metal Lantern|
|Cream Cupid Heart...|
|Knitted Union Fla...|
|Red Woolly Hottie...|
+--------------------+
only showing top 5 rows



As just mentioned, you can cast strings in uppercase and lowercase, as well:

In [None]:
from pyspark.sql.functions import lower, upper

df.select(col("Description"), lower(col("Description")), upper(lower(col("Description")))).show(2)

+--------------------+--------------------+-------------------------+
|         Description|  lower(Description)|upper(lower(Description))|
+--------------------+--------------------+-------------------------+
|WHITE HANGING HEA...|white hanging hea...|     WHITE HANGING HEA...|
| WHITE METAL LANTERN| white metal lantern|      WHITE METAL LANTERN|
+--------------------+--------------------+-------------------------+
only showing top 2 rows



Another trivial task is adding or removing spaces around a string. You can do this by using lpad, ltrim, rpad and rtrim, trim:

In [None]:
from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim

df.select(
    ltrim(lit('   HELLO   ')).alias('ltrim'),
    rtrim(lit('   HELLO   ')).alias('rtrim'),
    trim(lit('   HELLO   ')).alias('trim'),
    lpad(lit("HELLO"), 3, " ").alias("lp"),
    rpad(lit("HELLO"), 10, " ").alias("rp")
).show(2)

+--------+--------+-----+---+----------+
|   ltrim|   rtrim| trim| lp|        rp|
+--------+--------+-----+---+----------+
|HELLO   |   HELLO|HELLO|HEL|HELLO     |
|HELLO   |   HELLO|HELLO|HEL|HELLO     |
+--------+--------+-----+---+----------+
only showing top 2 rows



Note that if lpad or rpad takes a number less than the length of the string, it will always remove values from the right side of the string.

## Regular Expressions

Spark takes advantage of the complete power of Java regular expressions. The Java regular expression syntax departs slightly from other programming languages, so it is worth reviewing before putting anything into production. There are two key functions in Spark that you’ll need in order to perform regular expression tasks: regexp_extract and regexp_replace. These functions extract values and replace values, respectively.

Let’s explore how to use the regexp_replace function to replace substitute color names in our description column:

In [None]:
from pyspark.sql.functions import regexp_replace

regexp_string = 'BLACK|WHITE|RED|GREEN|BLUE'
df.select(regexp_replace(col('Description'), regexp_string, 'COLOR').alias('color_clean'), col('Description')).show(2)

+--------------------+--------------------+
|         color_clean|         Description|
+--------------------+--------------------+
|COLOR HANGING HEA...|WHITE HANGING HEA...|
| COLOR METAL LANTERN| WHITE METAL LANTERN|
+--------------------+--------------------+
only showing top 2 rows



Another task might be to replace given characters with other characters. Building this as a regular expression could be tedious, so Spark also provides the translate function to replace these values. This is done at the character level and will replace all instances of a character with the indexed character in the replacement string:

In [None]:
from pyspark.sql.functions import translate

df.select(translate(col('Description'), 'LEET', '1337'), col('Description')).show(2)

+----------------------------------+--------------------+
|translate(Description, LEET, 1337)|         Description|
+----------------------------------+--------------------+
|              WHI73 HANGING H3A...|WHITE HANGING HEA...|
|               WHI73 M37A1 1AN73RN| WHITE METAL LANTERN|
+----------------------------------+--------------------+
only showing top 2 rows



We can also perform something similar, like pulling out the first mentioned color:

In [None]:
from pyspark.sql.functions import regexp_extract

extract_str = '(BLACK|WHITE|RED|GREEN|BLUE)'
df.select(regexp_extract(col('Description'), extract_str, 1).alias("color_clean"), col('Description')).show(2)

+-----------+--------------------+
|color_clean|         Description|
+-----------+--------------------+
|      WHITE|WHITE HANGING HEA...|
|      WHITE| WHITE METAL LANTERN|
+-----------+--------------------+
only showing top 2 rows



Sometimes, rather than extracting values, we simply want to check for their existence. We can do this with the instr method on each column. This will return a Boolean declaring whether the value you specify is in the column’s string:

In [None]:
from pyspark.sql.functions import instr

containsBlack = instr(col('Description'), 'BLACK') >= 1
containsWhite = instr(col('Description'), 'WHITE') >= 1
df.withColumn('hasSimpleColor', containsBlack | containsWhite).where('hasSimpleColor').select('Description').show(3, False)

+----------------------------------+
|Description                       |
+----------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER|
|WHITE METAL LANTERN               |
|RED WOOLLY HOTTIE WHITE HEART.    |
+----------------------------------+
only showing top 3 rows



This is trivial with just two values, but it becomes more complicated when there are values.
Let’s work through this in a more rigorous way and take advantage of Spark’s ability to accept a dynamic number of arguments. We can also do this quite easily in Python. In this case, we’re going to use a different function, locate, that returns the integer location (1 based location). We then convert that to a Boolean before using it as the same basic feature:

In [None]:
from pyspark.sql.functions import locate, expr

simpleColors = ["black", "white", "red", "green", "blue"]

def color_locator(column, color_string):
    return locate(color_string.upper(), column).cast('boolean').alias('is_' + color_string)

selectedColumns = [color_locator(df.Description, c) for c in simpleColors]
selectedColumns.append(expr('*'))

df.select(*selectedColumns).where(expr('is_white or is_red')).select('Description').show(3, False)


+----------------------------------+
|Description                       |
+----------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER|
|WHITE METAL LANTERN               |
|RED WOOLLY HOTTIE WHITE HEART.    |
+----------------------------------+
only showing top 3 rows



This simple feature can often help you programmatically generate columns or Boolean filters in a way that is simple to understand and extend. We could extend this to calculating the smallest common denominator for a given input value, or whether a number is a prime.

# Working with Dates and Timestamps

Let’s begin with the basics and get the current date and the current timestamps:

In [None]:
from pyspark.sql.functions import current_date, current_timestamp

dateDF = spark.range(10).withColumn('today', current_date()).withColumn('now', current_timestamp())

dateDF.createOrReplaceTempView('dateTable')