In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.master("local[1]") \
                    .appName('test') \
                    .getOrCreate()

23/10/30 07:18:16 WARN Utils: Your hostname, codespaces-d00206 resolves to a loopback address: 127.0.0.1; using 172.16.5.4 instead (on interface eth0)
23/10/30 07:18:16 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/30 07:18:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In addition to working with any type of values, Spark also allows us to create the following groupings types:
- The simplest grouping is to just summarize a complete DataFrame by performing an aggregation in a select statement.
- A “group by” allows you to specify one or more keys as well as one or more aggregation functions to transform the value columns.
- A “window” gives you the ability to specify one or more keys as well as one or more aggregation functions to transform the value columns. However, the rows input to the function are somehow related to the current row.
- A “grouping set,” which you can use to aggregate at multiple different levels. Grouping sets are available as a primitive in SQL and via rollups and cubes in DataFrames.
- A “rollup” makes it possible for you to specify one or more keys as well as one or more aggregation functions to transform the value columns, which will be summarized hierarchically.
- A “cube” allows you to specify one or more keys as well as one or more aggregation functions to transform the value columns, which will be summarized across all combinations of columns.

Each grouping returns a RelationalGroupedDataset on which we specify our aggregations.

Let’s begin by reading in our data on purchases, repartitioning the data to have far fewer partitions (because we know it’s a small volume of data stored in a lot of small files), and caching the results for rapid access:

In [3]:
df = spark.read.format('csv').option('header', 'true').option('inferSchema', 'true').load('../data/retail-data/all/*.csv').coalesce(5)
df.cache()
df.createOrReplaceTempView('dfTable')

                                                                                

As mentioned, basic aggregations apply to an entire DataFrame. The simplest example is the count method:

In [4]:
df.count()

                                                                                

541909

# Aggregation Functions

All aggregations are available as functions, in addition to the special cases that can appear on DataFrames or via .stat, like we saw in Chapter 6. You can find most aggregation functions in the org.apache.spark.sql.functions package.

## count

The first function worth going over is count, except in this example it will perform as a transformation instead of an action. In this case, we can do one of two things: specify a specific column to count, or all the columns by using count(*) or count(1) to represent that we want to count every row as the literal one, as shown in this example:

In [5]:
from pyspark.sql.functions import count

df.select(count('StockCode')).show()

+----------------+
|count(StockCode)|
+----------------+
|          541909|
+----------------+



There are a number of gotchas when it comes to null values and counting. For instance, when performing a count(*), Spark will count null values (including rows containing all nulls). However, when counting an individual column, Spark will not count the null values.

## countDistinct

In [6]:
from pyspark.sql.functions import countDistinct

df.select(countDistinct('StockCode')).show()

+-------------------------+
|count(DISTINCT StockCode)|
+-------------------------+
|                     4070|
+-------------------------+



## approx_count_distinct

Often, we find ourselves working with large datasets and the exact distinct count is irrelevant. There are times when an approximation to a certain degree of accuracy will work just fine, and for that, you can use the approx_count_distinct function:

In [7]:
from pyspark.sql.functions import approx_count_distinct

df.select(approx_count_distinct('StockCode', 0.1)).show()

+--------------------------------+
|approx_count_distinct(StockCode)|
+--------------------------------+
|                            3364|
+--------------------------------+



## first and last

In [8]:
from pyspark.sql.functions import first, last

df.select(first('StockCode'), last('StockCode')).show()

+----------------+---------------+
|first(StockCode)|last(StockCode)|
+----------------+---------------+
|          85123A|          22138|
+----------------+---------------+



## min and max

In [9]:
from pyspark.sql.functions import min, max

df.select(min('Quantity'), max('Quantity')).show()

+-------------+-------------+
|min(Quantity)|max(Quantity)|
+-------------+-------------+
|       -80995|        80995|
+-------------+-------------+



## sum

In [10]:
from pyspark.sql.functions import sum

df.select(sum('Quantity')).show()

+-------------+
|sum(Quantity)|
+-------------+
|      5176450|
+-------------+



## sumDistinct

In addition to summing a total, you also can sum a distinct set of values by using the sumDistinct function:

In [11]:
from pyspark.sql.functions import sumDistinct

df.select(sumDistinct('Quantity')).show()



+----------------------+
|sum(DISTINCT Quantity)|
+----------------------+
|                 29310|
+----------------------+



## avg

In [12]:
from pyspark.sql.functions import sum, count, avg, expr

df.select(
    count('Quantity').alias('total_transactions'),
    sum("Quantity").alias('total_purchases'),
    avg("Quantity").alias('avg_purchases'),
    expr('mean(Quantity)').alias('mean_purchases')
).selectExpr(
    'total_purchases/total_transactions',
    'avg_purchases',
    'mean_purchases'
).show()

+--------------------------------------+----------------+----------------+
|(total_purchases / total_transactions)|   avg_purchases|  mean_purchases|
+--------------------------------------+----------------+----------------+
|                      9.55224954743324|9.55224954743324|9.55224954743324|
+--------------------------------------+----------------+----------------+



## Variance and Standard Deviation

Something to note is that Spark has both the formula for the sample standard deviation as well as the formula for the population standard deviation. These are fundamentally different statistical formulae, and we need to differentiate between them. By default, Spark performs the formula for the sample standard deviation or variance if you use the variance or stddev functions.

You can also specify these explicitly or refer to the population standard deviation or variance:

In [13]:
from pyspark.sql.functions import var_pop, stddev_pop
from pyspark.sql.functions import var_samp, stddev_samp

df.select(var_pop("Quantity"), var_samp("Quantity"),
    stddev_pop("Quantity"), stddev_samp("Quantity")).show()

+-----------------+------------------+--------------------+---------------------+
|var_pop(Quantity)|var_samp(Quantity)|stddev_pop(Quantity)|stddev_samp(Quantity)|
+-----------------+------------------+--------------------+---------------------+
|47559.30364660879| 47559.39140929848|  218.08095663447733|   218.08115785023355|
+-----------------+------------------+--------------------+---------------------+



## skewness and kurtosis

Skewness and kurtosis are both measurements of extreme points in your data. Skewness measures the asymmetry of the values in your data around the mean, whereas kurtosis is a measure of the tail of data. These are both relevant specifically when modeling your data as a probability distribution of a random variable. Although here we won’t go into the math behind these specifically, you can look up definitions quite easily on the internet. You can calculate these by using the functions:

In [14]:
from pyspark.sql.functions import skewness, kurtosis

df.select(skewness("Quantity"), kurtosis("Quantity")).show()

+------------------+------------------+
|skewness(Quantity)|kurtosis(Quantity)|
+------------------+------------------+
|-0.264075576105298|119768.05495534067|
+------------------+------------------+



## Covariance and Correlation

We discussed single column aggregations, but some functions compare the interactions of the values in two difference columns together. Two of these functions are cov and corr, for
covariance and correlation, respectively. Correlation measures the Pearson correlation coefficient, which is scaled between –1 and +1. The covariance is scaled according to the inputs in the data.

Like the var function, covariance can be calculated either as the sample covariance or the population covariance. Therefore it can be important to specify which formula you want to use. Correlation has no notion of this and therefore does not have calculations for population or sample. Here’s how they work:

In [15]:
from pyspark.sql.functions import corr, covar_pop, covar_samp

df.select(corr("InvoiceNo", "Quantity"), covar_samp("InvoiceNo", "Quantity"),
      covar_pop("InvoiceNo", "Quantity")).show()

[Stage 41:>                                                         (0 + 1) / 1]

+-------------------------+-------------------------------+------------------------------+
|corr(InvoiceNo, Quantity)|covar_samp(InvoiceNo, Quantity)|covar_pop(InvoiceNo, Quantity)|
+-------------------------+-------------------------------+------------------------------+
|     4.912186085636837E-4|             1052.7280543912716|            1052.7260778751674|
+-------------------------+-------------------------------+------------------------------+



                                                                                

## Aggregating to Complex Types

In Spark, you can perform aggregations not just of numerical values using formulas, you can also perform them on complex types. For example, we can collect a list of values present in a given column or only the unique values by collecting to a set.

You can use this to carry out some more programmatic access later on in the pipeline or pass the entire collection in a user-defined function (UDF):


In [16]:
from pyspark.sql.functions import collect_set, collect_list

df.agg(collect_set('Country'), collect_list('Country')).show()

                                                                                

+--------------------+---------------------+
|collect_set(Country)|collect_list(Country)|
+--------------------+---------------------+
|[Portugal, Italy,...| [United Kingdom, ...|
+--------------------+---------------------+



# Grouping

Thus far, we have performed only DataFrame-level aggregations. A more common task is to perform calculations based on groups in the data. This is typically done on categorical data for
which we group our data on one column and perform some calculations on the other columns that end up in that group.

We do this grouping in two phases. First we specify the column(s) on which we would like to group, and then we specify the aggregation(s). The first step returns a RelationalGroupedDataset, and the second step returns a DataFrame.

As mentioned, we can specify any number of columns on which we want to group:

In [17]:
df.groupBy("InvoiceNo", "CustomerId").count().show()

+---------+----------+-----+
|InvoiceNo|CustomerId|count|
+---------+----------+-----+
|   536846|     14573|   76|
|   537026|     12395|   12|
|   537883|     14437|    5|
|   538068|     17978|   12|
|   538279|     14952|    7|
|   538800|     16458|   10|
|   538942|     17346|   12|
|  C539947|     13854|    1|
|   540096|     13253|   16|
|   540530|     14755|   27|
|   541225|     14099|   19|
|   541978|     13551|    4|
|   542093|     17677|   16|
|   543188|     12567|   63|
|   543590|     17377|   19|
|  C543757|     13115|    1|
|  C544318|     12989|    1|
|   544578|     12365|    1|
|   545165|     16339|   20|
|   545289|     14732|   30|
+---------+----------+-----+
only showing top 20 rows



## Grouping with Expressions

As we saw earlier, counting is a bit of a special case because it exists as a method. For this, usually we prefer to use the count function. Rather than passing that function as an expression
into a select statement, we specify it as within agg. This makes it possible for you to pass-in arbitrary expressions that just need to have some aggregation specified. You can even do things like alias a column after transforming it for later use in your data flow:

In [18]:
from pyspark.sql.functions import count

df.groupBy('InvoiceNo').agg(
    count('Quantity').alias('quan'),
    expr('count(Quantity)')
).show()

+---------+----+---------------+
|InvoiceNo|quan|count(Quantity)|
+---------+----+---------------+
|   536596|   6|              6|
|   536938|  14|             14|
|   537252|   1|              1|
|   537691|  20|             20|
|   538041|   1|              1|
|   538184|  26|             26|
|   538517|  53|             53|
|   538879|  19|             19|
|   539275|   6|              6|
|   539630|  12|             12|
|   540499|  24|             24|
|   540540|  22|             22|
|  C540850|   1|              1|
|   540976|  48|             48|
|   541432|   4|              4|
|   541518| 101|            101|
|   541783|  35|             35|
|   542026|   9|              9|
|   542375|   6|              6|
|  C542604|   8|              8|
+---------+----+---------------+
only showing top 20 rows



## Grouping with Maps

Sometimes, it can be easier to specify your transformations as a series of Maps for which the key is the column, and the value is the aggregation function (as a string) that you would like to perform. You can reuse multiple column names if you specify them inline, as well:

In [19]:
df.groupBy('InvoiceNo').agg(expr('avg(Quantity)'), expr('stddev_pop(Quantity)')).show()

+---------+------------------+--------------------+
|InvoiceNo|     avg(Quantity)|stddev_pop(Quantity)|
+---------+------------------+--------------------+
|   536596|               1.5|  1.1180339887498947|
|   536938|33.142857142857146|  20.698023172885524|
|   537252|              31.0|                 0.0|
|   537691|              8.15|   5.597097462078001|
|   538041|              30.0|                 0.0|
|   538184|12.076923076923077|   8.142590198943392|
|   538517|3.0377358490566038|  2.3946659604837897|
|   538879|21.157894736842106|  11.811070444356483|
|   539275|              26.0|  12.806248474865697|
|   539630|20.333333333333332|  10.225241100118645|
|   540499|              3.75|  2.6653642652865788|
|   540540|2.1363636363636362|  1.0572457590557278|
|  C540850|              -1.0|                 0.0|
|   540976|10.520833333333334|   6.496760677872902|
|   541432|             12.25|  10.825317547305483|
|   541518| 23.10891089108911|  20.550782784878713|
|   541783|1

# Window Functions

To demonstrate, we will add a date column that will convert our invoice date into a column that contains only date information (not time information, too):

In [20]:
from pyspark.sql.functions import col, to_date

dfWithDate = df.withColumn('date', to_date(col('InvoiceDate'), 'MM/d/yyyy H:mm'))
dfWithDate.createOrReplaceTempView('dfWithDate')

The first step to a window function is to create a window specification. Note that the partition by is unrelated to the partitioning scheme concept that we have covered thus far. It’s just a similar concept that describes how we will be breaking up our group. The ordering determines the ordering within a given partition, and, finally, the frame specification (the rowsBetween statement) states which rows will be included in the frame based on its reference to the current input row. In the following example, we look at all previous rows up to the current row:

In [21]:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc

windowSpec = Window.partitionBy('CustomerId', 'date').orderBy(desc('Quantity')).rowsBetween(Window.unboundedPreceding, Window.currentRow)

Now we want to use an aggregation function to learn more about each specific customer. An example might be establishing the maximum purchase quantity over all time. To answer this, we use the same aggregation functions that we saw earlier by passing a column name or expression. In addition, we indicate the window specification that defines to which frames of data this function will apply:

In [22]:
from pyspark.sql.functions import max

maxPurchaseQuantity = max(col('Quantity')).over(windowSpec)

You will notice that this returns a column (or expressions). We can now use this in a DataFrame select statement. Before doing so, though, we will create the purchase quantity rank. To do that we use the dense_rank function to determine which date had the maximum purchase quantity for every customer. We use dense_rank as opposed to rank to avoid gaps in the ranking sequence when there are tied values (or in our case, duplicate rows):

In [23]:
from pyspark.sql.functions import dense_rank, rank

purchaseDenseRank = dense_rank().over(windowSpec)
purchaseRank = rank().over(windowSpec)

This also returns a column that we can use in select statements. Now we can perform a select to view the calculated window values:

In [24]:
from pyspark.sql.functions import col
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

dfWithDate.where('CustomerId IS NOT NULL').orderBy('CustomerId')\
    .select(
        col('CustomerId'),
        col('date'),
        col('Quantity'),
        purchaseRank.alias('quantityRank'),
        purchaseDenseRank.alias('quantityDenseRank'),
        maxPurchaseQuantity.alias('maxPurchaseQuantity')
    ).show()

[Stage 59:>                                                         (0 + 1) / 1]

+----------+----------+--------+------------+-----------------+-------------------+
|CustomerId|      date|Quantity|quantityRank|quantityDenseRank|maxPurchaseQuantity|
+----------+----------+--------+------------+-----------------+-------------------+
|     12346|2011-01-18|   74215|           1|                1|              74215|
|     12346|2011-01-18|  -74215|           2|                2|              74215|
|     12347|2010-12-07|      36|           1|                1|                 36|
|     12347|2010-12-07|      30|           2|                2|                 36|
|     12347|2010-12-07|      24|           3|                3|                 36|
|     12347|2010-12-07|      12|           4|                4|                 36|
|     12347|2010-12-07|      12|           4|                4|                 36|
|     12347|2010-12-07|      12|           4|                4|                 36|
|     12347|2010-12-07|      12|           4|                4|             

                                                                                

# Grouping Sets

## Rollups

Thus far, we’ve been looking at explicit groupings. When we set our grouping keys of multiple columns, Spark looks at those as well as the actual combinations that are visible in the dataset. A rollup is a multidimensional aggregation that performs a variety of group-by style calculations for us.

Let’s create a rollup that looks across time (with our new Date column) and space (with the Country column) and creates a new DataFrame that includes the grand total over all dates, the
grand total for each date in the DataFrame, and the subtotal for each country on each date in the DataFrame:


In [25]:
dfNoNull = dfWithDate.drop()
dfNoNull.createOrReplaceTempView("dfNoNull")

In [26]:
rolledUpDF = dfNoNull.rollup('Date', 'Country').agg(sum('Quantity'))\
    .selectExpr('Date', 'Country', '`sum(Quantity)` as total_quantity')\
    .orderBy('Date')
rolledUpDF.show()

[Stage 60:>                                                         (0 + 1) / 1]

+----------+--------------+--------------+
|      Date|       Country|total_quantity|
+----------+--------------+--------------+
|      null|          null|       5176450|
|2010-12-01|     Australia|           107|
|2010-12-01|United Kingdom|         23949|
|2010-12-01|        France|           449|
|2010-12-01|          null|         26814|
|2010-12-01|        Norway|          1852|
|2010-12-01|       Germany|           117|
|2010-12-01|          EIRE|           243|
|2010-12-01|   Netherlands|            97|
|2010-12-02|          null|         21023|
|2010-12-02|United Kingdom|         20873|
|2010-12-02|       Germany|           146|
|2010-12-02|          EIRE|             4|
|2010-12-03|          EIRE|          2575|
|2010-12-03|United Kingdom|         10439|
|2010-12-03|          null|         14830|
|2010-12-03|   Switzerland|           110|
|2010-12-03|         Spain|           400|
|2010-12-03|        France|           239|
|2010-12-03|      Portugal|            65|
+----------

                                                                                

Now where you see the null values is where you’ll find the grand totals. A null in both rollup columns specifies the grand total across both of those columns:

In [27]:
rolledUpDF.where("Country IS NULL").show()

[Stage 63:>                                                         (0 + 1) / 1]

+----------+-------+--------------+
|      Date|Country|total_quantity|
+----------+-------+--------------+
|      null|   null|       5176450|
|2010-12-01|   null|         26814|
|2010-12-02|   null|         21023|
|2010-12-03|   null|         14830|
|2010-12-05|   null|         16395|
|2010-12-06|   null|         21419|
|2010-12-07|   null|         24995|
|2010-12-08|   null|         22741|
|2010-12-09|   null|         18431|
|2010-12-10|   null|         20297|
|2010-12-12|   null|         10565|
|2010-12-13|   null|         17623|
|2010-12-14|   null|         20098|
|2010-12-15|   null|         18229|
|2010-12-16|   null|         29632|
|2010-12-17|   null|         16069|
|2010-12-19|   null|          3795|
|2010-12-20|   null|         14965|
|2010-12-21|   null|         15467|
|2010-12-22|   null|          3192|
+----------+-------+--------------+
only showing top 20 rows



                                                                                

In [28]:
rolledUpDF.where("Date IS NULL").show()

[Stage 66:>                                                         (0 + 1) / 1]

+----+-------+--------------+
|Date|Country|total_quantity|
+----+-------+--------------+
|null|   null|       5176450|
+----+-------+--------------+



                                                                                

## Cube

A cube takes the rollup to a level deeper. Rather than treating elements hierarchically, a cube does the same thing across all dimensions. This means that it won’t just go by date over the entire time period, but also the country.

In [29]:
dfNoNull.cube('Date', 'Country').agg(sum(col('Quantity')))\
    .select('Date', 'Country', 'sum(Quantity)').orderBy('Date').show()

[Stage 69:>                                                         (0 + 1) / 1]

+----+--------------------+-------------+
|Date|             Country|sum(Quantity)|
+----+--------------------+-------------+
|null|             Germany|       117448|
|null|             Austria|         4827|
|null|  European Community|          497|
|null|             Finland|        10666|
|null|               Italy|         7999|
|null|              Cyprus|         6317|
|null|              Poland|         3653|
|null|             Iceland|         2458|
|null|               Japan|        25218|
|null|               Malta|          944|
|null|      United Kingdom|      4263829|
|null|United Arab Emirates|          982|
|null|      Czech Republic|          592|
|null|           Lithuania|          652|
|null|              Sweden|        35637|
|null|              France|       110480|
|null|             Lebanon|          386|
|null|                null|      5176450|
|null|              Canada|         2763|
|null|           Singapore|         5234|
+----+--------------------+-------

                                                                                

## Grouping Metadata

Sometimes when using cubes and rollups, you want to be able to query the aggregation levels so that you can easily filter them down accordingly. We can do this by using the grouping_id,
which gives us a column specifying the level of aggregation that we have in our result set. The query in the example that follows returns four distinct grouping IDs:

Grouping Description ID
- 3 - This will appear for the highest-level aggregation, which will gives us the total quantity regardless of customerId and stockCode.
- 2 - This will appear for all aggregations of individual stock codes. This gives us the total quantity per stock code, regardless of customer.
- 1 - This will give us the total quantity on a per-customer basis, regardless of item purchased.
- 0 - This will give us the total quantity for individual customerId and stockCode combinations.

In [30]:
from pyspark.sql.functions import grouping_id, sum, exp

dfNoNull.cube('customerId', 'stockCode').agg(grouping_id(), sum(col('Quantity'))).show()

[Stage 72:>                                                         (0 + 1) / 1]

+----------+---------+-------------+-------------+
|customerId|stockCode|grouping_id()|sum(Quantity)|
+----------+---------+-------------+-------------+
|     12583|    22728|            0|           82|
|     16218|   85049D|            0|           12|
|     14307|    84375|            0|           12|
|     17908|    22568|            0|            1|
|     16583|    21890|            0|            6|
|     17951|    22807|            0|            6|
|     15525|   85114B|            0|            4|
|     17905|    22364|            0|            1|
|     16456|    20711|            0|           20|
|     16539|    21754|            0|           12|
|     15752|    22866|            0|           17|
|     17855|    22910|            0|            6|
|     13418|    22925|            0|            4|
|     13418|    22927|            0|            6|
|     18041|    21098|            0|            4|
|     17235|    21430|            0|            8|
|     16186|    22985|         

                                                                                

## Pivot

Pivots make it possible for you to convert a row into a column. For example, in our current data we have a Country column. With a pivot, we can aggregate according to some function for each of those given countries and display them in an easy-to-query way:

In [31]:
pivoted = dfWithDate.groupBy('date').pivot('Country').sum()

This DataFrame will now have a column for every combination of country, numeric variable, and a column specifying the date. For example, for USA we have the following columns: USA_sum(Quantity), USA_sum(UnitPrice), USA_sum(CustomerID). This represents one for each numeric column in our dataset (because we just performed an aggregation over all of them).
Here’s an example query and result from this data:

In [32]:
pivoted.where("date > '2011-12-05'").select("date" ,"`USA_sum(Quantity)`").show()

23/10/30 07:18:57 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 83:>                                                         (0 + 1) / 1]

+----------+-----------------+
|      date|USA_sum(Quantity)|
+----------+-----------------+
|2011-12-06|             null|
|2011-12-09|             null|
|2011-12-08|             -196|
|2011-12-07|             null|
+----------+-----------------+



                                                                                

Now all of the columns can be calculated with single groupings, but the value of a pivot comes down to how you would like to explore the data. It can be useful, if you have low enough cardinality in a certain column to transform it into columns so that users can see the schema and immediately know what to query for.

# User-Defined Aggregation Functions


UDAFs are currently available only in Scala or Java. However, in Spark 2.3, you will also be able to call Scala or Java UDFs and UDAFs by registering the function just as we showed in the UDF section in Chapter 6. For more information, go to SPARK-19439.

User-defined aggregation functions (UDAFs) are a way for users to define their own aggregation functions based on custom formulae or business rules. You can use UDAFs to compute custom calculations over groups of input data (as opposed to single rows). Spark maintains a single AggregationBuffer to store intermediate results for every group of input data.

To create a UDAF, you must inherit from the UserDefinedAggregateFunction base class and implement the following methods:
- **inputSchema** represents input arguments as a StructType
- **bufferSchema** represents intermediate UDAF results as a StructType
- **dataType** represents the return DataType
- **deterministic** is a Boolean value that specifies whether this UDAF will return the same result for a given input
- **initialize** allows you to initialize values of an aggregation buffer
- **update** describes how you should update the internal buffer based on a given row merge describes how two aggregation buffers should be merged
- **evaluate** will generate the final result of the aggregation

The following example implements a BoolAnd, which will inform us whether all the rows (for a given column) are true; if they’re not, it will return false:

In [None]:

// in Scala
  import org.apache.spark.sql.expressions.MutableAggregationBuffer
  import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
  import org.apache.spark.sql.Row
  import org.apache.spark.sql.types._
  class BoolAnd extends UserDefinedAggregateFunction {
    def inputSchema: org.apache.spark.sql.types.StructType =
      StructType(StructField("value", BooleanType) :: Nil)
    def bufferSchema: StructType = StructType(
      StructField("result", BooleanType) :: Nil
    )
    def dataType: DataType = BooleanType
    def deterministic: Boolean = true
    def initialize(buffer: MutableAggregationBuffer): Unit = {
      buffer(0) = true
    }
    def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
      buffer(0) = buffer.getAs[Boolean](0) && input.getAs[Boolean](0)
    }
    def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
      buffer1(0) = buffer1.getAs[Boolean](0) && buffer2.getAs[Boolean](0)
    }
    def evaluate(buffer: Row): Any = {
      buffer(0)
    } 
}

Now, we simply instantiate our class and/or register it as a function:

In [None]:
// in Scala
  val ba = new BoolAnd
  spark.udf.register("booland", ba)
  import org.apache.spark.sql.functions._
  spark.range(1)
    .selectExpr("explode(array(TRUE, TRUE, TRUE)) as t")
    .selectExpr("explode(array(TRUE, FALSE, TRUE)) as f", "t")
    .select(ba(col("t")), expr("booland(f)"))
    .show()
