In [1]:
import findspark
findspark.init()

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [6]:
df = spark.read.format('csv')\
.option("header", "true")\
.option("inferSchema", "true")\
.load("Book_Exercises/data/csv/all.txt")\
.coalesce(5)

**What is Coalesce?** \
The coalesce method reduces the number of partitions in a DataFrame. Coalesce avoids full shuffle, instead of creating new partitions, it shuffles the data using Hash Partitioner (Default), and adjusts into existing partitions, this means it can only decrease the number of partitions.

In [9]:
df.cache()

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: string, UnitPrice: double, CustomerID: int, Country: string]

In [10]:
df.show(5)

+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|12/1/2010 8:26|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|12/1/2010 8:26|     2.75|     17850|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
only showing top 5 rows



## Aggregation Functions:
* **count:** specify a specific column to count, or all the columns by using count(*) or count(1) 

In [13]:
from pyspark.sql.functions import count
df.select(count("StockCode")).show()

+----------------+
|count(StockCode)|
+----------------+
|           92764|
+----------------+



* **countDistinct:** Sometimes, the total number is not relevant; rather, it’s the number of unique groups 

In [15]:
from pyspark.sql.functions import countDistinct
df.select(countDistinct("StockCode")).show()

+-------------------------+
|count(DISTINCT StockCode)|
+-------------------------+
|                     3116|
+-------------------------+



* **approx_count_distinct** Often, we find ourselves working with large datasets and the exact distinct count is irrelevant.
There are times when an approximation to a certain degree of accuracy will work just fine

In [21]:
from pyspark.sql.functions import approx_count_distinct
df.select(approx_count_distinct("StockCode", 0.1)).show()

+--------------------------------+
|approx_count_distinct(StockCode)|
+--------------------------------+
|                            2631|
+--------------------------------+



* **first and last:**

In [22]:
from pyspark.sql.functions import first, last
df.select(first("StockCode"), last("StockCode")).show()

+----------------+---------------+
|first(StockCode)|last(StockCode)|
+----------------+---------------+
|          85123A|          20886|
+----------------+---------------+



* **max and min:**

In [23]:
from pyspark.sql.functions import min, max
df.select(min("StockCode"), max("StockCode")).show()

+--------------+--------------+
|min(StockCode)|max(StockCode)|
+--------------+--------------+
|         10002|             m|
+--------------+--------------+



* **sum:**

In [27]:
from pyspark.sql.functions import sum
df.select(sum("Quantity")).show()

+-------------+
|sum(Quantity)|
+-------------+
|       809925|
+-------------+



* **sumDistinct:**

In [26]:
from pyspark.sql.functions import sumDistinct
df.select(sumDistinct("Quantity")).show()

+----------------------+
|sum(DISTINCT Quantity)|
+----------------------+
|                 48673|
+----------------------+



* **avg:**

In [35]:
from pyspark.sql.functions import avg, expr
df.select(avg("Quantity"), expr("mean(Quantity)")).show()

+----------------+----------------+
|   avg(Quantity)|  mean(Quantity)|
+----------------+----------------+
|8.73102712259066|8.73102712259066|
+----------------+----------------+



* **Variance and Standard Deviation:**

In [40]:
from pyspark.sql.functions import var_pop, stddev_pop
from pyspark.sql.functions import var_samp, stddev_samp

df.select(var_pop("Quantity"), stddev_pop("Quantity")).show()
df.select(var_samp("Quantity"), stddev_samp("Quantity")).show()

+------------------+--------------------+
| var_pop(Quantity)|stddev_pop(Quantity)|
+------------------+--------------------+
|121867.36033221691|   349.0950591632842|
+------------------+--------------------+

+------------------+---------------------+
|var_samp(Quantity)|stddev_samp(Quantity)|
+------------------+---------------------+
|121868.67408188361|   349.09694080854337|
+------------------+---------------------+



* **Skewness and kurtosis:**

In [41]:
from pyspark.sql.functions import skewness, kurtosis
df.select(skewness("QUantity"), kurtosis("Quantity")).show()

+--------------------+------------------+
|  skewness(QUantity)|kurtosis(Quantity)|
+--------------------+------------------+
|-0.18925049467871705|44043.177602281416|
+--------------------+------------------+



* **Covariance and Correlation**

In [43]:
from pyspark.sql.functions import corr, covar_pop, covar_samp
df.select(corr("InvoiceNo", "Quantity"), covar_pop("InvoiceNo", "Quantity"), covar_samp("InvoiceNo", "Quantity")).show()

+-------------------------+------------------------------+-------------------------------+
|corr(InvoiceNo, Quantity)|covar_pop(InvoiceNo, Quantity)|covar_samp(InvoiceNo, Quantity)|
+-------------------------+------------------------------+-------------------------------+
|     0.004510386981990214|            2515.7715775622664|              2515.799221884859|
+-------------------------+------------------------------+-------------------------------+



* **Aggregating to Complex Types:**
In Spark, you can perform aggregations not just of numerical values using formulas, you can also
perform them on complex types. For example, we can collect a list of values present in a given
column or only the unique values by collecting to a set.

In [44]:
from pyspark.sql.functions import collect_set, collect_list
df.agg(collect_set("Country"), collect_list("Country")).show()

+--------------------+---------------------+
|collect_set(Country)|collect_list(Country)|
+--------------------+---------------------+
|[Portugal, Italy,...| [United Kingdom, ...|
+--------------------+---------------------+



___
## Grouping:
Thus far, we have performed only DataFrame-level aggregations. A more common task is to
perform calculations based on groups in the data. This is typically done on $categorical\ data$ forwhich we group our data on one column and perform some calculations on the other columns
that end up in that group.

In [46]:
df.groupBy("InvoiceNo", "CustomerId").count().show(5)

+---------+----------+-----+
|InvoiceNo|CustomerId|count|
+---------+----------+-----+
|   536846|     14573|   76|
|   537026|     12395|   12|
|   537883|     14437|    5|
|   538068|     17978|   12|
|   538279|     14952|    7|
+---------+----------+-----+
only showing top 5 rows



In [47]:
df.groupBy("InvoiceNo").agg(expr("avg(Quantity)"),expr("stddev_pop(Quantity)"))\
.show(5)

+---------+------------------+--------------------+
|InvoiceNo|     avg(Quantity)|stddev_pop(Quantity)|
+---------+------------------+--------------------+
|   536596|               1.5|  1.1180339887498947|
|   536938|33.142857142857146|  20.698023172885524|
|   537252|              31.0|                 0.0|
|   537691|              8.15|   5.597097462078001|
|   538041|              30.0|                 0.0|
+---------+------------------+--------------------+
only showing top 5 rows

