# Tutorial 5

**This tutorial will cover:**

* GroupBy function
* Aggregate functions

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Practice").getOrCreate()

df1 = spark.read.csv("test-data-5.csv", header=True, inferSchema=True)
df1.show()

+-----+------------+------+
| Name| Departments|Salary|
+-----+------------+------+
|Steve|Data Science| 10000|
|Steve|         IOT|  5000|
| John|    Big Data|  4000|
|Steve|    Big Data|  4000|
| John|Data Science|  3000|
|  Tom|Data Science| 20000|
|  Tom|         IOT| 10000|
|  Tom|    Big Data|  5000|
| Mark|Data Science| 10000|
| Mark|    Big Data|  2000|
+-----+------------+------+



The `groupBy()` and aggregate functions work together. For example, you will group the data by a column (or a list of columns) and specify an aggregate function to execute on the grouped data.

Tip: You can see a list of the available aggregate functions by typing `df1.groupBy("Column Name").` and pressing the `Tab` key.

In [2]:
# Find the highest salary by grouping the rows by the "Name" column and then sum the salaries.
df1.groupBy("Name").sum("Salary").show()

+-----+-----------+
| Name|sum(Salary)|
+-----+-----------+
|Steve|      19000|
|  Tom|      35000|
| Mark|      12000|
| John|       7000|
+-----+-----------+



Note that you cannot apply certain aggregate functions on certain columns due to the data types. For example, you cannot apply the `sum()` function on a column of `string` values. So it might be helpful to view the data types of each column:

In [3]:
df1.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Departments: string (nullable = true)
 |-- Salary: integer (nullable = true)



In [4]:
# Find out which department is paying the most in total salary.
df1.groupBy("Departments").sum("Salary").show()

+------------+-----------+
| Departments|sum(Salary)|
+------------+-----------+
|         IOT|      15000|
|    Big Data|      15000|
|Data Science|      43000|
+------------+-----------+



In [5]:
# Find out which department is paying the highest average salary.
df1.groupBy("Departments").mean("Salary").show()

+------------+-----------+
| Departments|avg(Salary)|
+------------+-----------+
|         IOT|     7500.0|
|    Big Data|     3750.0|
|Data Science|    10750.0|
+------------+-----------+



In [6]:
# Find out how many employees are working in each department.
df1.groupBy("Departments").count().show()

+------------+-----+
| Departments|count|
+------------+-----+
|         IOT|    2|
|    Big Data|    4|
|Data Science|    4|
+------------+-----+



In [7]:
# Another way to apply an aggregate function is with the `agg()` function. 
# The `agg()` function will execute an aggregate function without grouping the data first. In other words, this is a shorthand for `df.groupBy().agg()`.
df1.agg({"Salary": "sum"}).show()

+-----------+
|sum(Salary)|
+-----------+
|      73000|
+-----------+



In [8]:
# Using the `agg()` in the previous data cell is the same is this:
df1.groupBy().sum("Salary").show()

+-----------+
|sum(Salary)|
+-----------+
|      73000|
+-----------+



In [9]:
# Who has the maximum salary from a single department.
df1.groupBy("Name").max().show()

+-----+-----------+
| Name|max(Salary)|
+-----+-----------+
|Steve|      10000|
|  Tom|      20000|
| Mark|      10000|
| John|       4000|
+-----+-----------+



In [10]:
# Who has the minimum salary from a single department.
df1.groupBy("Name").min().show()

+-----+-----------+
| Name|min(Salary)|
+-----+-----------+
|Steve|       4000|
|  Tom|       5000|
| Mark|       2000|
| John|       3000|
+-----+-----------+

