## GroupBy

When we perform groupBy() on PySpark Dataframe, it returns GroupedData object
which contains below aggregate functions.

`count()` - Returns the count of rows for each group.

`mean()` - Returns the mean of values for each group.

`max()` - Returns the maximum of values for each group.

`min()` - Returns the minimum of values for each group.

`sum()` - Returns the total for values for each group.

`avg()` - Returns the average for values for each group.

`agg()` - Using agg() function, we can calculate more than one aggregate at a time.

`pivot()` - This function is used to Pivot the DataFrame

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [2]:
simpleData = [("James", "Sales", "NY", 90000, 34, 10000),
              ("Michael", "Sales", "NY", 86000, 56, 20000),
              ("Robert", "Sales", "CA", 81000, 30, 23000),
              ("Maria", "Finance", "CA", 90000, 24, 23000),
              ("Raman", "Finance", "CA", 99000, 40, 24000),
              ("Scott", "Finance", "NY", 83000, 36, 19000),
              ("Jen", "Finance", "NY", 79000, 53, 15000),
              ("Jeff", "Marketing", "CA", 80000, 25, 18000),
              ("Kumar", "Marketing", "NY", 91000, 50, 21000)
              ]

schema = ["employee_name", "department", "state", "salary", "age", "bonus"]

spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate()

df = spark.createDataFrame(data=simpleData, schema=schema)
df.printSchema()
df.show(truncate=False)

root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|James        |Sales     |NY   |90000 |34 |10000|
|Michael      |Sales     |NY   |86000 |56 |20000|
|Robert       |Sales     |CA   |81000 |30 |23000|
|Maria        |Finance   |CA   |90000 |24 |23000|
|Raman        |Finance   |CA   |99000 |40 |24000|
|Scott        |Finance   |NY   |83000 |36 |19000|
|Jen          |Finance   |NY   |79000 |53 |15000|
|Jeff         |Marketing |CA   |80000 |25 |18000|
|Kumar        |Marketing |NY   |91000 |50 |21000|
+-------------+----------+-----+------+---+-----+



## GroupBy Functions

### GroupBy & Sum

In [3]:
print('GroupBy & Sum')
df.groupBy('department').sum('salary').show(truncate=False)

GroupBy & Sum
+----------+-----------+
|department|sum(salary)|
+----------+-----------+
|Sales     |257000     |
|Finance   |351000     |
|Marketing |171000     |
+----------+-----------+



### GroupBy & Count

In [4]:
print('GroupBy & Count')
df.groupBy('department').count().show()

GroupBy & Count
+----------+-----+
|department|count|
+----------+-----+
|     Sales|    3|
|   Finance|    4|
| Marketing|    2|
+----------+-----+



### GroupBy & Min

In [5]:
print('GroupBy & Min')
df.groupBy('department').min().show()

GroupBy & Min
+----------+-----------+--------+----------+
|department|min(salary)|min(age)|min(bonus)|
+----------+-----------+--------+----------+
|     Sales|      81000|      30|     10000|
|   Finance|      79000|      24|     15000|
| Marketing|      80000|      25|     18000|
+----------+-----------+--------+----------+



### GroupBy & Max

In [6]:
print('GroupBy & Max')
df.groupBy('department').max().show()

GroupBy & Max
+----------+-----------+--------+----------+
|department|max(salary)|max(age)|max(bonus)|
+----------+-----------+--------+----------+
|     Sales|      90000|      56|     23000|
|   Finance|      99000|      53|     24000|
| Marketing|      91000|      50|     21000|
+----------+-----------+--------+----------+



### GroupBy & Avg

In [7]:
print('GroupBy & Avg')
df.groupBy('department').avg().show()

GroupBy & Avg
+----------+-----------------+--------+------------------+
|department|      avg(salary)|avg(age)|        avg(bonus)|
+----------+-----------------+--------+------------------+
|     Sales|85666.66666666667|    40.0|17666.666666666668|
|   Finance|          87750.0|   38.25|           20250.0|
| Marketing|          85500.0|    37.5|           19500.0|
+----------+-----------------+--------+------------------+



### GroupBy & Mean

In [8]:
print('GroupBy & Mean')
df.groupBy('department').mean().show()

GroupBy & Mean
+----------+-----------------+--------+------------------+
|department|      avg(salary)|avg(age)|        avg(bonus)|
+----------+-----------------+--------+------------------+
|     Sales|85666.66666666667|    40.0|17666.666666666668|
|   Finance|          87750.0|   38.25|           20250.0|
| Marketing|          85500.0|    37.5|           19500.0|
+----------+-----------------+--------+------------------+



### GroupBy and Aggregate on multiple columns

In [9]:
print('GroupBy and Aggregate on multiple columns')
df.groupBy('department', 'state').sum('salary', 'bonus').show(truncate=False)

GroupBy and Aggregate on multiple columns
+----------+-----+-----------+----------+
|department|state|sum(salary)|sum(bonus)|
+----------+-----+-----------+----------+
|Finance   |NY   |162000     |34000     |
|Marketing |NY   |91000      |21000     |
|Sales     |CA   |81000      |23000     |
|Marketing |CA   |80000      |18000     |
|Finance   |CA   |189000     |47000     |
|Sales     |NY   |176000     |30000     |
+----------+-----+-----------+----------+



## Multiple Aggregations

In [10]:
print('Multiple Aggregations')
df.groupBy('department').agg(sum('salary').alias('sum_salary'),
                             avg('salary').alias('avg_salary'),
                             sum('bonus').alias('sum_bonus'),
                             max('bonus').alias('max_bonus')).show(truncate=False)

Multiple Aggregations
+----------+----------+-----------------+---------+---------+
|department|sum_salary|avg_salary       |sum_bonus|max_bonus|
+----------+----------+-----------------+---------+---------+
|Sales     |257000    |85666.66666666667|53000    |23000    |
|Finance   |351000    |87750.0          |81000    |24000    |
|Marketing |171000    |85500.0          |39000    |21000    |
+----------+----------+-----------------+---------+---------+



## Using filter on Aggregate Data

In [11]:
print('Using filter on Aggregate Data')
df.groupBy('department').agg(sum('salary').alias('sum_salary'),
                             avg('salary').alias('avg_salary'),
                             sum('bonus').alias('sum_bonus'),
                             max('bonus').alias('max_bonus')) \
    .where(col('sum_bonus') >= 50000).show(truncate=False)

Using filter on Aggregate Data
+----------+----------+-----------------+---------+---------+
|department|sum_salary|avg_salary       |sum_bonus|max_bonus|
+----------+----------+-----------------+---------+---------+
|Sales     |257000    |85666.66666666667|53000    |23000    |
|Finance   |351000    |87750.0          |81000    |24000    |
+----------+----------+-----------------+---------+---------+

