## Overview

This is the third article in the PySpark series and in this article we will be looking at PySpark's GroupBy and Aggregate functions that could be very handy and useful when it comes to segmenting out the data according to the requirements so that it would become a bit easier task to analyse the chunks of data seperately based on the groups.

If you are already following my PySpark series then it's well and good, if not then please refer to the links which I'm providing-

1. Getting started with PySpark using Python
2. Data Preprocessing using PySpark - PySpark's DataFrame
3. Data preprocessing using PySpark - Handling missing values

## Table of content 

1. Why Aggregate and GroupBy functions are needed? 
2. Mandatory steps
3. PySpkark Aggregate function
4. PySpark's Group By function

## Why do we need GroupBy and aggregate functions

Grouping out the data is one of the most essential and good to have skill whenever we are working with Big data because especially when we are dealing with huge amount of the data during that time if we are not able to segment that data then it will be way more hard to analyse it and use it further for drawing the business related insights. 

And when it comes to aggregate function then it is the golden rule to remember that GroupBy and Agrregate functions go hand in hand i.e. we can't use the groupBy without aggregate function like SUM, COUNT, AVG, MAX, MIN and so on.

Before moving to main topic of this particular article **let's do following mandatory steps.**

1. Starting the Spark Session
2. Reading the dataset 

## Mandatory Steps

In this section we will get our **PySpark connection** with **Apache Spark** distribution and then we will **read our dataset** on which we will be applying the aggregate and groupby operations.

### Starting the Spark Session

By far if you are following my **PySpark series** then it would be easier for you to undertand that this is the **starter template** everytime we want to get started with PySpark.

In [16]:
!pip install pyspark



In [17]:
from pyspark.sql import SparkSession

spark_aggregate = SparkSession.builder.appName('Aggreagte and GroupBy').getOrCreate()
spark_aggregate

In a nutshell what we have done is simply imported the **SparkSession** from **`pyspark.sql`** package and created the SparkSession with **`getOrCreate()`** function.

### Reading the dataset

Here we will be **reading our dummy dataset** on which we will be performing the **GroupBy** and **Aggregate functions**. The reason I have choosen the dummy dataset is to provide the **simplicity in understanding the concepts**.

In [18]:
spark_aggregate_data = spark_aggregate.read.csv('/content/part4.csv', header = True, inferSchema = True)
spark_aggregate_data.show()

+------+------------+------+
|  Name| Departments|salary|
+------+------------+------+
|Oliver|Data Science| 10000|
|Oliver|         IOT|  5000|
| Johny|    Big Data|  4000|
|Oliver|    Big Data|  4000|
| Johny|Data Science|  3000|
|Mathew|Data Science| 20000|
|Mathew|         IOT| 10000|
|Mathew|    Big Data|  5000|
| Jacob|Data Science| 10000|
| Jacob|    Big Data|  2000|
+------+------------+------+



Here we have sucessfully **read our dummy dataset** and with the help of **show function** we can see the DataFrame too.

**Note:** **Starting a spark session** and **reading the dataset** part I've already covered in my first article to this series which is, **`Getting started with PySpark using python`** so if one is not able to grab each function related to above mentioned stuff then please visit that blog too where I've seggragated each function.

Let's check the **`scehma`** of our table/dataset to see what kind of data each column holds

In [19]:
spark_aggregate_data.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Departments: string (nullable = true)
 |-- salary: integer (nullable = true)



**Inference:** In the above output we can see that after using **`printScehma()`** function we have the type of each column of our dataset.

1. **Name** column holds the **string** data.
2. **Departments** column holds the **string** data.
3. **Salary** column holds the **integer** data.

## GroupBy operations

Now let's dive into the main topic of the blog where we will start by performing few GroupBy operations and see how PySpark can do that.

In [20]:
spark_aggregate_data.groupBy('Name')

<pyspark.sql.group.GroupedData at 0x7f7d6927a9d0>

**Inference:** In the above code for grouping by the dataset we have used the **`grouoBy() function`** and here specifically we are using the **Name** column to groupBy data and when we will see the output of the same so one can easily see that it returns the **`GroupedData`** format.

**Note:** This is pretty common thing and the one who is familiar with aggregate function of SQL they know that using the **GroupBy function without aggregate function is not possible or we can say it doesn't give the relavant output** so along with SQL this same strategy involves here as we will be using the groupBy function along with aggregate function.

Before using those aggregate function with our dataset corresponding to groupBy function we will first see **some common aggregate function and what operation it performs**:

1. **`AVG`:** This is the average aggregate function which returns the result set by grouping the column based on the average of set of values.
2. **`COUNT`:** This is the count aggregate function which will return the total number of set of values in a particular column corresponding to groupBy function.
3. **`MIN`:** This is the minimum aggregate function which returns the minimum or the smallest value among all the set of values in the whole row.
4. **`MAX`** The working and approach of using the MAX aggregate function is same as MIN aggregate function only the main differnce is that it will return the maximum value among the set of value in the row.
5. **`SUM`:** Now comes the SUM aggregate function which will return the sum of all the numeric values cooresponding to the grouped column.

So now as we have discussed few most commonly used aggregate function hence we will now implement some of them and see what kind of results it will return.

### GroupBy "Name" column

In this sub-section we will discuss about the **`"Name"`** parameter of the GroupBy function and see how useful it could be in dealing with **summation method**.

In [21]:
spark_aggregate_data.groupBy('Name').sum()

DataFrame[Name: string, sum(salary): bigint]

**Inference:** In the above code along with groupBy function we have used the sum aggregate function and it has returned as the **DataFrame** which hold two columns.

* **Name:** Which holds the string data as we already know that sum cannot be applied on the string hence it will remain same.
* **Sum:** If we look closely so we can find out that salary is **grouped with sum aggregate function** and things will get more clear when we will see the DataFrame that it had returned.

In [22]:
spark_aggregate_data.groupBy('Name').sum().show()

+------+-----------+
|  Name|sum(salary)|
+------+-----------+
| Jacob|      12000|
|Mathew|      35000|
|Oliver|      19000|
| Johny|       7000|
+------+-----------+



**Inference:** In the above output it is clear that **Name column** has been grouped together along with the **sum of the salary column**.

Note: In short we have answered one question: **Who is earning the highest salary?**
**Answer**: It's **Jacob with 12000$** earning highest among all.

Now let's find out which department is giving the maximum salary by using groupBy function.

### GroupBy "Department" column

By grouping the **Department column** and using the **sum aggregate** function along with it we can find which department gives the maximum salary.

In [24]:
spark_aggregate_data.groupBy('Departments').sum().show()

+------------+-----------+
| Departments|sum(salary)|
+------------+-----------+
|         IOT|      15000|
|    Big Data|      15000|
|Data Science|      43000|
+------------+-----------+



**Inference:** So from the above output it is clearly visible that **Data Science department** gives the **maximum salary** while IOT and Data science gives equal salary.

Now at the same time if we want to see the mean of the salary, department wise so we will be grouping the **department column** but this time will use the **mean aggregate function**.

In [25]:
spark_aggregate_data.groupBy('Departments').mean().show()

+------------+-----------+
| Departments|avg(salary)|
+------------+-----------+
|         IOT|     7500.0|
|    Big Data|     3750.0|
|Data Science|    10750.0|
+------------+-----------+



**Inference:** from the above outout we can see the mean salary from each department that employee get.

Let's find out another insight by using groupBy function along with another aggregate function.

**This time we will find out total number of employees in each department** and for that we will be using the **count function** along with grouping the department column.

In [26]:
spark_aggregate_data.groupBy('Departments').count().show()

+------------+-----+
| Departments|count|
+------------+-----+
|         IOT|    2|
|    Big Data|    4|
|Data Science|    4|
+------------+-----+



**Inference:** Here we can see that highest number of employee is working in **Data science and Big Data department i.e. 4** while in IOT department the total count is 2. 

Similarly we can use variety of aggregate function depending on our requirements. Suppose we need to find out that **who is getting the maximum salary ?** so for that we will groupBy **`"Name"`** column and use the **`"max"`** aggregate function and after that we will get the desired result and if the question is opposite so we will use the **`"min"`** function to find the **least salary of the employee**.

## Key takeaways from this article

1. First we performed the key tasks which were, setting up the spark session and reading the data on which we will be performing the operations.

2. Then we head towards the main thing i.e GroupBy operations and learned about how PySpark has implement the same along with that we deep dived into the parameter part where we disucussed,
    * Name column
    * Department column
    
3. Along with groupBy operations we also discussed aggregate function simultenously because now we already know that both of them go hand in hand, some of the functions that we go through were,
    * SUM()
    * MEAN()
    * COUNT()