Group By Master class
----------------------
1. How group by works
2. How to implement in Pyspark

[Refer this page for more](https://sparkbyexamples.com/pyspark/pyspark-groupby-agg-aggregate-explained/)

In [0]:

from pyspark.sql.functions import *
data = [(1,'manish',50000,'IT'),
(2,'vikash',60000,'sales'),
(3,'raushan',70000,'marketing'),
(4,'mukesh',80000,'IT'),
(5,'pritam',90000,'sales'),
(6,'nikita',45000,'marketing'),
(7,'ragini',55000,'marketing'),
(8,'rakesh',100000,'IT'),
(9,'aditya',65000,'IT'),
(10,'rahul',50000,'marketing')]

schema = ['id','name','salary','dept']

employee_df = spark.createDataFrame(data = data,schema = schema)

In [0]:
employee_df.show()

+---+-------+------+---------+
| id|   name|salary|     dept|
+---+-------+------+---------+
|  1| manish| 50000|       IT|
|  2| vikash| 60000|    sales|
|  3|raushan| 70000|marketing|
|  4| mukesh| 80000|       IT|
|  5| pritam| 90000|    sales|
|  6| nikita| 45000|marketing|
|  7| ragini| 55000|marketing|
|  8| rakesh|100000|       IT|
|  9| aditya| 65000|       IT|
| 10|  rahul| 50000|marketing|
+---+-------+------+---------+



In [0]:
employee_df.groupBy("dept").agg(sum("salary").alias("total_salary")).show()

+---------+------------+
|     dept|total_salary|
+---------+------------+
|       IT|      295000|
|    sales|      150000|
|marketing|      220000|
+---------+------------+



I learned one interesting thing i.e if we will not import sum from pyspark.sql.functions and will try<br>
to use sum directly as we had done in aggregate notebook then we will get error.

[Click Here to know more about this](https://stackoverflow.com/questions/36719039/sum-operation-on-pyspark-dataframe-giving-typeerror-when-type-is-fine/36719760#36719760)

In [0]:
# We can't get all the columns except groupBy and aggregate column
# while using groupBy.
# To resolve this issue we have two options either we can use window functions
# Otherwise we have to join the original dataframe with new dataframe in which 
# groupBy is applied.

new_df = employee_df.groupBy("dept").agg(sum("salary").alias("total_salary"))
result_df = employee_df.join(new_df, employee_df.dept == new_df.dept)
result_df.show()

+---+-------+------+---------+---------+------------+
| id|   name|salary|     dept|     dept|total_salary|
+---+-------+------+---------+---------+------------+
|  1| manish| 50000|       IT|       IT|      295000|
|  2| vikash| 60000|    sales|    sales|      150000|
|  3|raushan| 70000|marketing|marketing|      220000|
|  5| pritam| 90000|    sales|    sales|      150000|
|  4| mukesh| 80000|       IT|       IT|      295000|
|  6| nikita| 45000|marketing|marketing|      220000|
|  7| ragini| 55000|marketing|marketing|      220000|
|  8| rakesh|100000|       IT|       IT|      295000|
| 10|  rahul| 50000|marketing|marketing|      220000|
|  9| aditya| 65000|       IT|       IT|      295000|
+---+-------+------+---------+---------+------------+



In [0]:
# Question asked in video
question_data = [
    (1, 'manish', 50000, 'IT', 'india'),
    (2, 'vikash', 60000, 'sales', 'us'),
    (3, 'raushan', 70000, 'marketing', 'india'),
    (4, 'mukesh', 80000, 'IT', 'us'),
    (5, 'pritam', 90000, 'sales', 'india'),
    (6, 'nikita', 45000, 'marketing', 'us'),
    (7, 'ragini', 55000, 'marketing', 'india'),
    (8, 'rakesh', 100000, 'IT', 'us'),
    (9, 'aditya', 65000, 'IT', 'india'),
    (10, 'rahul', 50000, 'marketing', 'us')
]

question_data_schema = ['id','name','salary','dept','country']

emp_df = spark.createDataFrame(question_data).toDF(*question_data_schema)


In [0]:
# Solution of the asked question
emp_df.groupBy("country","dept")\
      .agg(sum("salary")
      .alias("Total_Salary"))\
      .sort(col("country")).show()

+-------+---------+------------+
|country|     dept|Total_Salary|
+-------+---------+------------+
|  india|       IT|      115000|
|  india|marketing|      125000|
|  india|    sales|       90000|
|     us|    sales|       60000|
|     us|       IT|      180000|
|     us|marketing|       95000|
+-------+---------+------------+



In [0]:
# Let's solve it using SparkSQL
emp_df.createOrReplaceTempView("emp_tbl")

spark.sql("""
          SELECT country,dept,sum(salary) as Total_Salary
          FROM emp_tbl
          GROUP BY country,dept
""").show(truncate=False)

+-------+---------+------------+
|country|dept     |Total_Salary|
+-------+---------+------------+
|india  |IT       |115000      |
|us     |sales    |60000       |
|india  |marketing|125000      |
|india  |sales    |90000       |
|us     |IT       |180000      |
|us     |marketing|95000       |
+-------+---------+------------+

