<a href="https://colab.research.google.com/github/Ramprashanth17/DataEngineering/blob/main/Databricks_DE/Practice/AdvOps_Spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Grouping Data and Types of Joins in Spark

We can group data based on different cols in a DataFrame, and apply a different aggregations such as sum or avg to get a holistic view of data slices.

```
salary_data.groupby('Department')
```

This returns a grouped data object and can be assigned to a separate DF and more ops can be done on this data.

Ex:
```
salary_data.groupby('Department').avg().show()
```



### A complex groupBy Statement

groupBy can be used in complex data ops, such as multiple aggregations within a single groupBy statement.



In [None]:
from pyspark.sql.functions import col, round
salary_data.groupBy('Department')\
  .sum('Salary')\
  .withColumn('sum(Salary)', round(col('sum(Salary)'), 2)) \
  .withColumnRenamed('sum(Salary)', 'Salary') \
  .orderBy('Department')\
  .show()

## Joining DataFrames in Spark

Join ops are used to combine data from two or more DataFrames based on a common column. It is essentially for merging datasets. aggregations, and relational operations.

**.join()** is the method to join and its parameters are:

- **other** The other dataframe to join with.
- **on**: Cols on which to join the DataFrames
- **how**: Type of join
- **suffixes**: Suffixes to add to cols with the same name in both dataframes

```
DF1.join(DF2, on, how)
```



In [None]:
salary_data_with_id = [(1, "John", "Field-eng", 3500), \
    (2, "Robert", "Sales", 4000), \
    (3, "Maria", "Finance", 3500), \
    (4, "Michael", "Sales", 3000), \
    (5, "Kelly", "Finance", 3500), \
    (6, "Kate", "Finance", 3000), \
    (7, "Martin", "Finance", 3500), \
    (8, "Kiran", "Sales", 2200), \
  ]


columns= ["ID", "Employee", "Department", "Salary"]
salary_data_with_id = spark.createDataFrame(data = salary_data_with_id, schema = columns)
salary_data_with_id.show()

In [None]:
employee_data = [(1, "NY", "M"), \
    (2, "NC", "M"), \
    (3, "NY", "F"), \
    (4, "TX", "M"), \
    (5, "NY", "F"), \
    (6, "AZ", "F") \
  ]
columns= ["ID", "State", "Gender"]
employee_data = spark.createDataFrame(data = employee_data, schema = columns)
employee_data.show()

#### **INNER JOIN**

Used to join two DF's based on values that are common in both DF's. Any value that doesn't exist in any one of the DFs wouldn't be part of the resulting DF.

***Default*** type of join in Spark.

Use Case:

Inner joins are useful for merging data when you are interested in common elements in both DataFrames – for example, joining sales data with customer data to see which customers made a purchase.

In [None]:
salary_data_with_id.join(employee_data,salary_data_with_id.ID == employee_data.ID, "inner").show()

##### **OUTER JOIN**

AKA ***full outer join*** returns all rows from both DF's, filling missing values with ***null***.

We should use an outer join when we want to join two DataFrames based on values that exist in both DataFrames, regardless of whether they don’t exist in the other DataFrame. Any values that exist in any one of the DataFrames would be part of the resulting DataFrame.

**Use Case**:
Use it when you want to include all records from both DataFrames while accommodating unmatched values- for ex, merging employee data with project data to see which employees are assigned to which projects, including those who are unassigned.

In [None]:
salary_data_with_id.join(employee_data, salary_data_with_id.ID == employee_data.ID, "outer").show()

#### **LEFT JOIN**

Left Join returns all the rows from the left DF and the matched rows from the right DF. If there is no match in the right DF, the result will contain **null** values.

**Use Case**

Useful when you want to keep all the records from the left DF and only the matching records from the right DF. Ex: Merging customer data with transaction data to see which customers have made a purchase.



In [None]:
salary_data_with_id.join(employee_data, salary_data_with_id.ID == employee_data.ID, "left").show()

#### **RIGHT JOIN**

Similar to Left Join, but returns all the rows from the right DF and the matched rows from the left DF. Non matching rows from the left DF contain null values.

In [None]:
salary_data_with_id.join(employee_data, salary_data_with_id.ID == employee_data.ID, "right").show()


##### **CROSS JOIN**

AKA Cartesian Join, combines each row from the left DF with every row from the right DF.

Typically used when you want to explore all possible combinations of data, such as when generating test data.


#### **UNION**

Union is used to join 2 DFs having similar schema.

In [None]:
salary_data_with_id_2 = [(1, "John", "Field-eng", 3500), \
    (2, "Robert", "Sales", 4000), \
    (3, "Aliya", "Finance", 3500), \
    (4, "Nate", "Sales", 3000), \
  ]
columns2= ["ID", "Employee", "Department", "Salary"]
salary_data_with_id_2 = spark.createDataFrame(data = salary_data_with_id_2, schema = columns2)
salary_data_with_id_2.printSchema()
salary_data_with_id_2.show(truncate=False)

In [None]:
unionDF = salary_data_with_id.union(salary_data_with_id_2)
unionDF.show(truncate=False)