# Module 06 - Joins - Exercises

## Instructions

This notebook contains exercises based on the concepts learned in Module 06.

- Complete each exercise in the provided code cells
- Run the data setup cells first to generate/create necessary data
- Test your solutions by running the verification cells (if provided)
- Refer back to the main module notebook if you need help


## Data Setup

Run the cells below to set up the data needed for the exercises.


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType
from pyspark.sql.functions import col, when, lit
import os

# Create SparkSession
spark = SparkSession.builder \
    .appName("Module 06 Exercises") \
    .master("local[*]") \
    .getOrCreate()

# Set data directory
data_dir = "../data"
os.makedirs(data_dir, exist_ok=True)
print("SparkSession created successfully!")
print(f"Data directory: {os.path.abspath(data_dir)}")

# Create data for join exercises
employees_data = [
    (1, "Alice", "Sales", 50000),
    (2, "Bob", "IT", 60000),
    (3, "Charlie", "Sales", 70000),
    (4, "Diana", "HR", 55000),
    (5, "Eve", "IT", 65000),
    (6, "Frank", "Marketing", 52000),  # Department not in departments initially
    (7, "Grace", "Sales", 58000),
    (8, "Henry", "IT", 62000),
    (9, "Iris", "HR", 54000),
    (10, "Jack", "Finance", 60000)
]

employees_schema = StructType([
    StructField("EmpID", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Department", StringType(), True),
    StructField("Salary", IntegerType(), True)
])

df_employees = spark.createDataFrame(employees_data, employees_schema)

departments_data = [
    ("Sales", "New York", 100000),
    ("IT", "San Francisco", 150000),
    ("HR", "Chicago", 80000),
    ("Finance", "Boston", 90000),  # Department not in employees initially
    ("Marketing", "Los Angeles", 70000)
]

departments_schema = StructType([
    StructField("Department", StringType(), True),
    StructField("Location", StringType(), True),
    StructField("Budget", IntegerType(), True)
])

df_departments = spark.createDataFrame(departments_data, departments_schema)

# Create additional DataFrames for complex join exercises
projects_data = [
    (1, "Project Alpha", "Sales"),
    (2, "Project Beta", "IT"),
    (3, "Project Gamma", "Sales"),
    (4, "Project Delta", "HR"),
    (5, "Project Epsilon", "IT")
]

projects_schema = StructType([
    StructField("ProjectID", IntegerType(), True),
    StructField("ProjectName", StringType(), True),
    StructField("Department", StringType(), True)
])

df_projects = spark.createDataFrame(projects_data, projects_schema)

print("Employees DataFrame:")
df_employees.show()

print("\nDepartments DataFrame:")
df_departments.show()

print("\nProjects DataFrame:")
df_projects.show()

## Exercises

Complete the following exercises based on the concepts from Module 06.


### Exercise 1: Inner Join

Perform an inner join between df_employees and df_departments on the 'Department' column.

In [0]:
# Your code here
inner_join = df_employees.join(df_departments,on="Department",how="inner")
inner_join.show()

### Exercise 2: Left Join

Perform a left join to get all employees with their department information (if available).

In [0]:
# Your code here
left_join = df_employees.join(df_departments,on="Department",how="left")
left_join.show()

### Exercise 3: Broadcast Join

Use broadcast join to join df_employees with df_departments (broadcast the smaller table).

In [0]:
from pyspark.sql.functions import broadcast

# Perform a broadcast join between df_employees and df_departments
broadcast_join = df_employees.join(
    broadcast(df_departments),
    on="Department",
    how="inner"
)
broadcast_join.show()



### Exercise 4: Right Join

Perform a right join to get all departments with their employees (if any).


In [0]:
# Your code here
right_join = df_employees.join(df_departments,on="department",how="right")
right_join.show()


### Exercise 5: Full Outer Join

Perform a full outer join between df_employees and df_departments on the 'Department' column.


In [0]:
# Your code here
full_outer_join = df_employees.join(df_departments,on="department",how="outer")
full_outer_join.show()

### Exercise 6: Left Semi Join

Perform a left semi join to get employees whose department exists in the departments table.


In [0]:
# Your code here
left_semi_join = df_employees.join(df_departments,on="Department",how="left_semi")
left_semi_join.show()

### Exercise 7: Left Anti Join

Perform a left anti join to get employees whose department does NOT exist in the departments table.


In [0]:
# Your code here
left_anti_join = df_employees.join(df_departments,on="Department",how="left_anti")
left_anti_join.show()

### Exercise 8: Join with Different Column Names

Create a new DataFrame df_employees_alt with column 'Dept' instead of 'Department', then join it with df_departments using the join condition.


In [0]:
# Your code here
# First create df_employees_alt with 'Dept' column
employee_schema = StructType([
    StructField("EmpID", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Dept", StringType(), True),
    StructField("Salary", IntegerType(), True)
])
df_employees_alt = spark.createDataFrame(employees_data,employee_schema)

# Then perform the join
join_with_different_column = df_employees_alt.join(df_departments,df_employees_alt.Dept==df_departments.Department,how="inner")
join_with_different_column.show()


### Exercise 9: Multiple Joins

Join df_employees with df_departments, then join the result with df_projects on Department.


In [0]:
# Your code here
multiple_join = df_employees.join(df_departments,on="Department",how="inner").join(df_projects,on="department",how="inner")
multiple_join.show()

### Exercise 10: Join with Select Specific Columns

Perform an inner join and select only Name, Department, and Location columns.


In [0]:
# Your code here
select_specific = df_employees.join(df_departments,on="Department",how="inner").select(
    df_employees.Name,
    df_employees.Department,
    df_departments.Location
)
select_specific.show()

### Exercise 11: Join with Filter

Perform a left join and then filter to show only employees with salary greater than 60000.


In [0]:
# Your code here
Join_With_Filter = df_employees.join(df_departments,on="department",how="inner").filter(
    df_employees.Salary>60000
)
Join_With_Filter.show()

### Exercise 12: Join with Aggregate

Join df_employees with df_departments, then calculate the average salary per department location.


In [0]:
# Your code here
from pyspark.sql.functions import avg

join_with_aggregate = df_employees.join(df_departments,on="department",how="inner") \
                      .groupBy(df_employees.Department,df_departments.Location) \
                      .agg(avg(df_employees.Salary).alias("Avarage Salary"))
join_with_aggregate.show()

### Exercise 13: Self Join

Create a self-join on df_employees to find pairs of employees in the same department.


In [0]:
# Your code here
self_join = df_employees.alias("e1") \
           .join(df_employees.alias("e2"),on="department",how="inner") \
           .filter(col("e1.EmpId")!=col("e2.EmpId"))

self_join.show()
# Hint: Use aliases for the same DataFrame


### Exercise 14: Join with Multiple Conditions

Join df_employees with df_departments on Department, and add an additional condition that Budget > 80000.


In [0]:
# Your code here
multi_join_with_conditions = df_employees \
           .join(df_departments,on="department",how="inner") \
           .filter(df_departments.Budget>80000)

multi_join_with_conditions.show()


### Exercise 15: Join and Count

Perform an inner join and count the number of employees in each department.


In [0]:
# Your code here
from pyspark.sql.functions import count
join_with_count = df_employees \
           .join(df_departments,on="department",how="inner") \
           .groupBy(df_employees.Department) \
           .agg(count("*").alias("NumberOfEmployee"))
join_with_count.show()


### Exercise 16: Join with Order By

Perform a left join and order the result by salary in descending order.


In [0]:
# Your code here
join_with_orderBy = df_employees \
           .join(df_departments,on="department",how="left") \
           .orderBy(df_employees.Salary.desc())
join_with_orderBy.show()


### Exercise 17: Join with Distinct

Perform a join and get distinct department names from the result.


In [0]:
# Your code here
join_with_distinct = df_employees \
           .join(df_departments,on="department",how="inner") \
           .select("Department").distinct()
join_with_distinct.show()

### Exercise 18: Join with CASE Statement

Join df_employees with df_departments and add a column 'BudgetCategory' that is 'High' if Budget >= 100000, else 'Low'.


In [0]:
# Your code here
from pyspark.sql.functions import when
join_with_case = df_employees \
           .join(df_departments,on="department",how="inner") \
           .withColumn(
               "BudgetCategory",
                when(col("budget")>=100000,"High").otherwise("Low")
            )
join_with_case.show()

### Exercise 19: Join Performance - Filter Before Join

Filter df_employees to only Sales department employees, then join with df_departments. This demonstrates the best practice of filtering before joining.


In [0]:
# Your code here
filtered_Employee = df_employees.filter(df_employees.Department=="Sales") \
                    .join(df_departments,on="Department",how="inner")
filter_before_Join.show()


### Exercise 20: Complex Join with Multiple DataFrames

Join df_employees, df_departments, and df_projects together. Show employee name, department, location, and project name.


In [0]:
# Your code here
join_with_multiple_dataframe = df_employees.join(df_departments,on="Department",how="inner") \
                                           .join(df_projects,on="Department",how="inner")
join_with_multiple_dataframe.show()

## Summary

Great job completing the exercises! Review your solutions and compare them with the solutions notebook if needed.
