# Module 05 - Spark SQL - Exercises

## Instructions

This notebook contains exercises based on the concepts learned in Module 05.

- Complete each exercise in the provided code cells
- Run the data setup cells first to generate/create necessary data
- Test your solutions by running the verification cells (if provided)
- Refer back to the main module notebook if you need help


## Data Setup

Run the cells below to set up the data needed for the exercises.


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType
from pyspark.sql.functions import col, when, lit
import os
 
os.environ["JAVA_TOOL_OPTIONS"] = (
    "--add-opens=java.base/java.lang=ALL-UNNAMED "
    "--add-opens=java.base/java.nio=ALL-UNNAMED "
    "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED"
)
os.environ["PYSPARK_PYTHON"] = "python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "python"

# Create SparkSession
spark = SparkSession.builder \
    .appName("Module 05 Exercises") \
    .master("local[*]") \
    .getOrCreate()

# Set data directory
data_dir = "../data"
os.makedirs(data_dir, exist_ok=True)
print("SparkSession created successfully!")
print(f"Data directory: {os.path.abspath(data_dir)}")

# Create sample data for Spark SQL exercises
data = [
    ("Alice", 25, "Sales", 50000, "NYC"),
    ("Bob", 30, "IT", 60000, "LA"),
    ("Charlie", 35, "Sales", 70000, "Chicago"),
    ("Diana", 28, "IT", 55000, "Houston"),
    ("Eve", 32, "HR", 65000, "Phoenix"),
    ("Frank", 27, "Sales", 52000, "NYC"),
    ("Grace", 29, "IT", 58000, "LA"),
    ("Henry", 31, "HR", 62000, "Chicago"),
    ("Iris", 26, "Sales", 48000, "NYC"),
    ("Jack", 33, "IT", 64000, "San Francisco")
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Department", StringType(), True),
    StructField("Salary", IntegerType(), True),
    StructField("City", StringType(), True)
])

df_employees = spark.createDataFrame(data, schema)

# Create departments DataFrame
departments_data = [
    ("Sales", "New York", 100000),
    ("IT", "San Francisco", 150000),
    ("HR", "Chicago", 80000),
    ("Finance", "Boston", 90000)
]

departments_schema = StructType([
    StructField("Department", StringType(), True),
    StructField("Location", StringType(), True),
    StructField("Budget", IntegerType(), True)
])

df_departments = spark.createDataFrame(departments_data, departments_schema)

print("Employee DataFrame created:")
df_employees.show()

print("\nDepartments DataFrame created:")
df_departments.show()

SparkSession created successfully!
Data directory: c:\Users\Brijesh.Gupta\Documents\data
Employee DataFrame created:
+-------+---+----------+------+-------------+
|   Name|Age|Department|Salary|         City|
+-------+---+----------+------+-------------+
|  Alice| 25|     Sales| 50000|          NYC|
|    Bob| 30|        IT| 60000|           LA|
|Charlie| 35|     Sales| 70000|      Chicago|
|  Diana| 28|        IT| 55000|      Houston|
|    Eve| 32|        HR| 65000|      Phoenix|
|  Frank| 27|     Sales| 52000|          NYC|
|  Grace| 29|        IT| 58000|           LA|
|  Henry| 31|        HR| 62000|      Chicago|
|   Iris| 26|     Sales| 48000|          NYC|
|   Jack| 33|        IT| 64000|San Francisco|
+-------+---+----------+------+-------------+


Departments DataFrame created:
+----------+-------------+------+
|Department|     Location|Budget|
+----------+-------------+------+
|     Sales|     New York|100000|
|        IT|San Francisco|150000|
|        HR|      Chicago| 80000|
| 

## Exercises

Complete the following exercises based on the concepts from Module 05.


### Exercise 1: Create Temporary View

Create a temporary view named 'employees_view' from df_employees using createOrReplaceTempView.

In [None]:
# Your code here
df_employees.createOrReplaceTempView("employee_view")

### Exercise 2: Basic SQL Query

Write a SQL query to select all employees with salary greater than 55000. Display all columns.

In [None]:
# Your code here
result = spark.sql("""SELECT * FROM employee_view
                   WHERE Salary > 55000    
""")
result.show()

+-------+---+----------+------+-------------+
|   Name|Age|Department|Salary|         City|
+-------+---+----------+------+-------------+
|    Bob| 30|        IT| 60000|           LA|
|Charlie| 35|     Sales| 70000|      Chicago|
|    Eve| 32|        HR| 65000|      Phoenix|
|  Grace| 29|        IT| 58000|           LA|
|  Henry| 31|        HR| 62000|      Chicago|
|   Jack| 33|        IT| 64000|San Francisco|
+-------+---+----------+------+-------------+



### Exercise 3: Aggregate SQL Query

Write a SQL query to find the average salary by department, ordered by average salary descending.

In [None]:
# Your code here
result = spark.sql("""SELECT Department,AVG(Salary) AS Avarage_Salary FROM employee_view
                   GROUP BY Department
                   ORDER BY AVG(Salary) DESC
""")
result.show()

+----------+--------------+
|Department|Avarage_Salary|
+----------+--------------+
|        HR|       63500.0|
|        IT|       59250.0|
|     Sales|       55000.0|
+----------+--------------+



### Exercise 4: Create Global Temporary View

Create a global temporary view named 'global_employees_view' from df_employees using createOrReplaceGlobalTempView.


In [None]:
# Your code here
df_employees.createOrReplaceGlobalTempView("global_employees_view")


### Exercise 5: SQL Query with Multiple Conditions

Write a SQL query to select employees who are either in the Sales department OR have a salary greater than 60000.


In [None]:
# Your code here
result = spark.sql("""SELECT * from global_temp.global_employees_view
                   WHERE Salary>60000 OR Department=='Sales' 
""")
result.show()


+-------+---+----------+------+-------------+
|   Name|Age|Department|Salary|         City|
+-------+---+----------+------+-------------+
|  Alice| 25|     Sales| 50000|          NYC|
|    Bob| 30|        IT| 60000|           LA|
|Charlie| 35|     Sales| 70000|      Chicago|
|  Diana| 28|        IT| 55000|      Houston|
|    Eve| 32|        HR| 65000|      Phoenix|
|  Frank| 27|     Sales| 52000|          NYC|
|  Grace| 29|        IT| 58000|           LA|
|  Henry| 31|        HR| 62000|      Chicago|
|   Iris| 26|     Sales| 48000|          NYC|
|   Jack| 33|        IT| 64000|San Francisco|
+-------+---+----------+------+-------------+



### Exercise 6: SQL Query with ORDER BY

Write a SQL query to select all employees, ordered by salary in descending order, then by name in ascending order.


In [None]:
# Your code here
result = spark.sql("""
    
""")
result.show()


### Exercise 7: SQL Query with COUNT and GROUP BY

Write a SQL query to count the number of employees in each department.


In [None]:
# Your code here
result = spark.sql("""
    
""")
result.show()


### Exercise 8: SQL Query with HAVING Clause

Write a SQL query to find departments with an average salary greater than 55000. Show department and average salary.


In [None]:
# Your code here
result = spark.sql("""
    
""")
result.show()


### Exercise 9: SQL Query with MIN and MAX

Write a SQL query to find the minimum and maximum salary for each department.


In [None]:
# Your code here
result = spark.sql("""
    
""")
result.show()


### Exercise 10: SQL Query with LIKE

Write a SQL query to select employees whose name starts with 'A'.


In [None]:
# Your code here
result = spark.sql("""
    
""")
result.show()


### Exercise 11: SQL Query with IN Clause

Write a SQL query to select employees whose city is either 'NYC' or 'LA'.


In [None]:
# Your code here
result = spark.sql("""
    
""")
result.show()


### Exercise 12: SQL Query with CASE Statement

Write a SQL query to add a new column 'SalaryCategory' that categorizes salaries: 'High' if salary >= 60000, 'Medium' if salary >= 50000, else 'Low'.


In [None]:
# Your code here
result = spark.sql("""
    
""")
result.show()


### Exercise 13: SQL Query with Subquery

First, create a view for departments. Then write a SQL query to find employees whose department has a budget greater than 90000.


In [None]:
# Your code here
# First create departments view
# Then write your query
result = spark.sql("""
    
""")
result.show()


### Exercise 14: SQL Query with JOIN

Create views for both employees and departments. Write a SQL query to join employees with departments on the Department column and show employee name, department, and department location.


In [None]:
# Your code here
# Create views first
# Then write your JOIN query
result = spark.sql("""
    
""")
result.show()


### Exercise 15: SQL Query with SUM

Write a SQL query to calculate the total salary for each department.


In [None]:
# Your code here
result = spark.sql("""
    
""")
result.show()


### Exercise 16: Convert SQL Result to DataFrame

Write a SQL query to select all employees, then convert the result to a DataFrame using spark.table() or by assigning the result directly.


In [None]:
# Your code here
# Write SQL query and convert to DataFrame
df_result = spark.sql("""
    
""")
print("Type:", type(df_result))
df_result.show()


### Exercise 17: SQL Query with DISTINCT

Write a SQL query to find all distinct cities where employees work.


In [None]:
# Your code here
result = spark.sql("""
    
""")
result.show()


### Exercise 18: SQL Query with LIMIT

Write a SQL query to select the top 3 employees with the highest salaries.


In [None]:
# Your code here
result = spark.sql("""
    
""")
result.show()


### Exercise 19: SQL Query with Aggregate Functions

Write a SQL query to find the count, average, minimum, and maximum salary for each department.


In [None]:
# Your code here
result = spark.sql("""
    
""")
result.show()


### Exercise 20: SQL Query with Window Function (ROW_NUMBER)

Write a SQL query to rank employees within each department by salary using ROW_NUMBER() window function.


In [None]:
# Your code here
result = spark.sql("""
    
""")
result.show()


## Summary

Great job completing the exercises! Review your solutions and compare them with the solutions notebook if needed.
