# Databricks Spark Practice - 30 Questions

## Introduction

This notebook contains 30 comprehensive practice questions covering all major PySpark concepts. These questions are designed to be solved on Databricks and will help you master:

- SparkSession and basic operations
- Reading and writing data
- DataFrame transformations
- Aggregations and GroupBy
- Spark SQL
- Joins
- Window functions
- Complex data types
- Performance optimization
- Databricks-specific features

## Instructions

1. **In Databricks**: SparkSession is automatically available as `spark`
2. **For local testing**: Uncomment the SparkSession creation code in the setup cell
3. Complete each exercise in the provided code cells
4. Run the data setup cells first to create sample data
5. Test your solutions by running the code and checking outputs
6. Refer back to the PySpark module notebooks if you need help


## Data Setup

Run the cells below to set up all the sample data needed for the exercises.


In [0]:
# In Databricks, SparkSession is already available
# For local testing, uncomment the following:

# from pyspark.sql import SparkSession
# spark = SparkSession.builder \
#     .appName("Databricks Practice") \
#     .master("local[*]") \
#     .getOrCreate()

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType, ArrayType
from pyspark.sql.functions import col, when, lit, expr, sum, avg, count, max, min, row_number, rank, dense_rank, lead, lag, window

print("Setup complete! SparkSession ready.")


Setup complete! SparkSession ready.


In [0]:
# Create employees DataFrame
employees_data = [
    (1, "Alice", 25, "Sales", 50000, "NYC", "2020-01-15"),
    (2, "Bob", 30, "IT", 60000, "LA", "2019-03-20"),
    (3, "Charlie", 35, "Sales", 70000, "Chicago", "2018-06-10"),
    (4, "Diana", 28, "IT", 55000, "NYC", "2021-02-14"),
    (5, "Eve", 32, "HR", 65000, "Houston", "2019-11-05"),
    (6, "Frank", 27, "Sales", 52000, "LA", "2022-01-08"),
    (7, "Grace", 29, "IT", 58000, "Chicago", "2020-09-12"),
    (8, "Henry", 31, "HR", 62000, "NYC", "2018-12-01"),
    (9, "Ivy", 26, "Sales", 51000, "Houston", "2021-07-22"),
    (10, "Jack", 33, "Finance", 75000, "LA", "2017-05-30")
]

employees_schema = StructType([
    StructField("emp_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("department", StringType(), True),
    StructField("salary", IntegerType(), True),
    StructField("city", StringType(), True),
    StructField("hire_date", StringType(), True)
])

df_employees = spark.createDataFrame(employees_data, employees_schema)
print("Employees DataFrame created:")
df_employees.show()
df_employees.display()


Employees DataFrame created:
+------+-------+---+----------+------+-------+----------+
|emp_id|   name|age|department|salary|   city| hire_date|
+------+-------+---+----------+------+-------+----------+
|     1|  Alice| 25|     Sales| 50000|    NYC|2020-01-15|
|     2|    Bob| 30|        IT| 60000|     LA|2019-03-20|
|     3|Charlie| 35|     Sales| 70000|Chicago|2018-06-10|
|     4|  Diana| 28|        IT| 55000|    NYC|2021-02-14|
|     5|    Eve| 32|        HR| 65000|Houston|2019-11-05|
|     6|  Frank| 27|     Sales| 52000|     LA|2022-01-08|
|     7|  Grace| 29|        IT| 58000|Chicago|2020-09-12|
|     8|  Henry| 31|        HR| 62000|    NYC|2018-12-01|
|     9|    Ivy| 26|     Sales| 51000|Houston|2021-07-22|
|    10|   Jack| 33|   Finance| 75000|     LA|2017-05-30|
+------+-------+---+----------+------+-------+----------+



emp_id,name,age,department,salary,city,hire_date
1,Alice,25,Sales,50000,NYC,2020-01-15
2,Bob,30,IT,60000,LA,2019-03-20
3,Charlie,35,Sales,70000,Chicago,2018-06-10
4,Diana,28,IT,55000,NYC,2021-02-14
5,Eve,32,HR,65000,Houston,2019-11-05
6,Frank,27,Sales,52000,LA,2022-01-08
7,Grace,29,IT,58000,Chicago,2020-09-12
8,Henry,31,HR,62000,NYC,2018-12-01
9,Ivy,26,Sales,51000,Houston,2021-07-22
10,Jack,33,Finance,75000,LA,2017-05-30


In [0]:
# Create departments DataFrame
departments_data = [
    ("Sales", "John", 1000000),
    ("IT", "Sarah", 1500000),
    ("HR", "Mike", 800000),
    ("Finance", "Lisa", 1200000),
    ("Marketing", "Tom", 900000)
]

departments_schema = StructType([
    StructField("dept_name", StringType(), True),
    StructField("manager", StringType(), True),
    StructField("budget", IntegerType(), True)
])

df_departments = spark.createDataFrame(departments_data, departments_schema)
print("Departments DataFrame created:")
df_departments.show()


Departments DataFrame created:
+---------+-------+-------+
|dept_name|manager| budget|
+---------+-------+-------+
|    Sales|   John|1000000|
|       IT|  Sarah|1500000|
|       HR|   Mike| 800000|
|  Finance|   Lisa|1200000|
|Marketing|    Tom| 900000|
+---------+-------+-------+



In [0]:
# Create sales DataFrame
sales_data = [
    (1, "2024-01-15", 1000, "Product A"),
    (1, "2024-02-20", 1500, "Product B"),
    (2, "2024-01-10", 2000, "Product A"),
    (3, "2024-02-05", 1200, "Product C"),
    (1, "2024-03-12", 1800, "Product A"),
    (4, "2024-01-25", 900, "Product B"),
    (2, "2024-02-28", 2200, "Product C"),
    (5, "2024-03-01", 1100, "Product A"),
    (3, "2024-03-15", 1300, "Product B")
]

sales_schema = StructType([
    StructField("emp_id", IntegerType(), True),
    StructField("sale_date", StringType(), True),
    StructField("amount", IntegerType(), True),
    StructField("product", StringType(), True)
])

df_sales = spark.createDataFrame(sales_data, sales_schema)
print("Sales DataFrame created:")
df_sales.show()


Sales DataFrame created:
+------+----------+------+---------+
|emp_id| sale_date|amount|  product|
+------+----------+------+---------+
|     1|2024-01-15|  1000|Product A|
|     1|2024-02-20|  1500|Product B|
|     2|2024-01-10|  2000|Product A|
|     3|2024-02-05|  1200|Product C|
|     1|2024-03-12|  1800|Product A|
|     4|2024-01-25|   900|Product B|
|     2|2024-02-28|  2200|Product C|
|     5|2024-03-01|  1100|Product A|
|     3|2024-03-15|  1300|Product B|
+------+----------+------+---------+



In [0]:
# Create products DataFrame
products_data = [
    ("Product A", "Electronics", 500),
    ("Product B", "Clothing", 300),
    ("Product C", "Electronics", 800),
    ("Product D", "Food", 50)
]

products_schema = StructType([
    StructField("product_name", StringType(), True),
    StructField("category", StringType(), True),
    StructField("base_price", IntegerType(), True)
])

df_products = spark.createDataFrame(products_data, products_schema)
print("Products DataFrame created:")
df_products.show()


Products DataFrame created:
+------------+-----------+----------+
|product_name|   category|base_price|
+------------+-----------+----------+
|   Product A|Electronics|       500|
|   Product B|   Clothing|       300|
|   Product C|Electronics|       800|
|   Product D|       Food|        50|
+------------+-----------+----------+



---

## Questions

### Questions 1-5: Basic DataFrame Operations


### Question 1: Filter and Select

Filter `df_employees` to show only employees from the 'Sales' department, and select only the columns: `name`, `age`, and `salary`.


In [0]:
# Your solution here
df_employees.filter(df_employees.department == 'Sales') \
                       .select('name', 'age', 'salary') \
                           .show()

+-------+---+------+
|   name|age|salary|
+-------+---+------+
|  Alice| 25| 50000|
|Charlie| 35| 70000|
|  Frank| 27| 52000|
|    Ivy| 26| 51000|
+-------+---+------+



### Question 2: Sort Data

Sort `df_employees` by `salary` in descending order and show the top 5 employees.


In [0]:
# Your solution here
df_employees.filter(df_employees.department == 'Sales') \
                       .select('name', 'age', 'salary') \
                        .orderBy('salary') \
                            .show()

+-------+---+------+
|   name|age|salary|
+-------+---+------+
|  Alice| 25| 50000|
|    Ivy| 26| 51000|
|  Frank| 27| 52000|
|Charlie| 35| 70000|
+-------+---+------+



### Question 3: Add Calculated Column

Add a new column `annual_bonus` to `df_employees` that is 10% of the salary. Display the result with columns: `name`, `salary`, and `annual_bonus`.


In [0]:
# Your solution here
df_employees.withColumn('bonus', df_employees.salary * 0.1) \
            .select('name','age','bonus') \
            .show()

+-------+---+------+
|   name|age| bonus|
+-------+---+------+
|  Alice| 25|5000.0|
|    Bob| 30|6000.0|
|Charlie| 35|7000.0|
|  Diana| 28|5500.0|
|    Eve| 32|6500.0|
|  Frank| 27|5200.0|
|  Grace| 29|5800.0|
|  Henry| 31|6200.0|
|    Ivy| 26|5100.0|
|   Jack| 33|7500.0|
+-------+---+------+



### Question 4: Conditional Logic

Create a new column `salary_category` in `df_employees` that categorizes salaries as:
- "High" if salary >= 65000
- "Medium" if salary >= 55000 and < 65000
- "Low" if salary < 55000

Show `name`, `salary`, and `salary_category`.


In [0]:
# Your solution here
df_employees.withColumn(
            'salary_category',
            when(df_employees.salary>=65000,'High').
            when((df_employees.salary>=55000) & (df_employees.salary<65000),'Medium')
            .otherwise('Low')) \
            .select('name','salary','salary_category') \
            .show()

+-------+------+---------------+
|   name|salary|salary_category|
+-------+------+---------------+
|  Alice| 50000|            Low|
|    Bob| 60000|         Medium|
|Charlie| 70000|           High|
|  Diana| 55000|         Medium|
|    Eve| 65000|           High|
|  Frank| 52000|            Low|
|  Grace| 58000|         Medium|
|  Henry| 62000|         Medium|
|    Ivy| 51000|            Low|
|   Jack| 75000|           High|
+-------+------+---------------+



### Question 5: Remove Duplicates and Null Handling

Filter `df_employees` to remove any rows where `age` is null, then remove duplicate rows based on all columns. Count the total number of rows remaining.


In [0]:
# Your solution here
df_employees.dropna(subset=['age']) \
            .dropDuplicates() \
            .count()

10

---

### Questions 6-10: Aggregations and GroupBy


### Question 6: Basic Aggregation

Calculate the average salary for each department in `df_employees`. Show department and average salary, sorted by average salary in descending order.


In [0]:
# Your solution here
df_avg_salary = df_employees.groupBy('department').agg(
    avg('salary').alias('avg_salary')
)


df_avg_salary_sorted = df_avg_salary.orderBy(col('avg_salary'), ascending=False)


df_avg_salary_sorted.show()

+----------+------------------+
|department|        avg_salary|
+----------+------------------+
|   Finance|           75000.0|
|        HR|           63500.0|
|        IT|57666.666666666664|
|     Sales|           55750.0|
+----------+------------------+



### Question 7: Multiple Aggregations

For each department, calculate:
- Total number of employees
- Average salary
- Maximum salary
- Minimum salary

Display the results sorted by department name.


In [0]:
# Your solution here
df_department_stats = df_employees.groupBy('department').agg(
    count('name').alias('total_employees'),
    avg('salary').alias('avg_salary'),     
    max('salary').alias('max_salary'),     
    min('salary').alias('min_salary')      
)


df_sorted_stats = df_department_stats.orderBy('department')


df_sorted_stats.show()

+----------+---------------+------------------+----------+----------+
|department|total_employees|        avg_salary|max_salary|min_salary|
+----------+---------------+------------------+----------+----------+
|   Finance|              1|           75000.0|     75000|     75000|
|        HR|              2|           63500.0|     65000|     62000|
|        IT|              3|57666.666666666664|     60000|     55000|
|     Sales|              4|           55750.0|     70000|     50000|
+----------+---------------+------------------+----------+----------+



### Question 8: GroupBy with Filter

Find the total sales amount (`amount`) for each employee (`emp_id`) in `df_sales`, but only include employees who have total sales greater than 2000. Show `emp_id` and total sales amount.


In [0]:
# Your solution here
df_sales_total = df_sales.groupBy('emp_id').agg(
    sum('amount').alias('total_sales')
)


df_sales_filtered = df_sales_total.filter(df_sales_total.total_sales > 2000)


df_sales_filtered.select('emp_id', 'total_sales').show()

+------+-----------+
|emp_id|total_sales|
+------+-----------+
|     1|       4300|
|     2|       4200|
|     3|       2500|
+------+-----------+



### Question 9: Count Distinct

Count the number of distinct cities where employees work in `df_employees`. Also, for each city, count how many employees work there.


In [0]:
# Your solution here
df_city_count = df_employees.groupBy('city').agg(
    count('city').alias('distinct_city_count'),
    count('emp_id').alias('employee_count')
)

df_city_count.show()

+-------+-------------------+--------------+
|   city|distinct_city_count|employee_count|
+-------+-------------------+--------------+
|    NYC|                  3|             3|
|     LA|                  3|             3|
|Chicago|                  2|             2|
|Houston|                  2|             2|
+-------+-------------------+--------------+



### Question 10: Aggregation with Conditions

Calculate the average age of employees for each department, but only include employees who are 30 years or older in the calculation.


In [0]:
# Your solution here
df_filtered = df_employees.filter(df_employees.age >= 30)


df_avg_age = df_filtered.groupBy('department').agg(
    avg('age').alias('avg_age')
)


df_avg_age.show()

+----------+-------+
|department|avg_age|
+----------+-------+
|        IT|   30.0|
|     Sales|   35.0|
|        HR|   31.5|
|   Finance|   33.0|
+----------+-------+



---

### Questions 11-15: Spark SQL


### Question 11: Create Temporary View and Query

Create a temporary view from `df_employees` called `employees_view` and write a SQL query to find all employees in the 'IT' department with salary greater than 55000. Show `name`, `age`, and `salary`.


In [0]:
df_employees.createOrReplaceTempView('employees_view')

In [0]:
%sql
SELECT name, age, salary
FROM employees_view
WHERE department = 'IT' AND salary > 55000

name,age,salary
Bob,30,60000
Grace,29,58000


In [0]:
# Your solution here
df_employees.createOrReplaceTempView('employees_view')

query = """
SELECT name, age, salary
FROM employees_view
WHERE department = 'IT' AND salary > 55000
"""

result = spark.sql(query)
result.show()


+-----+---+------+
| name|age|salary|
+-----+---+------+
|  Bob| 30| 60000|
|Grace| 29| 58000|
+-----+---+------+



### Question 12: SQL Aggregation

Using Spark SQL, write a query to find the department with the highest total salary. Show the department name and total salary.


In [0]:
# Your solution here
df_employees.createOrReplaceTempView('employees_view')

query = """
SELECT department, SUM(salary) AS total_salary
FROM employees_view
GROUP BY department
ORDER BY total_salary DESC
LIMIT 1
"""

result = spark.sql(query)
result.show()


+----------+------------+
|department|total_salary|
+----------+------------+
|     Sales|      223000|
+----------+------------+



### Question 13: SQL with CASE Statement

Using Spark SQL, create a query that shows `name`, `salary`, and a new column `salary_band`:
- 'A' for salary >= 70000
- 'B' for salary >= 60000 and < 70000
- 'C' for salary < 60000


In [0]:
# Your solution here
df_employees.createOrReplaceTempView('employees_view')

query = """
SELECT name, salary,
    CASE 
        WHEN salary >= 70000 THEN 'A'
        WHEN salary >= 60000 AND salary < 70000 THEN 'B'
        ELSE 'C'
    END AS salary_band
FROM employees_view
"""

result = spark.sql(query)
result.show()


+-------+------+-----------+
|   name|salary|salary_band|
+-------+------+-----------+
|  Alice| 50000|          C|
|    Bob| 60000|          B|
|Charlie| 70000|          A|
|  Diana| 55000|          C|
|    Eve| 65000|          B|
|  Frank| 52000|          C|
|  Grace| 58000|          C|
|  Henry| 62000|          B|
|    Ivy| 51000|          C|
|   Jack| 75000|          A|
+-------+------+-----------+



### Question 14: SQL Subquery

Using Spark SQL, find all employees whose salary is greater than the average salary of all employees. Show `name`, `department`, and `salary`.


In [0]:
# Your solution here
df_employees.createOrReplaceTempView('employees_view')

query = """
SELECT name, department, salary
FROM employees_view
WHERE salary > (SELECT AVG(salary) FROM employees_view)
"""

result = spark.sql(query)
result.show()


+-------+----------+------+
|   name|department|salary|
+-------+----------+------+
|    Bob|        IT| 60000|
|Charlie|     Sales| 70000|
|    Eve|        HR| 65000|
|  Henry|        HR| 62000|
|   Jack|   Finance| 75000|
+-------+----------+------+



### Question 15: SQL Window Function

Using Spark SQL, rank employees within each department by their salary (highest salary gets rank 1). Show `name`, `department`, `salary`, and `rank`.


In [0]:
# Your solution here
# from pyspark.sql.window import Window

df_employees.createOrReplaceTempView('employees_view')

query = """
SELECT name, department, salary,
       RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
FROM employees_view
"""

result = spark.sql(query)
result.show()


+-------+----------+------+----+
|   name|department|salary|rank|
+-------+----------+------+----+
|   Jack|   Finance| 75000|   1|
|    Eve|        HR| 65000|   1|
|  Henry|        HR| 62000|   2|
|    Bob|        IT| 60000|   1|
|  Grace|        IT| 58000|   2|
|  Diana|        IT| 55000|   3|
|Charlie|     Sales| 70000|   1|
|  Frank|     Sales| 52000|   2|
|    Ivy|     Sales| 51000|   3|
|  Alice|     Sales| 50000|   4|
+-------+----------+------+----+



---

### Questions 16-20: Joins


### Question 16: Inner Join

Perform an inner join between `df_employees` and `df_departments` on `department` = `dept_name`. Show `name`, `department`, `salary`, and `manager`.


In [0]:
# Your solution here
df_employees.join(df_departments, df_employees.department == df_departments.dept_name, "inner") \
    .select(df_employees.name, df_employees.department, df_employees.salary, df_departments.manager) \
    .show()


+-------+----------+------+-------+
|   name|department|salary|manager|
+-------+----------+------+-------+
|  Alice|     Sales| 50000|   John|
|    Bob|        IT| 60000|  Sarah|
|Charlie|     Sales| 70000|   John|
|  Diana|        IT| 55000|  Sarah|
|    Eve|        HR| 65000|   Mike|
|  Frank|     Sales| 52000|   John|
|  Grace|        IT| 58000|  Sarah|
|  Henry|        HR| 62000|   Mike|
|    Ivy|     Sales| 51000|   John|
|   Jack|   Finance| 75000|   Lisa|
+-------+----------+------+-------+



### Question 17: Left Join

Perform a left join between `df_employees` and `df_departments` on `department` = `dept_name`. This will show all employees even if their department doesn't exist in the departments table. Show `name`, `department`, and `manager`.


In [0]:
# Your solution here
df_employees.join(df_departments, df_employees.department == df_departments.dept_name, "left") \
    .select(df_employees.name, df_employees.department, df_departments.manager) \
    .show()


+-------+----------+-------+
|   name|department|manager|
+-------+----------+-------+
|  Alice|     Sales|   John|
|    Bob|        IT|  Sarah|
|Charlie|     Sales|   John|
|  Diana|        IT|  Sarah|
|    Eve|        HR|   Mike|
|  Frank|     Sales|   John|
|  Grace|        IT|  Sarah|
|  Henry|        HR|   Mike|
|    Ivy|     Sales|   John|
|   Jack|   Finance|   Lisa|
+-------+----------+-------+



### Question 18: Multiple Table Join

Join `df_employees`, `df_sales`, and `df_products` to show:
- Employee name
- Sale date
- Sale amount
- Product name
- Product category

Use appropriate join types to include all sales records.


In [0]:
df_employees.join(df_sales, df_employees.emp_id == df_sales.emp_id, "inner") \
    .join(df_products, df_sales.product == df_products.product_name, "inner") \
    .select(df_employees.name, df_sales.sale_date, df_sales.amount.alias("sale_amount"), df_products.product_name, df_products.category.alias("product_category")) \
    .show()


+-------+----------+-----------+------------+----------------+
|   name| sale_date|sale_amount|product_name|product_category|
+-------+----------+-----------+------------+----------------+
|  Alice|2024-01-15|       1000|   Product A|     Electronics|
|  Alice|2024-02-20|       1500|   Product B|        Clothing|
|    Bob|2024-01-10|       2000|   Product A|     Electronics|
|Charlie|2024-02-05|       1200|   Product C|     Electronics|
|  Alice|2024-03-12|       1800|   Product A|     Electronics|
|  Diana|2024-01-25|        900|   Product B|        Clothing|
|    Bob|2024-02-28|       2200|   Product C|     Electronics|
|    Eve|2024-03-01|       1100|   Product A|     Electronics|
|Charlie|2024-03-15|       1300|   Product B|        Clothing|
+-------+----------+-----------+------------+----------------+



### Question 19: Left Semi Join

Use a left semi join to find all employees from `df_employees` who have made at least one sale (exist in `df_sales`). Show only the employee information: `name`, `department`, and `salary`.


In [0]:
# Your solution here
df_employees.join(df_sales, df_employees.emp_id == df_sales.emp_id, "left_semi") \
    .select(df_employees.name, df_employees.department, df_employees.salary) \
    .show()


+-------+----------+------+
|   name|department|salary|
+-------+----------+------+
|  Alice|     Sales| 50000|
|    Bob|        IT| 60000|
|Charlie|     Sales| 70000|
|  Diana|        IT| 55000|
|    Eve|        HR| 65000|
+-------+----------+------+



### Question 20: Anti Join

Use an anti join to find all employees from `df_employees` who have NOT made any sales (do not exist in `df_sales`). Show `name`, `department`, and `salary`.


In [0]:
# Your solution here
df_employees.join(df_sales, df_employees.emp_id == df_sales.emp_id, "left_anti") \
    .select(df_employees.name, df_employees.department, df_employees.salary) \
    .show()


+-----+----------+------+
| name|department|salary|
+-----+----------+------+
|Frank|     Sales| 52000|
|Grace|        IT| 58000|
|Henry|        HR| 62000|
|  Ivy|     Sales| 51000|
| Jack|   Finance| 75000|
+-----+----------+------+



---

### Questions 21-25: Window Functions


### Question 21: Row Number

Use a window function to assign row numbers to employees within each department, ordered by salary in descending order. Show `name`, `department`, `salary`, and `row_number`.


In [0]:
from pyspark.sql import functions
from pyspark.sql.window import Window

window_spec = Window.partitionBy(df_employees.department).orderBy(functions.col("salary").desc())

df_employees.withColumn("row_number", functions.row_number().over(window_spec)) \
    .select("name", "department", "salary", "row_number") \
    .show()


+-------+----------+------+----------+
|   name|department|salary|row_number|
+-------+----------+------+----------+
|   Jack|   Finance| 75000|         1|
|    Eve|        HR| 65000|         1|
|  Henry|        HR| 62000|         2|
|    Bob|        IT| 60000|         1|
|  Grace|        IT| 58000|         2|
|  Diana|        IT| 55000|         3|
|Charlie|     Sales| 70000|         1|
|  Frank|     Sales| 52000|         2|
|    Ivy|     Sales| 51000|         3|
|  Alice|     Sales| 50000|         4|
+-------+----------+------+----------+



### Question 22: Rank and Dense Rank

Calculate both `rank` and `dense_rank` for employees within each department based on salary. Show `name`, `department`, `salary`, `rank`, and `dense_rank`. Notice the difference between rank and dense_rank when there are ties.


In [0]:
# Your solution here
from pyspark.sql import functions
from pyspark.sql.window import Window

window_spec = Window.partitionBy(df_employees.department).orderBy(functions.col("salary").desc())

df_employees.withColumn("rank", functions.rank().over(window_spec)) \
    .withColumn("dense_rank", functions.dense_rank().over(window_spec)) \
    .select("name", "department", "salary", "rank", "dense_rank") \
    .show()


+-------+----------+------+----+----------+
|   name|department|salary|rank|dense_rank|
+-------+----------+------+----+----------+
|   Jack|   Finance| 75000|   1|         1|
|    Eve|        HR| 65000|   1|         1|
|  Henry|        HR| 62000|   2|         2|
|    Bob|        IT| 60000|   1|         1|
|  Grace|        IT| 58000|   2|         2|
|  Diana|        IT| 55000|   3|         3|
|Charlie|     Sales| 70000|   1|         1|
|  Frank|     Sales| 52000|   2|         2|
|    Ivy|     Sales| 51000|   3|         3|
|  Alice|     Sales| 50000|   4|         4|
+-------+----------+------+----+----------+



### Question 23: Running Total

Calculate a running total of sales amounts for each employee in `df_sales`, ordered by `sale_date`. Show `emp_id`, `sale_date`, `amount`, and `running_total`.


In [0]:
# Your solution here
from pyspark.sql.window import Window



window_spec = Window.partitionBy("emp_id").orderBy("sale_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)


df_sales.withColumn("running_total", sum("amount").over(window_spec)) \
    .select("emp_id", "sale_date", "amount", "running_total") \
    .show()


+------+----------+------+-------------+
|emp_id| sale_date|amount|running_total|
+------+----------+------+-------------+
|     1|2024-01-15|  1000|         1000|
|     1|2024-02-20|  1500|         2500|
|     1|2024-03-12|  1800|         4300|
|     2|2024-01-10|  2000|         2000|
|     2|2024-02-28|  2200|         4200|
|     3|2024-02-05|  1200|         1200|
|     3|2024-03-15|  1300|         2500|
|     4|2024-01-25|   900|          900|
|     5|2024-03-01|  1100|         1100|
+------+----------+------+-------------+



### Question 24: Lead and Lag

For each employee's sales in `df_sales`, show:
- Current sale amount
- Previous sale amount (lag)
- Next sale amount (lead)

Order by `emp_id` and `sale_date`. Show `emp_id`, `sale_date`, `amount`, `prev_amount`, and `next_amount`.


In [0]:
# Your solution here
from pyspark.sql.window import Window

window_spec = Window.partitionBy("emp_id").orderBy("sale_date")

df_sales.withColumn("prev_amount", lag("amount").over(window_spec)) \
    .withColumn("next_amount", lead("amount").over(window_spec)) \
    .select("emp_id", "sale_date", "amount", "prev_amount", "next_amount") \
    .show()


+------+----------+------+-----------+-----------+
|emp_id| sale_date|amount|prev_amount|next_amount|
+------+----------+------+-----------+-----------+
|     1|2024-01-15|  1000|       NULL|       1500|
|     1|2024-02-20|  1500|       1000|       1800|
|     1|2024-03-12|  1800|       1500|       NULL|
|     2|2024-01-10|  2000|       NULL|       2200|
|     2|2024-02-28|  2200|       2000|       NULL|
|     3|2024-02-05|  1200|       NULL|       1300|
|     3|2024-03-15|  1300|       1200|       NULL|
|     4|2024-01-25|   900|       NULL|       NULL|
|     5|2024-03-01|  1100|       NULL|       NULL|
+------+----------+------+-----------+-----------+



### Question 25: Window Aggregation

For each sale in `df_sales`, calculate:
- Average sale amount for the same employee
- Maximum sale amount for the same employee
- Minimum sale amount for the same employee

Show `emp_id`, `sale_date`, `amount`, `avg_amount`, `max_amount`, and `min_amount`.


In [0]:
# Your solution here
window_spec = Window.partitionBy("emp_id")

df_sales.withColumn("avg_amount", avg("amount").over(window_spec)) \
    .withColumn("max_amount", max("amount").over(window_spec)) \
    .withColumn("min_amount", min("amount").over(window_spec)) \
    .select("emp_id", "sale_date", "amount", "avg_amount", "max_amount", "min_amount") \
    .show()


+------+----------+------+------------------+----------+----------+
|emp_id| sale_date|amount|        avg_amount|max_amount|min_amount|
+------+----------+------+------------------+----------+----------+
|     1|2024-01-15|  1000|1433.3333333333333|      1800|      1000|
|     1|2024-02-20|  1500|1433.3333333333333|      1800|      1000|
|     1|2024-03-12|  1800|1433.3333333333333|      1800|      1000|
|     2|2024-01-10|  2000|            2100.0|      2200|      2000|
|     2|2024-02-28|  2200|            2100.0|      2200|      2000|
|     3|2024-02-05|  1200|            1250.0|      1300|      1200|
|     3|2024-03-15|  1300|            1250.0|      1300|      1200|
|     4|2024-01-25|   900|             900.0|       900|       900|
|     5|2024-03-01|  1100|            1100.0|      1100|      1100|
+------+----------+------+------------------+----------+----------+



---

### Questions 26-30: Advanced Topics


### Question 26: Pivot Operation

Pivot `df_sales` to show total sales amount for each employee (`emp_id`) by product. The result should have columns: `emp_id`, `Product A`, `Product B`, `Product C` (and `Product D` if applicable).


In [0]:
# Your solution here
df_sales.groupBy("emp_id") \
    .pivot("product") \
    .agg(sum("amount")) \
    .show()


+------+---------+---------+---------+
|emp_id|Product A|Product B|Product C|
+------+---------+---------+---------+
|     1|     2800|     1500|     NULL|
|     2|     2000|     NULL|     2200|
|     3|     NULL|     1300|     1200|
|     4|     NULL|      900|     NULL|
|     5|     1100|     NULL|     NULL|
+------+---------+---------+---------+



### Question 27: Union Operation

Create two DataFrames:
1. Employees from 'Sales' department
2. Employees from 'IT' department

Union them together and show the result with columns: `name`, `department`, `salary`.


In [0]:
# Your solution here
df_sales_employees = df_employees.filter(df_employees.department == "Sales")
df_it_employees = df_employees.filter(df_employees.department == "IT")

df_sales_employees.select("name", "department", "salary") \
    .union(df_it_employees.select("name", "department", "salary")) \
    .show()


+-------+----------+------+
|   name|department|salary|
+-------+----------+------+
|  Alice|     Sales| 50000|
|Charlie|     Sales| 70000|
|  Frank|     Sales| 52000|
|    Ivy|     Sales| 51000|
|    Bob|        IT| 60000|
|  Diana|        IT| 55000|
|  Grace|        IT| 58000|
+-------+----------+------+



### Question 28: Complex Aggregation with Multiple Conditions

For each department in `df_employees`, calculate:
- Total number of employees
- Number of employees with salary > 60000
- Average salary for employees with salary > 60000
- Average salary for all employees

Show all results in a single query.


In [0]:
# Your solution here
df_employees.groupBy("department") \
    .agg(
        count("*").alias("total_employees"),
        count(when(df_employees.salary > 60000, 1)).alias("high_salary_employees"),
        avg(when(df_employees.salary > 60000, df_employees.salary)).alias("avg_high_salary"),
        avg(df_employees.salary).alias("avg_salary")
    ) \
    .show()


+----------+---------------+---------------------+---------------+------------------+
|department|total_employees|high_salary_employees|avg_high_salary|        avg_salary|
+----------+---------------+---------------------+---------------+------------------+
|     Sales|              4|                    1|        70000.0|           55750.0|
|        IT|              3|                    0|           NULL|57666.666666666664|
|        HR|              2|                    2|        63500.0|           63500.0|
|   Finance|              1|                    1|        75000.0|           75000.0|
+----------+---------------+---------------------+---------------+------------------+



### Question 29: Reading and Writing Data (Databricks)

**In Databricks:**
1. Write `df_employees` to a Parquet file in Volumes at path `Volumes/workspace/default/databricks_practice/employees/`
2. Read the data back from that path into a new DataFrame
3. Verify by showing the first 5 rows

**Note:** 
- In Databricks, use the Volumes path format: `Volumes/workspace/default/<catalog_name>/<schema_name>/<path>`
- For local testing, use a local path like `./data/output/employees/`


In [0]:
output_path = "/Volumes/demo_catalog/demo_schema/demo_volume/sellers_dataset.csv"

df_employees.write.mode("overwrite").csv(output_path)
df_employees_parquet = spark.read.csv(output_path)
display(df_employees_parquet.limit(5))

### Question 30: Complete ETL Pipeline

Create a complete ETL pipeline that:
1. **Extract**: Join `df_employees` and `df_sales` to get employee sales data
2. **Transform**: 
   - Calculate total sales per employee
   - Add a column `performance` that is "Excellent" if total sales > 3000, "Good" if > 2000, else "Average"
   - Join with `df_employees` to get employee details
3. **Load**: Select and display the final result with columns: `name`, `department`, `total_sales`, `performance`

Chain all operations together.


In [0]:
# Your solution here
df_employee_sales = df_employees.join(df_sales, df_employees.emp_id == df_sales.emp_id, "inner")

df_sales_total = df_employee_sales.groupBy(df_employees.emp_id, df_employees.name, df_employees.department) \
    .agg({"amount": "sum"}) \
    .withColumnRenamed("sum(amount)", "total_sales")

df_sales_total = df_sales_total.withColumn(
    "performance",
    when(df_sales_total.total_sales > 3000, "Excellent")
    .when(df_sales_total.total_sales > 2000, "Good")
    .otherwise("Average")
)

df_final_result = df_sales_total.select("name", "department", "total_sales", "performance")

df_final_result.show()


+-------+----------+-----------+-----------+
|   name|department|total_sales|performance|
+-------+----------+-----------+-----------+
|  Alice|     Sales|       4300|  Excellent|
|    Bob|        IT|       4200|  Excellent|
|Charlie|     Sales|       2500|       Good|
|  Diana|        IT|        900|    Average|
|    Eve|        HR|       1100|    Average|
+-------+----------+-----------+-----------+



---

## Additional Challenges (Optional)

If you've completed all 30 questions, try these advanced challenges:

1. **Performance Optimization**: Repartition `df_employees` by `department` and cache it. Measure the performance improvement.

2. **Complex Window Function**: Calculate the 3-month moving average of sales for each employee.

3. **Broadcast Join**: Use broadcast join hint for joining `df_employees` with `df_departments` (assuming departments is small).

4. **Date Operations**: Convert `hire_date` in `df_employees` to DateType and calculate the number of years each employee has been with the company.

5. **Array Operations**: Create an array column containing all cities where each department has employees, then explode it.

---

## Summary

Congratulations on completing the practice questions! These exercises covered:

✅ Basic DataFrame operations (filter, select, sort)

✅ Aggregations and GroupBy

✅ Spark SQL queries

✅ Various join types

✅ Window functions

✅ Advanced transformations

✅ ETL pipeline creation

Keep practicing and refer back to the PySpark module notebooks for detailed explanations!
