# Module 07b - Advanced Operations - Window Functions & UDFs - Exercises

## Instructions

This notebook contains exercises based on the concepts learned in Module 07b.

- Complete each exercise in the provided code cells
- Run the data setup cells first to generate/create necessary data
- Test your solutions by running the verification cells (if provided)
- Refer back to the main module notebook if you need help


## Data Setup

Run the cells below to set up the data needed for the exercises.


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql.functions import col, sum, avg, rank, dense_rank, row_number, lead, lag, max, min, count
from pyspark.sql.window import Window
import os

# Create SparkSession
spark = SparkSession.builder \
    .appName("Module 07b Exercises") \
    .master("local[*]") \
    .getOrCreate()

# Set data directory
data_dir = "../data"
os.makedirs(data_dir, exist_ok=True)
print("SparkSession created successfully!")

# Create sample DataFrame with time series data
data = [
    ("Alice", "Sales", 50000, "2024-01"),
    ("Bob", "IT", 60000, "2024-01"),
    ("Charlie", "Sales", 70000, "2024-01"),
    ("Diana", "IT", 55000, "2024-01"),
    ("Eve", "HR", 65000, "2024-01"),
    ("Alice", "Sales", 52000, "2024-02"),
    ("Bob", "IT", 61000, "2024-02"),
    ("Charlie", "Sales", 72000, "2024-02"),
    ("Diana", "IT", 56000, "2024-02"),
    ("Eve", "HR", 66000, "2024-02"),
    ("Alice", "Sales", 53000, "2024-03"),
    ("Bob", "IT", 62000, "2024-03"),
    ("Charlie", "Sales", 73000, "2024-03"),
    ("Diana", "IT", 57000, "2024-03"),
    ("Eve", "HR", 67000, "2024-03")
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Department", StringType(), True),
    StructField("Salary", IntegerType(), True),
    StructField("Month", StringType(), True)
])

df = spark.createDataFrame(data, schema)

print("Sample DataFrame:")
df.show()

SparkSession created successfully!
Sample DataFrame:
+-------+----------+------+-------+
|   Name|Department|Salary|  Month|
+-------+----------+------+-------+
|  Alice|     Sales| 50000|2024-01|
|    Bob|        IT| 60000|2024-01|
|Charlie|     Sales| 70000|2024-01|
|  Diana|        IT| 55000|2024-01|
|    Eve|        HR| 65000|2024-01|
|  Alice|     Sales| 52000|2024-02|
|    Bob|        IT| 61000|2024-02|
|Charlie|     Sales| 72000|2024-02|
|  Diana|        IT| 56000|2024-02|
|    Eve|        HR| 66000|2024-02|
|  Alice|     Sales| 53000|2024-03|
|    Bob|        IT| 62000|2024-03|
|Charlie|     Sales| 73000|2024-03|
|  Diana|        IT| 57000|2024-03|
|    Eve|        HR| 67000|2024-03|
+-------+----------+------+-------+



### Exercise 1: ROW_NUMBER Window Function

Complete the operation described in the code comments.

In [13]:
from pyspark.sql.functions import rank, dense_rank, row_number
from pyspark.sql.window import Window

# Use ROW_NUMBER() to rank employees within each department by salary
window_aspec = Window.partitionBy("Department").orderBy("Salary")
result_1 = df.withColumn("Rank",row_number().over(window_aspec))
result_1.show()

+-------+----------+------+-------+----+
|   Name|Department|Salary|  Month|Rank|
+-------+----------+------+-------+----+
|    Eve|        HR| 65000|2024-01|   1|
|    Eve|        HR| 66000|2024-02|   2|
|    Eve|        HR| 67000|2024-03|   3|
|  Diana|        IT| 55000|2024-01|   1|
|  Diana|        IT| 56000|2024-02|   2|
|  Diana|        IT| 57000|2024-03|   3|
|    Bob|        IT| 60000|2024-01|   4|
|    Bob|        IT| 61000|2024-02|   5|
|    Bob|        IT| 62000|2024-03|   6|
|  Alice|     Sales| 50000|2024-01|   1|
|  Alice|     Sales| 52000|2024-02|   2|
|  Alice|     Sales| 53000|2024-03|   3|
|Charlie|     Sales| 70000|2024-01|   4|
|Charlie|     Sales| 72000|2024-02|   5|
|Charlie|     Sales| 73000|2024-03|   6|
+-------+----------+------+-------+----+



### Exercise 2: RANK Window Function

Complete the operation described in the code comments.

In [14]:
# Your code here
# Use RANK() to rank employees within each department by salary
window_aspec = Window.partitionBy("Department").orderBy("Salary")
result_2 = df.withColumn("Rank",rank().over(window_aspec))
result_2.show()

+-------+----------+------+-------+----+
|   Name|Department|Salary|  Month|Rank|
+-------+----------+------+-------+----+
|    Eve|        HR| 65000|2024-01|   1|
|    Eve|        HR| 66000|2024-02|   2|
|    Eve|        HR| 67000|2024-03|   3|
|  Diana|        IT| 55000|2024-01|   1|
|  Diana|        IT| 56000|2024-02|   2|
|  Diana|        IT| 57000|2024-03|   3|
|    Bob|        IT| 60000|2024-01|   4|
|    Bob|        IT| 61000|2024-02|   5|
|    Bob|        IT| 62000|2024-03|   6|
|  Alice|     Sales| 50000|2024-01|   1|
|  Alice|     Sales| 52000|2024-02|   2|
|  Alice|     Sales| 53000|2024-03|   3|
|Charlie|     Sales| 70000|2024-01|   4|
|Charlie|     Sales| 72000|2024-02|   5|
|Charlie|     Sales| 73000|2024-03|   6|
+-------+----------+------+-------+----+



### Exercise 3: DENSE_RANK Window Function

Complete the operation described in the code comments.

In [15]:
# Your code here
# Use DENSE_RANK() to rank employees within each department by salary
window_aspec = Window.partitionBy("Department").orderBy("Salary")
result_3 = df.withColumn("Rank",dense_rank().over(window_aspec))
result_3.show()

+-------+----------+------+-------+----+
|   Name|Department|Salary|  Month|Rank|
+-------+----------+------+-------+----+
|    Eve|        HR| 65000|2024-01|   1|
|    Eve|        HR| 66000|2024-02|   2|
|    Eve|        HR| 67000|2024-03|   3|
|  Diana|        IT| 55000|2024-01|   1|
|  Diana|        IT| 56000|2024-02|   2|
|  Diana|        IT| 57000|2024-03|   3|
|    Bob|        IT| 60000|2024-01|   4|
|    Bob|        IT| 61000|2024-02|   5|
|    Bob|        IT| 62000|2024-03|   6|
|  Alice|     Sales| 50000|2024-01|   1|
|  Alice|     Sales| 52000|2024-02|   2|
|  Alice|     Sales| 53000|2024-03|   3|
|Charlie|     Sales| 70000|2024-01|   4|
|Charlie|     Sales| 72000|2024-02|   5|
|Charlie|     Sales| 73000|2024-03|   6|
+-------+----------+------+-------+----+



### Exercise 4: Running Total

Complete the operation described in the code comments.

In [16]:
# Your code here
# Calculate running total of salary within each department ordered by month
window_aspec = Window.partitionBy("Department").orderBy("Month")
result_4 = df.withColumn("RunningTotal",sum("salary").over(window_aspec))
result_4.show()

+-------+----------+------+-------+------------+
|   Name|Department|Salary|  Month|RunningTotal|
+-------+----------+------+-------+------------+
|    Eve|        HR| 65000|2024-01|       65000|
|    Eve|        HR| 66000|2024-02|      131000|
|    Eve|        HR| 67000|2024-03|      198000|
|    Bob|        IT| 60000|2024-01|      115000|
|  Diana|        IT| 55000|2024-01|      115000|
|    Bob|        IT| 61000|2024-02|      232000|
|  Diana|        IT| 56000|2024-02|      232000|
|    Bob|        IT| 62000|2024-03|      351000|
|  Diana|        IT| 57000|2024-03|      351000|
|  Alice|     Sales| 50000|2024-01|      120000|
|Charlie|     Sales| 70000|2024-01|      120000|
|  Alice|     Sales| 52000|2024-02|      244000|
|Charlie|     Sales| 72000|2024-02|      244000|
|  Alice|     Sales| 53000|2024-03|      370000|
|Charlie|     Sales| 73000|2024-03|      370000|
+-------+----------+------+-------+------------+



### Exercise 5: LEAD Function

Complete the operation described in the code comments.

In [18]:
# Your code here
# Use LEAD() to get next month's salary for each employee
window_aspec = Window.orderBy("Month")
result_5 = df.withColumn("NextMonthSalary",lead("salary",1).over(window_aspec))
result_5.show()

+-------+----------+------+-------+---------------+
|   Name|Department|Salary|  Month|NextMonthSalary|
+-------+----------+------+-------+---------------+
|  Alice|     Sales| 50000|2024-01|          60000|
|    Bob|        IT| 60000|2024-01|          70000|
|Charlie|     Sales| 70000|2024-01|          55000|
|  Diana|        IT| 55000|2024-01|          65000|
|    Eve|        HR| 65000|2024-01|          52000|
|  Alice|     Sales| 52000|2024-02|          61000|
|    Bob|        IT| 61000|2024-02|          72000|
|Charlie|     Sales| 72000|2024-02|          56000|
|  Diana|        IT| 56000|2024-02|          66000|
|    Eve|        HR| 66000|2024-02|          53000|
|  Alice|     Sales| 53000|2024-03|          62000|
|    Bob|        IT| 62000|2024-03|          73000|
|Charlie|     Sales| 73000|2024-03|          57000|
|  Diana|        IT| 57000|2024-03|          67000|
|    Eve|        HR| 67000|2024-03|           NULL|
+-------+----------+------+-------+---------------+



### Exercise 6: LAG Function

Complete the operation described in the code comments.

In [19]:
# Your code here
# Use LAG() to get previous month's salary for each employee
window_aspec = Window.orderBy("Month")
result_6 = df.withColumn("PreviousMonthSalary",lag("Salary").over(window_aspec))
result_6.show()

+-------+----------+------+-------+-------------------+
|   Name|Department|Salary|  Month|PreviousMonthSalary|
+-------+----------+------+-------+-------------------+
|  Alice|     Sales| 50000|2024-01|               NULL|
|    Bob|        IT| 60000|2024-01|              50000|
|Charlie|     Sales| 70000|2024-01|              60000|
|  Diana|        IT| 55000|2024-01|              70000|
|    Eve|        HR| 65000|2024-01|              55000|
|  Alice|     Sales| 52000|2024-02|              65000|
|    Bob|        IT| 61000|2024-02|              52000|
|Charlie|     Sales| 72000|2024-02|              61000|
|  Diana|        IT| 56000|2024-02|              72000|
|    Eve|        HR| 66000|2024-02|              56000|
|  Alice|     Sales| 53000|2024-03|              66000|
|    Bob|        IT| 62000|2024-03|              53000|
|Charlie|     Sales| 73000|2024-03|              62000|
|  Diana|        IT| 57000|2024-03|              73000|
|    Eve|        HR| 67000|2024-03|             

### Exercise 7: Window with Partition and Order

Complete the operation described in the code comments.

In [24]:
# Your code here
# Create a window partitioned by Department and ordered by Salary descending
window_aspec = Window.partitionBy("Department").orderBy(col("Salary").desc())
ranked_df = df.withColumn("Rank",row_number().over(window_aspec))
result_7 = ranked_df.drop(col("Rank"))
result_7.orderBy("Department", col("Salary").desc()).show()

+-------+----------+------+-------+
|   Name|Department|Salary|  Month|
+-------+----------+------+-------+
|    Eve|        HR| 67000|2024-03|
|    Eve|        HR| 66000|2024-02|
|    Eve|        HR| 65000|2024-01|
|    Bob|        IT| 62000|2024-03|
|    Bob|        IT| 61000|2024-02|
|    Bob|        IT| 60000|2024-01|
|  Diana|        IT| 57000|2024-03|
|  Diana|        IT| 56000|2024-02|
|  Diana|        IT| 55000|2024-01|
|Charlie|     Sales| 73000|2024-03|
|Charlie|     Sales| 72000|2024-02|
|Charlie|     Sales| 70000|2024-01|
|  Alice|     Sales| 53000|2024-03|
|  Alice|     Sales| 52000|2024-02|
|  Alice|     Sales| 50000|2024-01|
+-------+----------+------+-------+



### Exercise 8: Maximum Salary per Department

Complete the operation described in the code comments.

In [25]:
# Your code here
# Find the maximum salary in each department using window function
window_aspec = Window.partitionBy("Department")
result_8 = df.withColumn("maxSalary",max("salary").over(window_aspec))
result_8.show()

+-------+----------+------+-------+---------+
|   Name|Department|Salary|  Month|maxSalary|
+-------+----------+------+-------+---------+
|    Eve|        HR| 65000|2024-01|    67000|
|    Eve|        HR| 66000|2024-02|    67000|
|    Eve|        HR| 67000|2024-03|    67000|
|    Bob|        IT| 60000|2024-01|    62000|
|  Diana|        IT| 55000|2024-01|    62000|
|    Bob|        IT| 61000|2024-02|    62000|
|  Diana|        IT| 56000|2024-02|    62000|
|    Bob|        IT| 62000|2024-03|    62000|
|  Diana|        IT| 57000|2024-03|    62000|
|  Alice|     Sales| 50000|2024-01|    73000|
|Charlie|     Sales| 70000|2024-01|    73000|
|  Alice|     Sales| 52000|2024-02|    73000|
|Charlie|     Sales| 72000|2024-02|    73000|
|  Alice|     Sales| 53000|2024-03|    73000|
|Charlie|     Sales| 73000|2024-03|    73000|
+-------+----------+------+-------+---------+



### Exercise 9: Average Salary per Department

Complete the operation described in the code comments.

In [27]:
# Your code here
# Calculate average salary per department using window function
window_aspec = Window.partitionBy("Department")
result_9 = df.withColumn("AvgSalary",avg("salary").over(window_aspec))
result_9.show()

+-------+----------+------+-------+------------------+
|   Name|Department|Salary|  Month|         AvgSalary|
+-------+----------+------+-------+------------------+
|    Eve|        HR| 65000|2024-01|           66000.0|
|    Eve|        HR| 66000|2024-02|           66000.0|
|    Eve|        HR| 67000|2024-03|           66000.0|
|    Bob|        IT| 60000|2024-01|           58500.0|
|  Diana|        IT| 55000|2024-01|           58500.0|
|    Bob|        IT| 61000|2024-02|           58500.0|
|  Diana|        IT| 56000|2024-02|           58500.0|
|    Bob|        IT| 62000|2024-03|           58500.0|
|  Diana|        IT| 57000|2024-03|           58500.0|
|  Alice|     Sales| 50000|2024-01|61666.666666666664|
|Charlie|     Sales| 70000|2024-01|61666.666666666664|
|  Alice|     Sales| 52000|2024-02|61666.666666666664|
|Charlie|     Sales| 72000|2024-02|61666.666666666664|
|  Alice|     Sales| 53000|2024-03|61666.666666666664|
|Charlie|     Sales| 73000|2024-03|61666.666666666664|
+-------+-

### Exercise 10: Salary Difference from Department Average

Complete the operation described in the code comments.

In [28]:
# Your code here
# Calculate each employee's salary difference from their department average
window_aspec = Window.partitionBy("Department")
avg_salary = df.withColumn("AvgDiff",avg("salary").over(window_aspec))
result_10 = avg_salary.withColumn("SalaryDiff",col("AvgDiff")-col("salary")).show()

+-------+----------+------+-------+------------------+-------------------+
|   Name|Department|Salary|  Month|           AvgDiff|         SalaryDiff|
+-------+----------+------+-------+------------------+-------------------+
|    Eve|        HR| 65000|2024-01|           66000.0|             1000.0|
|    Eve|        HR| 66000|2024-02|           66000.0|                0.0|
|    Eve|        HR| 67000|2024-03|           66000.0|            -1000.0|
|    Bob|        IT| 60000|2024-01|           58500.0|            -1500.0|
|  Diana|        IT| 55000|2024-01|           58500.0|             3500.0|
|    Bob|        IT| 61000|2024-02|           58500.0|            -2500.0|
|  Diana|        IT| 56000|2024-02|           58500.0|             2500.0|
|    Bob|        IT| 62000|2024-03|           58500.0|            -3500.0|
|  Diana|        IT| 57000|2024-03|           58500.0|             1500.0|
|  Alice|     Sales| 50000|2024-01|61666.666666666664| 11666.666666666664|
|Charlie|     Sales| 7000

### Exercise 11: Percent Rank

Complete the operation described in the code comments.

In [31]:
window_aspec = Window.partitionBy("Department").orderBy(col("Salary"))

result_11 = df.withColumn("PercentRank", percent_rank().over(window_aspec))

result_11.show()


+-------+----------+------+-------+-----------+
|   Name|Department|Salary|  Month|PercentRank|
+-------+----------+------+-------+-----------+
|    Eve|        HR| 65000|2024-01|        0.0|
|    Eve|        HR| 66000|2024-02|        0.5|
|    Eve|        HR| 67000|2024-03|        1.0|
|  Diana|        IT| 55000|2024-01|        0.0|
|  Diana|        IT| 56000|2024-02|        0.2|
|  Diana|        IT| 57000|2024-03|        0.4|
|    Bob|        IT| 60000|2024-01|        0.6|
|    Bob|        IT| 61000|2024-02|        0.8|
|    Bob|        IT| 62000|2024-03|        1.0|
|  Alice|     Sales| 50000|2024-01|        0.0|
|  Alice|     Sales| 52000|2024-02|        0.2|
|  Alice|     Sales| 53000|2024-03|        0.4|
|Charlie|     Sales| 70000|2024-01|        0.6|
|Charlie|     Sales| 72000|2024-02|        0.8|
|Charlie|     Sales| 73000|2024-03|        1.0|
+-------+----------+------+-------+-----------+



### Exercise 12: Cumulative Sum

Complete the operation described in the code comments.

In [32]:
# Your code here
# Calculate cumulative sum of salary for each employee over months
window_aspec = Window.partitionBy("Name").orderBy("Month")
result_12 = df.withColumn("CumulativeSalary",sum("salary").over(window_aspec)).show()

+-------+----------+------+-------+----------------+
|   Name|Department|Salary|  Month|CumulativeSalary|
+-------+----------+------+-------+----------------+
|  Alice|     Sales| 50000|2024-01|           50000|
|  Alice|     Sales| 52000|2024-02|          102000|
|  Alice|     Sales| 53000|2024-03|          155000|
|    Bob|        IT| 60000|2024-01|           60000|
|    Bob|        IT| 61000|2024-02|          121000|
|    Bob|        IT| 62000|2024-03|          183000|
|Charlie|     Sales| 70000|2024-01|           70000|
|Charlie|     Sales| 72000|2024-02|          142000|
|Charlie|     Sales| 73000|2024-03|          215000|
|  Diana|        IT| 55000|2024-01|           55000|
|  Diana|        IT| 56000|2024-02|          111000|
|  Diana|        IT| 57000|2024-03|          168000|
|    Eve|        HR| 65000|2024-01|           65000|
|    Eve|        HR| 66000|2024-02|          131000|
|    Eve|        HR| 67000|2024-03|          198000|
+-------+----------+------+-------+-----------

### Exercise 13: Moving Average

Complete the operation described in the code comments.

In [33]:
# Your code here
# Calculate 2-month moving average of salary for each employee
window_aspec = Window.orderBy("Month").rowsBetween(-1,0)
result_13 = df.withColumn("MovingAverage",avg("salary").over(window_aspec)).show()

+-------+----------+------+-------+-------------+
|   Name|Department|Salary|  Month|MovingAverage|
+-------+----------+------+-------+-------------+
|  Alice|     Sales| 50000|2024-01|      50000.0|
|    Bob|        IT| 60000|2024-01|      55000.0|
|Charlie|     Sales| 70000|2024-01|      65000.0|
|  Diana|        IT| 55000|2024-01|      62500.0|
|    Eve|        HR| 65000|2024-01|      60000.0|
|  Alice|     Sales| 52000|2024-02|      58500.0|
|    Bob|        IT| 61000|2024-02|      56500.0|
|Charlie|     Sales| 72000|2024-02|      66500.0|
|  Diana|        IT| 56000|2024-02|      64000.0|
|    Eve|        HR| 66000|2024-02|      61000.0|
|  Alice|     Sales| 53000|2024-03|      59500.0|
|    Bob|        IT| 62000|2024-03|      57500.0|
|Charlie|     Sales| 73000|2024-03|      67500.0|
|  Diana|        IT| 57000|2024-03|      65000.0|
|    Eve|        HR| 67000|2024-03|      62000.0|
+-------+----------+------+-------+-------------+



### Exercise 14: First Value in Window

Complete the operation described in the code comments.

In [35]:
# Your code here
# Get the first salary value in each department using first_value()
from pyspark.sql import Window
from pyspark.sql.functions import first, col


window_aspec = Window.partitionBy("Department").orderBy("Month") \
                     .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)


result_14 = df.withColumn(
    "FirstSalary",
    first(col("Salary")).over(window_aspec)
)


result_14.show()


+-------+----------+------+-------+-----------+
|   Name|Department|Salary|  Month|FirstSalary|
+-------+----------+------+-------+-----------+
|    Eve|        HR| 65000|2024-01|      65000|
|    Eve|        HR| 66000|2024-02|      65000|
|    Eve|        HR| 67000|2024-03|      65000|
|    Bob|        IT| 60000|2024-01|      60000|
|  Diana|        IT| 55000|2024-01|      60000|
|    Bob|        IT| 61000|2024-02|      60000|
|  Diana|        IT| 56000|2024-02|      60000|
|    Bob|        IT| 62000|2024-03|      60000|
|  Diana|        IT| 57000|2024-03|      60000|
|  Alice|     Sales| 50000|2024-01|      50000|
|Charlie|     Sales| 70000|2024-01|      50000|
|  Alice|     Sales| 52000|2024-02|      50000|
|Charlie|     Sales| 72000|2024-02|      50000|
|  Alice|     Sales| 53000|2024-03|      50000|
|Charlie|     Sales| 73000|2024-03|      50000|
+-------+----------+------+-------+-----------+



### Exercise 15: Last Value in Window

Complete the operation described in the code comments.

In [36]:
# Your code here
# Get the last salary value in each department using last_value()
from pyspark.sql import Window
from pyspark.sql.functions import last, col


window_aspec = Window.partitionBy("Department").orderBy("Month") \
                     .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)


result_15 = df.withColumn(
    "LastSalary",
    last(col("Salary")).over(window_aspec)
)


result_15.show()


+-------+----------+------+-------+-----------+
|   Name|Department|Salary|  Month|FirstSalary|
+-------+----------+------+-------+-----------+
|    Eve|        HR| 65000|2024-01|      67000|
|    Eve|        HR| 66000|2024-02|      67000|
|    Eve|        HR| 67000|2024-03|      67000|
|    Bob|        IT| 60000|2024-01|      57000|
|  Diana|        IT| 55000|2024-01|      57000|
|    Bob|        IT| 61000|2024-02|      57000|
|  Diana|        IT| 56000|2024-02|      57000|
|    Bob|        IT| 62000|2024-03|      57000|
|  Diana|        IT| 57000|2024-03|      57000|
|  Alice|     Sales| 50000|2024-01|      73000|
|Charlie|     Sales| 70000|2024-01|      73000|
|  Alice|     Sales| 52000|2024-02|      73000|
|Charlie|     Sales| 72000|2024-02|      73000|
|  Alice|     Sales| 53000|2024-03|      73000|
|Charlie|     Sales| 73000|2024-03|      73000|
+-------+----------+------+-------+-----------+



### Exercise 16: Count per Department

Complete the operation described in the code comments.

In [37]:
# Your code here
# Count number of employees in each department using window function
window_aspec = Window.partitionBy("Department")
result_16 = df.withColumn("EmployeeCount",count("*").over(window_aspec))
result_16.show()

+-------+----------+------+-------+-------------+
|   Name|Department|Salary|  Month|EmployeeCount|
+-------+----------+------+-------+-------------+
|    Eve|        HR| 65000|2024-01|            3|
|    Eve|        HR| 66000|2024-02|            3|
|    Eve|        HR| 67000|2024-03|            3|
|    Bob|        IT| 60000|2024-01|            6|
|  Diana|        IT| 55000|2024-01|            6|
|    Bob|        IT| 61000|2024-02|            6|
|  Diana|        IT| 56000|2024-02|            6|
|    Bob|        IT| 62000|2024-03|            6|
|  Diana|        IT| 57000|2024-03|            6|
|  Alice|     Sales| 50000|2024-01|            6|
|Charlie|     Sales| 70000|2024-01|            6|
|  Alice|     Sales| 52000|2024-02|            6|
|Charlie|     Sales| 72000|2024-02|            6|
|  Alice|     Sales| 53000|2024-03|            6|
|Charlie|     Sales| 73000|2024-03|            6|
+-------+----------+------+-------+-------------+



### Exercise 17: Salary Rank with Ties

Complete the operation described in the code comments.

In [38]:
# Your code here
# Rank employees by salary, handling ties appropriately
window_aspec = Window.orderBy(col("Salary").desc())
result_17 = df.withColumn("Rank",dense_rank().over(window_aspec))
result_17.show()

+-------+----------+------+-------+----+
|   Name|Department|Salary|  Month|Rank|
+-------+----------+------+-------+----+
|Charlie|     Sales| 73000|2024-03|   1|
|Charlie|     Sales| 72000|2024-02|   2|
|Charlie|     Sales| 70000|2024-01|   3|
|    Eve|        HR| 67000|2024-03|   4|
|    Eve|        HR| 66000|2024-02|   5|
|    Eve|        HR| 65000|2024-01|   6|
|    Bob|        IT| 62000|2024-03|   7|
|    Bob|        IT| 61000|2024-02|   8|
|    Bob|        IT| 60000|2024-01|   9|
|  Diana|        IT| 57000|2024-03|  10|
|  Diana|        IT| 56000|2024-02|  11|
|  Diana|        IT| 55000|2024-01|  12|
|  Alice|     Sales| 53000|2024-03|  13|
|  Alice|     Sales| 52000|2024-02|  14|
|  Alice|     Sales| 50000|2024-01|  15|
+-------+----------+------+-------+----+



### Exercise 18: Window with Rows Between

Complete the operation described in the code comments.

In [40]:
# Your code here
# Create a window with rowsBetween to calculate sum of current and previous row
window_aspec = Window.rowsBetween(-1,0)
result_18 = df.withColumn("SumCurrentAndPrevious",sum("salary").over(window_aspec)).show()

+-------+----------+------+-------+---------------------+
|   Name|Department|Salary|  Month|SumCurrentAndPrevious|
+-------+----------+------+-------+---------------------+
|  Alice|     Sales| 50000|2024-01|                50000|
|    Bob|        IT| 60000|2024-01|               110000|
|Charlie|     Sales| 70000|2024-01|               130000|
|  Diana|        IT| 55000|2024-01|               125000|
|    Eve|        HR| 65000|2024-01|               120000|
|  Alice|     Sales| 52000|2024-02|               117000|
|    Bob|        IT| 61000|2024-02|               113000|
|Charlie|     Sales| 72000|2024-02|               133000|
|  Diana|        IT| 56000|2024-02|               128000|
|    Eve|        HR| 66000|2024-02|               122000|
|  Alice|     Sales| 53000|2024-03|               119000|
|    Bob|        IT| 62000|2024-03|               115000|
|Charlie|     Sales| 73000|2024-03|               135000|
|  Diana|        IT| 57000|2024-03|               130000|
|    Eve|     

### Exercise 19: Multiple Window Functions

Complete the operation described in the code comments.

In [44]:
# Your code here
# Apply multiple window functions (rank, avg, max) in a single query
from pyspark.sql import Window
from pyspark.sql.functions import rank, avg, max, col


window_rank = Window.partitionBy("Department").orderBy("Salary")


window_agg = Window.partitionBy("Department")


result_19 = df.withColumn("Rank", rank().over(window_rank)) \
              .withColumn("AvgSalary", avg(col("Salary")).over(window_agg)) \
              .withColumn("MaxSalary", max(col("Salary")).over(window_agg))


result_19.show()


+-------+----------+------+-------+----+------------------+---------+
|   Name|Department|Salary|  Month|Rank|         AvgSalary|MaxSalary|
+-------+----------+------+-------+----+------------------+---------+
|    Eve|        HR| 65000|2024-01|   1|           66000.0|    67000|
|    Eve|        HR| 66000|2024-02|   2|           66000.0|    67000|
|    Eve|        HR| 67000|2024-03|   3|           66000.0|    67000|
|  Diana|        IT| 55000|2024-01|   1|           58500.0|    62000|
|  Diana|        IT| 56000|2024-02|   2|           58500.0|    62000|
|  Diana|        IT| 57000|2024-03|   3|           58500.0|    62000|
|    Bob|        IT| 60000|2024-01|   4|           58500.0|    62000|
|    Bob|        IT| 61000|2024-02|   5|           58500.0|    62000|
|    Bob|        IT| 62000|2024-03|   6|           58500.0|    62000|
|  Alice|     Sales| 50000|2024-01|   1|61666.666666666664|    73000|
|  Alice|     Sales| 52000|2024-02|   2|61666.666666666664|    73000|
|  Alice|     Sales|

### Exercise 20: Complex Window with Multiple Partitions

Complete the operation described in the code comments.

In [45]:
# Your code here
# Create a complex window partitioned by Department, ordered by Month, with specific range
from pyspark.sql import Window
from pyspark.sql.functions import avg, col


window_spec = Window.partitionBy("Department") \
                    .orderBy("Month") \
                    .rowsBetween(-2, 0)

result = df.withColumn(
    "Salary_MA_3",
    avg(col("Salary")).over(window_spec)
)

result.show()


+-------+----------+------+-------+------------------+
|   Name|Department|Salary|  Month|       Salary_MA_3|
+-------+----------+------+-------+------------------+
|    Eve|        HR| 65000|2024-01|           65000.0|
|    Eve|        HR| 66000|2024-02|           65500.0|
|    Eve|        HR| 67000|2024-03|           66000.0|
|    Bob|        IT| 60000|2024-01|           60000.0|
|  Diana|        IT| 55000|2024-01|           57500.0|
|    Bob|        IT| 61000|2024-02|58666.666666666664|
|  Diana|        IT| 56000|2024-02|57333.333333333336|
|    Bob|        IT| 62000|2024-03|59666.666666666664|
|  Diana|        IT| 57000|2024-03|58333.333333333336|
|  Alice|     Sales| 50000|2024-01|           50000.0|
|Charlie|     Sales| 70000|2024-01|           60000.0|
|  Alice|     Sales| 52000|2024-02|57333.333333333336|
|Charlie|     Sales| 72000|2024-02|64666.666666666664|
|  Alice|     Sales| 53000|2024-03|           59000.0|
|Charlie|     Sales| 73000|2024-03|           66000.0|
+-------+-

## Summary

Great job completing the exercises! Review your solutions and compare them with the solutions notebook if needed.
