# Module 7 - Advanced Operations: Window Functions & UDFs

## Introduction

This notebook covers advanced PySpark transformations including window functions, user-defined functions (UDFs), and other powerful operations.

## What You'll Learn

- Window functions
- User-defined functions (UDFs)
- Pivot and unpivot operations
- Union and distinct operations
- Advanced column operations


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import *
from pyspark.sql.window import Window

# Create SparkSession
spark = SparkSession.builder \
    .appName("Advanced Transformations") \
    .master("local[*]") \
    .getOrCreate()

# Create sample DataFrame
data = [
    ("Alice", "Sales", 50000, "2024-01"),
    ("Bob", "IT", 60000, "2024-01"),
    ("Charlie", "Sales", 70000, "2024-01"),
    ("Diana", "IT", 55000, "2024-01"),
    ("Alice", "Sales", 52000, "2024-02"),
    ("Bob", "IT", 61000, "2024-02"),
    ("Charlie", "Sales", 72000, "2024-02"),
    ("Diana", "IT", 56000, "2024-02")
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Department", StringType(), True),
    StructField("Salary", IntegerType(), True),
    StructField("Month", StringType(), True)
])

df = spark.createDataFrame(data, schema)
df.show()


25/12/28 21:42:14 WARN Utils: Your hostname, N-MacBookPro-37.local resolves to a loopback address: 127.0.0.1; using 192.168.1.2 instead (on interface en0)
25/12/28 21:42:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/28 21:42:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/28 21:42:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/12/28 21:42:15 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/12/28 21:42:15 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/12/28 21:42:15 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
25/12/28 21:42:15 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting 

+-------+----------+------+-------+
|   Name|Department|Salary|  Month|
+-------+----------+------+-------+
|  Alice|     Sales| 50000|2024-01|
|    Bob|        IT| 60000|2024-01|
|Charlie|     Sales| 70000|2024-01|
|  Diana|        IT| 55000|2024-01|
|  Alice|     Sales| 52000|2024-02|
|    Bob|        IT| 61000|2024-02|
|Charlie|     Sales| 72000|2024-02|
|  Diana|        IT| 56000|2024-02|
+-------+----------+------+-------+



## Window Functions

Window functions perform calculations across a set of rows related to the current row. Unlike `groupBy()`, window functions don't collapse rows.

**Common Use Cases:**
- Running totals
- Rankings
- Moving averages
- Comparing rows to group aggregates

### Windowing Aggregations

When working with window functions, you need to define three key parameters:

1. **Partition Column**: Partition by based on one or more columns (similar to GROUP BY)
2. **Sorting Column**: Sort the data within each partition
3. **Window Size**: Define the size by mentioning the start row and end row

**Example**: Consider you have a windowdata dataset and you are required to:
- Partition by country
- Sort based on week number
- Define window size (from start to current row)
- Find the running total of invoice value

```python
from pyspark.sql.functions import *
from pyspark.sql.window import Window

# Define window specification
mywindow = Window.partitionBy("country") \
    .orderBy("weeknum") \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Apply window function
result_df = orders_df.withColumn("running_total", sum("invoicevalue").over(mywindow))
result_df.show()
```

### Common Windowing Functions

PySpark provides several windowing functions for different analytical needs:

1. **`rank()`**: Assigns ranks to rows within a window partition, with gaps in ranking sequence when there are ties
2. **`dense_rank()`**: Assigns ranks to rows within a window partition, without gaps in ranking sequence (consecutive ranks)
3. **`row_number()`**: Assigns a unique sequential number to each row within a window partition
4. **`lead()`**: Accesses data from a subsequent row in the same window partition
5. **`lag()`**: Accesses data from a previous row in the same window partition

**Key Differences:**
- `rank()` vs `dense_rank()`: `rank()` leaves gaps when there are ties, `dense_rank()` doesn't
- `row_number()`: Always assigns unique sequential numbers, even for ties
- `lead()` vs `lag()`: `lead()` looks ahead, `lag()` looks behind


In [2]:
# Define a window partitioned by Department, ordered by Salary
window_spec = Window.partitionBy("Department").orderBy(col("Salary").desc())

# Rank employees within each department
df_with_rank = df.withColumn("Rank", rank().over(window_spec))
df_with_rank.show()


+-------+----------+------+-------+----+
|   Name|Department|Salary|  Month|Rank|
+-------+----------+------+-------+----+
|    Bob|        IT| 61000|2024-02|   1|
|    Bob|        IT| 60000|2024-01|   2|
|  Diana|        IT| 56000|2024-02|   3|
|  Diana|        IT| 55000|2024-01|   4|
|Charlie|     Sales| 72000|2024-02|   1|
|Charlie|     Sales| 70000|2024-01|   2|
|  Alice|     Sales| 52000|2024-02|   3|
|  Alice|     Sales| 50000|2024-01|   4|
+-------+----------+------+-------+----+



In [3]:
# Calculate running total within department
window_sum = Window.partitionBy("Department").orderBy("Month").rowsBetween(Window.unboundedPreceding, Window.currentRow)

df_with_running_total = df.withColumn("RunningTotal", sum("Salary").over(window_sum))
df_with_running_total.show()


+-------+----------+------+-------+------------+
|   Name|Department|Salary|  Month|RunningTotal|
+-------+----------+------+-------+------------+
|    Bob|        IT| 60000|2024-01|       60000|
|  Diana|        IT| 55000|2024-01|      115000|
|    Bob|        IT| 61000|2024-02|      176000|
|  Diana|        IT| 56000|2024-02|      232000|
|  Alice|     Sales| 50000|2024-01|       50000|
|Charlie|     Sales| 70000|2024-01|      120000|
|  Alice|     Sales| 52000|2024-02|      172000|
|Charlie|     Sales| 72000|2024-02|      244000|
+-------+----------+------+-------+------------+



In [4]:
# Calculate average salary per department (using window)
window_avg = Window.partitionBy("Department")

df_with_avg = df.withColumn("DeptAvgSalary", avg("Salary").over(window_avg))
df_with_avg.show()


+-------+----------+------+-------+-------------+
|   Name|Department|Salary|  Month|DeptAvgSalary|
+-------+----------+------+-------+-------------+
|    Bob|        IT| 60000|2024-01|      58000.0|
|  Diana|        IT| 55000|2024-01|      58000.0|
|    Bob|        IT| 61000|2024-02|      58000.0|
|  Diana|        IT| 56000|2024-02|      58000.0|
|  Alice|     Sales| 50000|2024-01|      61000.0|
|Charlie|     Sales| 70000|2024-01|      61000.0|
|  Alice|     Sales| 52000|2024-02|      61000.0|
|Charlie|     Sales| 72000|2024-02|      61000.0|
+-------+----------+------+-------+-------------+



### Examples of Windowing Functions: rank, dense_rank, row_number, lead, lag

Let's see examples of the common windowing functions:


In [5]:
# Example: rank(), dense_rank(), and row_number()

from pyspark.sql.functions import rank, dense_rank, row_number
from pyspark.sql.window import Window

# Define window specification - partition by Department, order by Salary descending
window_spec = Window.partitionBy("Department").orderBy(col("Salary").desc())

# Apply different ranking functions
df_rankings = df.withColumn("rank", rank().over(window_spec)) \
    .withColumn("dense_rank", dense_rank().over(window_spec)) \
    .withColumn("row_number", row_number().over(window_spec))

print("Rankings within each Department (ordered by Salary descending):")
df_rankings.select("Name", "Department", "Salary", "rank", "dense_rank", "row_number").show()

print("\nKey Differences:")
print("- rank(): Leaves gaps when there are ties (e.g., if two people have same salary)")
print("- dense_rank(): No gaps, consecutive ranks even with ties")
print("- row_number(): Always unique sequential numbers, even for ties")


Rankings within each Department (ordered by Salary descending):
+-------+----------+------+----+----------+----------+
|   Name|Department|Salary|rank|dense_rank|row_number|
+-------+----------+------+----+----------+----------+
|    Bob|        IT| 61000|   1|         1|         1|
|    Bob|        IT| 60000|   2|         2|         2|
|  Diana|        IT| 56000|   3|         3|         3|
|  Diana|        IT| 55000|   4|         4|         4|
|Charlie|     Sales| 72000|   1|         1|         1|
|Charlie|     Sales| 70000|   2|         2|         2|
|  Alice|     Sales| 52000|   3|         3|         3|
|  Alice|     Sales| 50000|   4|         4|         4|
+-------+----------+------+----+----------+----------+


Key Differences:
- rank(): Leaves gaps when there are ties (e.g., if two people have same salary)
- dense_rank(): No gaps, consecutive ranks even with ties
- row_number(): Always unique sequential numbers, even for ties


In [6]:
# Example: lead() and lag()

from pyspark.sql.functions import lead, lag

# Define window specification - partition by Department, order by Month
window_spec_lead_lag = Window.partitionBy("Department").orderBy("Month")

# Apply lead and lag functions
df_lead_lag = df.withColumn("previous_salary", lag("Salary", 1).over(window_spec_lead_lag)) \
    .withColumn("next_salary", lead("Salary", 1).over(window_spec_lead_lag)) \
    .withColumn("salary_change", col("Salary") - col("previous_salary"))

print("Lead and Lag Examples:")
print("lead(): Accesses data from the NEXT row")
print("lag(): Accesses data from the PREVIOUS row")
print()

df_lead_lag.select("Name", "Department", "Month", "Salary", "previous_salary", "next_salary", "salary_change").show()

print("\nNote:")
print("- lag(Salary, 1): Previous row's salary")
print("- lead(Salary, 1): Next row's salary")
print("- The second parameter (1) is the offset (how many rows ahead/behind)")


Lead and Lag Examples:
lead(): Accesses data from the NEXT row
lag(): Accesses data from the PREVIOUS row

+-------+----------+-------+------+---------------+-----------+-------------+
|   Name|Department|  Month|Salary|previous_salary|next_salary|salary_change|
+-------+----------+-------+------+---------------+-----------+-------------+
|    Bob|        IT|2024-01| 60000|           NULL|      55000|         NULL|
|  Diana|        IT|2024-01| 55000|          60000|      61000|        -5000|
|    Bob|        IT|2024-02| 61000|          55000|      56000|         6000|
|  Diana|        IT|2024-02| 56000|          61000|       NULL|        -5000|
|  Alice|     Sales|2024-01| 50000|           NULL|      70000|         NULL|
|Charlie|     Sales|2024-01| 70000|          50000|      52000|        20000|
|  Alice|     Sales|2024-02| 52000|          70000|      72000|       -18000|
|Charlie|     Sales|2024-02| 72000|          52000|       NULL|        20000|
+-------+----------+-------+------+

## User-Defined Functions (UDFs)

UDFs allow you to apply custom Python functions to DataFrame columns. **Note**: UDFs are slower than built-in functions - use them only when necessary.


In [7]:
# Define a simple UDF
from pyspark.sql.types import StringType

def categorize_salary(salary):
    if salary > 65000:
        return "High"
    elif salary > 55000:
        return "Medium"
    else:
        return "Low"

# Register UDF
categorize_udf = udf(categorize_salary, StringType())

# Apply UDF
df_with_category = df.withColumn("SalaryCategory", categorize_udf(col("Salary")))
df_with_category.show()


                                                                                

+-------+----------+------+-------+--------------+
|   Name|Department|Salary|  Month|SalaryCategory|
+-------+----------+------+-------+--------------+
|  Alice|     Sales| 50000|2024-01|           Low|
|    Bob|        IT| 60000|2024-01|        Medium|
|Charlie|     Sales| 70000|2024-01|          High|
|  Diana|        IT| 55000|2024-01|           Low|
|  Alice|     Sales| 52000|2024-02|           Low|
|    Bob|        IT| 61000|2024-02|        Medium|
|Charlie|     Sales| 72000|2024-02|          High|
|  Diana|        IT| 56000|2024-02|        Medium|
+-------+----------+------+-------+--------------+



## Pivot Operations

Pivot transforms rows into columns. Useful for creating summary tables.


In [8]:
# Pivot: Transform Department values into columns
df_pivot = df.groupBy("Name").pivot("Department").agg(sum("Salary").alias("TotalSalary"))
df_pivot.show()


+-------+------+------+
|   Name|    IT| Sales|
+-------+------+------+
|  Diana|111000|  NULL|
|Charlie|  NULL|142000|
|    Bob|121000|  NULL|
|  Alice|  NULL|102000|
+-------+------+------+



## Union Operations

Combine multiple DataFrames with the same schema.


In [9]:
# Create another DataFrame
data2 = [
    ("Eve", "HR", 65000, "2024-01"),
    ("Frank", "Sales", 52000, "2024-01")
]

df2 = spark.createDataFrame(data2, schema)

# Union DataFrames
df_union = df.union(df2)
print("Union of DataFrames:")
df_union.show()

# Union with distinct (removes duplicates)
df_union_distinct = df.union(df2).distinct()
print("\nUnion with distinct:")
df_union_distinct.show()


Union of DataFrames:
+-------+----------+------+-------+
|   Name|Department|Salary|  Month|
+-------+----------+------+-------+
|  Alice|     Sales| 50000|2024-01|
|    Bob|        IT| 60000|2024-01|
|Charlie|     Sales| 70000|2024-01|
|  Diana|        IT| 55000|2024-01|
|  Alice|     Sales| 52000|2024-02|
|    Bob|        IT| 61000|2024-02|
|Charlie|     Sales| 72000|2024-02|
|  Diana|        IT| 56000|2024-02|
|    Eve|        HR| 65000|2024-01|
|  Frank|     Sales| 52000|2024-01|
+-------+----------+------+-------+


Union with distinct:
+-------+----------+------+-------+
|   Name|Department|Salary|  Month|
+-------+----------+------+-------+
|  Alice|     Sales| 50000|2024-01|
|    Bob|        IT| 60000|2024-01|
|Charlie|     Sales| 70000|2024-01|
|  Diana|        IT| 55000|2024-01|
|  Alice|     Sales| 52000|2024-02|
|    Bob|        IT| 61000|2024-02|
|Charlie|     Sales| 72000|2024-02|
|  Diana|        IT| 56000|2024-02|
|    Eve|        HR| 65000|2024-01|
|  Frank|     Sales|

## Distinct and Drop Duplicates

Remove duplicate rows from DataFrames.


In [10]:
# Get distinct rows
df_distinct = df.distinct()
print("Distinct rows:")
df_distinct.show()

# Drop duplicates based on specific columns
df_no_duplicates = df.dropDuplicates(["Name", "Department"])
print("\nDrop duplicates based on Name and Department:")
df_no_duplicates.show()


Distinct rows:
+-------+----------+------+-------+
|   Name|Department|Salary|  Month|
+-------+----------+------+-------+
|  Alice|     Sales| 50000|2024-01|
|    Bob|        IT| 60000|2024-01|
|Charlie|     Sales| 70000|2024-01|
|  Diana|        IT| 55000|2024-01|
|  Alice|     Sales| 52000|2024-02|
|    Bob|        IT| 61000|2024-02|
|Charlie|     Sales| 72000|2024-02|
|  Diana|        IT| 56000|2024-02|
+-------+----------+------+-------+


Drop duplicates based on Name and Department:
+-------+----------+------+-------+
|   Name|Department|Salary|  Month|
+-------+----------+------+-------+
|  Alice|     Sales| 50000|2024-01|
|    Bob|        IT| 60000|2024-01|
|Charlie|     Sales| 70000|2024-01|
|  Diana|        IT| 55000|2024-01|
+-------+----------+------+-------+



## Summary

In this notebook, you learned:

1. **Window Functions**: Perform calculations across related rows without collapsing data
2. **UDFs**: Custom functions for complex transformations (use sparingly - they're slower)
3. **Pivot**: Transform rows into columns for summary tables
4. **Union**: Combine DataFrames with the same schema
5. **Distinct/Drop Duplicates**: Remove duplicate rows

**Key Takeaway**: Window functions are powerful for analytical queries. UDFs should be used only when built-in functions aren't sufficient, as they have performance overhead.

**Next Steps**: In Module 8, we'll learn about performance optimization techniques including partitioning, caching, and bucketing.
