
<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/UNAL_Logosimbolo.svg/583px-UNAL_Logosimbolo.svg.png" alt="" width="1280" height="300" /></p>

# WINDOWS

In Apache Spark, window functions allow calculations across a set of rows related to the current row, without grouping data. They are similar to aggregate functions, but instead of reducing data to a single value per group, they output a value for each row based on the "window" of surrounding rows. Window functions are useful for tasks like ranking, calculating moving averages, or accessing adjacent rows' data. 


![](https://miro.medium.com/v2/resize:fit:537/0*-TLOWiq8V9-2YVW-.png)

In [0]:
from pyspark.sql import Window
from pyspark.sql.functions import row_number, rank, dense_rank
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType
import datetime

data = [
    ("Sales", "Alice", datetime.date(2024, 1, 1), 200),
    ("Sales", "Alice", datetime.date(2024, 1, 1), 300),
    ("Sales", "Alice", datetime.date(2024, 1, 2), 450),
    ("Sales", "Alice", datetime.date(2024, 1, 3), 450),
    ("Sales", "Bob",   datetime.date(2024, 1, 1), 300),
    ("Sales", "Bob",   datetime.date(2024, 1, 2), 500),
    ("IT",    "Carol", datetime.date(2024, 1, 1), 150),
    ("IT",    "Carol", datetime.date(2024, 1, 2), 400),
    ("IT",    "Dave",  datetime.date(2024, 1, 1), 100),
    ("IT",    "Dave",  datetime.date(2024, 1, 2), 300),
    ("IT",    "Dave",  datetime.date(2024, 1, 3), 800),
]

schema = StructType([
    StructField("department", StringType(), True),
    StructField("employee", StringType(), True),
    StructField("date", DateType(), True),
    StructField("sales_amount", IntegerType(), True),
])

df = spark.createDataFrame(data, schema)
df.createOrReplaceTempView("test")
df.display()

In [0]:
%sql
SELECT * FROM test

## SELECT

### RANKING

#### RANK

`RANK()` function assigns a ranking to each row within a partition (group) of the data based on some ordering criterion. The ranks are assigned based on the order you specify, and any ties

**IMPORTANT**: `rank` skip numbers if there are ties.

In [0]:
window_spec = Window.partitionBy("employee").orderBy(df["sales_amount"].desc())

In [0]:
df.withColumn("rank", rank().over(window_spec)).display()

In [0]:
%sql
SELECT
  *,
  RANK() OVER (PARTITION BY employee ORDER BY sales_amount DESC) AS rank
FROM test

Databricks visualization. Run in Databricks to view.

#### DENSE RANK

* It assigns a rank number to rows within a partition (group), based on some ordering.
* Ties (same value) get the same rank.
* But ranks are not skipped — they go in dense sequence.

In [0]:
df.withColumn("dense_rank", dense_rank().over(window_spec)).display()

In [0]:
%sql
SELECT
  *,
  DENSE_RANK() OVER (PARTITION BY employee ORDER BY sales_amount DESC) AS dense_rank
FROM test

#### ROW NUMBER

* It gives a unique, sequential number to each row within a partition (group), based on the order you define.
* No ties: Even if two rows have the same value, they get different row numbers.
* It always produces consecutive numbers: 1, 2, 3, 4, etc.



In [0]:
df.withColumn("row_number", row_number().over(window_spec)).display()

In [0]:
%sql
SELECT
  ROW_NUMBER() OVER (ORDER BY department) AS row_number,
  *,
  ROW_NUMBER() OVER (PARTITION BY employee ORDER BY sales_amount DESC) AS row_number
FROM test

### AGREGATIONS

create agregations based on windows without groupping. You can apply `sum()`, `avg()`, `count()`, `min()`, `max()` etc.

In [0]:
from pyspark.sql.functions import col, sum

In [0]:
window_agg = Window.partitionBy("employee").orderBy(df["date"].asc())

In [0]:
df.withColumn("aggregations", sum(col("sales_amount")).over(window_agg )).display()


In [0]:
%sql
SELECT
  *,
  SUM(sales_amount) OVER (
    PARTITION BY employee
    ORDER BY date ASC
  ) AS aggregations
FROM test;

Databricks visualization. Run in Databricks to view.

### ANALYTIC FUNCTIONS

Functions like `LEAD()`, `LAG()`, `FIRST_VALUE()`, and `LAST_VALUE()` look at other rows in relation to the current row, but without collapsing rows (unlike GROUP BY).
They add information from other rows to each row — very powerful in time series, ranking, and change detection.


In [0]:
df.display()

In [0]:
from pyspark.sql.functions import lead, lag, first, last

In [0]:
window_anlt = Window.partitionBy("employee").orderBy("date")

#### LEAD

looks for the value of the next row within its partition (group), based on the order you defined.

If there’s no next row (because it’s the last one in the group), it returns NULL.

Looks forward — gets the value from the next row.

Think:` What’s coming next?`


Sintax:

`LEAD(column_name, offset, default_value) OVER (PARTITION BY ... ORDER BY ...)`

##### NULL (DEFAULT)

In [0]:
df.withColumn("lead", lead(col("sales_amount")).over(window_anlt )).display()

In [0]:
%sql
SELECT
  *,
  LEAD(sales_amount) OVER (
    PARTITION BY employee
    ORDER BY date
  ) AS lead
FROM test;

##### CUSTOM VALUE

In [0]:
df.withColumn("lead", lead(col("sales_amount"), 1, 0).over(window_anlt )).display()

In [0]:
%sql
SELECT
  *,
  LEAD(sales_amount, 1, 0) OVER (
    PARTITION BY employee
    ORDER BY date
  ) AS lead
FROM test;

#### LAG

Look backward (previous row)

##### NULL (DEFAULT)

In [0]:
df.withColumn("lag", lag(col("sales_amount")).over(window_anlt )).display()

In [0]:
%sql
SELECT
  *,
  LAG(sales_amount) OVER (
    PARTITION BY employee
    ORDER BY date
  ) AS lag
FROM test;

##### CUSTOM VALUE

`LAG(column_name, offset, default_value) OVER (PARTITION BY ... ORDER BY ...)`

In [0]:
df.withColumn("lag", lag(col("sales_amount"), 1, 0).over(window_anlt )).display()

In [0]:
%sql
SELECT
  *,
  LAG(sales_amount, 1, 0) OVER (
    PARTITION BY employee
    ORDER BY date
  ) AS lag
FROM test;

#### FIRST

gets the first row’s value in the current partition (group), based on the ordering.

In [0]:
df.withColumn("first", first(col("sales_amount")).over(window_anlt )).display()

In [0]:
%sql
SELECT
  *,
  FIRST(sales_amount) OVER (
    PARTITION BY employee
    ORDER BY date
  ) AS first
FROM test;

#### LAST

In [0]:
df.withColumn("last", last("sales_amount").over(window_anlt)).display()

In [0]:
%sql
SELECT
  *,
  LAST(sales_amount) OVER (
    PARTITION BY employee
    ORDER BY date
  ) AS first
FROM test;

### FRAME ESPECIFICATIONS

In window functions, a frame defines the subset of rows that the window function should operate on, relative to the current row. The frame can be adjusted based on specific criteria — like how many rows before or after the current row.

In [0]:
window_fs = Window.partitionBy("employee").orderBy("date")
df.withColumn("row_number", row_number().over(window_fs)).display()

In [0]:
%sql
SELECT
  *,
  ROW_NUMBER() OVER (
    PARTITION BY employee
    ORDER BY date
  ) AS row_number
FROM test;

#### ROWS

* This defines the window based on physical rows — a specific number of rows before and after the current row.
* It does not care about the values in the rows, just their position relative to the current row.

We need to specify `.rowsBetween` to the `windows` definition

In [0]:
from pyspark.sql.functions import sum

##### ROW FRAMES

Row frames are how window functions in SQL or PySpark define exactly which rows to consider around each current row.

They are based on the physical position of the rows (not the values).

###### ROWS BETWEEN

defines the physical rows window around the current row to apply the window function.




In [0]:

# take 1 row before, current and 1 row after
windows_row = Window.partitionBy("employee").orderBy("date").rowsBetween(-1,1)

df.withColumn("sum", sum("sales_amount").over(windows_row)).display()


In [0]:
%sql
SELECT
  *,
  SUM(sales_amount) OVER (
    PARTITION BY employee
    ORDER BY date
    ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
  ) AS sum
FROM test;

###### UNBOUNDED PRECEDING

It means `from the first row of the partition` up to where you define

In [0]:
# take from partition to current row, 2 rows after
windows_row = Window.partitionBy("employee").orderBy("date").rowsBetween(Window.unboundedPreceding, 2)

df.withColumn("sum", sum("sales_amount").over(windows_row)).display()

In [0]:
%sql
SELECT
  *,
  SUM(sales_amount) OVER (
    PARTITION BY employee
    ORDER BY date
    ROWS BETWEEN UNBOUNDED PRECEDING AND 2 FOLLOWING
  ) AS sum
FROM test;

###### UNBOUNDED FOLLOWING

It means `from the current row all the way to the last row` of the partition.

In [0]:

window_spec = Window.partitionBy("employee").orderBy("date").rowsBetween(
    Window.unboundedPreceding,
    Window.unboundedFollowing
)

df.withColumn("last_sales", last("sales_amount").over(window_spec)).display()

###### CURRENT ROW
It's the current row the function is operating on.

Note: we can use 0 insted of

In [0]:
# take current value and 1 row after
windows_row = Window.partitionBy("employee").orderBy("date").rowsBetween(Window.currentRow, 1)

df.withColumn("sum", sum("sales_amount").over(windows_row)).display()

In [0]:
%sql
SELECT
  *,
  SUM(sales_amount) OVER (
    PARTITION BY employee
    ORDER BY date
    ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING
  ) AS sum
FROM test;

#### RANGES

* This defines the window based on value ranges, not row positions.
* The frame looks at rows whose values fall within a specific range of the current row’s value.

In [0]:
# current sales_amount + 200) y (current sales_amount + 300).
windows_row = Window.partitionBy("employee").orderBy("sales_amount").rangeBetween(200, 300)

df.withColumn("sum", sum("sales_amount").over(windows_row)).display()

In [0]:
%sql
SELECT
  *,
  SUM(sales_amount) OVER (
    PARTITION BY employee
    ORDER BY sales_amount
    RANGE BETWEEN 200 FOLLOWING AND 300 FOLLOWING
  ) AS sum
FROM test;

## GROUP BY

The `window()` function in `pyspark.sql.functions` is used to group data based on time windows, which is especially useful for performing aggregations on time series or event sequences. This function divides the dataset into time intervals (e.g., minutes, hours, days) and allows operations like counts, sums, averages, etc., within each time window.


Supported time units:


* seconds
* minutes
* hours
* days
* weeks
* months
* quarters
* years

In [0]:
df.display()

In [0]:
from pyspark.sql.functions import window

### SIMPLE UNIT

In [0]:
df.groupBy(window("date", "1 day"), "employee").sum("sales_amount").display()

### WINDOWS + SLIDE

In [0]:
# 1-hour window with a 30-minute slide
df.groupBy(window("date", "1 days", "12 hours"), "employee").sum("sales_amount").display()