### Window Function

# 🧠 PySpark Window Functions Explained

Window functions in PySpark allow you to perform calculations **across a set of rows related to the current row**, without collapsing them into a single result (unlike aggregate functions).

---

## 🔍 What Is a Window?

A **window** defines a **subset of rows** within a DataFrame over which a window function operates.

A window is defined using:

```
Window.partitionBy(...).orderBy(...)
```

- `partitionBy(...)`: (Optional) Groups data into partitions — like SQL's `GROUP BY`. Ranking and calculations reset for each partition.
- `orderBy(...)`: (Required for ranking functions) Orders rows within each partition — critical for assigning rank, lag/lead, etc.

---

## 🪜 Common PySpark Window Functions

### 1. `row_number()`

- Assigns a unique sequential number to each row **within a partition**, ordered by the specified column.
- **No ties** — even rows with same values get different numbers.

```
row_number().over(Window.partitionBy("Department").orderBy("Salary"))
```

| Employee | Salary | row_number |
|----------|--------|------------|
| Alice    | 5000   | 1          |
| Bob      | 4800   | 2          |
| Charlie  | 4800   | 3          |

---

### 2. `rank()`

- Assigns ranks **with gaps** in case of ties.
- If two rows are tied at rank 2, the next rank will be 4 (not 3).

```
rank().over(Window.partitionBy("Department").orderBy("Salary"))
```

| Employee | Salary | rank |
|----------|--------|------|
| Alice    | 5000   | 1    |
| Bob      | 4800   | 2    |
| Charlie  | 4800   | 2    |
| David    | 4700   | 4    |

---

### 3. `dense_rank()`

- Similar to `rank()`, but **no gaps** in ranking.
- Ties share the same rank, and the next rank is incremented by 1.

```
dense_rank().over(Window.partitionBy("Department").orderBy("Salary"))
```

| Employee | Salary | dense_rank |
|----------|--------|------------|
| Alice    | 5000   | 1          |
| Bob      | 4800   | 2          |
| Charlie  | 4800   | 2          |
| David    | 4700   | 3          |

---

### 4. `lag(column, offset)`

- Retrieves the **value of a previous row** in the window.
- Useful for comparing current and previous row (e.g., change in salary).

```
lag("Salary", 1).over(Window.partitionBy("Department").orderBy("Salary"))
```

| Employee | Salary | lag_salary |
|----------|--------|------------|
| Alice    | 5000   | null       |
| Bob      | 4800   | 5000       |
| Charlie  | 4700   | 4800       |

---

### 5. `lead(column, offset)`

- Retrieves the **value of the next row** in the window.
- Useful for forward-looking comparisons (e.g., predicting trends).

```
lead("Salary", 1).over(Window.partitionBy("Department").orderBy("Salary"))
```

| Employee | Salary | lead_salary |
|----------|--------|-------------|
| Alice    | 5000   | 4800        |
| Bob      | 4800   | 4700        |
| Charlie  | 4700   | null        |

---

## 💡 Full Window Spec Example

```
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank, lag, lead

windowSpec = Window.partitionBy("Department").orderBy("Salary")

df.withColumn("row_number", row_number().over(windowSpec)) \
  .withColumn("rank", rank().over(windowSpec)) \
  .withColumn("dense_rank", dense_rank().over(windowSpec)) \
  .withColumn("lag_salary", lag("Salary", 1).over(windowSpec)) \
  .withColumn("lead_salary", lead("Salary", 1).over(windowSpec)) \
  .show()
```

---

## 📝 Summary of Differences

| Function      | Ties | Gaps in Ranks | Use Case                        |
|---------------|------|----------------|---------------------------------|
| `row_number()` | ❌   | N/A            | Always unique row order         |
| `rank()`       | ✅   | ✅             | Official ranking with gaps      |
| `dense_rank()` | ✅   | ❌             | Compact ranks without skipping  |
| `lag()`        | N/A  | N/A            | Look back to previous value     |
| `lead()`       | N/A  | N/A            | Look forward to next value      |

---

## ✅ Bonus Tip: Without `partitionBy`

If you omit `partitionBy`, window functions apply **across the entire DataFrame**.

```
Window.orderBy("Salary")
```

This ranks rows globally (not grouped by department).

---


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank, lag, lead

# Start Spark session
spark = SparkSession.builder.appName("WindowFunctionsEnhanced").getOrCreate()

# Sample data
data = [
    ("Sales", "Alice", 5000),
    ("Sales", "Bob", 4800),
    ("Sales", "Charlie", 4800),
    ("HR", "David", 4000),
    ("HR", "Eva", 4000),
    ("HR", "Frank", 3900),
]

# Create DataFrame
df = spark.createDataFrame(data, ["Department", "Employee", "Salary"])

# Define window specification: Partition by Department, Order by Salary descending
windowSpec = Window.partitionBy("Department").orderBy(df["Salary"].desc())

# Apply window functions
df_with_all = df \
    .withColumn("row_number", row_number().over(windowSpec)) \
    .withColumn("rank", rank().over(windowSpec)) \
    .withColumn("dense_rank", dense_rank().over(windowSpec)) \
    .withColumn("lag_salary", lag("Salary", 1).over(windowSpec)) \
    .withColumn("lead_salary", lead("Salary", 1).over(windowSpec))

# Show result
df_with_all.show()


+----------+--------+------+----------+----+----------+----------+-----------+
|Department|Employee|Salary|row_number|rank|dense_rank|lag_salary|lead_salary|
+----------+--------+------+----------+----+----------+----------+-----------+
|        HR|   David|  4000|         1|   1|         1|      null|       4000|
|        HR|     Eva|  4000|         2|   1|         1|      4000|       3900|
|        HR|   Frank|  3900|         3|   3|         2|      4000|       null|
|     Sales|   Alice|  5000|         1|   1|         1|      null|       4800|
|     Sales|     Bob|  4800|         2|   2|         2|      5000|       4800|
|     Sales| Charlie|  4800|         3|   2|         2|      4800|       null|
+----------+--------+------+----------+----+----------+----------+-----------+

