# Window Functions in PySpark

## Table of Contents
1. [Introduction to Window Functions](#1-introduction-to-window-functions)
2. [Key Concepts in Window Functions](#2-key-concepts-in-window-functions)
3. [Rank](#3-rank)
4. [Dense Rank](#4-dense-rank)
5. [Row Number](#5-row-number)
6. [Lead](#6-lead)
7. [Lag](#7-lag)
8. [Summary](#8-summary)

## 1. Introduction to Window Functions

Window functions in PySpark are used to perform calculations across a set of rows that are related to the current row. 
Unlike simple aggregations, window functions allow you to define a "window" of rows over which the calculation is performed. 
This is particularly useful for tasks like ranking, cumulative sums, and comparing rows.

### Why are Window Functions Important?
- **Ranking**: Assign ranks to rows based on specific criteria.
- **Cumulative Calculations**: Compute running totals, averages, etc.
- **Row Comparison**: Compare current rows with previous or next rows using `lead` and `lag`.
- **Data Analysis**: Perform advanced data analysis without reshaping the data.

### Key Components of Window Functions
- **Partitioning**: Divides the data into groups (e.g., by country).
- **Ordering**: Specifies the order of rows within each partition.
- **Window Frame**: Defines the range of rows to include in the calculation (e.g., all previous rows, current row, etc.).

```
┌───────────────┐
│  Partition 1  │
│  Row 1        │
│  Row 2        │
│  Row 3        │
└───────────────┘
┌───────────────┐
│  Partition 2  │
│  Row 1        │
│  Row 2        │
└───────────────┘
```

## 2. Key Concepts in Window Functions

Before diving into specific window functions, let's understand the key concepts:

### Partitioning
Partitioning divides the dataset into groups based on a column (e.g., `country`). Each group is processed independently.

### Ordering
Ordering specifies the order of rows within each partition. For example, you can order rows by `invoicevalue` in descending order.

### Window Frame
The window frame defines the range of rows to include in the calculation. Common frames include:
- **Unbounded Preceding**: All rows from the start of the partition to the current row.
- **Current Row**: Only the current row.
- **Unbounded Following**: All rows from the current row to the end of the partition.

```
┌───────────────┐
│  Partition 1  │
│  Row 1        │  ◄── Window Frame
│  Row 2        │
│  Row 3        │
└───────────────┘
```

## 3. Rank

The `rank` function assigns a rank to each row within a partition. Rows with the same value receive the same rank, and the next rank is skipped.

### Example: Rank Customers by Invoice Value

Let's rank customers within each country based on their `invoicevalue`.

```python
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import rank

# Initialize Spark session
spark = SparkSession.builder.appName("WindowFunctions").getOrCreate()

# Load customer data
df = spark.read.csv("customers.csv", header=True, inferSchema=True)

# Define window specification
window_spec = Window.partitionBy("country").orderBy(df["invoicevalue"].desc())

# Add rank column
df_with_rank = df.withColumn("rank", rank().over(window_spec))

# Show results
df_with_rank.show()
```

**Output:**

| country | invoicevalue | rank |
|---------|--------------|------|
| USA     | 1000         | 1    |
| USA     | 1000         | 1    |
| USA     | 800          | 3    |
| UK      | 1200         | 1    |
| UK      | 900          | 2    |

### Key Points:
- **Same Value, Same Rank**: Rows with the same `invoicevalue` receive the same rank.
- **Rank Skipping**: The next rank is skipped after ties (e.g., rank 1, 1, 3).

## 4. Dense Rank

The `dense_rank` function is similar to `rank`, but it does not skip ranks after ties. Rows with the same value receive the same rank, and the next rank is not skipped.

### Example: Dense Rank Customers by Invoice Value

Let's dense rank customers within each country based on their `invoicevalue`.

```python
from pyspark.sql.functions import dense_rank

# Add dense rank column
df_with_dense_rank = df.withColumn("dense_rank", dense_rank().over(window_spec))

# Show results
df_with_dense_rank.show()
```

**Output:**

| country | invoicevalue | dense_rank |
|---------|--------------|------------|
| USA     | 1000         | 1          |
| USA     | 1000         | 1          |
| USA     | 800          | 2          |
| UK      | 1200         | 1          |
| UK      | 900          | 2          |

### Key Points:
- **Same Value, Same Rank**: Rows with the same `invoicevalue` receive the same rank.
- **No Rank Skipping**: The next rank is not skipped after ties (e.g., rank 1, 1, 2).

## 5. Row Number

The `row_number` function assigns a unique sequential number to each row within a partition, starting from 1. Unlike `rank` and `dense_rank`, `row_number` does not handle ties.

### Example: Row Number for Customers by Invoice Value

Let's assign row numbers to customers within each country based on their `invoicevalue`.

```python
from pyspark.sql.functions import row_number

# Add row number column
df_with_row_number = df.withColumn("row_number", row_number().over(window_spec))

# Show results
df_with_row_number.show()
```

**Output:**

| country | invoicevalue | row_number |
|---------|--------------|------------|
| USA     | 1000         | 1          |
| USA     | 1000         | 2          |
| USA     | 800          | 3          |
| UK      | 1200         | 1          |
| UK      | 900          | 2          |

### Key Points:
- **Unique Numbers**: Each row gets a unique number, even if values are the same.
- **No Handling of Ties**: Ties are not handled; each row gets a distinct number.

## 6. Lead

The `lead` function allows you to access the value of a column in the next row within the same partition. This is useful for comparing the current row with the next row.

### Example: Lead for Invoice Value

Let's compare the current `invoicevalue` with the next row's `invoicevalue`.

```python
from pyspark.sql.functions import lead

# Add lead column
df_with_lead = df.withColumn("next_invoicevalue", lead("invoicevalue").over(window_spec))

# Show results
df_with_lead.show()
```

**Output:**

| country | invoicevalue | next_invoicevalue |
|---------|--------------|-------------------|
| USA     | 1000         | 1000              |
| USA     | 1000         | 800               |
| USA     | 800          | null              |
| UK      | 1200         | 900               |
| UK      | 900          | null              |

### Key Points:
- **Next Row Value**: Accesses the value of the next row.
- **Null for Last Row**: The last row in each partition will have `null` for the lead value.

## 7. Lag

The `lag` function allows you to access the value of a column in the previous row within the same partition. This is useful for comparing the current row with the previous row.

### Example: Lag for Invoice Value

Let's compare the current `invoicevalue` with the previous row's `invoicevalue`.

```python
from pyspark.sql.functions import lag

# Add lag column
df_with_lag = df.withColumn("previous_invoicevalue", lag("invoicevalue").over(window_spec))

# Show results
df_with_lag.show()
```

**Output:**

| country | invoicevalue | previous_invoicevalue |
|---------|--------------|-----------------------|
| USA     | 1000         | null                  |
| USA     | 1000         | 1000                  |
| USA     | 800          | 1000                  |
| UK      | 1200         | null                  |
| UK      | 900          | 1200                  |

### Key Points:
- **Previous Row Value**: Accesses the value of the previous row.
- **Null for First Row**: The first row in each partition will have `null` for the lag value.

## 8. Summary

In this notebook, we explored **Window Functions** in PySpark, including `rank`, `dense_rank`, `row_number`, `lead`, and `lag`. These functions are essential for advanced data analysis, allowing you to perform calculations across related rows.

### Key Takeaways:
- **Rank**: Assigns ranks with skips after ties.
- **Dense Rank**: Assigns ranks without skips after ties.
- **Row Number**: Assigns unique sequential numbers to rows.
- **Lead**: Accesses the value of the next row.
- **Lag**: Accesses the value of the previous row.

By mastering these window functions, you can perform complex data analysis tasks efficiently in PySpark.