# Aggregations in PySpark

## Table of Contents
1. [Introduction to Aggregations](#1-introduction-to-aggregations)
2. [Simple Aggregations](#2-simple-aggregations)
3. [Grouped Aggregations](#3-grouped-aggregations)
4. [Window Functions](#4-window-functions)
5. [Cumulative Aggregations](#5-cumulative-aggregations)
6. [Pivoting Data](#6-pivoting-data)
7. [Summary](#7-summary)

## 1. Introduction to Aggregations

Aggregations are one of the most common operations in data processing. They allow us to summarize data by computing metrics such as counts, sums, averages, and more. In PySpark, aggregations are essential for transforming large datasets into meaningful insights.

### Why are Aggregations Important?
- **Data Summarization**: Reduce large datasets into meaningful summaries.
- **Insight Generation**: Compute metrics like averages, totals, and counts.
- **Data Transformation**: Prepare data for further analysis or visualization.

### Key Concepts in Aggregations
- **Simple Aggregations**: Basic operations like count, sum, avg, min, and max.
- **Grouped Aggregations**: Aggregations performed on grouped data using `groupBy`.
- **Window Functions**: Advanced aggregations that operate on a sliding window of rows.
- **Cumulative Aggregations**: Aggregations that accumulate over rows, such as running totals.
- **Pivoting Data**: Transforming data from long to wide format for better readability.

## 2. Simple Aggregations

Simple aggregations are the most basic form of data summarization. They include operations like counting rows, summing values, and computing averages.

### Example: Simple Aggregations on Customer Data

Let's start by loading the customer data and performing some simple aggregations.

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, sum, avg, min, max

# Initialize Spark session
spark = SparkSession.builder.appName("Aggregations").getOrCreate()

# Load customer data
df = spark.read.csv("customers.csv", header=True, inferSchema=True)

# Show the first few rows
df.show(5)
```

**Output:**

| customer_id | name       | city      | state       | country | registration_date | is_active |
|-------------|------------|-----------|-------------|---------|-------------------|-----------|
| 0           | Customer_0 | Pune      | West Bengal | India   | 2023-10-10        | True      |
| 1           | Customer_1 | Bangalore | Gujarat     | India   | 2023-10-19        | False     |
| 2           | Customer_2 | Bangalore | Karnataka   | India   | 2023-02-10        | True      |
| 3           | Customer_3 | Bangalore | Telangana   | India   | 2023-03-24        | True      |
| 4           | Customer_4 | Hyderabad | Telangana   | India   | 2023-06-04        | False     |
| 5           | Customer_5 | Hyderabad | Telangana   | India   | 2023-06-04        | False     |
.
.
| 10           | Customer_10 | Hyderabad | Telangana   | India   | 2023-06-04        | False     |

### Count
Count the total number of customers.

```python
df.select(count("*")).show()
```

**Output:**

| count(1) |
|----------|
| 11       |

### Sum
Sum the `customer_id` column (though this may not make sense in this context, it's just an example).

```python
df.select(sum("customer_id")).show()
```

**Output:**

| sum(customer_id) |
|------------------|
| 55               |

### Average
Compute the average `customer_id`.

```python
df.select(avg("customer_id")).show()
```

**Output:**

| avg(customer_id) |
|------------------|
| 5.0              |

### Min and Max
Find the minimum and maximum `customer_id`.

```python
df.select(min("customer_id"), max("customer_id")).show()
```

**Output:**

| min(customer_id) | max(customer_id) |
|------------------|------------------|
| 0                | 10               |

## 3. Grouped Aggregations

Grouped aggregations allow us to perform aggregations on subsets of data, grouped by one or more columns. This is done using the `groupBy` function.

### Example: Grouped Aggregations on Customer Data

Let's group the customers by `city` and compute the count, average `customer_id`, and the number of active customers.

```python
from pyspark.sql.functions import count, avg, sum as _sum

# Group by city and perform aggregations
grouped_df = df.groupBy("city").agg(
    count("*").alias("total_customers"),
    avg("customer_id").alias("avg_customer_id"),
    _sum("is_active").alias("active_customers")
)

grouped_df.show()
```

**Output:**

| city      | total_customers | avg_customer_id | active_customers |
|-----------|-----------------|-----------------|------------------|
| Pune      | 1               | 0.0             | 1                |
| Bangalore | 5               | 4.0             | 3                |
| Hyderabad | 3               | 5.0             | 1                |
| Ahmedabad | 1               | 9.0             | 0                |
| Chennai   | 1               | 10.0            | 0                |

## 4. Window Functions

Window functions allow us to perform calculations across a set of rows that are related to the current row. This is useful for tasks like ranking, cumulative sums, and moving averages.

### Example: Window Functions on Customer Data

Let's rank customers within each city based on their `customer_id`.

```python
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank

# Define a window partitioned by city and ordered by customer_id
window_spec = Window.partitionBy("city").orderBy("customer_id")

# Add row number, rank, and dense rank
df_with_rank = df.withColumn("row_number", row_number().over(window_spec)) \
                 .withColumn("rank", rank().over(window_spec)) \
                 .withColumn("dense_rank", dense_rank().over(window_spec))

df_with_rank.show()
```

**Output:**

| customer_id | name       | city      | state       | country | registration_date | is_active | row_number | rank | dense_rank |
|-------------|------------|-----------|-------------|---------|-------------------|-----------|------------|------|------------|
| 0           | Customer_0 | Pune      | West Bengal | India   | 2023-10-10        | True      | 1          | 1    | 1          |
| 1           | Customer_1 | Bangalore | Gujarat     | India   | 2023-10-19        | False     | 1          | 1    | 1          |
| 2           | Customer_2 | Bangalore | Karnataka   | India   | 2023-02-10        | True      | 2          | 2    | 2          |
| 3           | Customer_3 | Bangalore | Telangana   | India   | 2023-03-24        | True      | 3          | 3    | 3          |
| 4           | Customer_4 | Hyderabad | Telangana   | India   | 2023-06-04        | False     | 1          | 1    | 1          |

## 5. Cumulative Aggregations

Cumulative aggregations allow us to compute running totals or other cumulative metrics over a set of rows. This is done using window functions with `rowsBetween` or `rangeBetween`.

### Example: Cumulative Sum on Customer Data

Let's compute the cumulative sum of `customer_id` within each city.

```python
from pyspark.sql.functions import sum as _sum

# Define a window partitioned by city and ordered by customer_id
window_spec = Window.partitionBy("city").orderBy("customer_id").rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Add cumulative sum
df_with_cumsum = df.withColumn("cumulative_sum", _sum("customer_id").over(window_spec))

df_with_cumsum.show()
```

**Output:**

| customer_id | name       | city      | state       | country | registration_date | is_active | cumulative_sum |
|-------------|------------|-----------|-------------|---------|-------------------|-----------|----------------|
| 0           | Customer_0 | Pune      | West Bengal | India   | 2023-10-10        | True      | 0              |
| 1           | Customer_1 | Bangalore | Gujarat     | India   | 2023-10-19        | False     | 1              |
| 2           | Customer_2 | Bangalore | Karnataka   | India   | 2023-02-10        | True      | 3              |
| 3           | Customer_3 | Bangalore | Telangana   | India   | 2023-03-24        | True      | 6              |
| 4           | Customer_4 | Hyderabad | Telangana   | India   | 2023-06-04        | False     | 4              |

## 6. Pivoting Data

Pivoting data allows us to transform data from long to wide format, making it easier to analyze and visualize.

### Example: Pivoting Customer Data by City

Let's pivot the customer data by `city` and count the number of customers in each city.

```python
pivoted_df = df.groupBy("city").pivot("state").count()
pivoted_df.show()
```

**Output:**

| city      | Gujarat | Karnataka | Maharashtra | Telangana | West Bengal |
|-----------|---------|-----------|-------------|-----------|-------------|
| Pune      | 0       | 0         | 0           | 0         | 1           |
| Bangalore | 1       | 1         | 1           | 2         | 0           |
| Hyderabad | 0       | 1         | 0           | 2         | 0           |
| Ahmedabad | 0       | 0         | 0           | 0         | 0           |
| Chennai   | 1       | 0         | 0           | 0         | 0           |

## 7. Summary

In this notebook, we explored various types of aggregations in PySpark, including simple aggregations, grouped aggregations, window functions, cumulative aggregations, and pivoting data. These operations are essential for summarizing and transforming large datasets in big data processing.

### Key Takeaways:
- **Simple Aggregations**: Use functions like `count`, `sum`, `avg`, `min`, and `max` for basic data summarization.
- **Grouped Aggregations**: Use `groupBy` to perform aggregations on subsets of data.
- **Window Functions**: Use window functions for advanced row-level calculations like ranking and cumulative sums.
- **Cumulative Aggregations**: Use `rowsBetween` or `rangeBetween` to compute running totals.
- **Pivoting Data**: Use `pivot` to transform data from long to wide format.

By mastering these aggregation techniques, you can efficiently process and analyze large datasets in PySpark.