# **Aggregation Functions**

**In Spark, aggregation functions are operations that perform a computation on a set of values and return a single aggregated result. These functions are commonly used in Spark DataFrames to summarize and analyze data. Here are some common aggregation functions in Spark, along with examples:**

### 1. `count`:

The `count` function is used to count the number of rows in a DataFrame.

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import count

# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Sample data
data = [("Alice", 28, "New York"),
        ("Bob", 35, "San Francisco"),
        ("Charlie", 22, "Los Angeles")]

# Define the schema
schema = ["name", "age", "city"]

# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

# Count the number of rows
row_count = df.count()

# Show the result
print(f"Number of rows: {row_count}")

# Stop the Spark session when done
spark.stop()
```

### 2. `sum`:

The `sum` function is used to calculate the sum of values in a column.

```python
from pyspark.sql.functions import sum

# Sum the values in the "age" column
total_age = df.agg(sum("age")).collect()[0][0]

# Show the result
print(f"Total age: {total_age}")
```

### 3. `avg`:

The `avg` function calculates the average of values in a column.

```python
from pyspark.sql.functions import avg

# Calculate the average age
average_age = df.agg(avg("age")).collect()[0][0]

# Show the result
print(f"Average age: {average_age}")
```

### 4. `min` and `max`:

The `min` and `max` functions find the minimum and maximum values in a column, respectively.

```python
from pyspark.sql.functions import min, max

# Find the minimum and maximum ages
min_age = df.agg(min("age")).collect()[0][0]
max_age = df.agg(max("age")).collect()[0][0]

# Show the results
print(f"Minimum age: {min_age}")
print(f"Maximum age: {max_age}")
```

### 5. `groupBy` and `agg`:

Combining `groupBy` with `agg` allows for more complex aggregations, such as counting occurrences of values.

```python
from pyspark.sql.functions import count

# Group by the "city" column and count occurrences
city_counts = df.groupBy("city").agg(count("*").alias("count"))

# Show the result
city_counts.show()
```


### 6. `groupBy` with Multiple Aggregations:

You can perform multiple aggregations on different columns using `groupBy` and chaining aggregation functions.

```python
from pyspark.sql.functions import sum, avg, max

# Group by "city" and calculate sum, average, and maximum age for each city
result_grouped = df.groupBy("city").agg(
    sum("age").alias("total_age"),
    avg("age").alias("average_age"),
    max("age").alias("max_age")
)

# Show the result
result_grouped.show()
```

### 7. `groupBy` with Pivot:

Pivot allows you to transform rows into columns based on a specific column's values.

```python
# Pivot the DataFrame based on the "city" column
result_pivot = df.groupBy("name").pivot("city").agg(avg("age"))

# Show the result
result_pivot.show()
```

### 8. `approxQuantile`:

`approxQuantile` is used to approximate the quantiles of a numerical column.

```python
from pyspark.sql.functions import approxQuantile

# Calculate approximate quantiles for the "age" column
quantiles = df.approxQuantile("age", [0.25, 0.5, 0.75], 0.1)

# Show the result
print("Approximate Quantiles:", quantiles)
```

### 9. `corr`:

The `corr` function calculates the Pearson correlation coefficient between two numerical columns.

```python
from pyspark.sql.functions import corr

# Calculate the correlation between "age" and "other_numeric_column"
correlation = df.agg(corr("age", "other_numeric_column")).collect()[0][0]

# Show the result
print("Correlation:", correlation)
```

### 10. `collect_list` and `collect_set`:

`collect_list` and `collect_set` are used to aggregate values into lists or sets.

```python
from pyspark.sql.functions import collect_list, collect_set

# Collect a list of names for each city
result_list = df.groupBy("city").agg(collect_list("name").alias("names_list"))

# Collect a set of unique names for each city
result_set = df.groupBy("city").agg(collect_set("name").alias("names_set"))

# Show the results
result_list.show()
result_set.show()
```

Certainly! Let's continue exploring more aggregation functions in Spark:

### 11. `percentile`:

The `percentile` function is used to calculate the specified percentiles of a numerical column.

```python
from pyspark.sql.functions import percentile

# Calculate percentiles for the "age" column
percentiles = df.agg(percentile("age", [0.25, 0.5, 0.75])).collect()[0]

# Show the result
print("Percentiles:", percentiles)
```

### 12. `first` and `last`:

The `first` and `last` functions return the first or last value in a group, respectively.

```python
from pyspark.sql.functions import first, last

# Get the first and last names for each city
result_first_last = df.groupBy("city").agg(
    first("name").alias("first_name"),
    last("name").alias("last_name")
)

# Show the result
result_first_last.show()
```

### 13. `pivot` with Aggregation:

You can use `pivot` along with aggregation functions to pivot and aggregate data simultaneously.

```python
# Pivot the DataFrame based on the "city" column and calculate the sum of ages for each city
result_pivot_aggregate = df.groupBy("name").pivot("city").agg(sum("age"))

# Show the result
result_pivot_aggregate.show()
```

### 14. `window` function:

The `window` function is used for window-based aggregations, such as running totals.

```python
from pyspark.sql.window import Window
from pyspark.sql.functions import sum

# Define a window specification
window_spec = Window.orderBy("name")

# Calculate the running total of ages for each name
result_running_total = df.withColumn("running_total", sum("age").over(window_spec))

# Show the result
result_running_total.show()
```

### 15. Custom Aggregation:

You can define custom aggregation logic using the `agg` function and a user-defined function (UDF).

```python
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

# Define a custom aggregation function (e.g., sum of squared ages)
@udf(IntegerType())
def sum_of_squares(values):
    return sum(x**2 for x in values)

# Use the custom aggregation in the DataFrame
result_custom_aggregation = df.groupBy("city").agg(sum_of_squares(collect_list("age")).alias("sum_of_squares"))

# Show the result
result_custom_aggregation.show()
```


# **Thank You!**