# PySpark DataFrames Advanced Operations

## Overview
This notebook covers advanced DataFrame operations including window functions, pivoting, unpivoting, and complex transformations.

## Learning Objectives
- Master window functions for analytics
- Perform pivoting and unpivoting
- Use advanced column operations
- Handle complex data types (arrays, structs, maps)
- Apply User Defined Functions (UDFs)

---

## 1. Window Functions

Window functions perform calculations across rows related to the current row.

In [None]:
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Sample data
sales_data = [
    ("2024-01-01", "Electronics", "Laptop", 1200, 2),
    ("2024-01-02", "Electronics", "Phone", 800, 5),
    ("2024-01-03", "Clothing", "Shirt", 50, 10),
    ("2024-01-04", "Electronics", "Tablet", 400, 3),
    ("2024-01-05", "Clothing", "Pants", 80, 8),
    ("2024-01-06", "Electronics", "Laptop", 1200, 1),
    ("2024-01-07", "Clothing", "Jacket", 150, 4)
]

df = spark.createDataFrame(
    sales_data,
    ["date", "category", "product", "price", "quantity"]
)

df = df.withColumn("date", to_date(col("date")))
df = df.withColumn("revenue", col("price") * col("quantity"))

display(df)

### ROW_NUMBER, RANK, DENSE_RANK

In [None]:
# Define window specification
window_spec = Window.partitionBy("category").orderBy(col("revenue").desc())

# Apply ranking functions
df_ranked = df.select(
    "date",
    "category",
    "product",
    "revenue",
    row_number().over(window_spec).alias("row_num"),
    rank().over(window_spec).alias("rank"),
    dense_rank().over(window_spec).alias("dense_rank")
)

display(df_ranked.orderBy("category", "row_num"))

### LAG and LEAD Functions

In [None]:
# Window by category, ordered by date
date_window = Window.partitionBy("category").orderBy("date")

df_lag_lead = df.select(
    "date",
    "category",
    "product",
    "revenue",
    lag("revenue", 1).over(date_window).alias("prev_revenue"),
    lead("revenue", 1).over(date_window).alias("next_revenue"),
    (col("revenue") - lag("revenue", 1).over(date_window)).alias("revenue_change")
)

display(df_lag_lead)

### Running Totals and Moving Averages

In [None]:
# Running total
running_total_window = Window.partitionBy("category").orderBy("date") \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Moving average (3-day window)
moving_avg_window = Window.partitionBy("category").orderBy("date") \
    .rowsBetween(-2, Window.currentRow)

df_windowed = df.select(
    "date",
    "category",
    "product",
    "revenue",
    sum("revenue").over(running_total_window).alias("running_total"),
    avg("revenue").over(moving_avg_window).alias("moving_avg_3day"),
    count("*").over(running_total_window).alias("cumulative_count")
)

display(df_windowed.orderBy("category", "date"))

### NTILE for Quartiles/Percentiles

In [None]:
# Divide into quartiles by revenue
quartile_window = Window.orderBy(col("revenue").desc())

df_quartiles = df.select(
    "product",
    "revenue",
    ntile(4).over(quartile_window).alias("quartile"),
    percent_rank().over(quartile_window).alias("percent_rank")
)

display(df_quartiles)

## 2. Pivot and Unpivot Operations

### Pivot - Convert Rows to Columns

In [None]:
# Pivot: Category revenue by date
df_pivot = df.groupBy("date").pivot("category").sum("revenue")

display(df_pivot.orderBy("date"))

# Pivot with specific values (more efficient)
df_pivot_opt = df.groupBy("date") \
    .pivot("category", ["Electronics", "Clothing"]) \
    .agg(sum("revenue").alias("revenue"))

display(df_pivot_opt.orderBy("date"))

### Unpivot - Convert Columns to Rows

In [None]:
# Create a pivoted dataframe first
pivoted_df = df.groupBy("date").pivot("category").sum("revenue")

# Unpivot using stack
df_unpivot = pivoted_df.select(
    "date",
    expr("stack(2, 'Electronics', Electronics, 'Clothing', Clothing) as (category, revenue)")
)

display(df_unpivot.orderBy("date", "category"))

## 3. Complex Data Types

### Working with Arrays

In [None]:
# Create array columns
df_arrays = df.select(
    "product",
    "category",
    array("price", "quantity").alias("metrics"),
    split(col("product"), "").alias("product_chars")
)

display(df_arrays)

# Array operations
df_array_ops = df_arrays.select(
    "product",
    "metrics",
    size("metrics").alias("array_size"),
    array_contains("metrics", 1200).alias("has_1200"),
    element_at("metrics", 1).alias("first_element"),
    sort_array("metrics").alias("sorted_metrics")
)

display(df_array_ops)

### Explode Arrays

In [None]:
# Explode array into rows
df_exploded = df_arrays.select(
    "product",
    "category",
    explode("metrics").alias("metric_value")
)

display(df_exploded)

# Explode with position
df_posexplode = df_arrays.select(
    "product",
    posexplode("metrics").alias("pos", "value")
)

display(df_posexplode)

### Working with Structs

In [None]:
# Create struct column
df_struct = df.select(
    "product",
    struct(
        col("price").alias("unit_price"),
        col("quantity"),
        col("revenue")
    ).alias("sale_info")
)

display(df_struct)

# Access struct fields
df_struct_access = df_struct.select(
    "product",
    "sale_info",
    col("sale_info.unit_price").alias("price"),
    col("sale_info.quantity").alias("qty")
)

display(df_struct_access)

### Working with Maps

In [None]:
# Create map column
df_map = df.select(
    "product",
    create_map(
        lit("price"), col("price"),
        lit("quantity"), col("quantity").cast("double")
    ).alias("attributes")
)

display(df_map)

# Access map values
df_map_access = df_map.select(
    "product",
    col("attributes")["price"].alias("price_from_map"),
    map_keys("attributes").alias("keys"),
    map_values("attributes").alias("values")
)

display(df_map_access)

## 4. JSON Operations

In [None]:
# Sample JSON data
json_data = [
    (1, '{"name":"Alice","age":25,"city":"NY"}'),
    (2, '{"name":"Bob","age":30,"city":"LA"}'),
    (3, '{"name":"Charlie","age":35,"city":"SF"}')
]

df_json = spark.createDataFrame(json_data, ["id", "json_str"])

# Parse JSON
schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType()),
    StructField("city", StringType())
])

df_parsed = df_json.select(
    "id",
    from_json(col("json_str"), schema).alias("data")
).select(
    "id",
    "data.*"
)

display(df_parsed)

# Convert to JSON
df_to_json = df_parsed.select(
    "id",
    to_json(struct("name", "age", "city")).alias("json_output")
)

display(df_to_json)

## 5. User Defined Functions (UDFs)

In [None]:
from pyspark.sql.functions import udf

# Simple UDF
def categorize_price(price):
    if price < 100:
        return "Low"
    elif price < 500:
        return "Medium"
    else:
        return "High"

# Register UDF
categorize_udf = udf(categorize_price, StringType())

# Use UDF
df_with_udf = df.select(
    "product",
    "price",
    categorize_udf(col("price")).alias("price_category")
)

display(df_with_udf)

### Pandas UDF (More Efficient)

In [None]:
from pyspark.sql.functions import pandas_udf
import pandas as pd

# Pandas UDF for scalar operations
@pandas_udf(DoubleType())
def calculate_discount(price: pd.Series) -> pd.Series:
    return price * 0.9  # 10% discount

df_discount = df.select(
    "product",
    "price",
    calculate_discount(col("price")).alias("discounted_price")
)

display(df_discount)

## 6. Advanced String Operations

In [None]:
# String operations
text_df = spark.createDataFrame(
    [("  Hello World  ",), ("PySpark Tutorial",), ("Data Engineering",)],
    ["text"]
)

df_strings = text_df.select(
    "text",
    upper("text").alias("upper"),
    lower("text").alias("lower"),
    trim("text").alias("trimmed"),
    ltrim("text").alias("ltrimmed"),
    rtrim("text").alias("rtrimmed"),
    length("text").alias("length"),
    substring("text", 1, 5).alias("first_5_chars"),
    regexp_replace("text", "\\s+", "_").alias("replace_spaces"),
    regexp_extract("text", "(\\w+)", 1).alias("first_word")
)

display(df_strings)

## 7. Date and Time Operations

In [None]:
# Date operations
df_dates = df.select(
    "date",
    year("date").alias("year"),
    month("date").alias("month"),
    dayofmonth("date").alias("day"),
    dayofweek("date").alias("day_of_week"),
    dayofyear("date").alias("day_of_year"),
    weekofyear("date").alias("week_of_year"),
    quarter("date").alias("quarter"),
    date_format("date", "yyyy-MM").alias("year_month"),
    date_add("date", 7).alias("plus_7_days"),
    date_sub("date", 7).alias("minus_7_days"),
    datediff(current_date(), "date").alias("days_since")
)

display(df_dates)

## 8. Conditional Expressions

In [None]:
# when/otherwise
df_conditional = df.select(
    "product",
    "revenue",
    when(col("revenue") > 2000, "High")
        .when(col("revenue") > 500, "Medium")
        .otherwise("Low")
        .alias("revenue_tier"),
    
    # Multiple conditions
    when((col("revenue") > 1000) & (col("category") == "Electronics"), "Premium Electronics")
        .when(col("revenue") > 1000, "Premium Other")
        .otherwise("Standard")
        .alias("segment")
)

display(df_conditional)

## Practice Exercises

### Exercise 1: Top Products per Category
Find the top 2 products by revenue in each category using window functions.

In [None]:
# Your solution here
# TODO: Use row_number() with window to get top 2 per category

### Exercise 2: Moving Average
Calculate a 3-day moving average of revenue for each category.

In [None]:
# Your solution here
# TODO: Use window with rowsBetween for moving average

## Summary

In this notebook, you learned:

✅ Window functions (row_number, rank, lag, lead)
✅ Running totals and moving averages
✅ Pivot and unpivot operations
✅ Complex data types (arrays, structs, maps)
✅ JSON operations
✅ User Defined Functions (UDFs)
✅ Advanced string and date operations
✅ Conditional expressions

## Next Steps

1. Complete the practice exercises
2. Explore more PySpark functions
3. Learn about joins and performance optimization

## Additional Resources

- [PySpark Functions API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html)
- [Spark By Examples](https://sparkbyexamples.com/pyspark-tutorial/)