# Lab 7: User-Defined Functions (UDFs) - Solutions

**Objective**: Master creating and optimizing User-Defined Functions in Spark for custom business logic.

**Learning Outcomes**:
- Create and register UDFs for custom transformations
- Understand UDF performance implications and optimization
- Implement vectorized UDFs with pandas
- Apply UDFs to real-world business scenarios
- Debug and troubleshoot UDF issues

**Estimated Time**: 50 minutes

---

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, pandas_udf, col, when, regexp_extract, split, lit, datediff, current_date, sum, avg, count
from pyspark.sql.types import *
import pandas as pd
import numpy as np
import time
import re
import json
import os

# Fix for Codespace pandas UDF support - override environment variables
print("üîß Configuring Spark for pandas UDF support in Codespace")
system_python = "/usr/local/python/3.11.13/bin/python3"
os.environ['PYSPARK_PYTHON'] = system_python
os.environ['PYSPARK_DRIVER_PYTHON'] = system_python

# Create Spark session with pandas UDF support
spark = SparkSession.builder \
    .appName("Lab7-UDFs-Solutions") \
    .config("spark.pyspark.python", system_python) \
    .config("spark.pyspark.driver.python", system_python) \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.sql.execution.arrow.maxRecordsPerBatch", "1000") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.memory", "1g") \
    .config("spark.master", "local[2]") \
    .config("spark.sql.adaptive.logLevel", "ERROR") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")  # Suppress warnings for cleaner output
spark.sparkContext.setLogLevel("ERROR")  # Extra safety for log suppression

print(f"üöÄ UDF Lab Solutions - Spark {spark.version} with Arrow: {spark.conf.get('spark.sql.execution.arrow.pyspark.enabled')}")
print(f"‚úÖ Pandas available: {pd.__version__}")
print(f"‚úÖ Python configured: {spark.conf.get('spark.pyspark.python')}")

# Enhanced Spark UI URL display
ui_url = spark.sparkContext.uiWebUrl
print(f"Spark UI: {ui_url}")
print("üí° In GitHub Codespaces: Check the 'PORTS' tab below for forwarded port 4040 to access Spark UI")

## Part 1: Basic UDF Creation and Usage

In [None]:
# Load datasets
customers_df = spark.read.csv("../Datasets/customers.csv", header=True, inferSchema=True)
transactions_df = spark.read.csv("../Datasets/customer_transactions.csv", header=True, inferSchema=True)

print("üìä Datasets loaded for UDF examples")
print(f"  - Customers: {customers_df.count():,} records")
print(f"  - Transactions: {transactions_df.count():,} records")

# Simple UDF examples
def categorize_age(age):
    """Categorize customers by age group"""
    if age is None:
        return "Unknown"
    elif age < 25:
        return "Gen Z"
    elif age < 40:
        return "Millennial"
    elif age < 55:
        return "Gen X"
    else:
        return "Boomer"

# Register as UDF
age_category_udf = udf(categorize_age, StringType())

# Apply UDF to DataFrame
customers_with_generation = customers_df.withColumn(
    "generation", 
    age_category_udf(col("age"))
)

print("üë• Customer generations:")
customers_with_generation.groupBy("generation").count().orderBy("count", ascending=False).show()

**Exercise 1.1**: Create business logic UDFs for data enrichment.

In [None]:
# Solution: Business Logic UDF Challenge

# UDF 1: Customer Risk Scoring
def calculate_risk_score(age, total_spent, transaction_count, state):
    """Calculate customer risk score based on multiple factors"""
    if age is None or total_spent is None or transaction_count is None:
        return "Unknown"
    
    risk_score = 0
    
    # Age factor - younger and very old customers are higher risk
    if age < 25:
        risk_score += 5
    elif age > 65:
        risk_score += 3
    
    # Spending factor - very high or very low average spending is risky
    avg_transaction = total_spent / max(transaction_count, 1)
    if avg_transaction > 1000 or avg_transaction < 10:
        risk_score += 4
    
    # Frequency factor - very few transactions is risky
    if transaction_count < 3:
        risk_score += 6
    elif transaction_count > 50:
        risk_score += 2  # Very high frequency also suspicious
    
    # Geographic factor (example high-risk states)
    high_risk_states = ['FL', 'NV', 'CA']  # Example states with higher fraud rates
    if state in high_risk_states:
        risk_score += 2
    
    # Return risk category
    if risk_score >= 15:
        return "High"
    elif risk_score >= 8:
        return "Medium"
    else:
        return "Low"

# Register risk scoring UDF
risk_score_udf = udf(calculate_risk_score, StringType())

# UDF 2: Email Domain Classification
def classify_email_domain(email):
    """Classify email domains into categories"""
    if email is None or '@' not in email:
        return "Unknown"
    
    # Extract domain from email
    domain = email.split('@')[1].lower() if '@' in email else ""
    
    # Classify domains
    business_domains = ['company.com', 'corp.com', 'business.com', 'enterprise.com']
    consumer_domains = ['gmail.com', 'yahoo.com', 'hotmail.com', 'outlook.com', 'aol.com']
    educational_domains = ['edu', '.edu', 'university.edu']
    
    if domain in business_domains or 'corp' in domain or 'company' in domain:
        return "Business"
    elif domain in consumer_domains:
        return "Consumer"
    elif any(edu in domain for edu in educational_domains):
        return "Educational"
    elif domain.endswith('.gov'):
        return "Government"
    else:
        return "Other"

# Register email classification UDF
email_domain_udf = udf(classify_email_domain, StringType())

# UDF 3: Transaction Anomaly Detection
def detect_anomaly(amount, customer_avg, customer_std):
    """Detect if transaction is anomalous for customer"""
    if amount is None or customer_avg is None or customer_std is None:
        return False
    
    if customer_std == 0:
        # If no variation in spending, flag very different amounts
        return abs(amount - customer_avg) > customer_avg * 0.5
    
    # Calculate z-score
    z_score = abs(amount - customer_avg) / customer_std
    
    # Return True if anomaly (|z-score| > 2)
    return z_score > 2.0

# Register anomaly detection UDF
anomaly_udf = udf(detect_anomaly, BooleanType())

# Apply all UDFs to create enriched dataset
print("üîß Applying business logic UDFs...")

# First, prepare customer statistics
from pyspark.sql.functions import sum as spark_sum, avg as spark_avg, count as spark_count, stddev

customer_stats = transactions_df.groupBy("customer_id") \
    .agg(
        spark_sum("amount").alias("total_spent"),
        spark_count("*").alias("transaction_count"),
        spark_avg("amount").alias("avg_amount"),
        stddev("amount").alias("std_amount")
    )

# Apply UDFs to customers
enriched_customers = customers_df \
    .join(customer_stats, "customer_id", "left") \
    .withColumn("risk_category",
        risk_score_udf(col("age"), col("total_spent"), col("transaction_count"), col("state"))
    ) \
    .withColumn("email_domain_type",
        email_domain_udf(col("email"))
    )

# Apply anomaly detection to transactions
enriched_transactions = transactions_df \
    .join(customer_stats, "customer_id", "left") \
    .withColumn("is_anomaly",
        anomaly_udf(col("amount"), col("avg_amount"), col("std_amount"))
    )

print("üìä UDF Results:")

print("\nRisk Distribution:")
enriched_customers.groupBy("risk_category").count().orderBy("count", ascending=False).show()

print("\nEmail Domain Classification:")
enriched_customers.groupBy("email_domain_type").count().orderBy("count", ascending=False).show()

print("\nAnomaly Detection:")
anomaly_stats = enriched_transactions.groupBy("is_anomaly").count()
anomaly_stats.show()

# Show some anomalous transactions
print("Sample anomalous transactions:")
enriched_transactions.filter(col("is_anomaly") == True) \
    .select("customer_id", "amount", "avg_amount", "std_amount") \
    .show(5)

# Validation
risk_categories = enriched_customers.select("risk_category").distinct().count()
email_types = enriched_customers.select("email_domain_type").distinct().count()
anomaly_count = enriched_transactions.filter(col("is_anomaly")).count()

assert risk_categories > 0, "Should have risk categories"
assert email_types > 0, "Should have email domain types"
assert anomaly_count >= 0, "Should have anomaly detection results"

print(f"\n‚úì Exercise 1.1 completed!")
print(f"üìà Generated {risk_categories} risk categories, {email_types} email types, {anomaly_count} anomalies")

## Part 2: Performance Optimization and Pandas UDFs

In [None]:
# Compare regular UDF vs Pandas UDF performance
print("‚ö° UDF Performance Comparison")

# Regular UDF for complex calculation
def calculate_loyalty_score_regular(transaction_count, total_spent, days_since_signup, avg_amount):
    """Calculate customer loyalty score (regular UDF)"""
    if transaction_count is None or total_spent is None or days_since_signup is None or avg_amount is None:
        return 0.0
    
    # Complex business logic
    frequency_score = min(transaction_count / 10.0, 1.0) * 30
    volume_score = min(total_spent / 5000.0, 1.0) * 40
    tenure_score = min(days_since_signup / 365.0, 1.0) * 20
    consistency_score = (1.0 - abs(avg_amount - 100) / 100.0) * 10 if avg_amount > 0 else 0
    
    return max(0, frequency_score + volume_score + tenure_score + consistency_score)

# Pandas UDF for vectorized calculation
@pandas_udf(returnType=DoubleType())
def calculate_loyalty_score_pandas(transaction_count: pd.Series, total_spent: pd.Series, 
                                 days_since_signup: pd.Series, avg_amount: pd.Series) -> pd.Series:
    """Calculate customer loyalty score (Pandas UDF - vectorized)"""
    # Handle nulls
    transaction_count = transaction_count.fillna(0)
    total_spent = total_spent.fillna(0)
    days_since_signup = days_since_signup.fillna(0)
    avg_amount = avg_amount.fillna(0)
    
    frequency_score = np.minimum(transaction_count / 10.0, 1.0) * 30
    volume_score = np.minimum(total_spent / 5000.0, 1.0) * 40
    tenure_score = np.minimum(days_since_signup / 365.0, 1.0) * 20
    consistency_score = np.where(avg_amount > 0, 
                                (1.0 - np.abs(avg_amount - 100) / 100.0) * 10, 0)
    
    return np.maximum(0, frequency_score + volume_score + tenure_score + consistency_score)

# Register UDFs
loyalty_regular_udf = udf(calculate_loyalty_score_regular, DoubleType())
# Pandas UDF is already decorated

# Prepare test dataset with customer metrics
from pyspark.sql.functions import datediff, current_date

customer_metrics = transactions_df.join(customers_df, "customer_id") \
    .groupBy("customer_id", "signup_date") \
    .agg(
        spark_count("*").alias("transaction_count"),
        spark_sum("amount").alias("total_spent"),
        spark_avg("amount").alias("avg_amount")
    ) \
    .withColumn("days_since_signup", 
                datediff(current_date(), col("signup_date")))

print(f"üìä Testing with {customer_metrics.count():,} customer records")

# Performance test
print("\n‚è±Ô∏è  Performance Testing:")

# Test regular UDF
start_time = time.time()
regular_result = customer_metrics.withColumn(
    "loyalty_score_regular",
    loyalty_regular_udf(col("transaction_count"), col("total_spent"), 
                       col("days_since_signup"), col("avg_amount"))
).count()
regular_time = time.time() - start_time

# Test Pandas UDF
start_time = time.time()
pandas_result = customer_metrics.withColumn(
    "loyalty_score_pandas",
    calculate_loyalty_score_pandas(col("transaction_count"), col("total_spent"), 
                                  col("days_since_signup"), col("avg_amount"))
).count()
pandas_time = time.time() - start_time

print(f"Regular UDF: {regular_result:,} records in {regular_time:.4f}s")
print(f"Pandas UDF: {pandas_result:,} records in {pandas_time:.4f}s")

if pandas_time > 0:
    speedup = regular_time / pandas_time
    print(f"Performance gain: {speedup:.1f}x faster with Pandas UDF")
else:
    print("Pandas UDF completed too quickly to measure accurately")

# Compare results to ensure correctness
sample_comparison = customer_metrics.withColumn(
    "loyalty_regular", loyalty_regular_udf(col("transaction_count"), col("total_spent"), 
                                          col("days_since_signup"), col("avg_amount"))
).withColumn(
    "loyalty_pandas", calculate_loyalty_score_pandas(col("transaction_count"), col("total_spent"), 
                                                    col("days_since_signup"), col("avg_amount"))
).select("customer_id", "loyalty_regular", "loyalty_pandas")

print("\nüîç Sample loyalty score comparison:")
sample_comparison.show(5)

**Exercise 2.1**: Implement and optimize complex business calculations.

In [None]:
# Solution: Advanced UDF Optimization Challenge

# Challenge 1: Customer Segmentation with Machine Learning Features
print("üß† Advanced Feature Engineering with UDFs")

# Import required functions with proper aliases
from pyspark.sql.functions import max as spark_max, sum as spark_sum, count as spark_count, datediff, current_date

# Feature 1: RFM Score (Recency, Frequency, Monetary)
@pandas_udf(returnType=StructType([
    StructField("recency_score", IntegerType()),
    StructField("frequency_score", IntegerType()),
    StructField("monetary_score", IntegerType()),
    StructField("rfm_segment", StringType())
]))
def calculate_rfm_scores(days_since_last: pd.Series, frequency: pd.Series, 
                        monetary: pd.Series) -> pd.DataFrame:
    """Calculate RFM scores using pandas vectorization"""
    
    # Handle missing values
    days_since_last = days_since_last.fillna(365)  # Default to 1 year
    frequency = frequency.fillna(1)
    monetary = monetary.fillna(0)
    
    # Recency: 1-5 (1 = most recent, i.e., fewer days since last purchase)
    recency_percentiles = np.percentile(days_since_last, [20, 40, 60, 80])
    recency_score = pd.cut(days_since_last, 
                          bins=[-1] + list(recency_percentiles) + [float('inf')], 
                          labels=[5, 4, 3, 2, 1]).astype(int)
    
    # Frequency: 1-5 (5 = most frequent)
    frequency_percentiles = np.percentile(frequency, [20, 40, 60, 80])
    frequency_score = pd.cut(frequency, 
                           bins=[-1] + list(frequency_percentiles) + [float('inf')], 
                           labels=[1, 2, 3, 4, 5]).astype(int)
    
    # Monetary: 1-5 (5 = highest value)
    monetary_percentiles = np.percentile(monetary, [20, 40, 60, 80])
    monetary_score = pd.cut(monetary, 
                          bins=[-1] + list(monetary_percentiles) + [float('inf')], 
                          labels=[1, 2, 3, 4, 5]).astype(int)
    
    # Create segment labels based on RFM scores
    def create_segment(r, f, m):
        if r >= 4 and f >= 4 and m >= 4:
            return "Champions"
        elif r >= 3 and f >= 3 and m >= 3:
            return "Loyal Customers"
        elif r >= 4 and f <= 2:
            return "New Customers"
        elif r <= 2 and f >= 3:
            return "At Risk"
        elif r <= 2 and f <= 2:
            return "Lost Customers"
        else:
            return "Potential Loyalists"
    
    rfm_segment = pd.Series([create_segment(r, f, m) 
                            for r, f, m in zip(recency_score, frequency_score, monetary_score)])
    
    return pd.DataFrame({
        'recency_score': recency_score,
        'frequency_score': frequency_score,
        'monetary_score': monetary_score,
        'rfm_segment': rfm_segment
    })

# Feature 2: Time-based patterns
@pandas_udf(returnType=StructType([
    StructField("seasonality_index", DoubleType()),
    StructField("trend_direction", StringType()),
    StructField("volatility_score", DoubleType())
]))
def analyze_temporal_patterns(monthly_amounts: pd.Series) -> pd.DataFrame:
    """Analyze customer temporal spending patterns"""
    
    def parse_and_analyze(amounts_str):
        if pd.isna(amounts_str) or amounts_str == '' or amounts_str == 'null':
            return 0.0, 'stable', 0.0
        
        try:
            # Parse comma-separated amounts (simulated monthly data)
            amounts = [float(x) for x in str(amounts_str).split(',') if x.strip()]
            
            if len(amounts) < 3:
                return 0.0, 'stable', 0.0
            
            amounts = np.array(amounts)
            
            # Calculate seasonality (coefficient of variation)
            seasonality = np.std(amounts) / np.mean(amounts) if np.mean(amounts) > 0 else 0
            
            # Calculate trend direction using linear regression slope
            x = np.arange(len(amounts))
            slope = np.polyfit(x, amounts, 1)[0] if len(amounts) > 1 else 0
            
            if slope > 5:
                trend = 'increasing'
            elif slope < -5:
                trend = 'decreasing'
            else:
                trend = 'stable'
            
            # Calculate volatility (normalized standard deviation)
            volatility = seasonality  # Same as seasonality for simplicity
            
            return seasonality, trend, volatility
            
        except (ValueError, TypeError):
            return 0.0, 'stable', 0.0
    
    results = monthly_amounts.apply(parse_and_analyze)
    
    return pd.DataFrame({
        'seasonality_index': [r[0] for r in results],
        'trend_direction': [r[1] for r in results],
        'volatility_score': [r[2] for r in results]
    })

# Challenge 2: Text Processing UDFs
@pandas_udf(returnType=StructType([
    StructField("sentiment_score", DoubleType()),
    StructField("entity_count", IntegerType()),
    StructField("complexity_score", DoubleType())
]))
def analyze_text_features(text_data: pd.Series) -> pd.DataFrame:
    """Extract features from text data (simulated for demo)"""
    
    def analyze_text(text):
        if pd.isna(text) or text == '':
            return 0.0, 0, 0.0
        
        text = str(text)
        
        # Mock sentiment analysis (positive words vs negative words)
        positive_words = ['good', 'great', 'excellent', 'amazing', 'love', 'best']
        negative_words = ['bad', 'terrible', 'awful', 'hate', 'worst', 'horrible']
        
        text_lower = text.lower()
        pos_count = sum(1 for word in positive_words if word in text_lower)
        neg_count = sum(1 for word in negative_words if word in text_lower)
        
        sentiment = (pos_count - neg_count) / max(len(text.split()), 1)
        
        # Mock entity count (capitalized words as potential entities)
        entities = len([word for word in text.split() if word[0].isupper()]) if text else 0
        
        # Text complexity (average word length and sentence structure)
        words = text.split()
        avg_word_length = sum(len(word) for word in words) / len(words) if words else 0
        complexity = avg_word_length / 10.0  # Normalize to 0-1 scale
        
        return sentiment, entities, complexity
    
    results = text_data.apply(analyze_text)
    
    return pd.DataFrame({
        'sentiment_score': [r[0] for r in results],
        'entity_count': [r[1] for r in results],
        'complexity_score': [r[2] for r in results]
    })

# Apply advanced UDFs to datasets
print("üîß Applying Advanced UDFs...")

# Prepare customer data for RFM analysis
rfm_data = transactions_df.join(customers_df, "customer_id") \
    .groupBy("customer_id") \
    .agg(
        datediff(current_date(), spark_max("transaction_date")).alias("days_since_last"),
        spark_count("*").alias("frequency"),
        spark_sum("amount").alias("monetary")
    )

# Apply RFM UDF
customers_with_rfm = rfm_data.withColumn(
    "rfm_analysis", 
    calculate_rfm_scores(col("days_since_last"), col("frequency"), col("monetary"))
).select(
    "customer_id",
    col("rfm_analysis.recency_score").alias("recency_score"),
    col("rfm_analysis.frequency_score").alias("frequency_score"),
    col("rfm_analysis.monetary_score").alias("monetary_score"),
    col("rfm_analysis.rfm_segment").alias("rfm_segment")
)

print("üìä RFM Segmentation Results:")
customers_with_rfm.groupBy("rfm_segment").count().orderBy("count", ascending=False).show()

# Create sample temporal data for temporal analysis
# Simulate monthly spending patterns
sample_temporal_data = spark.createDataFrame([
    ("CUST_000001", "100,150,120,180,200,190,210"),
    ("CUST_000002", "50,60,55,65,70,80,90"),
    ("CUST_000003", "200,180,160,140,120,100,80")
], ["customer_id", "monthly_amounts"])

temporal_analysis = sample_temporal_data.withColumn(
    "temporal_patterns",
    analyze_temporal_patterns(col("monthly_amounts"))
).select(
    "customer_id",
    col("temporal_patterns.seasonality_index").alias("seasonality"),
    col("temporal_patterns.trend_direction").alias("trend"),
    col("temporal_patterns.volatility_score").alias("volatility")
)

print("\nüìà Temporal Pattern Analysis:")
temporal_analysis.show()

# Performance comparison for complex operations
print("\n‚ö° Complex UDF Performance Analysis:")

# Compare performance of RFM calculation with different approaches
start_time = time.time()
rfm_pandas_count = customers_with_rfm.count()
rfm_pandas_time = time.time() - start_time

print(f"Pandas UDF RFM Analysis: {rfm_pandas_count:,} customers in {rfm_pandas_time:.4f}s")

# Validation
rfm_segments = customers_with_rfm.select("rfm_segment").distinct().count()
temporal_records = temporal_analysis.count()

assert rfm_segments > 0, "Should have RFM segments"
assert temporal_records > 0, "Should have temporal analysis"
assert rfm_pandas_count > 0, "Should have RFM analysis results"

print(f"\n‚úì Exercise 2.1 completed!")
print(f"üß† Generated {rfm_segments} RFM segments for {rfm_pandas_count:,} customers")
print(f"üìä Analyzed temporal patterns for {temporal_records} sample customers")

## Part 3: UDF Best Practices and Troubleshooting

In [None]:
# UDF best practices and common pitfalls
print("üéØ UDF Best Practices and Troubleshooting")

# 1. Error handling in UDFs
def safe_divide_udf(numerator, denominator):
    """Safe division with error handling"""
    try:
        if denominator == 0 or denominator is None:
            return None
        if numerator is None:
            return None
        return float(numerator) / float(denominator)
    except (TypeError, ValueError, ZeroDivisionError):
        return None

safe_divide = udf(safe_divide_udf, DoubleType())

# 2. Null handling patterns
def handle_nulls_properly(value, default_value=0):
    """Proper null handling in UDFs"""
    if value is None:
        return default_value
    try:
        return int(value) * 2  # Example transformation
    except (TypeError, ValueError):
        return default_value

null_safe_udf = udf(handle_nulls_properly, IntegerType())

# 3. Performance monitoring UDF
def monitored_udf_function(input_value):
    """UDF with performance monitoring"""
    import time
    start_time = time.time()
    
    # Your business logic here
    result = input_value.upper() if input_value else ""
    
    # Log performance (in production, use proper logging)
    execution_time = time.time() - start_time
    if execution_time > 0.001:  # Log slow operations
        print(f"Slow UDF execution: {execution_time:.4f}s for input: {input_value}")
    
    return result

monitored_udf = udf(monitored_udf_function, StringType())

print("‚úÖ UDF Best Practices:")
print("1. Always handle None/null values")
print("2. Use appropriate error handling")
print("3. Prefer Pandas UDFs for numerical operations")
print("4. Avoid complex object serialization")
print("5. Monitor UDF performance")
print("6. Consider built-in functions first")

# Test safe operations
test_data = spark.createDataFrame([
    (10, 2, "test"),
    (15, 0, "hello"),
    (None, 5, None),
    (20, None, "world")
], ["numerator", "denominator", "text_col"])

safe_operations_result = test_data \
    .withColumn("safe_division", safe_divide(col("numerator"), col("denominator"))) \
    .withColumn("null_safe_transform", null_safe_udf(col("numerator"), lit(0))) \
    .withColumn("monitored_text", monitored_udf(col("text_col")))

print("\nüß™ Safe UDF Operations Test:")
safe_operations_result.show()

# Common alternatives to UDFs
print("\nüöÄ UDF Alternatives (Often Better Performance):")

# Instead of UDF for simple conditions
print("‚ùå UDF approach:")
def classify_amount_udf(amount):
    if amount is None:
        return "unknown"
    elif amount > 1000:
        return "high"
    elif amount > 100:
        return "medium"
    else:
        return "low"

amount_classifier = udf(classify_amount_udf, StringType())

print("‚úÖ Built-in functions approach:")
# Use when/otherwise instead
transactions_classified = transactions_df.withColumn(
    "amount_category_builtin",
    when(col("amount").isNull(), "unknown")
    .when(col("amount") > 1000, "high")
    .when(col("amount") > 100, "medium")
    .otherwise("low")
)

# Performance comparison
print("\n‚è±Ô∏è  Performance: Built-in vs UDF")

# UDF approach timing
start_time = time.time()
udf_count = transactions_df.withColumn("category_udf", amount_classifier(col("amount"))).count()
udf_time = time.time() - start_time

# Built-in approach timing
start_time = time.time()
builtin_count = transactions_classified.count()
builtin_time = time.time() - start_time

print(f"UDF approach: {udf_count:,} records in {udf_time:.4f}s")
print(f"Built-in approach: {builtin_count:,} records in {builtin_time:.4f}s")

if builtin_time > 0:
    speedup = udf_time / builtin_time
    print(f"Built-in functions are {speedup:.1f}x faster!")

# Show sample results to verify correctness
print("\nüìä Sample categorization results:")
comparison_sample = transactions_df.withColumn("udf_category", amount_classifier(col("amount"))) \
    .withColumn("builtin_category", 
        when(col("amount").isNull(), "unknown")
        .when(col("amount") > 1000, "high")
        .when(col("amount") > 100, "medium")
        .otherwise("low")
    ).select("amount", "udf_category", "builtin_category")

comparison_sample.show(5)

**Exercise 3.1**: Implement robust, production-ready UDFs.

In [None]:
# Solution: Production-Ready UDF Implementation Challenge

# Challenge 1: Data Validation and Cleansing UDF
def create_data_validator():
    """Factory function to create configurable data validation UDF"""
    
    @pandas_udf(returnType=StructType([
        StructField("is_valid", BooleanType()),
        StructField("validation_errors", StringType()),
        StructField("cleansed_value", StringType())
    ]))
    def validate_and_cleanse(input_series: pd.Series, 
                           validation_type: pd.Series) -> pd.DataFrame:
        """Comprehensive data validation and cleansing"""
        
        results = []
        
        for value, v_type in zip(input_series, validation_type):
            is_valid = True
            errors = []
            cleansed = str(value) if value is not None else ""
            
            try:
                if v_type == 'email':
                    # Email validation and cleansing
                    if value is None or '@' not in str(value):
                        is_valid = False
                        errors.append("Invalid email format")
                    else:
                        # Clean email (lowercase, trim)
                        cleansed = str(value).lower().strip()
                        # Basic email pattern check
                        import re
                        if not re.match(r'^[\w\.-]+@[\w\.-]+\.\w+$', cleansed):
                            is_valid = False
                            errors.append("Invalid email pattern")
                
                elif v_type == 'phone':
                    # Phone number validation and formatting
                    if value is None:
                        is_valid = False
                        errors.append("Missing phone number")
                    else:
                        # Extract digits only
                        digits = re.sub(r'[^0-9]', '', str(value))
                        if len(digits) == 10:
                            cleansed = f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
                        elif len(digits) == 11 and digits[0] == '1':
                            cleansed = f"({digits[1:4]}) {digits[4:7]}-{digits[7:]}"
                        else:
                            is_valid = False
                            errors.append("Invalid phone number length")
                
                elif v_type == 'currency':
                    # Currency validation and normalization
                    if value is None:
                        is_valid = False
                        errors.append("Missing currency value")
                    else:
                        try:
                            # Remove currency symbols and convert to float
                            clean_value = re.sub(r'[^0-9.-]', '', str(value))
                            amount = float(clean_value)
                            if amount < 0:
                                errors.append("Negative currency value")
                            cleansed = f"${amount:.2f}"
                        except ValueError:
                            is_valid = False
                            errors.append("Invalid currency format")
                
                elif v_type == 'date':
                    # Date validation and standardization
                    if value is None:
                        is_valid = False
                        errors.append("Missing date")
                    else:
                        try:
                            # Try to parse common date formats
                            from datetime import datetime
                            date_formats = ['%Y-%m-%d', '%m/%d/%Y', '%d-%m-%Y']
                            parsed_date = None
                            for fmt in date_formats:
                                try:
                                    parsed_date = datetime.strptime(str(value), fmt)
                                    break
                                except ValueError:
                                    continue
                            
                            if parsed_date:
                                cleansed = parsed_date.strftime('%Y-%m-%d')
                            else:
                                is_valid = False
                                errors.append("Invalid date format")
                        except Exception:
                            is_valid = False
                            errors.append("Date parsing error")
                
            except Exception as e:
                is_valid = False
                errors.append(f"Validation error: {str(e)}")
            
            results.append({
                'is_valid': is_valid,
                'validation_errors': '|'.join(errors) if errors else '',
                'cleansed_value': cleansed
            })
        
        return pd.DataFrame(results)
    
    return validate_and_cleanse

# Challenge 2: Advanced Business Rules Engine
class BusinessRulesEngine:
    """Configurable business rules engine using UDFs"""
    
    def __init__(self):
        self.rules = {}
    
    def add_rule(self, rule_name, rule_function):
        """Add a business rule"""
        self.rules[rule_name] = rule_function
    
    def create_rules_udf(self):
        """Create UDF that applies all registered rules"""
        
        @pandas_udf(returnType=StructType([
            StructField("rules_passed", IntegerType()),
            StructField("rules_failed", IntegerType()),
            StructField("failed_rules", StringType()),
            StructField("overall_status", StringType())
        ]))
        def apply_business_rules(customer_data: pd.Series) -> pd.DataFrame:
            """Apply all business rules to customer data"""
            
            results = []
            
            for data_str in customer_data:
                passed = 0
                failed = 0
                failed_rule_names = []
                
                try:
                    # Parse customer data (assuming JSON-like string)
                    if data_str and data_str != 'null':
                        # Simple parsing for demo - in production use proper JSON parsing
                        data_parts = str(data_str).split('|')
                        data = {}
                        for part in data_parts:
                            if '=' in part:
                                key, value = part.split('=', 1)
                                try:
                                    data[key] = float(value) if '.' in value else int(value)
                                except ValueError:
                                    data[key] = value
                    else:
                        data = {}
                    
                    # Apply rules
                    for rule_name, rule_func in self.rules.items():
                        try:
                            if rule_func(data):
                                passed += 1
                            else:
                                failed += 1
                                failed_rule_names.append(rule_name)
                        except Exception:
                            failed += 1
                            failed_rule_names.append(f"{rule_name}_ERROR")
                
                except Exception:
                    # If parsing fails, mark all rules as failed
                    failed = len(self.rules)
                    failed_rule_names = [f"{name}_PARSE_ERROR" for name in self.rules.keys()]
                
                # Determine overall status
                if failed == 0:
                    status = "APPROVED"
                elif failed <= 2:
                    status = "REVIEW"
                else:
                    status = "REJECTED"
                
                results.append({
                    'rules_passed': passed,
                    'rules_failed': failed,
                    'failed_rules': '|'.join(failed_rule_names),
                    'overall_status': status
                })
            
            return pd.DataFrame(results)
        
        return apply_business_rules

# Test the production-ready UDFs
print("üè≠ Production-Ready UDF Testing")

# Test data validation UDF
validator_udf = create_data_validator()

# Create test dataset with various data quality issues
test_validation_data = spark.createDataFrame([
    ("john@example.com", "email"),
    ("invalid-email", "email"),
    ("(555) 123-4567", "phone"),
    ("5551234567", "phone"),
    ("$123.45", "currency"),
    ("-$50.00", "currency"),
    ("2023-12-25", "date"),
    ("12/25/2023", "date"),
    ("invalid-date", "date")
], ["input_value", "validation_type"])

# Apply validation UDF
validated_data = test_validation_data.withColumn(
    "validation_result",
    validator_udf(col("input_value"), col("validation_type"))
).select(
    "input_value",
    "validation_type",
    col("validation_result.is_valid").alias("is_valid"),
    col("validation_result.validation_errors").alias("errors"),
    col("validation_result.cleansed_value").alias("cleansed")
)

print("üìã Data Validation Results:")
validated_data.show(truncate=False)

# Test business rules engine
rules_engine = BusinessRulesEngine()

# Add business rules
rules_engine.add_rule("min_age", lambda data: data.get('age', 0) >= 18)
rules_engine.add_rule("max_transaction", lambda data: data.get('amount', 0) <= 10000)
rules_engine.add_rule("valid_state", lambda data: data.get('state', '') in ['CA', 'NY', 'TX', 'FL', 'WA', 'IL'])
rules_engine.add_rule("positive_balance", lambda data: data.get('balance', 0) >= 0)

# Create test data for business rules
rules_test_data = spark.createDataFrame([
    ("CUST_001", "age=25|amount=500|state=CA|balance=1000"),
    ("CUST_002", "age=16|amount=200|state=NY|balance=500"),
    ("CUST_003", "age=30|amount=15000|state=XX|balance=-100"),
    ("CUST_004", "age=45|amount=750|state=TX|balance=2000")
], ["customer_id", "customer_data"])

# Apply business rules
rules_udf = rules_engine.create_rules_udf()
customer_rules_test = rules_test_data.withColumn(
    "rules_result",
    rules_udf(col("customer_data"))
).select(
    "customer_id",
    col("rules_result.rules_passed").alias("passed"),
    col("rules_result.rules_failed").alias("failed"),
    col("rules_result.failed_rules").alias("failed_rules"),
    col("rules_result.overall_status").alias("status")
)

print("\n‚öñÔ∏è  Business Rules Results:")
customer_rules_test.show(truncate=False)

# Performance and error monitoring
print("\nüìä UDF Performance Monitoring:")

# Time the validation UDF
start_time = time.time()
validation_count = validated_data.count()
validation_time = time.time() - start_time

# Time the business rules UDF
start_time = time.time()
rules_count = customer_rules_test.count()
rules_time = time.time() - start_time

print(f"Data Validation UDF: {validation_count} records in {validation_time:.4f}s")
print(f"Business Rules UDF: {rules_count} records in {rules_time:.4f}s")

# Summary statistics
valid_records = validated_data.filter(col("is_valid")).count()
approved_customers = customer_rules_test.filter(col("status") == "APPROVED").count()

print(f"\nüìà Quality Metrics:")
print(f"Data Validation: {valid_records}/{validation_count} records valid ({valid_records/validation_count*100:.1f}%)")
print(f"Business Rules: {approved_customers}/{rules_count} customers approved ({approved_customers/rules_count*100:.1f}%)")

# Validation
assert validation_count > 0, "Should have validation results"
assert rules_count > 0, "Should have business rules results"
assert valid_records >= 0, "Should have valid record count"
assert approved_customers >= 0, "Should have approved customer count"

print(f"\n‚úì Exercise 3.1 completed!")
print(f"üè≠ Production-ready UDF pipeline implemented and tested")
print(f"üîç Validated {validation_count} data points, processed {rules_count} business rule evaluations")

## Summary: UDF Mastery

### Key Capabilities Mastered:
1. **UDF Creation**: Regular and Pandas UDFs for different use cases
2. **Performance Optimization**: Vectorization with Pandas UDFs for significant speedups
3. **Error Handling**: Robust null handling and exception management
4. **Business Logic**: Complex domain-specific transformations
5. **Production Patterns**: Validation, monitoring, and maintainable UDFs

### Performance Guidelines:
| **Use Case** | **Best Choice** | **Performance** | **When to Use** |
|--------------|-----------------|----------------|----------------|
| Simple conditions | Built-in functions | Fastest | Always prefer when possible |
| Complex string operations | Regular UDF | Good | When built-ins insufficient |
| Numerical computations | Pandas UDF | Very Good | Mathematical operations |
| ML feature engineering | Pandas UDF | Excellent | Vectorizable operations |
| Business rule validation | Pandas UDF | Very Good | Complex logic with multiple conditions |
| External API calls | Regular UDF (with caution) | Varies | When unavoidable, use connection pooling |

### Advanced Techniques Demonstrated:
- ‚úÖ **RFM Analysis**: Customer segmentation with vectorized calculations
- ‚úÖ **Temporal Pattern Analysis**: Time-series feature extraction
- ‚úÖ **Data Validation Pipeline**: Production-ready validation with error handling
- ‚úÖ **Business Rules Engine**: Configurable rule evaluation system
- ‚úÖ **Performance Monitoring**: UDF execution time tracking
- ‚úÖ **Error Recovery**: Graceful handling of malformed data

### Best Practices Checklist:
- ‚úÖ Handle null values explicitly in all UDFs
- ‚úÖ Use appropriate return types with proper schema definition
- ‚úÖ Prefer Pandas UDFs for numerical operations (3-10x faster)
- ‚úÖ Consider built-in functions first - always benchmark
- ‚úÖ Implement comprehensive error handling with try-catch blocks
- ‚úÖ Monitor UDF performance and log slow operations
- ‚úÖ Test UDFs thoroughly with edge cases and invalid data
- ‚úÖ Document complex business logic with clear examples
- ‚úÖ Use factory patterns for configurable UDFs
- ‚úÖ Implement proper data type validation and conversion

### Production Deployment Considerations:
- **Serialization**: Minimize complex object passing to UDFs
- **Resource Management**: Monitor memory usage with large datasets
- **Error Logging**: Implement proper logging for production debugging
- **Version Control**: Track UDF changes and maintain backward compatibility
- **Testing**: Create comprehensive test suites for business logic validation
- **Performance**: Regular benchmarking against built-in alternatives

In [None]:
spark.stop()
print("üéâ Lab 7 completed! UDF expertise achieved.")
print("üèÜ Module 1 Lab Series Complete!")
print("\nüìö You've mastered:")
print("  ‚úÖ RDD Fundamentals - Core distributed data structures")
print("  ‚úÖ Transformations vs Actions - Lazy evaluation principles")
print("  ‚úÖ Lazy Evaluation - DAG optimization and Catalyst engine")
print("  ‚úÖ DataFrame API - Structured data processing")
print("  ‚úÖ Spark SQL - Complex analytical queries")
print("  ‚úÖ DataFrame Operations - Advanced joins and aggregations")
print("  ‚úÖ User-Defined Functions - Custom business logic")
print("\nüöÄ Ready for Module 2: Advanced Spark Techniques!")
print("üí° Next topics: Streaming, MLlib, Performance Tuning, Production Deployment")