# GPU-Accelerated Spark Connect Demo - ETL and ML Pipeline (Spark 4.0+)

This notebook demonstrates the latest Spark 4.0+ Connect capabilities with GPU acceleration, showcasing:
- **Spark Connect** for remote DataFrame and SQL operations
- **MLlib over Spark Connect** (new in Spark 4.0)
- **NVIDIA RAPIDS GPU acceleration** for up to 9x performance improvement
- **End-to-end ETL and ML workflows** with no code changes required

*Based on the Data and AI Summit 2025 session: "GPU Accelerated Spark Connect"*


## 1. Connect to Spark via Spark Connect


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd
import numpy as np

# Create GPU-accelerated Spark session using Spark Connect 4.0+
spark = SparkSession.builder \
    .remote("sc://spark-connect:15002") \
    .appName("GPU-Accelerated-ETL-ML-Demo") \
    .getOrCreate()

print(f"Spark version: {spark.version}")
print(f"Spark Connect URL: sc://spark-connect:15002")
print("GPU acceleration enabled via RAPIDS plugin")

# Check if RAPIDS plugin is active
try:
    spark.sql("SHOW FUNCTIONS").filter(col("function").contains("gpu")).show(5)
    print("✅ RAPIDS GPU functions detected")
except:
    print("⚠️  RAPIDS GPU functions not detected - running on CPU")


KeyboardInterrupt: 

## 2. GPU-Accelerated Data Processing


In [None]:
# Create large dataset for GPU acceleration demonstration
from datetime import datetime, timedelta
import random

# Generate larger dataset to showcase GPU performance
print("Creating large dataset for GPU acceleration demo...")

# Test GPU-accelerated operations
df = spark.range(100000).toDF("id")
df = df.withColumn("value", (col("id") * 3.14159).cast("double"))
df = df.withColumn("squared", col("value") ** 2)
df = df.withColumn("sqrt_val", sqrt(col("value")))
df = df.withColumn("log_val", log(col("value") + 1))

# Cache for reuse
df.cache()

print(f"Dataset size: {df.count():,} records")
print("Sample data:")
df.show(10)

# Perform aggregations (accelerated on GPU)
print("\nGPU-accelerated aggregations:")
agg_results = df.agg(
    avg("value").alias("avg_value"),
    stddev("value").alias("stddev_value"),
    min("squared").alias("min_squared"),
    max("squared").alias("max_squared")
)
agg_results.show()

print("✅ GPU-accelerated DataFrame operations completed!")


## 3. MLlib over Spark Connect (New in Spark 4.0)


In [None]:
# Demonstrate MLlib over Spark Connect (new in Spark 4.0)
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline

print("🚀 Testing MLlib over Spark Connect (Spark 4.0 feature)...")

# Prepare ML dataset
ml_df = df.select("id", "value", "squared", "sqrt_val", "log_val")

# Create features vector
assembler = VectorAssembler(
    inputCols=["value", "sqrt_val", "log_val"],
    outputCol="features"
)

# Scale features
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Create linear regression model (accelerated via GPU plugins)
lr = LinearRegression(
    featuresCol="scaled_features", 
    labelCol="squared",
    predictionCol="prediction"
)

# Build ML pipeline
pipeline = Pipeline(stages=[assembler, scaler, lr])

# Split data
train_df, test_df = ml_df.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_df.count():,} records")
print(f"Test set: {test_df.count():,} records")

# Train model via Spark Connect
print("Training model via Spark Connect...")
model = pipeline.fit(train_df)

# Make predictions
predictions = model.transform(test_df)
predictions.select("squared", "prediction").show(10)

# Evaluate model
evaluator = RegressionEvaluator(labelCol="squared", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)

print(f"Model RMSE: {rmse:.4f}")
print("✅ MLlib over Spark Connect working successfully!")


## 4. Performance Summary & Cleanup


In [None]:
# Performance summary and cleanup
print("🎯 Demo Summary:")
print("✅ Spark Connect 4.0 with GPU acceleration")
print("✅ Large-scale DataFrame operations")
print("✅ MLlib training over Spark Connect")
print("✅ Transparent GPU acceleration via RAPIDS plugin")
print("")
print("📊 Expected Performance Benefits:")
print("- Up to 9x faster execution on GPU-accelerated operations")
print("- 80% cost reduction through improved efficiency")
print("- No code changes required for acceleration")
print("")
print("🔗 Learn more: https://www.databricks.com/dataaisummit/session/gpu-accelerated-spark-connect")

# Stop the Spark session
spark.stop()
print("\n✅ Spark session stopped successfully.")
