# Advanced Apache Sedona Examples

This notebook demonstrates complex spatial analytics scenarios using Apache Sedona.

## Complex Use Cases Covered:
1. **Spatial ETL Pipeline** - Processing large spatial datasets
2. **Geofencing & Location Intelligence** - Real-time location analytics
3. **Spatial Clustering** - DBSCAN clustering of spatial points
4. **Route Optimization** - Spatial network analysis
5. **Heatmap Generation** - Spatial density analysis
6. **Multi-scale Spatial Joins** - Performance optimization techniques
7. **Spatial Machine Learning** - Predictive spatial modeling

In [1]:
# Advanced imports for complex spatial operations
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import ClusteringEvaluator

from sedona.register import SedonaRegistrator
from sedona.utils import SedonaKryoRegistrator, KryoSerializer

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Polygon as MPLPolygon
import folium
from folium.plugins import HeatMap
import geopandas as gpd
from shapely.geometry import Point, Polygon
import json
import random
from datetime import datetime, timedelta

# Set random seed for reproducibility
np.random.seed(42)
random.seed(42)

In [2]:
# Initialize Spark with optimized configuration for spatial operations
spark = SparkSession.builder \
    .appName("AdvancedSedonaExamples") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryo.registrator", "org.apache.sedona.core.serde.SedonaKryoRegistrator") \
    .config("spark.sql.extensions", "org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")
SedonaRegistrator.registerAll(spark)

print("✅ Advanced Sedona environment initialized!")
print(f"Spark Version: {spark.version}")
print(f"Available cores: {spark.sparkContext.defaultParallelism}")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/25 12:09:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  SedonaRegistrator.registerAll(spark)


✅ Advanced Sedona environment initialized!
Spark Version: 3.4.0
Available cores: 14


  cls.register(spark)
25/10/25 12:09:42 WARN UDTRegistration: Cannot register UDT for org.locationtech.jts.geom.Geometry, which is already registered.
25/10/25 12:09:42 WARN UDTRegistration: Cannot register UDT for org.locationtech.jts.index.SpatialIndex, which is already registered.
25/10/25 12:09:42 WARN UDTRegistration: Cannot register UDT for org.geotools.coverage.grid.GridCoverage2D, which is already registered.
25/10/25 12:09:42 WARN SimpleFunctionRegistry: The function st_union_aggr replaced a previously registered function.
25/10/25 12:09:42 WARN SimpleFunctionRegistry: The function st_envelope_aggr replaced a previously registered function.
25/10/25 12:09:42 WARN SimpleFunctionRegistry: The function st_intersection_aggr replaced a previously registered function.


## 1. Spatial ETL Pipeline: Processing NYC Taxi Data

Simulating processing of millions of taxi trips with spatial operations.

In [3]:
# Generate large-scale taxi trip data (simulating 1M+ trips)
def generate_nyc_taxi_data(num_trips=100000):
    # NYC bounding box (approximate)
    nyc_bounds = {
        'min_lat': 40.4774, 'max_lat': 40.9176,
        'min_lon': -74.2591, 'max_lon': -73.7004
    }
    
    # Generate trip data
    trips = []
    base_time = datetime(2024, 1, 1)
    
    for i in range(num_trips):
        # Pickup location (slightly clustered around Manhattan)
        pickup_lat = np.random.normal(40.7589, 0.05)  # Centered on Manhattan
        pickup_lon = np.random.normal(-73.9851, 0.05)
        
        # Dropoff location (random within NYC)
        dropoff_lat = np.random.uniform(nyc_bounds['min_lat'], nyc_bounds['max_lat'])
        dropoff_lon = np.random.uniform(nyc_bounds['min_lon'], nyc_bounds['max_lon'])
        
        # Trip details
        trip_time = base_time + timedelta(minutes=np.random.randint(0, 525600))  # Random time in year
        fare = np.random.uniform(5.0, 50.0)
        distance = np.random.uniform(0.1, 20.0)
        
        trips.append({
            'trip_id': f'trip_{i:06d}',
            'pickup_datetime': trip_time.isoformat(),
            'pickup_lat': pickup_lat,
            'pickup_lon': pickup_lon,
            'dropoff_lat': dropoff_lat,
            'dropoff_lon': dropoff_lon,
            'fare_amount': fare,
            'trip_distance': distance,
            'passenger_count': np.random.randint(1, 7)
        })
    
    return trips

print("Generating NYC taxi trip data...")
taxi_data = generate_nyc_taxi_data(50000)  # 50K trips for demo
print(f"Generated {len(taxi_data)} taxi trips")

# Convert to Spark DataFrame
taxi_schema = StructType([
    StructField("trip_id", StringType(), True),
    StructField("pickup_datetime", StringType(), True),
    StructField("pickup_lat", DoubleType(), True),
    StructField("pickup_lon", DoubleType(), True),
    StructField("dropoff_lat", DoubleType(), True),
    StructField("dropoff_lon", DoubleType(), True),
    StructField("fare_amount", DoubleType(), True),
    StructField("trip_distance", DoubleType(), True),
    StructField("passenger_count", IntegerType(), True)
])

taxi_df = spark.createDataFrame(taxi_data, schema=taxi_schema)
print(f"Created Spark DataFrame with {taxi_df.count()} records")

Generating NYC taxi trip data...
Generated 50000 taxi trips
Created Spark DataFrame with 50000 records


                                                                                

In [4]:
# Complex Spatial ETL Operations
taxi_df.createOrReplaceTempView("taxi_trips")

# 1. Create spatial geometries and calculate trip vectors
spatial_trips = spark.sql("""
    SELECT 
        trip_id,
        pickup_datetime,
        ST_Point(pickup_lon, pickup_lat) as pickup_point,
        ST_Point(dropoff_lon, dropoff_lat) as dropoff_point,
        ST_Distance(ST_Point(pickup_lon, pickup_lat), ST_Point(dropoff_lon, dropoff_lat)) as euclidean_distance,
        fare_amount,
        trip_distance,
        passenger_count,
        CASE 
            WHEN HOUR(pickup_datetime) BETWEEN 7 AND 9 THEN 'Morning Rush'
            WHEN HOUR(pickup_datetime) BETWEEN 17 AND 19 THEN 'Evening Rush'
            WHEN HOUR(pickup_datetime) BETWEEN 22 AND 5 THEN 'Night'
            ELSE 'Regular'
        END as time_period
    FROM taxi_trips
    WHERE pickup_lat BETWEEN 40.4 AND 41.0 
      AND pickup_lon BETWEEN -74.5 AND -73.5
      AND dropoff_lat BETWEEN 40.4 AND 41.0 
      AND dropoff_lon BETWEEN -74.5 AND -73.5
""")

spatial_trips.cache()
print(f"Processed {spatial_trips.count()} valid spatial trips")
spatial_trips.show(5)

25/10/25 12:09:54 WARN UDTRegistration: Cannot register UDT for org.apache.sedona.viz.core.ImageSerializableWrapper, which is already registered.
25/10/25 12:09:54 WARN UDTRegistration: Cannot register UDT for org.apache.sedona.viz.utils.Pixel, which is already registered.
25/10/25 12:09:54 WARN SimpleFunctionRegistry: The function st_pixelize replaced a previously registered function.
25/10/25 12:09:54 WARN SimpleFunctionRegistry: The function st_tilename replaced a previously registered function.
25/10/25 12:09:54 WARN SimpleFunctionRegistry: The function st_colorize replaced a previously registered function.
25/10/25 12:09:54 WARN SimpleFunctionRegistry: The function st_encodeimage replaced a previously registered function.
25/10/25 12:09:54 WARN SimpleFunctionRegistry: The function st_render replaced a previously registered function.
25/10/25 12:09:54 WARN UDTRegistration: Cannot register UDT for org.locationtech.jts.geom.Geometry, which is already registered.
25/10/25 12:09:54 WAR

Processed 50000 valid spatial trips
+-----------+-------------------+--------------------+--------------------+--------------------+------------------+------------------+---------------+------------+
|    trip_id|    pickup_datetime|        pickup_point|       dropoff_point|  euclidean_distance|       fare_amount|     trip_distance|passenger_count| time_period|
+-----------+-------------------+--------------------+--------------------+--------------------+------------------+------------------+---------------+------------+
|trip_000000|2024-02-08T02:46:00|POINT (-73.992013...|POINT (-73.924629...| 0.06923145060550419|25.062473878411602|2.0895008247782574|              3|     Regular|
|trip_000001|2024-08-11T12:39:00|POINT (-73.934574...|POINT (-74.247599...|  0.3134462732238153| 42.45991883601898| 4.325548302497695|              4|     Regular|
|trip_000002|2024-12-13T08:26:00|POINT (-74.031304...|POINT (-74.017772...|0.025771611874429636| 32.53338026250708|2.8759278269756323|          

                                                                                

## 2. Advanced Geofencing: Multi-Zone Analysis

Creating complex geofences and analyzing spatial patterns.

In [5]:
# Create NYC borough-like zones using polygons
zones_data = [
    {
        'zone_id': 'manhattan_south', 
        'zone_name': 'Lower Manhattan',
        'polygon': 'POLYGON((-74.0479 40.6829, -73.9441 40.6829, -73.9441 40.7589, -74.0479 40.7589, -74.0479 40.6829))'
    },
    {
        'zone_id': 'manhattan_central', 
        'zone_name': 'Midtown Manhattan',
        'polygon': 'POLYGON((-74.0479 40.7589, -73.9441 40.7589, -73.9441 40.8176, -74.0479 40.8176, -74.0479 40.7589))'
    },
    {
        'zone_id': 'brooklyn_west', 
        'zone_name': 'West Brooklyn',
        'polygon': 'POLYGON((-74.0479 40.6000, -73.9000 40.6000, -73.9000 40.7000, -74.0479 40.7000, -74.0479 40.6000))'
    },
    {
        'zone_id': 'queens_central', 
        'zone_name': 'Central Queens',
        'polygon': 'POLYGON((-73.9000 40.7000, -73.7500 40.7000, -73.7500 40.8000, -73.9000 40.8000, -73.9000 40.7000))'
    }
]

zones_df = spark.createDataFrame(zones_data)
zones_df.createOrReplaceTempView("zones")

# Create spatial zones
spatial_zones = spark.sql("""
    SELECT 
        zone_id,
        zone_name,
        ST_GeomFromWKT(polygon) as zone_geometry,
        ST_Area(ST_GeomFromWKT(polygon)) as zone_area
    FROM zones
""")

spatial_zones.show()

+-----------------+-----------------+--------------------+--------------------+
|          zone_id|        zone_name|       zone_geometry|           zone_area|
+-----------------+-----------------+--------------------+--------------------+
|  manhattan_south|  Lower Manhattan|POLYGON ((-74.047...|0.007888799999999488|
|manhattan_central|Midtown Manhattan|POLYGON ((-74.047...|0.006093059999999745|
|    brooklyn_west|    West Brooklyn|POLYGON ((-74.047...|0.014789999999999491|
|   queens_central|   Central Queens|POLYGON ((-73.9 4...|0.014999999999999715|
+-----------------+-----------------+--------------------+--------------------+



In [6]:
# Complex spatial join: Assign pickup and dropoff zones
spatial_trips.createOrReplaceTempView("spatial_trips")
spatial_zones.createOrReplaceTempView("spatial_zones")

trips_with_zones = spark.sql("""
    SELECT 
        t.trip_id,
        t.pickup_datetime,
        t.pickup_point,
        t.dropoff_point,
        t.euclidean_distance,
        t.fare_amount,
        t.trip_distance,
        t.passenger_count,
        t.time_period,
        pz.zone_id as pickup_zone,
        pz.zone_name as pickup_zone_name,
        dz.zone_id as dropoff_zone,
        dz.zone_name as dropoff_zone_name,
        CASE 
            WHEN pz.zone_id = dz.zone_id THEN 'Intra-zone'
            ELSE 'Inter-zone'
        END as trip_type
    FROM spatial_trips t
    LEFT JOIN spatial_zones pz ON ST_Within(t.pickup_point, pz.zone_geometry)
    LEFT JOIN spatial_zones dz ON ST_Within(t.dropoff_point, dz.zone_geometry)
""")

trips_with_zones.cache()
trips_with_zones.createOrReplaceTempView("trips_with_zones")
print(f"Trips with zone assignments: {trips_with_zones.count()}")

# Analyze zone patterns
zone_analysis = spark.sql("""
    SELECT 
        pickup_zone_name,
        dropoff_zone_name,
        trip_type,
        time_period,
        COUNT(*) as trip_count,
        AVG(fare_amount) as avg_fare,
        AVG(euclidean_distance) as avg_distance,
        SUM(passenger_count) as total_passengers
    FROM trips_with_zones
    WHERE pickup_zone IS NOT NULL AND dropoff_zone IS NOT NULL
    GROUP BY pickup_zone_name, dropoff_zone_name, trip_type, time_period
    ORDER BY trip_count DESC
""")

zone_analysis.show(20)

25/10/25 12:10:06 WARN UDTRegistration: Cannot register UDT for org.apache.sedona.viz.core.ImageSerializableWrapper, which is already registered.
25/10/25 12:10:06 WARN UDTRegistration: Cannot register UDT for org.apache.sedona.viz.utils.Pixel, which is already registered.
25/10/25 12:10:06 WARN SimpleFunctionRegistry: The function st_pixelize replaced a previously registered function.
25/10/25 12:10:06 WARN SimpleFunctionRegistry: The function st_tilename replaced a previously registered function.
25/10/25 12:10:06 WARN SimpleFunctionRegistry: The function st_colorize replaced a previously registered function.
25/10/25 12:10:06 WARN SimpleFunctionRegistry: The function st_encodeimage replaced a previously registered function.
25/10/25 12:10:06 WARN SimpleFunctionRegistry: The function st_render replaced a previously registered function.
25/10/25 12:10:06 WARN UDTRegistration: Cannot register UDT for org.locationtech.jts.geom.Geometry, which is already registered.
25/10/25 12:10:06 WAR

Trips with zone assignments: 52318
+-----------------+-----------------+----------+------------+----------+------------------+--------------------+----------------+
| pickup_zone_name|dropoff_zone_name| trip_type| time_period|trip_count|          avg_fare|        avg_distance|total_passengers|
+-----------------+-----------------+----------+------------+----------+------------------+--------------------+----------------+
|  Lower Manhattan|   Central Queens|Inter-zone|     Regular|       733|27.427335100286825| 0.17449557917840078|            2604|
|  Lower Manhattan|    West Brooklyn|Inter-zone|     Regular|       674|28.154813553742706| 0.09357235376383466|            2340|
|Midtown Manhattan|    West Brooklyn|Inter-zone|     Regular|       586|27.437387431195283| 0.14627218923280066|            2059|
|Midtown Manhattan|   Central Queens|Inter-zone|     Regular|       580|27.882216827449465| 0.17483526991486867|            2079|
|  Lower Manhattan|  Lower Manhattan|Intra-zone|     Re

## 3. Spatial Clustering: DBSCAN-like Analysis

Finding hotspots and clusters in pickup locations.

In [7]:
# Spatial clustering using grid-based approach (DBSCAN alternative for big data)
def create_spatial_grid(df, grid_size=0.001):  # ~100m grid cells
    """
    Create a spatial grid for clustering analysis
    """
    df.createOrReplaceTempView("points")
    
    grid_df = spark.sql(f"""
        SELECT 
            FLOOR(ST_X(pickup_point) / {grid_size}) * {grid_size} as grid_x,
            FLOOR(ST_Y(pickup_point) / {grid_size}) * {grid_size} as grid_y,
            COUNT(*) as point_count,
            AVG(fare_amount) as avg_fare,
            time_period,
            COLLECT_LIST(trip_id) as trip_ids
        FROM points
        WHERE pickup_point IS NOT NULL
        GROUP BY grid_x, grid_y, time_period
        HAVING point_count >= 3
        ORDER BY point_count DESC
    """)
    
    return grid_df

# Create hotspot analysis
hotspots = create_spatial_grid(trips_with_zones)
hotspots.cache()

print(f"Identified {hotspots.count()} spatial hotspots")
hotspots.show(10)

25/10/25 12:10:12 WARN UDTRegistration: Cannot register UDT for org.apache.sedona.viz.core.ImageSerializableWrapper, which is already registered.
25/10/25 12:10:12 WARN UDTRegistration: Cannot register UDT for org.apache.sedona.viz.utils.Pixel, which is already registered.
25/10/25 12:10:12 WARN SimpleFunctionRegistry: The function st_pixelize replaced a previously registered function.
25/10/25 12:10:12 WARN SimpleFunctionRegistry: The function st_tilename replaced a previously registered function.
25/10/25 12:10:12 WARN SimpleFunctionRegistry: The function st_colorize replaced a previously registered function.
25/10/25 12:10:12 WARN SimpleFunctionRegistry: The function st_encodeimage replaced a previously registered function.
25/10/25 12:10:12 WARN SimpleFunctionRegistry: The function st_render replaced a previously registered function.
25/10/25 12:10:12 WARN UDTRegistration: Cannot register UDT for org.locationtech.jts.geom.Geometry, which is already registered.
25/10/25 12:10:12 WAR

Identified 4036 spatial hotspots
+-------+------+-----------+------------------+-----------+--------------------+
| grid_x|grid_y|point_count|          avg_fare|time_period|            trip_ids|
+-------+------+-----------+------------------+-----------+--------------------+
|-73.979|40.699|         10| 28.54039689992977|    Regular|[trip_001123, tri...|
|-73.958|40.742|          9|30.729177281036957|    Regular|[trip_001315, tri...|
|-74.001|40.699|          8|22.227245091971604|    Regular|[trip_004934, tri...|
|-73.938|40.744|          8|28.560674990514304|    Regular|[trip_001734, tri...|
|-73.956|40.736|          8|23.398327907427685|    Regular|[trip_007442, tri...|
|-74.018|40.697|          8| 31.30807819106216|    Regular|[trip_021749, tri...|
|-73.968|40.686|          8| 30.86401760066078|    Regular|[trip_022914, tri...|
|-74.016|40.753|          8|30.124422570106365|    Regular|[trip_019266, tri...|
|-73.998|40.692|          8| 36.54472547899647|    Regular|[trip_001216, tri



In [8]:
# Advanced hotspot analysis with density metrics
hotspots.createOrReplaceTempView("hotspots")

hotspot_analysis = spark.sql("""
    WITH hotspot_stats AS (
        SELECT 
            grid_x,
            grid_y,
            time_period,
            point_count,
            avg_fare,
            ST_Point(grid_x, grid_y) as grid_center,
            -- Calculate local density (points within 500m)
            (
                SELECT SUM(h2.point_count) 
                FROM hotspots h2 
                WHERE h2.time_period = h1.time_period
                  AND ST_Distance(ST_Point(h1.grid_x, h1.grid_y), ST_Point(h2.grid_x, h2.grid_y)) <= 0.005
            ) as neighborhood_density
        FROM hotspots h1
    ),
    ranked_hotspots AS (
        SELECT *,
            ROW_NUMBER() OVER (PARTITION BY time_period ORDER BY neighborhood_density DESC) as density_rank,
            CASE 
                WHEN neighborhood_density >= 100 THEN 'Super Hotspot'
                WHEN neighborhood_density >= 50 THEN 'Major Hotspot'
                WHEN neighborhood_density >= 20 THEN 'Minor Hotspot'
                ELSE 'Regular Area'
            END as hotspot_category
        FROM hotspot_stats
    )
    SELECT 
        time_period,
        hotspot_category,
        COUNT(*) as num_areas,
        AVG(point_count) as avg_pickups_per_area,
        AVG(avg_fare) as avg_fare_in_category,
        MAX(neighborhood_density) as max_density
    FROM ranked_hotspots
    GROUP BY time_period, hotspot_category
    ORDER BY time_period, max_density DESC
""")

print("Hotspot Analysis by Time Period:")
hotspot_analysis.show()

Hotspot Analysis by Time Period:


                                                                                

+------------+----------------+---------+--------------------+--------------------+-----------+
| time_period|hotspot_category|num_areas|avg_pickups_per_area|avg_fare_in_category|max_density|
+------------+----------------+---------+--------------------+--------------------+-----------+
|Evening Rush|    Regular Area|       52|   3.326923076923077|  26.897678890559515|         12|
|Morning Rush|    Regular Area|       71|   3.408450704225352|  28.383944332931463|         12|
|     Regular|   Super Hotspot|      921|   3.800217155266015|  27.554897167210996|        156|
|     Regular|   Major Hotspot|     1783|   3.675266404935502|  27.481191061607486|         99|
|     Regular|   Minor Hotspot|      795|  3.4540880503144655|  27.880101475771042|         49|
|     Regular|    Regular Area|      414|   3.185990338164251|  27.807555434482975|         19|
+------------+----------------+---------+--------------------+--------------------+-----------+



## 4. Route Optimization & Network Analysis

Analyzing optimal routes and identifying inefficient trips.

In [9]:
# Route efficiency analysis
trips_with_zones.createOrReplaceTempView("trips_analysis")

route_efficiency = spark.sql("""
    SELECT 
        trip_id,
        pickup_zone_name,
        dropoff_zone_name,
        euclidean_distance,
        trip_distance,
        fare_amount,
        time_period,
        -- Calculate efficiency metrics
        CASE 
            WHEN euclidean_distance > 0 THEN trip_distance / euclidean_distance 
            ELSE NULL 
        END as detour_ratio,
        
        CASE 
            WHEN trip_distance > 0 THEN fare_amount / trip_distance 
            ELSE NULL 
        END as fare_per_mile,
        
        -- Classify trip efficiency
        CASE 
            WHEN trip_distance / euclidean_distance <= 1.2 THEN 'Efficient'
            WHEN trip_distance / euclidean_distance <= 1.5 THEN 'Moderate'
            ELSE 'Inefficient'
        END as route_efficiency
    FROM trips_analysis
    WHERE euclidean_distance > 0.001  -- Filter out very short trips
      AND trip_distance > 0
      AND pickup_zone_name IS NOT NULL
      AND dropoff_zone_name IS NOT NULL
""")

route_efficiency.cache()
print(f"Route efficiency analysis for {route_efficiency.count()} trips")
route_efficiency.show(10)

25/10/25 12:10:47 WARN UDTRegistration: Cannot register UDT for org.apache.sedona.viz.core.ImageSerializableWrapper, which is already registered.
25/10/25 12:10:47 WARN UDTRegistration: Cannot register UDT for org.apache.sedona.viz.utils.Pixel, which is already registered.
25/10/25 12:10:47 WARN SimpleFunctionRegistry: The function st_pixelize replaced a previously registered function.
25/10/25 12:10:47 WARN SimpleFunctionRegistry: The function st_tilename replaced a previously registered function.
25/10/25 12:10:47 WARN SimpleFunctionRegistry: The function st_colorize replaced a previously registered function.
25/10/25 12:10:47 WARN SimpleFunctionRegistry: The function st_encodeimage replaced a previously registered function.
25/10/25 12:10:47 WARN SimpleFunctionRegistry: The function st_render replaced a previously registered function.
25/10/25 12:10:47 WARN UDTRegistration: Cannot register UDT for org.locationtech.jts.geom.Geometry, which is already registered.
25/10/25 12:10:47 WAR

Route efficiency analysis for 6148 trips
+-----------+----------------+-----------------+--------------------+------------------+------------------+------------+------------------+-------------------+----------------+
|    trip_id|pickup_zone_name|dropoff_zone_name|  euclidean_distance|     trip_distance|       fare_amount| time_period|      detour_ratio|      fare_per_mile|route_efficiency|
+-----------+----------------+-----------------+--------------------+------------------+------------------+------------+------------------+-------------------+----------------+
|trip_000002| Lower Manhattan|  Lower Manhattan|0.025771611874429636|2.8759278269756323| 32.53338026250708|Morning Rush|111.59285810248844| 11.312307616815147|     Inefficient|
|trip_000004| Lower Manhattan|Midtown Manhattan|0.051531646188783295|3.5499566048046645|  42.4937710281274|     Regular| 68.88886475311884| 11.970222669937568|     Inefficient|
|trip_000072| Lower Manhattan|    West Brooklyn|0.052231715267038924| 16.0

In [10]:
# Advanced route analysis with corridor identification
route_efficiency.createOrReplaceTempView("route_efficiency")

corridor_analysis = spark.sql("""
    WITH route_corridors AS (
        SELECT 
            pickup_zone_name,
            dropoff_zone_name,
            time_period,
            COUNT(*) as trip_volume,
            AVG(detour_ratio) as avg_detour,
            AVG(fare_per_mile) as avg_fare_per_mile,
            AVG(euclidean_distance) as avg_distance,
            PERCENTILE_APPROX(detour_ratio, 0.95) as p95_detour,
            -- Efficiency score (lower is better)
            AVG(detour_ratio) * 100 + (1.0 / AVG(fare_per_mile)) as inefficiency_score
        FROM route_efficiency
        WHERE pickup_zone_name != dropoff_zone_name  -- Inter-zone trips only
        GROUP BY pickup_zone_name, dropoff_zone_name, time_period
        HAVING trip_volume >= 5  -- Focus on popular routes
    ),
    ranked_corridors AS (
        SELECT *,
            ROW_NUMBER() OVER (PARTITION BY time_period ORDER BY trip_volume DESC) as volume_rank,
            ROW_NUMBER() OVER (PARTITION BY time_period ORDER BY inefficiency_score DESC) as inefficiency_rank
        FROM route_corridors
    )
    SELECT 
        time_period,
        pickup_zone_name,
        dropoff_zone_name,
        trip_volume,
        ROUND(avg_detour, 2) as avg_detour_ratio,
        ROUND(avg_fare_per_mile, 2) as avg_fare_per_mile,
        ROUND(inefficiency_score, 2) as inefficiency_score,
        volume_rank,
        inefficiency_rank,
        CASE 
            WHEN inefficiency_rank <= 3 THEN 'Optimization Priority'
            WHEN volume_rank <= 5 THEN 'High Volume Corridor'
            ELSE 'Regular Route'
        END as route_category
    FROM ranked_corridors
    WHERE volume_rank <= 10 OR inefficiency_rank <= 5
    ORDER BY time_period, inefficiency_rank
""")

print("Top Route Corridors by Time Period:")
corridor_analysis.show(30, truncate=False)

Top Route Corridors by Time Period:
+------------+-----------------+-----------------+-----------+----------------+-----------------+------------------+-----------+-----------------+---------------------+
|time_period |pickup_zone_name |dropoff_zone_name|trip_volume|avg_detour_ratio|avg_fare_per_mile|inefficiency_score|volume_rank|inefficiency_rank|route_category       |
+------------+-----------------+-----------------+-----------+----------------+-----------------+------------------+-----------+-----------------+---------------------+
|Evening Rush|West Brooklyn    |Lower Manhattan  |27         |214.92          |6.34             |21492.57          |8          |1                |Optimization Priority|
|Evening Rush|Midtown Manhattan|Lower Manhattan  |54         |157.24          |7.84             |15723.95          |5          |2                |Optimization Priority|
|Evening Rush|Lower Manhattan  |West Brooklyn    |108        |142.36          |7.42             |14235.69          |3  

## 5. Spatial Machine Learning: Demand Prediction

Using spatial features for predictive modeling.

In [11]:
# Create features for ML model
ml_features = spark.sql("""
    WITH spatial_features AS (
        SELECT 
            grid_x,
            grid_y,
            time_period,
            point_count as demand,
            avg_fare,
            -- Spatial features
            grid_x * 1000000 as x_scaled,  -- Scale coordinates
            grid_y * 1000000 as y_scaled,
            
            -- Time-based features
            CASE time_period 
                WHEN 'Morning Rush' THEN 1 
                WHEN 'Evening Rush' THEN 2
                WHEN 'Night' THEN 3
                ELSE 0 
            END as time_encoded,
            
            -- Distance from city center (approximate)
            SQRT(POWER(grid_x - (-73.9851), 2) + POWER(grid_y - 40.7589, 2)) as distance_from_center
        FROM hotspots
        WHERE point_count >= 3
    )
    SELECT *,
        -- Categorize demand levels for classification
        CASE 
            WHEN demand >= 20 THEN 2  -- High demand
            WHEN demand >= 10 THEN 1  -- Medium demand  
            ELSE 0                    -- Low demand
        END as demand_category
    FROM spatial_features
""")

ml_features.cache()
print(f"Created ML features for {ml_features.count()} spatial-temporal points")
ml_features.show(10)

Created ML features for 4036 spatial-temporal points
+-------+------+-----------+------+------------------+-------------+------------+------------+--------------------+---------------+
| grid_x|grid_y|time_period|demand|          avg_fare|     x_scaled|    y_scaled|time_encoded|distance_from_center|demand_category|
+-------+------+-----------+------+------------------+-------------+------------+------------+--------------------+---------------+
|-73.979|40.699|    Regular|    10| 28.54039689992977|-73979000.000|40699000.000|           0|  0.0602097998667991|              1|
|-73.958|40.742|    Regular|     9|30.729177281036957|-73958000.000|40742000.000|           0| 0.03193775195595332|              0|
|-74.001|40.699|    Regular|     8|22.227245091971604|-74001000.000|40699000.000|           0| 0.06197434953268974|              0|
|-73.938|40.744|    Regular|     8|28.560674990514304|-73938000.000|40744000.000|           0| 0.04940060728371667|              0|
|-73.956|40.736|    Reg

25/10/25 12:11:01 WARN UDTRegistration: Cannot register UDT for org.apache.sedona.viz.core.ImageSerializableWrapper, which is already registered.
25/10/25 12:11:01 WARN UDTRegistration: Cannot register UDT for org.apache.sedona.viz.utils.Pixel, which is already registered.
25/10/25 12:11:01 WARN SimpleFunctionRegistry: The function st_pixelize replaced a previously registered function.
25/10/25 12:11:01 WARN SimpleFunctionRegistry: The function st_tilename replaced a previously registered function.
25/10/25 12:11:01 WARN SimpleFunctionRegistry: The function st_colorize replaced a previously registered function.
25/10/25 12:11:01 WARN SimpleFunctionRegistry: The function st_encodeimage replaced a previously registered function.
25/10/25 12:11:01 WARN SimpleFunctionRegistry: The function st_render replaced a previously registered function.
25/10/25 12:11:01 WARN UDTRegistration: Cannot register UDT for org.locationtech.jts.geom.Geometry, which is already registered.
25/10/25 12:11:01 WAR

In [13]:
# Analyze clusters
predictions.createOrReplaceTempView("ml_predictions")

cluster_analysis = spark.sql("""
    WITH cluster_time_periods AS (
        SELECT 
            cluster,
            time_period,
            COUNT(*) as period_count,
            ROW_NUMBER() OVER (PARTITION BY cluster ORDER BY COUNT(*) DESC) as rn
        FROM ml_predictions
        GROUP BY cluster, time_period
    ),
    dominant_periods AS (
        SELECT 
            cluster,
            time_period as dominant_time_period
        FROM cluster_time_periods
        WHERE rn = 1
    )
    SELECT 
        c.cluster,
        COUNT(*) as cluster_size,
        AVG(c.demand) as avg_demand,
        AVG(c.avg_fare) as avg_fare_in_cluster,
        AVG(c.distance_from_center) as avg_distance_from_center,
        
        -- Most common time period in cluster
        dp.dominant_time_period,
        
        -- Demand characteristics
        MIN(c.demand) as min_demand,
        MAX(c.demand) as max_demand,
        STDDEV(c.demand) as demand_std
    FROM ml_predictions c
    LEFT JOIN dominant_periods dp ON c.cluster = dp.cluster
    GROUP BY c.cluster, dp.dominant_time_period
    ORDER BY c.cluster
""")

print("Spatial-Temporal Demand Clusters:")
cluster_analysis.show()

Spatial-Temporal Demand Clusters:
+-------+------------+------------------+-------------------+------------------------+--------------------+----------+----------+------------------+
|cluster|cluster_size|        avg_demand|avg_fare_in_cluster|avg_distance_from_center|dominant_time_period|min_demand|max_demand|        demand_std|
+-------+------------+------------------+-------------------+------------------------+--------------------+----------+----------+------------------+
|      0|         774| 3.445736434108527|  27.38362005404498|    0.050221999008898364|             Regular|         3|         8|0.7606262571181329|
|      1|         827| 3.909310761789601| 27.780791184320087|     0.06200423717324027|             Regular|         3|        10|1.0600597926933237|
|      2|         798|3.4423558897243107|  27.55561057754531|     0.04711078200598404|             Regular|         3|         8|0.7504481061400463|
|      3|         892|3.7286995515695067| 27.508281660887874|    0.01945

## 6. Performance Optimization Techniques

Demonstrating advanced Sedona performance optimization.

In [14]:
# Spatial indexing and partitioning strategies
import time

def benchmark_spatial_join(df1, df2, join_condition, description):
    """
    Benchmark different spatial join strategies
    """
    start_time = time.time()
    result_count = join_condition.count()
    end_time = time.time()
    
    print(f"{description}:")
    print(f"  - Result count: {result_count}")
    print(f"  - Execution time: {end_time - start_time:.2f} seconds")
    print(f"  - Partitions: {join_condition.rdd.getNumPartitions()}")
    return result_count, end_time - start_time

# Create test datasets
large_points = spark.sql("""
    SELECT 
        ST_Point(RAND() * 0.1 - 74.0, RAND() * 0.1 + 40.7) as point,
        CAST(RAND() * 1000 AS INT) as point_id
    FROM range(10000)
""")

test_polygons = spark.sql("""
    SELECT 
        ST_Buffer(ST_Point(RAND() * 0.05 - 73.98, RAND() * 0.05 + 40.75), 0.001) as polygon,
        CAST(RAND() * 100 AS INT) as poly_id
    FROM range(100)
""")

large_points.cache()
test_polygons.cache()

print(f"Created {large_points.count()} test points and {test_polygons.count()} test polygons")

25/10/25 12:18:44 WARN UDTRegistration: Cannot register UDT for org.apache.sedona.viz.core.ImageSerializableWrapper, which is already registered.
25/10/25 12:18:44 WARN UDTRegistration: Cannot register UDT for org.apache.sedona.viz.utils.Pixel, which is already registered.
25/10/25 12:18:44 WARN SimpleFunctionRegistry: The function st_pixelize replaced a previously registered function.
25/10/25 12:18:44 WARN SimpleFunctionRegistry: The function st_tilename replaced a previously registered function.
25/10/25 12:18:44 WARN SimpleFunctionRegistry: The function st_colorize replaced a previously registered function.
25/10/25 12:18:44 WARN SimpleFunctionRegistry: The function st_encodeimage replaced a previously registered function.
25/10/25 12:18:44 WARN SimpleFunctionRegistry: The function st_render replaced a previously registered function.
25/10/25 12:18:44 WARN UDTRegistration: Cannot register UDT for org.locationtech.jts.geom.Geometry, which is already registered.
25/10/25 12:18:44 WAR

Created 10000 test points and 100 test polygons


In [15]:
# Benchmark different join strategies
large_points.createOrReplaceTempView("large_points")
test_polygons.createOrReplaceTempView("test_polygons")

# Strategy 1: Basic spatial join
basic_join = spark.sql("""
    SELECT p.point_id, pg.poly_id
    FROM large_points p, test_polygons pg
    WHERE ST_Within(p.point, pg.polygon)
""")

benchmark_spatial_join(large_points, test_polygons, basic_join, "Basic Spatial Join")

# Strategy 2: With spatial indexing hint
indexed_join = spark.sql("""
    SELECT /*+ BROADCAST(pg) */ p.point_id, pg.poly_id
    FROM large_points p, test_polygons pg
    WHERE ST_Within(p.point, pg.polygon)
""")

benchmark_spatial_join(large_points, test_polygons, indexed_join, "Broadcast Join Strategy")

# Performance tips summary
print("\n🚀 Performance Optimization Tips:")
print("1. Use broadcast joins for small polygon datasets (< 200MB)")
print("2. Partition data by spatial regions for large datasets")
print("3. Cache frequently accessed spatial DataFrames")
print("4. Use appropriate spatial predicates (ST_Within vs ST_Intersects)")
print("5. Consider spatial indexing for repeated queries")

Basic Spatial Join:
  - Result count: 307
  - Execution time: 0.15 seconds
  - Partitions: 14
Broadcast Join Strategy:
  - Result count: 307
  - Execution time: 0.08 seconds
  - Partitions: 14

🚀 Performance Optimization Tips:
1. Use broadcast joins for small polygon datasets (< 200MB)
2. Partition data by spatial regions for large datasets
3. Cache frequently accessed spatial DataFrames
4. Use appropriate spatial predicates (ST_Within vs ST_Intersects)
5. Consider spatial indexing for repeated queries


## 7. Advanced Visualization: Interactive Heatmaps

Creating sophisticated spatial visualizations.

In [20]:
# Prepare data for visualization
viz_data = spark.sql("""
    SELECT 
        grid_x as longitude,
        grid_y as latitude,
        point_count as intensity,
        avg_fare,
        time_period
    FROM hotspots
    WHERE point_count >= 5
    ORDER BY point_count DESC
    LIMIT 500
""")

viz_pandas = viz_data.toPandas()
print(f"Prepared {len(viz_pandas)} points for visualization")

# Create multi-layer interactive map
def create_advanced_heatmap(df):
    # Center map on NYC
    center_lat = df['latitude'].mean()
    center_lon = df['longitude'].mean()
    
    m = folium.Map(
        location=[center_lat, center_lon],
        zoom_start=12,
        tiles='OpenStreetMap'
    )
    
    # Add different layers for different time periods
    time_periods = df['time_period'].unique()
    colors = ['red', 'blue', 'green', 'orange']
    
    for i, period in enumerate(time_periods):
        period_data = df[df['time_period'] == period]
        
        # Create heatmap data
        heat_data = [[row['latitude'], row['longitude'], row['intensity']] 
                    for idx, row in period_data.iterrows()]
        
        if heat_data:  # Only create layer if data exists
            HeatMap(
                heat_data,
                name=f'Demand - {period}',
                radius=15,
                blur=10,
                gradient={0.2: colors[i % len(colors)], 1.0: colors[i % len(colors)]}
            ).add_to(m)
    
    # Add layer control
    folium.LayerControl().add_to(m)
    
    return m

if len(viz_pandas) > 0:
    heatmap = create_advanced_heatmap(viz_pandas)
    print("✅ Interactive heatmap created! (Display in Jupyter)")
else:
    print("No data available for visualization")

Prepared 491 points for visualization
✅ Interactive heatmap created! (Display in Jupyter)


In [21]:
heatmap

In [19]:
# Advanced statistical analysis
summary_stats = spark.sql("""
    WITH overall_stats AS (
        SELECT 
            COUNT(DISTINCT trip_id) as total_trips,
            COUNT(DISTINCT pickup_zone_name) as unique_pickup_zones,
            COUNT(DISTINCT dropoff_zone_name) as unique_dropoff_zones,
            AVG(fare_amount) as avg_fare,
            AVG(euclidean_distance) as avg_distance,
            SUM(passenger_count) as total_passengers
        FROM trips_with_zones
        WHERE pickup_zone_name IS NOT NULL
    ),
    efficiency_stats AS (
        SELECT 
            route_efficiency,
            COUNT(*) as trip_count,
            AVG(detour_ratio) as avg_detour,
            AVG(fare_per_mile) as avg_fare_per_mile
        FROM route_efficiency
        GROUP BY route_efficiency
    )
    SELECT 
        'Overall Statistics' as metric_type,
        CAST(total_trips AS STRING) as value,
        'Total processed trips' as description
    FROM overall_stats
    
    UNION ALL
    
    SELECT 
        'Spatial Coverage' as metric_type,
        CAST(unique_pickup_zones AS STRING) as value,
        'Unique pickup zones covered' as description
    FROM overall_stats
    
    UNION ALL
    
    SELECT 
        'Route Efficiency' as metric_type,
        CONCAT(route_efficiency, ': ', CAST(trip_count AS STRING), ' trips') as value,
        CONCAT('Avg detour ratio: ', CAST(ROUND(avg_detour, 2) AS STRING)) as description
    FROM efficiency_stats
    ORDER BY metric_type, value
""")

print("\n📊 Advanced Spatial Analytics Summary:")
summary_stats.show(20, truncate=False)


📊 Advanced Spatial Analytics Summary:
+------------------+-----------------------+---------------------------+
|metric_type       |value                  |description                |
+------------------+-----------------------+---------------------------+
|Overall Statistics|32716                  |Total processed trips      |
|Route Efficiency  |Efficient: 19 trips    |Avg detour ratio: 0.87     |
|Route Efficiency  |Inefficient: 6123 trips|Avg detour ratio: 135.28   |
|Route Efficiency  |Moderate: 6 trips      |Avg detour ratio: 1.35     |
|Spatial Coverage  |4                      |Unique pickup zones covered|
+------------------+-----------------------+---------------------------+



In [None]:
# Cleanup
print("\n🧹 Cleaning up resources...")
spark.catalog.clearCache()
print("Cache cleared.")

print("\n🎯 Complex Spatial Analytics Completed!")
print("\nKey Capabilities Demonstrated:")
print("• Large-scale spatial ETL processing")
print("• Multi-zone geofencing analysis")
print("• Spatial clustering and hotspot detection")
print("• Route optimization analysis")
print("• Spatial machine learning integration")
print("• Performance optimization techniques")
print("• Advanced interactive visualizations")

# Uncomment to stop Spark (keep running for interactive use)
# spark.stop()