# Geospatial Intelligence with H3

## Purpose

This notebook introduces **geospatial intelligence**
into the Crisis Recovery Lakehouse using **H3 hexagonal indexing**.

Its purpose is to transform **latitude–longitude–based events**
into **spatially aggregated risk signals** that can be:
- Analyzed efficiently
- Compared consistently across regions
- Used for crisis monitoring and decision-making

This notebook converts raw geographic coordinates
into **location-aware intelligence**.

## Business Context

During a crisis, issues rarely occur uniformly.

Problems often cluster geographically:
- Certain neighborhoods experience more delays
- Specific zones show hygiene complaints
- Localized congestion overwhelms delivery capacity

Raw latitude and longitude data:
- Is difficult to aggregate
- Is computationally expensive
- Does not align with business decision boundaries

Geospatial intelligence enables leadership to answer:
- *Where is the crisis concentrated?*
- *Which areas require intervention first?*
- *Are risks spreading or contained geographically?*


## Inputs and Outputs

### Inputs (Silver / Enriched Data)

| Source Table | Purpose |
|-------------|--------|
| Order or event data with latitude & longitude | Geographic signal source |

Coordinates typically originate from:
- Delivery locations
- Store locations
- Customer drop-off points


### Outputs

| Output | Business Purpose |
|------|------------------|
| H3-indexed dataset | Spatial aggregation foundation |
| Zone-level risk signals | Identify geographic hotspots |
| Geospatial features | Downstream Gold & ML usage |

All outputs use **hexagonal spatial indexing**
to ensure consistency and scalability.

## Why H3 (Hexagonal Indexing)

### Business Problem

Traditional geographic grouping (city, zip code, radius):
- Creates uneven regions
- Breaks at boundaries
- Is difficult to scale globally


### Approach

H3 solves this by:
- Dividing the world into **equal-area hexagons**
- Allowing hierarchical resolution (coarse → fine)
- Enabling fast spatial aggregation and joins

This makes H3 ideal for:
- Crisis hotspot detection
- Market congestion analysis
- Zone-based risk scoring

## Design Principles

- Use **hexagons, not raw coordinates**
- Spatial resolution must be **explicit and consistent**
- Geospatial logic must be **deterministic**
- Outputs must be safe for aggregation and ML
- No visualization assumptions baked into the data




In [0]:
%pip install h3


In [0]:
dbutils.library.restartPython()

## 1: Store Location Dimension Initialization

### Business Problem

Geospatial analysis requires **stable latitude and longitude references**
for physical entities such as stores.

However, transactional order data often:
- Does not contain store coordinates
- Cannot be reliably joined to external GIS systems
- Lacks a dedicated location dimension

Without consistent store locations, spatial aggregation
and hotspot detection are not possible.

---

### Approach

We create a **store location dimension table** that:
- Assigns each store a fixed latitude and longitude
- Constrains coordinates within a realistic city boundary
- Persists locations for reuse across geospatial analyses

This table is created only once to ensure:
- Deterministic spatial behavior
- Reproducibility across runs
- Clean separation between transactional and spatial data

In [0]:
import random
from pyspark.sql import Row

# Create store location dimension table if it does not exist
if not spark.catalog.tableExists("food_delivery.dim_store_locations"):

    # Generate random latitude within Bengaluru city bounds
    def generate_lat():
        return 12.9 + (random.random() * 0.1) - 0.05

    # Generate random longitude within Bengaluru city bounds
    def generate_lon():
        return 77.5 + (random.random() * 0.1) - 0.05

    # Get distinct store IDs from enriched orders (safe to collect for small dataset)
    stores_df = (
        spark.table("food_delivery.silver_orders_enriched")
        .select("store_id")
        .distinct()
    )

    store_ids = [row.store_id for row in stores_df.collect()]

    # Assign each store a random but stable location within city bounds
    store_locs = [
        Row(
            store_id=sid,
            latitude=generate_lat(),
            longitude=generate_lon()
        )
        for sid in store_ids
    ]

    # Create DataFrame and persist as Delta table for spatial analysis
    df_store_locs = spark.createDataFrame(store_locs)

    df_store_locs.write.format("delta") \
        .mode("overwrite") \
        .saveAsTable("food_delivery.dim_store_locations")
else:
    # Skip creation if table already exists
    print("store locations table already exists → skipping creation")

# Display sample of AI review sentiment data for reference
display(spark.table("food_delivery.ai_review_sentiment").limit(10))


## 2: H3 Hexagonal Index Generation

### Business Problem

Raw latitude and longitude values:
- Are expensive to aggregate
- Do not group naturally
- Make spatial comparisons difficult

Direct coordinate-based grouping
does not scale for large datasets.

---

### Approach

We define a user-defined function (UDF) to convert
latitude–longitude pairs into **H3 hexagonal indexes**.

Using H3 resolution level 9:
- Provides neighborhood-level granularity
- Balances spatial precision with aggregation efficiency
- Enables consistent spatial joins and grouping

Each coordinate is mapped to a **single hexagon cell**,
forming the foundation for geospatial intelligence.


In [0]:
# UDF to convert latitude and longitude to H3 hexagonal index (resolution 9)
import h3
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(returnType=StringType())
def get_h3_index(lat, lon):
    # Return None if coordinates are missing
    if lat is None or lon is None:
        return None
    # Compute H3 index for given coordinates
    return h3.latlng_to_cell(lat, lon, 9)

%md
## 3: Build Geospatial Safety Intelligence Table

### Business Problem

Crisis-related risks are influenced by the intersection of:
- Location
- Time
- Operational context
- Customer feedback

Without a unified spatial dataset,
these signals remain siloed and underutilized.

### Approach

We construct a **geospatial intelligence table** by:
- Joining orders with AI-enriched sentiment data
- Attaching fixed store locations
- Converting coordinates into H3 indexes
- Normalizing timestamps to date-level granularity

The resulting table provides a **location-aware view**
of safety and hygiene signals suitable for:
- Spatial aggregation
- Trend analysis
- Downstream Gold analytics


In [0]:
from pyspark.sql.functions import col, to_date

# Check if the geospatial intelligence table already exists
if not spark.catalog.tableExists("food_delivery.gold_geo_safety_map"):
    # Load enriched order, AI sentiment, and store location data
    df_orders = spark.table("food_delivery.silver_orders_enriched")
    df_ai = spark.table("food_delivery.ai_review_sentiment")
    df_locs = spark.table("food_delivery.dim_store_locations")

    # Join datasets and generate H3 hexagonal index for each event
    df_geo_intelligence = (
        df_orders
        .join(df_ai, "order_id", "left")
        .join(df_locs, "store_id", "left")
        .withColumn("h3_index", get_h3_index(col("latitude"), col("longitude")))
        .select(
            "order_id",
            "created_at_simulated",
            "store_id",
            "latitude",
            "longitude",
            "h3_index",
            "ai_topic"
        )
    )

    # Normalize timestamp to date-level granularity
    df_geo_intelligence = df_geo_intelligence.withColumn("created_at_simulated", to_date(col("created_at_simulated")))

    # Persist the geospatial intelligence table for downstream analytics
    df_geo_intelligence.write.format("delta") \
        .mode("overwrite") \
        .saveAsTable("food_delivery.gold_geo_safety_map")
else:
    # Skip creation if table already exists
    print("gold geo safety maptable already exists → skipping creation")

# Display a sample of the geospatial intelligence table
display(spark.table("food_delivery.gold_geo_safety_map").limit(10))

## 4: Hygiene Risk Hotspot Detection

### Business Problem

Hygiene-related issues pose
**high-severity operational and reputational risk**.

During crises, these issues often:
- Cluster geographically
- Spread locally before becoming widespread
- Require immediate, location-specific intervention

---

### Approach

We identify hygiene hotspots by:
- Filtering for hygiene-related AI topics
- Aggregating events at the H3 hexagon level
- Ranking zones by incident frequency

This highlights **geographic concentrations of risk**,
enabling targeted investigation and rapid response
before issues escalate across markets.


In [0]:
# Identify hygiene risk hotspots by:
# - Filtering for hygiene-related AI topics
# - Aggregating events at the H3 hexagon level
# - Ranking zones by incident frequency

hotspots = (
    spark.table("food_delivery.gold_geo_safety_map")
    .filter(col("ai_topic") == "Hygiene")
    .groupBy("h3_index", "latitude", "longitude")
    .count()
    .orderBy(col("count").desc())
)

# Display top 20 hygiene risk hotspots for targeted intervention
display(hotspots.limit(20))

Databricks visualization. Run in Databricks to view.

## Summary

This notebook establishes the **geospatial intelligence layer**
of the Crisis Recovery Lakehouse by:

- Converting latitude–longitude data into H3 hexagonal indexes
- Enabling consistent spatial aggregation across markets
- Creating zone-level signals for crisis detection
- Providing scalable geographic features for analytics and ML

It ensures that **location is treated as a first-class signal** —
not an afterthought.


## Downstream Dependencies

The Geospatial Intelligence layer feeds:

### Gold Analytics Tables
- Market congestion analysis
- Regional SLA degradation metrics
- Geographic risk dashboards

---

### Crisis Operations & Planning
- Area-based intervention planning
- Supply rebalancing decisions
- Zone-level escalation workflows

---

### ML Feature Engineering
- Location-based risk indicators
- Spatial exposure features
- Market-level churn sensitivity signals

---

Any error in this layer directly impacts:
- Hotspot detection accuracy
- Regional decision quality
- Crisis response effectiveness

This is why geospatial logic must remain
**precise, consistent, and scalable**.