# H3 Index Analysis of Shortcuts

This notebook analyzes the spatial distribution of shortcuts using H3 geospatial indexing.

## Overview
1. Load shortcuts data
2. Add geographical information to shortcuts
3. Analyze H3 index distribution
4. Visualize spatial patterns

In [1]:
# Import required libraries
import sys
import os
from pathlib import Path

# Add src directory to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / 'src'))

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import project utilities
from utilities import add_info_for_shortcuts, read_edges

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully")

Libraries imported successfully


## 1. Initialize Spark Session

In [2]:
# Initialize Spark session with appropriate configuration
spark = SparkSession.builder \
    .appName("H3 Shortcuts Analysis") \
    .config("spark.driver.memory", "8g") \
    .config("spark.executor.memory", "8g") \
    .config("spark.sql.shuffle.partitions", "200") \
    .getOrCreate()

print(f"Spark version: {spark.version}")
print(f"Spark UI: {spark.sparkContext.uiWebUrl}")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/19 14:56:31 WARN Utils: Your hostname, Bamdad-Beast, resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/11/19 14:56:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/19 14:56:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark version: 4.0.1
Spark UI: http://10.255.255.254:4040


## 2. Load Shortcuts Data

In [3]:
# Load shortcuts from parquet
shortcuts_path = "../output/shortcuts_final"

shortcuts_df = spark.read.parquet(shortcuts_path)

print(f"Total shortcuts loaded: {shortcuts_df.count():,}")
print("\nSchema:")
shortcuts_df.printSchema()

print("\nSample data:")
shortcuts_df.show(5, truncate=False)

Total shortcuts loaded: 2,218,590

Schema:
root
 |-- incoming_edge: string (nullable = true)
 |-- outgoing_edge: string (nullable = true)
 |-- cost: double (nullable = true)
 |-- via_edge: string (nullable = true)


Sample data:
+--------------------------+--------------------------+-------------------+--------------------------+
|incoming_edge             |outgoing_edge             |cost               |via_edge                  |
+--------------------------+--------------------------+-------------------+--------------------------+
|(13250816284, 13250816290)|(13250816282, 13250816290)|0.19090000000000001|(13250816290, 13250816282)|
|(1930062545, 1114576394)  |(1114576394, 1930062609)  |0.6456000000000001 |(1114576394, 1930062609)  |
|(558687195, 9301172306)   |(9301172306, 336173436)   |1.5695666666666668 |(9301172306, 336173436)   |
|(6020960906, 6020960897)  |(6020960897, 10742571435) |1.981325           |(6020960897, 10742571435) |
|(416109492, 8175562672)   |(8175562672, 416109492

## 3. Add Geographical Information

Use the `add_info_for_shortcuts` function to enrich shortcuts with:
- Node coordinates
- H3 indices
- Edge geometries

In [4]:
# Load edges data (required for add_info_for_shortcuts)
# The edges file contains H3 indices (incoming_cell, outgoing_cell) and LCA resolution
edges_path = "../data/burnaby_driving_simplified_edges_with_h3.csv"
edges_df = read_edges(spark, edges_path)

print(f"Edges loaded: {edges_df.count():,}")
edges_df.show(5)

Edges loaded: 34,765
+--------------------+---------------+---------------+-------+
|                  id|  incoming_cell|  outgoing_cell|lca_res|
+--------------------+---------------+---------------+-------+
|(250385795, 49018...|8f28de135509245|8f28de135546632|      8|
|(250385795, 37935...|8f28de135555684|8f28de135546632|      9|
|(250385795, 43138...|8f28de13500551e|8f28de135546632|      7|
|(3793579800, 2628...|8f28de135553940|8f28de135555684|     10|
|(3793579800, 2503...|8f28de135546632|8f28de135555684|      9|
+--------------------+---------------+---------------+-------+
only showing top 5 rows


In [5]:
# Add geographical information to shortcuts
print("Adding geographical information to shortcuts...")

shortcuts_with_info = add_info_for_shortcuts(
    spark=spark,
    shortcuts_df=shortcuts_df,
    edges_df=edges_df
)

print("\nEnriched schema:")
shortcuts_with_info.printSchema()

print("\nSample enriched data:")
shortcuts_with_info.show(5, truncate=False)

Adding geographical information to shortcuts...

Enriched schema:
root
 |-- incoming_edge: string (nullable = true)
 |-- outgoing_edge: string (nullable = true)
 |-- cost: double (nullable = true)
 |-- via_edge: string (nullable = true)
 |-- lca_res: integer (nullable = true)
 |-- via_cell: string (nullable = true)
 |-- via_res: integer (nullable = true)


Sample enriched data:
+--------------------------+--------------------------+-------------------+--------------------------+-------+---------------+-------+
|incoming_edge             |outgoing_edge             |cost               |via_edge                  |lca_res|via_cell       |via_res|
+--------------------------+--------------------------+-------------------+--------------------------+-------+---------------+-------+
|(13250816284, 13250816290)|(13250816282, 13250816290)|0.19090000000000001|(13250816290, 13250816282)|13     |8d28de8884250bf|13     |
|(1930062545, 1114576394)  |(1114576394, 1930062609)  |0.6456000000000001 |(111

                                                                                

## 4. H3 Index Analysis

Analyze the spatial distribution of shortcuts using H3 indices.

In [6]:
# Check available columns after enrichment
print("Available columns:")
for col in shortcuts_with_info.columns:
    print(f"  - {col}")

# The add_info_for_shortcuts function adds:
# - via_cell: The LCA (Lowest Common Ancestor) cell for the shortcut
# - via_res: The resolution of the via_cell
# - lca_res: The resolution of the LCA

Available columns:
  - incoming_edge
  - outgoing_edge
  - cost
  - via_edge
  - lca_res
  - via_cell
  - via_res


In [7]:
# Analyze H3 resolution distribution
print("Analyzing H3 resolution distribution...")

# Cache the enriched data for faster analysis
shortcuts_with_info.cache()

print(f"Total enriched shortcuts: {shortcuts_with_info.count():,}")

# Count shortcuts by resolution
res_counts = shortcuts_with_info.groupBy("via_res").count().orderBy("via_res")
print("\nShortcuts by H3 Resolution:")
res_counts.show()

Analyzing H3 resolution distribution...


                                                                                

Total enriched shortcuts: 2,218,590

Shortcuts by H3 Resolution:
+-------+------+
|via_res| count|
+-------+------+
|      3| 40755|
|      4|  6393|
|      5| 85084|
|      6|538416|
|      7|493532|
|      8|473907|
|      9|289882|
|     10|143133|
|     11| 40309|
|     12|  7935|
|     13|  1073|
|     14|   167|
|     15| 98004|
+-------+------+



In [8]:
# Group by H3 index and calculate statistics
# Note: Adjust this based on actual column names from add_info_for_shortcuts

# Example structure (to be adjusted):
# h3_stats = shortcuts_with_info.groupBy("h3_index") \
#     .agg(
#         F.count("*").alias("shortcut_count"),
#         F.avg("cost").alias("avg_cost"),
#         F.min("cost").alias("min_cost"),
#         F.max("cost").alias("max_cost")
#     ) \
#     .orderBy(F.desc("shortcut_count"))

# h3_stats.show(20)

print("H3 statistics calculation placeholder - adjust based on actual schema")

H3 statistics calculation placeholder - adjust based on actual schema


## 5. Spatial Distribution Visualization

In [9]:
# Visualize H3 distribution
# This section will be populated once we know the exact schema

print("Visualization placeholder - to be implemented based on actual H3 columns")

Visualization placeholder - to be implemented based on actual H3 columns


In [10]:
# Sample data for visualization
sample_size = 10000
sample_df = shortcuts_with_info.sample(False, min(sample_size / shortcuts_with_info.count(), 1.0))
sample_pd = sample_df.toPandas()

print(f"Sample size for visualization: {len(sample_pd):,}")

Sample size for visualization: 10,230


## 6. Save Results

In [11]:
# Save enriched shortcuts with H3 information
output_path = "../output/shortcuts_with_h3.parquet"

shortcuts_with_info.write.mode("overwrite").parquet(output_path)

print(f"✓ Enriched shortcuts saved to: {output_path}")



✓ Enriched shortcuts saved to: ../output/shortcuts_with_h3.parquet


                                                                                

In [12]:
# Save H3 statistics
# h3_stats_path = "../output/h3_statistics.parquet"
# h3_stats.write.mode("overwrite").parquet(h3_stats_path)
# print(f"✓ H3 statistics saved to: {h3_stats_path}")

print("Statistics save placeholder")

Statistics save placeholder


## 7. Cleanup

In [None]:
# Stop Spark session
spark.stop()
print("Spark session stopped.")