# Apply Pre-Trained Routing Disagreement Models

**This is the recommended starting point for most users.**

This notebook demonstrates how to:
1. Load the pre-trained models
2. Generate the 6 required predictor features for your O-D pairs
3. Predict where routing platforms will disagree
4. Interpret and use the results

## What the Models Predict

- **Distance Model**: Predicts if platforms will disagree by >5% on walking distance
- **Time Model**: Predicts if platforms will disagree by >20% on walking time

## Requirements

- Origin/destination coordinates (latitude, longitude)
- Google Earth Engine account (free) for elevation data
- US Census API key (free) for population data

## Step 0: Setup

In [None]:
# Install dependencies
!pip install osmnx geopandas shapely pyproj geopy earthengine-api census pygris joblib pandas numpy tqdm -q

# Clone the repository to get the pre-trained models
!git clone https://github.com/Malmusidi/routing-prediction.git repo 2>/dev/null || echo "Repository exists or using local files"

print("✓ Setup complete")

In [None]:
#@title Configuration { display-mode: "form" }

# If you cloned the repo, models are in repo/models/
# Otherwise, upload the model files and update these paths
DISTANCE_MODEL_PATH = 'repo/models/distance_model.joblib'  #@param {type:"string"}
TIME_MODEL_PATH = 'repo/models/time_model.joblib'  #@param {type:"string"}

# API Keys (get these for free - links in README)
CENSUS_API_KEY = 'YOUR_CENSUS_API_KEY'  #@param {type:"string"}
GEE_PROJECT_ID = 'YOUR_GEE_PROJECT_ID'  #@param {type:"string"}

# Census geography for your study area
STATE_FIPS = '13'  #@param {type:"string"}
COUNTY_FIPS = '059'  #@param {type:"string"}

print("Configuration set!")

In [None]:
import pandas as pd
import numpy as np
import geopandas as gpd
import osmnx as ox
from shapely.geometry import Point
from geopy.distance import geodesic
from collections import Counter
from tqdm.auto import tqdm
import joblib
import warnings
warnings.filterwarnings('ignore')

# Configure OSMnx
ox.settings.log_console = False
ox.settings.use_cache = True

print("✓ Libraries imported")

## Step 1: Load Pre-Trained Models

In [None]:
# Load the pre-trained models
try:
    distance_model = joblib.load(DISTANCE_MODEL_PATH)
    time_model = joblib.load(TIME_MODEL_PATH)
    print("✓ Models loaded successfully!")
    print(f"  Distance model: {distance_model.n_estimators} trees")
    print(f"  Time model: {time_model.n_estimators} trees")
except FileNotFoundError:
    print("⚠️ Model files not found!")
    print("\nOption 1: Upload the model files manually")
    print("Option 2: Update the model paths in the configuration above")
    print("Option 3: Clone the repository with the correct username")

## Step 2: Prepare Your O-D Pairs

You can either:
- **Upload a CSV** with your O-D coordinates
- **Create sample data** to test the workflow

In [None]:
#@title Choose your data source { display-mode: "form" }

data_source = "Create sample data"  #@param ["Upload CSV", "Create sample data"]

if data_source == "Upload CSV":
    from google.colab import files
    print("Please upload your CSV file...")
    uploaded = files.upload()
    filename = list(uploaded.keys())[0]
    df = pd.read_csv(filename)
    print(f"\n✓ Loaded {len(df)} O-D pairs")
    print(f"Columns: {list(df.columns)}")
    
else:
    # Create sample O-D pairs in Athens-Clarke County, Georgia
    # These are example coordinates for demonstration
    sample_data = {
        'origin_lat': [33.9519, 33.9612, 33.9478, 33.9550, 33.9680],
        'origin_lon': [-83.3576, -83.3780, -83.3890, -83.3650, -83.3920],
        'dest_lat': [33.9545, 33.9580, 33.9510, 33.9590, 33.9620],
        'dest_lon': [-83.3620, -83.3750, -83.3850, -83.3700, -83.3880]
    }
    df = pd.DataFrame(sample_data)
    print("✓ Created sample data with 5 O-D pairs in Athens-Clarke County, GA")

# Set column names
COL_ORIGIN_LAT = 'origin_lat'
COL_ORIGIN_LON = 'origin_lon'
COL_DEST_LAT = 'dest_lat'
COL_DEST_LON = 'dest_lon'

df.head()

## Step 3: Generate Predictor Features

The models require these 6 features:
1. `Straight_Line_Distance_m`
2. `Origin_Road_Length_Density_m_km2`
3. `Dest_Intersection_Density_n_km2`
4. `Slope_Pct`
5. `Elevation_Difference_m`
6. `Population`

In [None]:
# Constants
BUFFER_M = 400
CRS_METRIC = 3857

# Helper functions
def area_km2(geom):
    return geom.area / 1_000_000

def buffer_from_latlon(lat, lon):
    pt = Point(lon, lat)
    return gpd.GeoSeries([pt], crs=4326).to_crs(CRS_METRIC).buffer(BUFFER_M)[0]

def calculate_haversine(row):
    origin = (row[COL_ORIGIN_LAT], row[COL_ORIGIN_LON])
    dest = (row[COL_DEST_LAT], row[COL_DEST_LON])
    return geodesic(origin, dest).meters

def net_metrics(buf_metric):
    """Calculate network metrics within buffer"""
    intersect_density = road_len_density = 0.0
    area = area_km2(buf_metric)
    buf_wgs = gpd.GeoSeries([buf_metric], crs=CRS_METRIC).to_crs(4326).iloc[0]
    
    try:
        G = ox.graph_from_polygon(buf_wgs, network_type="drive_service", 
                                   simplify=True, retain_all=False)
        if len(G.nodes) and len(G.edges):
            nodes, edges = ox.graph_to_gdfs(G, nodes=True, edges=True)
            edges_m = edges.to_crs(CRS_METRIC)
            road_len_density = edges_m.geometry.length.sum() / area
            
            nodes_m = nodes.to_crs(CRS_METRIC)
            nodes_cl = gpd.clip(nodes_m, buf_metric)
            if "street_count" in nodes_cl.columns:
                inter_cnt = (nodes_cl["street_count"] >= 3).sum()
            else:
                deg = Counter([u for u, v, k in G.edges(keys=True)] +
                            [v for u, v, k in G.edges(keys=True)])
                inter_cnt = sum(1 for n in nodes_cl.index if deg.get(n, 0) >= 3)
            intersect_density = inter_cnt / area
    except:
        pass
    return intersect_density, road_len_density

print("✓ Helper functions defined")

In [None]:
# Feature 1: Straight-line distance
print("Calculating straight-line distances...")
df['Straight_Line_Distance_m'] = df.apply(calculate_haversine, axis=1)
print(f"✓ Distances: {df['Straight_Line_Distance_m'].min():.0f} - {df['Straight_Line_Distance_m'].max():.0f} m")

In [None]:
# Features 2-3: Network metrics
print("Calculating network metrics (this may take a moment)...")

df['Origin_Road_Length_Density_m_km2'] = 0.0
df['Dest_Intersection_Density_n_km2'] = 0.0

for i, row in tqdm(df.iterrows(), total=len(df), desc="Processing"):
    # Origin road density
    buf_origin = buffer_from_latlon(row[COL_ORIGIN_LAT], row[COL_ORIGIN_LON])
    _, rd_o = net_metrics(buf_origin)
    df.at[i, 'Origin_Road_Length_Density_m_km2'] = rd_o
    
    # Destination intersection density
    buf_dest = buffer_from_latlon(row[COL_DEST_LAT], row[COL_DEST_LON])
    id_d, _ = net_metrics(buf_dest)
    df.at[i, 'Dest_Intersection_Density_n_km2'] = id_d

print("✓ Network metrics calculated")

In [None]:
# Features 4-5: Topographic data from Google Earth Engine
print("Fetching elevation data from Google Earth Engine...")

import ee
ee.Authenticate()
ee.Initialize(project=GEE_PROJECT_ID)

DEM = ee.Image('USGS/SRTMGL1_003')

def get_elevation(lat, lon):
    """Get elevation at a point"""
    point = ee.Geometry.Point(lon, lat)
    elev = DEM.sample(point, 30).first().get('elevation').getInfo()
    return elev if elev else 0

# Get elevations
df['elev_origin'] = df.apply(lambda r: get_elevation(r[COL_ORIGIN_LAT], r[COL_ORIGIN_LON]), axis=1)
df['elev_dest'] = df.apply(lambda r: get_elevation(r[COL_DEST_LAT], r[COL_DEST_LON]), axis=1)

# Calculate derived features
df['Elevation_Difference_m'] = df['elev_dest'] - df['elev_origin']
df['Slope_Pct'] = (df['Elevation_Difference_m'] / df['Straight_Line_Distance_m']) * 100

# Clean up temp columns
df = df.drop(columns=['elev_origin', 'elev_dest'])

print("✓ Topographic features calculated")

In [None]:
# Feature 6: Population from Census
print("Fetching population data from Census...")

from census import Census
import pygris

# Download block groups
block_groups = pygris.block_groups(state=STATE_FIPS, county=COUNTY_FIPS, year=2020, cache=True)
block_groups = block_groups.to_crs(CRS_METRIC)

# Get population data
c = Census(CENSUS_API_KEY)
pop_data = c.pl.get(('P1_001N',), geo={'for': 'block group:*', 
                    'in': f'state:{STATE_FIPS} county:{COUNTY_FIPS}'}, year=2020)
pop_df = pd.DataFrame(pop_data).rename(columns={'P1_001N': 'Population'})
pop_df['GEOID'] = pop_df['state'] + pop_df['county'] + pop_df['tract'] + pop_df['block group']
pop_df = pop_df[['GEOID', 'Population']].astype({'Population': float})

# Merge and spatial join
block_groups = block_groups.merge(pop_df, on='GEOID', how='left')
block_groups['Population'] = block_groups['Population'].fillna(0)

geometry = [Point(lon, lat) for lon, lat in zip(df[COL_ORIGIN_LON], df[COL_ORIGIN_LAT])]
gdf = gpd.GeoDataFrame(df, geometry=geometry, crs=4326).to_crs(CRS_METRIC)
gdf_joined = gdf.sjoin(block_groups[['GEOID', 'Population', 'geometry']], how='left', predicate='within')

df['Population'] = gdf_joined['Population'].fillna(0).values

print("✓ Population data added")

In [None]:
# Verify all features are present
FEATURE_COLS = [
    'Straight_Line_Distance_m',
    'Origin_Road_Length_Density_m_km2',
    'Dest_Intersection_Density_n_km2',
    'Slope_Pct',
    'Elevation_Difference_m',
    'Population'
]

print("\nFeature Summary:")
print("="*50)
for col in FEATURE_COLS:
    if col in df.columns:
        print(f"✓ {col:40} Mean: {df[col].mean():10.2f}")
    else:
        print(f"✗ {col:40} MISSING")

print("="*50)
print("\n✓ All features ready for prediction!")

## Step 4: Make Predictions

In [None]:
# Prepare feature matrix
X = df[FEATURE_COLS]

# Make predictions
df['distance_disagreement'] = distance_model.predict(X)
df['time_disagreement'] = time_model.predict(X)

# Get prediction probabilities
df['distance_disagreement_prob'] = distance_model.predict_proba(X)[:, 1]
df['time_disagreement_prob'] = time_model.predict_proba(X)[:, 1]

print("\n" + "="*60)
print("PREDICTION RESULTS")
print("="*60)
print(f"\nTotal O-D pairs analyzed: {len(df)}")
print(f"\nDistance Disagreement (>5%):")
print(f"  Predicted to agree: {(df['distance_disagreement']==0).sum()}")
print(f"  Predicted to disagree: {(df['distance_disagreement']==1).sum()}")
print(f"\nTime Disagreement (>20%):")
print(f"  Predicted to agree: {(df['time_disagreement']==0).sum()}")
print(f"  Predicted to disagree: {(df['time_disagreement']==1).sum()}")

In [None]:
# View results
result_cols = [COL_ORIGIN_LAT, COL_ORIGIN_LON, COL_DEST_LAT, COL_DEST_LON,
               'Straight_Line_Distance_m', 'distance_disagreement', 
               'distance_disagreement_prob', 'time_disagreement', 
               'time_disagreement_prob']

print("\nDetailed Results:")
df[result_cols]

## Step 5: Interpret and Use Results

### How to Use These Predictions

**Option 1: Selective Multi-Platform Validation**
- Query all 3 platforms only for routes where `disagreement = 1`
- Use single-platform results for routes where `disagreement = 0`

**Option 2: Uncertainty Reporting**
- Report the percentage of high-disagreement routes as a measure of analytical uncertainty
- Higher percentages indicate less reliable accessibility estimates

In [None]:
# Calculate uncertainty metrics for your analysis
pct_dist_disagree = (df['distance_disagreement'] == 1).mean() * 100
pct_time_disagree = (df['time_disagreement'] == 1).mean() * 100

print("\n" + "="*60)
print("UNCERTAINTY ASSESSMENT")
print("="*60)
print(f"\nRoutes predicted to have platform disagreement:")
print(f"  Distance (>5% discrepancy): {pct_dist_disagree:.1f}%")
print(f"  Time (>20% discrepancy): {pct_time_disagree:.1f}%")

if pct_dist_disagree > 50 or pct_time_disagree > 50:
    print("\n⚠️ High uncertainty: Consider validating with multiple platforms")
elif pct_dist_disagree > 25 or pct_time_disagree > 25:
    print("\n⚡ Moderate uncertainty: Selective validation recommended")
else:
    print("\n✓ Low uncertainty: Single-platform results likely reliable")

In [None]:
# Identify routes that need validation
routes_to_validate = df[
    (df['distance_disagreement'] == 1) | (df['time_disagreement'] == 1)
].copy()

print(f"\nRoutes requiring multi-platform validation: {len(routes_to_validate)} / {len(df)}")

if len(routes_to_validate) > 0:
    print("\nThese routes are predicted to have significant platform disagreement:")
    display(routes_to_validate[[COL_ORIGIN_LAT, COL_ORIGIN_LON, 
                                 COL_DEST_LAT, COL_DEST_LON,
                                 'distance_disagreement_prob', 
                                 'time_disagreement_prob']])

## Step 6: Export Results

In [None]:
# Save results to CSV
output_file = '/content/routing_predictions.csv'
df.to_csv(output_file, index=False)

print(f"\n✓ Results saved to {output_file}")
print(f"\nColumns in output file:")
for col in df.columns:
    print(f"  - {col}")

In [None]:
# Download results (Colab)
from google.colab import files
files.download(output_file)
print("✓ File downloaded")

## Summary

You've successfully:
1. ✓ Loaded pre-trained routing disagreement models
2. ✓ Generated predictor features for your O-D pairs
3. ✓ Predicted where routing platforms will disagree
4. ✓ Identified routes requiring multi-platform validation

### Next Steps
- For routes with predicted disagreement, query Google Maps, ArcGIS, and ORS
- Reconcile estimates through averaging or manual validation
- Report uncertainty metrics in your accessibility analysis

### Questions?
See the README or open an issue on GitHub.