# ML-Accelerated Hosting Capacity Screening

From the [Sisyphean Gridworks ML Playground](https://sgridworks.com/ml-playground/guides/11-advanced-hosting-capacity.html)

## Setup

Clone the repository and install dependencies. Run this cell first.

In [None]:
!git clone https://github.com/SGridworks/Dynamic-Network-Model.git 2>/dev/null || echo 'Already cloned'
%cd Dynamic-Network-Model
!pip install -q pandas numpy matplotlib seaborn scikit-learn xgboost lightgbm pyarrow

## Load SP&L Data and Compute Hosting Capacity

We load the SP&L network, transformer, solar, and load profile data using the data loader API. Then we compute hosting capacity per transformer as a simplified estimate: rated kVA minus existing solar capacity minus peak load. These computed values become the labels for supervised learning.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import time

from demo_data.load_demo_data import (
    load_network_nodes,
    load_network_edges,
    load_transformers,
    load_solar_installations,
    load_load_profiles,
)

# --- Load all datasets via the SP&L data loader ---
nodes = load_network_nodes()         # ~44k nodes with lat/lon
edges = load_network_edges()         # ~44k edges with impedance, rated amps
transformers = load_transformers()   # ~21k transformers with kva_rating
solar = load_solar_installations()   # ~17k solar installs with capacity_kw
load_profiles = load_load_profiles() # 15-min feeder load profiles

print(f"Network nodes:        {len(nodes):,}")
print(f"Network edges:        {len(edges):,}")
print(f"Transformers:         {len(transformers):,}")
print(f"Solar installations:  {len(solar):,}")
print(f"Load profile rows:    {len(load_profiles):,}")

## Build Transformer and Network Features

A good surrogate model needs features that capture the physical factors driving hosting capacity: how far the transformer is from the substation, how much load is nearby, how stiff the local network is. We engineer six groups of features from the SP&L data.

In [None]:
# --- Compute hosting capacity per transformer ---
# Existing solar capacity aggregated to each transformer
solar_by_xfmr = solar.groupby("transformer_id")["capacity_kw"].sum()
solar_by_xfmr.name = "existing_solar_kw"

# Peak load per feeder from load profiles (MW -> kW)
feeder_peak = load_profiles.groupby("feeder_id")["load_mw"].max().reset_index()
feeder_peak.columns = ["feeder_id", "peak_load_mw"]

# Count transformers per feeder to allocate feeder load
xfmr_counts = transformers.groupby("feeder_id").size().reset_index(name="n_xfmrs")
feeder_peak = feeder_peak.merge(xfmr_counts, on="feeder_id")
feeder_peak["load_per_xfmr_kw"] = (
    feeder_peak["peak_load_mw"] * 1000 / feeder_peak["n_xfmrs"]
)

# Build the HCA table: capacity minus solar minus load
hca = transformers[["feeder_id", "substation_id",
                     "kva_rating", "age_years",
                     "latitude", "longitude"]].copy()
hca = hca.merge(solar_by_xfmr, left_index=True,
                right_index=True, how="left")
hca["existing_solar_kw"] = hca["existing_solar_kw"].fillna(0)
hca = hca.merge(
    feeder_peak[["feeder_id", "load_per_xfmr_kw"]],
    on="feeder_id", how="left"
)

# Hosting capacity = rated capacity - existing solar - allocated load
hca["hosting_capacity_kw"] = (
    hca["kva_rating"] - hca["existing_solar_kw"] - hca["load_per_xfmr_kw"]
).clip(lower=0)

print(f"\nHosting capacity computed for {len(hca):,} transformers")
print(f"Across {hca['feeder_id'].nunique()} feeders")
print(f"\nHosting capacity summary (kW):")
print(hca["hosting_capacity_kw"].describe().round(1))

## Prepare Training Labels and Split

The target variable is hosting_capacity_kw—the estimated remaining capacity at each transformer after accounting for existing solar and allocated peak load. This is a regression problem: we predict a continuous value, not a category.

In [None]:
# --- Feature Group 1: Distance from substation ---
# Compute distance from each transformer to its substation node
sub_nodes = nodes[nodes["node_type"] == "substation_bus"]
sub_coords = sub_nodes[["latitude", "longitude"]].reset_index()
sub_coords.columns = ["substation_id", "sub_lat", "sub_lon"]

hca = hca.merge(sub_coords, on="substation_id", how="left")

# Approximate distance in km using lat/lon
hca["dist_from_sub_km"] = np.sqrt(
    ((hca["latitude"] - hca["sub_lat"]) * 111) ** 2 +
    ((hca["longitude"] - hca["sub_lon"]) * 85) ** 2
)

# --- Feature Group 2: Transformer characteristics ---
# kva_rating and age_years are already in the hca table from Step 1

# --- Feature Group 3: Conductor impedance from network edges ---
# Aggregate edge data per feeder for impedance and ampacity features
feeder_edges = edges.groupby("feeder_id").agg({
    "rated_amps":              "min",    # bottleneck conductor
    "impedance_r_ohm_per_mile": "mean",  # average resistance
    "length_miles":            "sum",    # total line length
}).reset_index()
feeder_edges.columns = ["feeder_id", "min_rated_amps",
                         "avg_r_ohm_per_mi", "total_line_mi"]

print("Feature groups created:")
print(f"  Transformers with distance: {len(hca):,}")
print(f"  Feeder edge features: {len(feeder_edges):,} feeders")

## Train a LightGBM Regression Model

LightGBM uses gradient-boosted decision trees with histogram-based splitting for speed. For hosting capacity prediction, regression is the right objective: we want to predict a continuous kW value, not a category.

In [None]:
# --- Feature Group 4: Load density ---
# Peak load per feeder from load profiles (already computed in Step 1)
# Count transformers per feeder for load density
feeder_stats = feeder_peak[["feeder_id", "peak_load_mw", "n_xfmrs"]].copy()
feeder_stats["peak_load_kw"] = feeder_stats["peak_load_mw"] * 1000
feeder_stats["load_density_kw_per_xfmr"] = (
    feeder_stats["peak_load_kw"] / feeder_stats["n_xfmrs"]
)

# --- Feature Group 5: Existing DER penetration ---
feeder_solar = solar.groupby("feeder_id")["capacity_kw"].sum().reset_index()
feeder_solar.columns = ["feeder_id", "existing_solar_total_kw"]
feeder_stats = feeder_stats.merge(feeder_solar, on="feeder_id", how="left")
feeder_stats["existing_solar_total_kw"] = feeder_stats["existing_solar_total_kw"].fillna(0)
feeder_stats["pv_penetration_pct"] = (
    feeder_stats["existing_solar_total_kw"] / feeder_stats["peak_load_kw"] * 100
)

# --- Feature Group 6: Merge everything into one table ---
df = hca.copy()
df = df.merge(feeder_edges, on="feeder_id", how="left")
df = df.merge(
    feeder_stats[["feeder_id", "peak_load_kw", "n_xfmrs",
                  "load_density_kw_per_xfmr",
                  "existing_solar_total_kw", "pv_penetration_pct"]],
    on="feeder_id", how="left"
)

# Fill missing values with 0
# NOTE: LightGBM handles NaN natively by routing missing values to the
# optimal split direction. Using fillna(0) is a deliberate choice here
# because a transformer with no edge data genuinely has zero local
# impedance. However, be careful with this pattern: if 0 has a
# meaningful non-missing interpretation for a feature (e.g.,
# age_years=0 could imply a brand-new transformer rather than missing
# data), consider using a sentinel value like -1 or keeping NaN.
df = df.fillna(0)

print(f"\nFinal dataset: {len(df):,} rows x {len(df.columns)} columns")
print(f"Features available: {len(df.columns) - 4}")  # subtract IDs and target

## Evaluate with R², MAE, and Scatter Plot

A good surrogate model should show predictions tightly clustered around the 45-degree line (predicted = actual). We also check feature importance to validate the model learned physically meaningful patterns.

In [None]:
# Define features and target
feature_cols = [
    "dist_from_sub_km",
    "kva_rating", "age_years",
    "min_rated_amps", "avg_r_ohm_per_mi", "total_line_mi",
    "peak_load_kw", "n_xfmrs", "load_density_kw_per_xfmr",
    "existing_solar_kw",
    "existing_solar_total_kw", "pv_penetration_pct",
    "load_per_xfmr_kw",
]
target_col = "hosting_capacity_kw"

X = df[feature_cols]
y = df[target_col]

# 80/20 train/test split stratified by feeder to ensure all feeders
# are represented in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42,
    stratify=df["feeder_id"]
)

print(f"Training set: {len(X_train):,} transformers")
print(f"Test set:     {len(X_test):,} transformers")
print(f"\nTarget distribution (training):")
print(f"  Mean: {y_train.mean():.0f} kW")
print(f"  Std:  {y_train.std():.0f} kW")
print(f"  Min:  {y_train.min():.0f} kW")
print(f"  Max:  {y_train.max():.0f} kW")

## Sensitivity Analysis

A surrogate model lets you explore "what-if" scenarios instantly. We vary temperature (which affects conductor ratings), load growth, and inverter power factor to see how hosting capacity shifts across the network.

In [None]:
# Create LightGBM datasets
train_data = lgb.Dataset(X_train, label=y_train,
                          feature_name=feature_cols)
test_data = lgb.Dataset(X_test, label=y_test,
                         reference=train_data)

# Model parameters
params = {
    "objective":       "regression",
    "metric":          "mae",
    "boosting_type":   "gbdt",
    "num_leaves":      31,
    "learning_rate":   0.05,
    "feature_fraction": 0.8,
    "bagging_fraction": 0.8,
    "bagging_freq":    5,
    "min_child_samples": 10,
    "verbose":         -1,
    "seed":            42,
}

# Train with early stopping
callbacks = [
    lgb.early_stopping(stopping_rounds=50),
    lgb.log_evaluation(period=100),
]

model = lgb.train(
    params,
    train_data,
    num_boost_round=1000,
    valid_sets=[train_data, test_data],
    valid_names=["train", "test"],
    callbacks=callbacks,
)

print(f"\nBest iteration: {model.best_iteration}")
# LightGBM stores the metric under 'l1' internally even when you pass metric='mae'
best_scores = model.best_score.get("test", {})
best_mae = best_scores.get("mae", best_scores.get("l1", None))
print(f"Best test MAE:  {best_mae:.1f} kW")

## Probabilistic Hosting Capacity with Quantile Regression

Point estimates are useful, but planners need ranges. "The hosting capacity is between 80 and 180 kW with 80% confidence" is more actionable than "the hosting capacity is 130 kW." LightGBM supports quantile regression natively.

In [None]:
# Predict on test set
y_pred = model.predict(X_test, num_iteration=model.best_iteration)

# Metrics
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f"R-squared: {r2:.3f}")
print(f"MAE:       {mae:.1f} kW")

# Scatter plot: predicted vs actual
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Left: predicted vs actual
ax1.scatter(y_test, y_pred, alpha=0.6, c="#5FCCDB", edgecolor="#2D6A7A", s=40)
ax1.plot([0, 500], [0, 500], "r--", linewidth=1.5, label="Perfect prediction")
ax1.set_xlabel("Actual Hosting Capacity (kW)")
ax1.set_ylabel("Predicted Hosting Capacity (kW)")
ax1.set_title(f"Predicted vs Actual (R\u00b2={r2:.3f}, MAE={mae:.0f} kW)")
ax1.legend()
ax1.grid(True, alpha=0.3)

# Right: feature importance
importance = pd.DataFrame({
    "feature": feature_cols,
    "importance": model.feature_importance(importance_type="gain")
}).sort_values("importance", ascending=True)

ax2.barh(importance["feature"], importance["importance"], color="#2D6A7A")
ax2.set_xlabel("Feature Importance (Gain)")
ax2.set_title("LightGBM Feature Importance")

plt.tight_layout()
plt.show()

## Map Hosting Capacity Spatially

Utility planners think spatially. A hosting capacity "heat map" shows at a glance where the grid can absorb more solar and where it cannot. We use the latitude and longitude from the SP&L transformer data to plot predicted hosting capacity geographically.

In [None]:
# Scenario analysis: vary one feature at a time
baseline = X_test.copy()
scenarios = {}

# Scenario 1: Summer peak (reduce ampacity by 15% due to temperature)
summer_peak = baseline.copy()
summer_peak["min_rated_amps"] *= 0.85
scenarios["Summer derating (-15% ampacity)"] = summer_peak

# Scenario 2: 20% load growth
load_growth = baseline.copy()
load_growth["peak_load_kw"] *= 1.20
load_growth["load_density_kw_per_xfmr"] *= 1.20
load_growth["load_per_xfmr_kw"] *= 1.20
scenarios["20% load growth"] = load_growth

# Scenario 3: High existing DER (double current PV penetration)
high_der = baseline.copy()
high_der["existing_solar_kw"] *= 2.0
high_der["existing_solar_total_kw"] *= 2.0
high_der["pv_penetration_pct"] *= 2.0
scenarios["2x existing PV penetration"] = high_der

# Predict hosting capacity for each scenario
results = {"Baseline": model.predict(baseline, num_iteration=model.best_iteration)}
for name, scenario_X in scenarios.items():
    results[name] = model.predict(scenario_X, num_iteration=model.best_iteration)

# Compare median hosting capacity across scenarios
fig, ax = plt.subplots(figsize=(10, 5))
medians = {k: np.median(v) for k, v in results.items()}
colors = ["#5FCCDB", "#E53E3E", "#D69E2E", "#9F7AEA"]
ax.bar(medians.keys(), medians.values(), color=colors)
ax.set_ylabel("Median Hosting Capacity (kW)")
ax.set_title("Sensitivity Analysis: Impact on Hosting Capacity")
plt.xticks(rotation=20, ha="right")
plt.tight_layout()
plt.show()

# Print percentage changes
baseline_median = medians["Baseline"]
print("Scenario impact vs baseline:")
for name, med in medians.items():
    pct = (med - baseline_median) / baseline_median * 100
    print(f"  {name}: {med:.0f} kW ({pct:+.1f}%)")

## Benchmark: ML Screening vs Full Recomputation

The whole point of a surrogate model is speed. Let us quantify how much faster ML screening is compared to recomputing hosting capacity from scratch for every transformer. The ML surrogate skips the per-transformer aggregation of solar, load allocation, and capacity arithmetic. For the specific task of screening hosting capacity values, the speed advantage is enormous.

In [None]:
# Train quantile models for 10th, 50th, and 90th percentiles
quantile_models = {}

for alpha in [0.10, 0.50, 0.90]:
    q_params = params.copy()
    q_params["objective"] = "quantile"
    q_params["alpha"] = alpha
    q_params["metric"] = "quantile"

    q_model = lgb.train(
        q_params,
        train_data,
        num_boost_round=500,
        valid_sets=[test_data],
        valid_names=["test"],
        callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)],
    )
    quantile_models[alpha] = q_model
    print(f"  Quantile {alpha:.0%} model trained ({q_model.best_iteration} rounds)")

# Predict quantile ranges
q10 = quantile_models[0.10].predict(X_test)
q50 = quantile_models[0.50].predict(X_test)
q90 = quantile_models[0.90].predict(X_test)

# Plot quantile predictions for a subset of transformers
n_show = 25
idx = np.argsort(y_test.values)[:n_show]  # sort by actual HC

fig, ax = plt.subplots(figsize=(14, 6))
x_pos = np.arange(n_show)

# 80% prediction interval (10th to 90th percentile)
ax.fill_between(x_pos, q10[idx], q90[idx], alpha=0.3,
                 color="#5FCCDB", label="80% prediction interval")
ax.plot(x_pos, q50[idx], "o-", color="#2D6A7A",
        linewidth=2, markersize=5, label="Median prediction")
ax.plot(x_pos, y_test.values[idx], "s", color="#E53E3E",
        markersize=6, label="Actual (computed)")

ax.set_xlabel("Transformer (sorted by actual hosting capacity)")
ax.set_ylabel("Hosting Capacity (kW)")
ax.set_title("Probabilistic Hosting Capacity: 80% Prediction Interval")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Coverage: what fraction of actuals fall within the 80% interval?
coverage = np.mean((y_test.values >= q10) & (y_test.values mean(q90 - q10)
print(f"\n80% interval coverage: {coverage:.1%} (target: 80%)")
print(f"Average interval width: {avg_width:.0f} kW")

## What You Built and Next Steps

In [None]:
# Predict hosting capacity for ALL transformers (not just test set)
y_all_pred = model.predict(X, num_iteration=model.best_iteration)

# Build map dataframe with predictions and coordinates
map_df = df[["feeder_id", "latitude", "longitude"]].copy()
map_df["hc_predicted_kw"] = y_all_pred

# Spatial hosting capacity map
fig, ax = plt.subplots(figsize=(12, 10))

scatter = ax.scatter(
    map_df["longitude"], map_df["latitude"],
    c=map_df["hc_predicted_kw"],
    cmap="RdYlGn",
    s=20,
    edgecolor="#333",
    linewidth=0.3,
    alpha=0.7,
    vmin=0,
    vmax=500,
)

cbar = plt.colorbar(scatter, ax=ax, shrink=0.8)
cbar.set_label("Predicted Hosting Capacity (kW)", fontsize=12)

# Label substations from the nodes table
sub_locs = nodes[nodes["node_type"] == "substation_bus"]
ax.scatter(sub_locs["longitude"], sub_locs["latitude"],
           marker="^", s=200, c="black", zorder=5,
           label="Substation")

ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
ax.set_title("SP&L Service Territory: ML-Predicted Hosting Capacity")
ax.legend(loc="upper left", fontsize=11)
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.show()

# Summary by feeder
feeder_summary = map_df.groupby("feeder_id")["hc_predicted_kw"].agg(
    ["mean", "min", "max"]
).round(0)
feeder_summary.columns = ["Mean HC (kW)", "Min HC (kW)", "Max HC (kW)"]
print("\nHosting capacity by feeder (first 10):")
print(feeder_summary.head(10))

In [None]:
# Time the ML prediction for all transformers
start_ml = time.perf_counter()
for _ in range(100):  # run 100 times for stable measurement
    ml_pred = model.predict(X, num_iteration=model.best_iteration)
end_ml = time.perf_counter()
ml_time_per_run = (end_ml - start_ml) / 100

# Time the pandas-based full computation for comparison
start_pd = time.perf_counter()
for _ in range(10):
    _solar_agg = solar.groupby("transformer_id")["capacity_kw"].sum()
    _peak = load_profiles.groupby("feeder_id")["load_mw"].max()
end_pd = time.perf_counter()
pd_time_per_run = (end_pd - start_pd) / 10

print("=== Speed Comparison ===")
print(f"ML surrogate (all {len(X):,} transformers): {ml_time_per_run*1000:.1f} ms")
print(f"Pandas recomputation:  {pd_time_per_run*1000:.0f} ms")
print(f"Speedup: {pd_time_per_run / ml_time_per_run:,.0f}x")

# Scaling comparison (what-if for larger utilities)
xfmr_counts = [1000, 5000, 21000, 50000, 100000]
scale = len(X)
print(f"\n{'Transformers':>14}  {'Full Recompute':>15}  {'ML Model':>12}  {'Speedup':>10}")
print("-" * 58)
for n in xfmr_counts:
    pd_t = pd_time_per_run * (n / scale)
    ml_t = ml_time_per_run * (n / scale)
    if pd_t >= 1:
        pd_str = f"{pd_t:.1f} sec"
    else:
        pd_str = f"{pd_t*1000:.0f} ms"
    print(f"{n:>14,}  {pd_str:>15}  {ml_t*1000:>9.1f} ms  {pd_t/ml_t:>10,.0f}x")

In [None]:
# Save the trained LightGBM model
model.save_model("hosting_capacity_surrogate.lgb")

# Load it back
model = lgb.Booster(model_file="hosting_capacity_surrogate.lgb")