# Outage Prediction with Random Forest

From the [Sisyphean Gridworks ML Playground](https://sgridworks.com/ml-playground/guides/01-outage-prediction.html)

## Setup

Clone the repository and install dependencies. Run this cell first.

In [None]:
!git clone https://github.com/SGridworks/Dynamic-Network-Model.git 2>/dev/null || echo 'Already cloned'
%cd Dynamic-Network-Model
!pip install -q pandas numpy matplotlib seaborn scikit-learn xgboost lightgbm pyarrow

## Load the Data

Open a new Jupyter notebook and run the following cell to import your libraries and load the three SP&L datasets. Run your code from the repository root (where the demo_data/ folder lives).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Load SP&L datasets using the data loader API
from demo_data.load_demo_data import (
    load_outage_history, load_weather_data, load_transformers
)

outages      = load_outage_history()
weather      = load_weather_data()
transformers = load_transformers()

print(f"Outage events loaded: {len(outages):,}")
print(f"Weather rows loaded:  {len(weather):,}")
print(f"Transformers loaded:  {len(transformers):,}")

## Explore the Data

Before building a model, look at what you have. Run each line below in its own cell so you can see the output.

In [None]:
# See the first few outage rows
outages.head()

## Build Daily Features

Outages happen on a specific day. Weather is recorded every hour. To combine them, we need to summarize weather into daily statistics (max wind, max temperature, total rainfall, etc.).

In [None]:
# How many outages per cause code?
outages["cause_code"].value_counts().plot(kind="bar", title="Outages by Cause")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

## Create the Target Variable

A classification model needs a target: the thing you are predicting. Our target is "Did at least one outage happen on this day?" (yes = 1, no = 0).

In [None]:
# Check the weather columns
weather.describe()

## Add Time-Based Features

Outages follow seasonal patterns. Let's add month-of-year and day-of-week as features so the model can learn these cycles.

In [None]:
# Create a date column from the weather timestamp
weather["date"] = weather["timestamp"].dt.date

# Aggregate weather to daily summaries
daily_weather = weather.groupby("date").agg({
    "temperature_f":   ["max", "min", "mean"],
    "wind_speed_mph":  ["max", "mean"],
    "humidity_pct":    "mean",
    "is_storm":        "max",
}).reset_index()

# Flatten the multi-level column names
daily_weather.columns = [
    "date", "temp_max", "temp_min", "temp_mean",
    "wind_max", "wind_mean", "humidity_mean", "is_storm"
]

print(daily_weather.head())
print(f"\nDaily weather rows: {len(daily_weather)}")

## Split into Training and Test Sets

We need to hold back some data the model has never seen, so we can honestly evaluate it later. The standard practice is an 80/20 split: 80% for training, 20% for testing.

In [None]:
# Extract the date from each outage event
outages["date"] = outages["fault_detected"].dt.date

# Count outages per day
outage_days = outages.groupby("date").size().reset_index(name="outage_count")

# Merge with daily weather
df = daily_weather.merge(outage_days, on="date", how="left")

# Fill days with no outages as 0
df["outage_count"] = df["outage_count"].fillna(0).astype(int)

# Create the binary target: 1 if any outage, 0 if none
df["outage_flag"] = (df["outage_count"] > 0).astype(int)

print(f"Total days: {len(df)}")
print(f"Days with outages: {df['outage_flag'].sum()}")
print(f"Days without outages: {(df['outage_flag'] == 0).sum()}")

## Train the Random Forest

Now the exciting part. We create a Random Forest classifier and fit it on the training data. "Fitting" means the model examines all the training rows and learns patterns that connect weather features to outage outcomes.

In [None]:
# Convert date column to datetime for feature extraction
df["date"] = pd.to_datetime(df["date"])

# Add calendar features
df["month"]       = df["date"].dt.month
df["day_of_week"] = df["date"].dt.dayofweek    # 0 = Monday, 6 = Sunday
df["is_summer"]   = df["month"].isin([6, 7, 8]).astype(int)

print(df[["date", "temp_max", "wind_max", "month", "outage_flag"]].head(10))

## Test the Model

Now we use the held-out test data—data the model has never seen—to see how well it performs in the real world.

In [None]:
# Define features (X) and target (y)
feature_cols = [
    "temp_max", "temp_min", "temp_mean",
    "wind_max", "wind_mean",
    "humidity_mean", "is_storm",
    "month", "day_of_week", "is_summer"
]

X = df[feature_cols]
y = df["outage_flag"]

# Split: 80% train, 20% test (random_state for reproducibility)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples:     {len(X_test)}")

## Understand Feature Importance

One of the best things about Random Forests: they tell you which features matter most. This is valuable for utility engineers because it shows which weather variables drive outage risk.

In [None]:
# Create the model with 200 decision trees
model = RandomForestClassifier(
    n_estimators=200,       # number of trees in the forest
    max_depth=10,           # limit tree depth to prevent overfitting
    random_state=42,        # for reproducible results
    class_weight="balanced" # adjust for imbalanced classes
)

# Train the model
model.fit(X_train, y_train)

print("Model training complete.")
print(f"Number of trees: {model.n_estimators}")
print(f"Features used:   {model.n_features_in_}")

## What You Built and Next Steps

Congratulations. You just:

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

# Print a classification report
print(classification_report(y_test, y_pred,
      target_names=["No Outage", "Outage"]))

In [None]:
# Plot a confusion matrix
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["No Outage", "Outage"],
            yticklabels=["No Outage", "Outage"], ax=ax)
ax.set_xlabel("Predicted")
ax.set_ylabel("Actual")
ax.set_title("Confusion Matrix: Outage Prediction")
plt.tight_layout()
plt.show()

In [None]:
# Get feature importances
importances = model.feature_importances_
feat_imp = pd.Series(importances, index=feature_cols).sort_values(ascending=True)

# Plot
fig, ax = plt.subplots(figsize=(8, 5))
feat_imp.plot(kind="barh", color="#5FCCDB", ax=ax)
ax.set_title("Feature Importance: What Drives Outages?")
ax.set_xlabel("Importance Score")
plt.tight_layout()
plt.show()