In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from grid_analytics_helper import *

# 1.0 Data

All relevant methods to grab data are located in the pjm_retrieve_data.py. 

Following the rules for DataMiner2, "Note that information and data contained in Data Miner is for internal use only and redistribution of information and or
data contained in or derived from Data Miner is strictly prohibited without a PJM membership."

In [None]:
dataframe_file_path = "./dataframes/" # Change to your folder
zonal_lmp_file_name = "jul_2020_jul_2024_zonal_lmps"
daily_gen_cap_file_name = "jul_2020_jul_2024_daily_capacity_generation"
gen_outage_file_name = "jul_2020_jul_2024_generation_outages"
zone_to_region_name = "zone_to_region"
output_file_name = "jul_2020_jul_2024_historical_grid_data"

# Read in data
lmp_data = pd.read_parquet(f"{dataframe_file_path}{zonal_lmp_file_name}.parquet", engine="pyarrow")
generation_capacity = pd.read_parquet(f"{dataframe_file_path}{daily_gen_cap_file_name}.parquet", engine="pyarrow")
outage_seven_days = pd.read_parquet(f"{dataframe_file_path}{gen_outage_file_name}.parquet", engine="pyarrow")
zone_to_region = pd.read_parquet(f"{dataframe_file_path}{zone_to_region_name}.parquet", engine="pyarrow")

# 2.0 Merge Data

The purpose to understand supply-demand dynamics and operational constraints vs. congestion.

**First:** Merge LMP and Daily Generation Capacity Data Together

**Rationale:** Generation Capacity Influences Congestion and LMPs.

By merging these datasets, we can analyze how variations in capacity impact LMPs (Ex. high LMPs during periods of tight capacity margins).

Incorporate capacity margins into congestion spread forecasting models to predict future grid stress and congestion events.

**Handling Potential Missing Data From Daily Generation Capacity:**

To deal with any missing values from the daily generation capacity data, a linear interpolation is applied because generation capacity changes gradually, making linear interpolation ideal for filling gaps without introducing bias. Use of forward-fill or backward-fill would only assume that capacity remains constant over time.

**Second:** Merge on Generation Outage for Seven Days Data

**Rationale:** Outages directly impact grid reliability and congestion. Incorporating near-term outage risks (ie.Seven days outage data) can be used to identify near-term congestion caused by planned or unplanned outages, which can be important in real-time market analysis or bidding strategies.

Outages reduce the available generation capacity, which:
- Lowers the grid’s ability to meet demand, especially during peak hours.
- Forces reliance on less efficient or more expensive generators, leading to higher LMPs and increased congestion risk.

Adds Predictive Power to Congestion Forecasts
- Total Outages (MW): The overall reduction in capacity, which directly correlates with congestion risk.
- A new feature will be generated 

**Handling Potential Missing Data from the Generation Outage for Seven Days Data:**

Without adding in to much complexity, a fill-forward interpolation will be applied at the region level. Fill-forward interpolation was selected purely considering the fact that outages are discrete events that do not vary continuously.

In [None]:
data = merge_historical_data(lmp_data, generation_capacity, outage_seven_days, zone_to_region)

# 3.0 Feature Engineering

Let's introduce congestion-related features/metrics into the merged dataset.

**Note:** Not all will be incorporated into the modelling phases, but rather as a "trial and error" place for me to better understand the market.

## 3.1  Locational Marginal Pricing (LMP) Trends

These features will provide information for understanding LMP trends and volatility at each node on an hourly basis in each region.

### 3.1.1 LMP Delta

**LMP Delta:** Tracks the hourly change in LMP for each pricing node.

**Formula:** $\text{LMP Delta} = \text{LMP}_{t} - \text{LMP}_{(t-1)}$

**LMP Absolute Delta:** Tracks the magnitude of the changes in LMP.

**Formula:** $\text{LMP Delta} = |\text{LMP}_{t} - \text{LMP}_{(t-1)}|$

In [None]:
data = create_lmp_delta(data)

### 3.1.2 LMP Volatility 

By using a rolling standard deviation over every 24 hour period, this can provide a measure of variability of LMPs and potentially help identify price instability at specific nodes or region.

In [None]:
data = create_lmp_volatility(data)

## 3.2 Outage Metrics 

These metrics are critical for understanding grid performance, as they reflect the capacity and reliability of the system.

### 3.2.1 Forced Outage Percentage

This feature will measure the share outages due to unplanned (forced) events.

**Higher Fourced Outage Percentage** indicates greater system stress or unexpected maintenance issues.

**Formula:** $\text{Forced Outage Percentage} = \frac{\text{Forced Outages (MW)}}{\text{Total Outages (MW)}} \times 100$


In [None]:
data = create_forced_outage_pct(data)

### 3.2.2 Outage Intensity

This feature measures how much of the available generation capacity is affected by outages at a node, zone, or region.

**Formula:** $\text{Outage Intensity} = \frac{\text{Total Outages (MW)}}{\text{Economic Max (MW)}} \times 100$


Note:
- Total Outages is daily data.
- Economic Max is hourly data

For this feature, it will be completed with daily regional graularity in which economic max features will be converted to daily averages. This way introducing artificial hourly variability can be avoided and interpretability can be preserved.

In [None]:
data = create_outage_intensity(data)

# 3.3 Stress Indicators

Stress indicators are equally crucial for capturing grid stability and identifying congestion risks. The introduce of the following features will hopefully provide insights into how different regions of the system as a whole is coping with outages, capacity constraints, and demand surges.

### 3.3.1 Capacity Margin

The available generation capacity (represented by Economic Max, Emergency Max, and Total Committed) directly impacts grid stress and congestion:
- Low capacity margins: A small buffer between Economic Max and Total Committed leaves the grid vulnerable to congestion and price spikes.
- High capacity margins: Ample available generation allows the system to respond flexibly to unexpected demand or transmission constraints, reducing congestion and stabilizing prices.

This feature will provide information on the buffer available to meet unexpected demand or supply fluctuations.

**Formula:** $\text{Capacity Margin} \left( \% \right) = \frac{\text{Economic Max} - \text{Total Commited}}{\text{Economic Max}} \times 100$

In [None]:
data = create_capacity_margin(data)

### 3.3.2 Region Stress Ratio

This feature will aid in comparing regional outages to system-wide outages to identify disproportionately stressed regions.

**Formula:** $\text{Region Stress Ratio} = \frac{\text{Total Outages (MW)}}{\text{RTO Total Outages (MW)}} \times 100$

In [None]:
data = create_region_stress_ratio(data)

### 3.3.3 Emergency Trigger

**Emergency Triggered:** This feature flags when the current demand exceeds the optimal limit (Economic Max).

**Formula:** $\text{Emergency Triggered} = \text{Total Committed} \gt \text{Economic Max}$

In [None]:
data = create_emergency_triggered(data)

### 3.3.4 Near Emergency Threshold

**Near Emergency Threshold:** This feature is meant to signal an early warning for grid stress before reaching full capacity under normal operating conditions. The threshold here is set at 95% of the Economic Max.

**Formula:** $\text{Near Emergency Threshold} = \text{Total Committed} \gt 0.95 \times \text{Economic Max}$

In [None]:
data = create_near_emergency(data)

## Output Data

In [None]:
data.to_parquet(f"{dataframe_file_path}{output_file_name}.parquet", index=False, engine="pyarrow")

# 4.0 EDA

In [None]:
mdata = pd.read_parquet(f"{dataframe_file_path}{output_file_name}.parquet", engine="pyarrow") # the merged data

# 4.1 Check for Missing Data

This should not be an issue, as this issue has been dealt with at every step in the data creation process. However, for the sake of completeness we will check.

In [None]:
missing_data = mdata.isnull().sum()
missing_columns = missing_data[missing_data > 0]
print("Columns with Missing Values:\n", missing_columns)

# 4.2 Feature Exploration 

**I will add as I refine the plots!**

# 5.0 Predictions

## 5.1 Emergency Triggers

- **Target Variable**: Binary 0 (Not Triggered) vs. 1 (Triggered)
- **Feature:** 
    - **Temporal Features:** 'hour, day_of_week, month, is_weekend, and season' capture time-based patterns in the grid's operation.
    - **Lagged Features:** Lag by 1 hour, 3 hours, 6 hours, and 1 day
        - near_emergency: If the grid is near its emergency state, the likelihood of triggering an emergency increases.
        - capacity_margin: Shows how close the system has been to resource limits.
        - lmp_volatility: Reflects pricing stress, which could precede emergency conditions.
    - **Rolling Windows:** Rolling windows are 3 hours, 6 hours, and 1 day
        - lmp_volatility and region_stress_ratio: A spike in stress or volatility might last for hours/days before triggering an emergency. 
        - **Note:** *Rolling averages smooth noisy data and capture broader trends*

In [None]:
mdata1 = emergency_trigger_set_up(mdata)
mdata1.head()

In [None]:
x_train, x_test = walk_forward_validation_et(mdata1, target_column="emergency_triggered", models_to_use=["decision_tree"], gap=24)

In [None]:
results = walk_forward_validation_et(mdata1, target_column="emergency_triggered", models_to_use=["decision_tree"])

In [None]:
folds = []
f1_scores = []
for dicts in results["decision_tree"]:
    folds.append(dicts["fold"])
    f1_scores.append(dicts["f1_score"])
f1_scores

plt.figure(figsize=(8, 5))
plt.plot(folds, f1_scores, marker='o', linestyle='-', color='blue')
plt.title("Decision Tree F1-Scores Across Folds")
plt.xlabel("Fold")
plt.ylabel("F1-Score")
plt.ylim(0.9, 1.1)
plt.grid()
plt.show()

In [None]:
results["decision_tree"][0]