In [None]:
import pandas as pd
import os

import grid_analytics_helper as gah

---
# 1.0 Data

All relevant methods to grab data are located in the pjm_retrieve_data.py. 

Following the rules for DataMiner2, "Note that information and data contained in Data Miner is for internal use only and redistribution of information and or data contained in or derived from Data Miner is strictly prohibited without a PJM membership."

**Note:** The following data ranges from the start of July 2020 to the end of July 2024.

In [None]:
dataframe_file_path = "./dataframes/" # Change to your folder
model_output_path = "./models/" # model saved path
zonal_lmp_file_name = "jul_2020_jul_2024_zonal_lmps"
daily_gen_cap_file_name = "jul_2020_jul_2024_daily_capacity_generation"
gen_outage_file_name = "jul_2020_jul_2024_generation_outages"
zone_to_region_name = "zone_to_region"
output_file_name = "20200701_000000_20240801_000000_grid_data"

# Read in data
lmp_data = pd.read_parquet(os.path.join(dataframe_file_path, f"{zonal_lmp_file_name}.parquet"), engine="pyarrow")
generation_capacity = pd.read_parquet(os.path.join(dataframe_file_path, f"{daily_gen_cap_file_name}.parquet"), engine="pyarrow")
outage_seven_days = pd.read_parquet(os.path.join(dataframe_file_path, f"{gen_outage_file_name}.parquet"), engine="pyarrow")
zone_to_region = pd.read_parquet(os.path.join(dataframe_file_path, f"{zone_to_region_name}.parquet"), engine="pyarrow")

---
# 2.0 Merge Data

The purpose to understand supply-demand dynamics and operational constraints vs. congestion.

**First:** Merge LMP and Daily Generation Capacity Data Together

**Rationale:** Generation Capacity Influences Congestion and LMPs.

By merging these datasets, we can analyze how variations in capacity impact LMPs (Ex. high LMPs during periods of tight capacity margins).

Incorporate capacity margins into congestion spread forecasting models to predict future grid stress and congestion events.

**Handling Potential Missing Data From Daily Generation Capacity:**

To deal with any missing values from the daily generation capacity data, a linear interpolation is applied because generation capacity changes gradually, making linear interpolation ideal for filling gaps without introducing bias. Use of forward-fill or backward-fill would only assume that capacity remains constant over time.

**Second:** Merge on Generation Outage for Seven Days Data

**Rationale:** Outages directly impact grid reliability and congestion. Incorporating near-term outage risks (ie.Seven days outage data) can be used to identify near-term congestion caused by planned or unplanned outages, which can be important in real-time market analysis or bidding strategies.

Outages reduce the available generation capacity, which:
- Lowers the grid’s ability to meet demand, especially during peak hours.
- Forces reliance on less efficient or more expensive generators, leading to higher LMPs and increased congestion risk.

Adds Predictive Power to Congestion Forecasts
- Total Outages (MW): The overall reduction in capacity, which directly correlates with congestion risk.
- A new feature will be generated 

**Handling Potential Missing Data from the Generation Outage for Seven Days Data:**

Without adding in to much complexity, a fill-forward interpolation will be applied at the region level. Fill-forward interpolation was selected purely considering the fact that outages are discrete events that do not vary continuously.

In [None]:
data = gah.merge_historical_data(lmp_data, generation_capacity, outage_seven_days, zone_to_region)

---
# 3.0 Feature Engineering

Let's introduce congestion-related features/metrics into the merged dataset.

**Note:** Not all will be incorporated into the modelling phases, but rather as a "trial and error" place for me to better understand the market.

## 3.1  Locational Marginal Pricing (LMP) Trends

These features will provide information for understanding LMP trends and volatility at each node on an hourly basis in each region.

### 3.1.1 LMP Delta

**LMP Delta:** Tracks the hourly change in LMP for each pricing node.

**Formula:** $\text{LMP Delta} = \text{LMP}_{t} - \text{LMP}_{(t-1)}$

**LMP Absolute Delta:** Tracks the magnitude of the changes in LMP.

**Formula:** $\text{LMP Delta} = |\text{LMP}_{t} - \text{LMP}_{(t-1)}|$

In [None]:
data = gah.create_lmp_delta(data)

### 3.1.2 LMP Volatility 

By using a rolling standard deviation over every 24 hour period, this can provide a measure of variability of LMPs and potentially help identify price instability at specific nodes or region.

In [None]:
data = gah.create_lmp_volatility(data)

## 3.2 Outage Metrics 

These metrics are critical for understanding grid performance, as they reflect the capacity and reliability of the system.

### 3.2.1 Forced Outage Percentage

This feature will measure the share outages due to unplanned (forced) events.

**Higher Fourced Outage Percentage** indicates greater system stress or unexpected maintenance issues.

**Formula:** $\text{Forced Outage Percentage} = \frac{\text{Forced Outages (MW)}}{\text{Total Outages (MW)}} \times 100$


In [None]:
data = gah.create_forced_outage_pct(data)

### 3.2.2 Outage Intensity

This feature measures how much of the available generation capacity is affected by outages at a node, zone, or region.

**Formula:** $\text{Outage Intensity} = \frac{\text{Total Outages (MW)}}{\text{Economic Max (MW)}} \times 100$


Note:
- Total Outages is daily data.
- Economic Max is hourly data

For this feature, it will be completed with daily regional graularity in which economic max features will be converted to daily averages. This way introducing artificial hourly variability can be avoided and interpretability can be preserved.

In [None]:
data = gah.create_outage_intensity(data)

## 3.3 Stress Indicators

Stress indicators are equally crucial for capturing grid stability and identifying congestion risks. The introduce of the following features will hopefully provide insights into how different regions of the system as a whole is coping with outages, capacity constraints, and demand surges.

### 3.3.1 Capacity Margin

The available generation capacity (represented by Economic Max, Emergency Max, and Total Committed) directly impacts grid stress and congestion:
- Low capacity margins: A small buffer between Economic Max and Total Committed leaves the grid vulnerable to congestion and price spikes.
- High capacity margins: Ample available generation allows the system to respond flexibly to unexpected demand or transmission constraints, reducing congestion and stabilizing prices.

This feature will provide information on the buffer available to meet unexpected demand or supply fluctuations.

**Formula:** $\text{Capacity Margin} \left( \% \right) = \frac{\text{Economic Max} - \text{Total Commited}}{\text{Economic Max}} \times 100$

In [None]:
data = gah.create_capacity_margin(data)

### 3.3.2 Region Stress Ratio

This feature will aid in comparing regional outages to system-wide outages to identify disproportionately stressed regions.

**Formula:** $\text{Region Stress Ratio} = \frac{\text{Total Outages (MW)}}{\text{RTO Total Outages (MW)}} \times 100$

In [None]:
data = gah.create_region_stress_ratio(data)

### 3.3.3 Emergency Trigger

**Emergency Triggered:** This feature flags when the current demand exceeds the optimal limit (Economic Max).

**Formula:** $\text{Emergency Triggered} = \text{Total Committed} \gt \text{Economic Max}$

In [None]:
data = gah.create_emergency_triggered(data)

### 3.3.4 Near Emergency Threshold

**Near Emergency Threshold:** This feature is meant to signal an early warning for grid stress before reaching full capacity under normal operating conditions. The threshold here is set at 95% of the Economic Max.

**Formula:** $\text{Near Emergency Threshold} = \text{Total Committed} \gt 0.95 \times \text{Economic Max}$

In [None]:
data = gah.create_near_emergency(data)

### 3.3.5 Congestion Stress

**Congestion Stress:** This feature is designed to potentially identify congestion opportunities by capturing both the severity and unpredictability of generator outages. It combines two key outage indicators: Forced Outage Percentage (measuring unexpected failures) and Outage Intensity (measuring total outages relative to available capacity). 

**Formula:** $\text{Congestion Stress} = \text{Forced Outage (\%)} \times \text{Outage Intensity (\%)}$

**Note:** Higher values indicate a greater likelihood of congestion due to unexpected generation shortfalls.

In [None]:
data = gah.create_congestion_risk(data)

## Output Data

In [None]:
data.to_parquet(os.path.join(dataframe_file_path, f"{output_file_name}.parquet"), index=False, engine="pyarrow")

---
# 4.0 EDA

In [None]:
dataframe_file_path = "./dataframes/" # Change to your folder
output_file_name = "20200701_000000_20240801_000000_grid_data"
####################################################################################################
data = pd.read_parquet(os.path.join(dataframe_file_path, f"{output_file_name}.parquet"), engine="pyarrow")

# 4.1 Check for Missing Data

This should not be an issue, as this issue has been dealt with at every step in the data creation process. However, for the sake of completeness we will check.

In [None]:
missing_data = data.isnull().sum()
missing_columns = missing_data[missing_data > 0]
print("Columns with Missing Values:\n", missing_columns)

# 4.2 Feature Exploration 

## 4.2.1 LMP Delta

*Note:* This feature has hourly granularity and is region specific.



Let check for extreme values or unexpected trends.

In [None]:
gah.plot_interactive_histogram(data=data, feature="lmp_delta", feature_units="($/MWh)", bins=100)

The distribution shows most price changes are near zero but there are some extreme outliers.

In [None]:
gah.plot_interactive_boxplot(data=data, feature="lmp_delta", feature_units="($/MWh)")

Reducing the lower and upper outlier threshold too $\pm$ 200, the distribution across regions appear similar, but the significant outliers may be an indication of localized price volatility.

In [None]:
gah.plot_interactive_timeseries(data=data, feature="lmp_delta", feature_units="($/MWh)")

The "Mid Atlantic - Dominion" Region may have frequent outliers on an hourly basis, signaling that specific nodes or congestion points are potentially influencing prices there.

Let's further investigate the outliers using IQR.

In [None]:
gah.plot_interactive_timeseries_scatter_outliers(data=data, feature="lmp_delta", feature_units="($/MWh)")

It is not surprising the Winter months exhibit LMP Delta outliers. This data can be useful with say fuel prices. For example, if fuel price data was available, one can determine if Winter 2021 could be associated with extreme weather or fuel price spikes.

In the Western region, the top 10 pricing nodes with absolute hourly changes in LMP over $100 /MWh:

In [None]:
gah.interactive_pnode_lmp_outliers(data=data, lmp_feature="lmp_delta")

## 4.2.2 LMP Volatility

*Note:* This feature has hourly granularity and is region specific.

Let check for extreme values or unexpected trends.

In [None]:
gah.plot_interactive_histogram(data=data, feature="lmp_volatility", feature_units="($/MWh)", bins=250)

The distribution of LMP volatility is right-skewed, with most values concentrated near zero. This suggests that price swings are typically minor, but extreme events do occur and are rare.

In [None]:
gah.plot_interactive_boxplot(data=data, feature="lmp_volatility", feature_units="($/MWh)")

The data shows high volatility outliers in the "Mid Atlantic - Dominion: region compared to "Western". With a $50/MWh upper threshold applied, "Mid Atlantic - Dominion" stills has a broader range of volatility compared to "Western", and with both regions having their volatility concentrated below $10/MWh.

In [None]:
gah.plot_interactive_rolling_timeseries(data=data, feature="lmp_volatility", feature_units="($/MWh)")

"Mid Atlantic - Dominion" exhibits consistently higher volatility spikes compared to the "Western" region, with pronounced peaks around early 2021, early 2023, and early 2024. The "Western" region displays a generally stable volatility range, but smaller, less frequent spikes occur around similar timeframes. Both regions show periods of elevated volatility during winter months, which aligns with seasonal factors like increased heating demand and weather-related disruptions.

## 4.2.3 Forced Outage Percentage

*Note:* This feature has daily granularity and is region specific.

In [None]:
forced_out_pct_data = gah.handle_forced_outage_pct_data(df=data)

Verify whether the percentage values are within a reasonable range and see how forced outages vary across regions and time.

In [None]:
gah.plot_interactive_timeseries(data=forced_out_pct_data, feature="forced_outage_pct", feature_units="(%)")

This plot effectively shows the trend of forced outages across the two regions and the whole PJM RTO. Lets take a look at a 14-day rolled average.

In [None]:
gah.plot_interactive_rolling_timeseries(data=forced_out_pct_data, feature="forced_outage_pct", feature_units="(%)", hourly_roll=False)

The rolling windows provided a smoother view of how forced outages are behaving in each region over time (ie. Seasonal patterns are clearly visible). This could be potentially useful in understanding underperforming or overperforming during critical periods.

In [None]:
gah.plot_interactive_boxplot(data=forced_out_pct_data, feature="forced_outage_pct", feature_units="(%)")

The regional distribution is well-represented. You can see that "Western" has a wider spread and slightly higher median compared to "Mid Atlantic - Dominion."

In [None]:
gah.plot_interactive_calendar_heatmap(data=forced_out_pct_data, feature="forced_outage_pct", feature_units="(%)")

The calendar heatmmap serves a system-wide visualization that focuses on all regions combined. This is best for spotting overall trends, seasonality, and anomalies at a glance. Furthermore, the region filter allows this data to be seperate to a regional level to better understand regional specific events.

## 4.2.4 Outage Intensity

*Note:* This feature has daily granularity and is region specific.

In [None]:
outage_intensity_data = gah.handle_outage_intensity_data(data)

Analyze whether the percentage values are within a reasonable range and see how forced outages vary across regions and time.

In [None]:
gah.plot_interactive_timeseries(data=outage_intensity_data, feature="outage_intensity", feature_units="(%)")

The outage intensity shows a cyclical pattern, with peaks in the fall and spring periods. It is observed that lower outage intensity occur during summer and winter periods aligning with periods of higher energy demand (Ex. heating and cooling seasons). This seems to be a result of strategic planning by grid operators to ensure grid reliability during periods of high electricity demand.

In [None]:
gah.plot_interactive_boxplot(data=outage_intensity_data, feature="outage_intensity", feature_units="(%)")

Overall, it appears both regions display a similar distribution pattern. This suggests that outage intensities might be driven by common underlying factors across regions, such as weather or grid stress.

## 4.2.5 Capacity Margin

*Note:* This feature has hourly granularity but is NOT region specific.

In [None]:
capacity_margin_data = gah.handle_capacity_margin_data(data)

Let's take a look at how capacity margin is distributed.

In [None]:
gah.plot_interactive_histogram(data=capacity_margin_data, feature="capacity_margin", feature_units="(%)", region_filter=False)

The distribution appears left-skewed with high frequency near zero and extreme negative margins. The long left tail (the negative values) could reflect rare but critical situations where demand far exceed available supply, creating potential grid instability.

Over time:

In [None]:
gah.plot_interactive_timeseries(data=capacity_margin_data, feature="capacity_margin", feature_units="(%)", region_filter=False)

In [None]:
gah.plot_interactive_boxplot(data=capacity_margin_data, feature="capacity_margin", feature_units="(%)", region_filter=False)

Capacity Margin fluctuates consistently around 0%, with occasional dips into strongly negative territory. There does not appear to be a strong upward or downward trend in the overall capacity margin over time. The Winter months (December, January, February) show a tendency for higher variability in capacity margin, with sharp drops observed during some periods. This could correspond to increased stress on the grid during colder months due to heating demand. The Summer months (June, July, August) show less dramatic variability but still exhibit dips, likely due to cooling demand (air conditioning).

Let's consider if there is any relationships between Outage Intensity and Capacity Margin. This could provide meaningful insights because it explores how the available capacity buffer (margin) influences or correlates with the system's outage levels.

From a market perspective, if capacity margins consistently correlate with higher outage intensity, it may reflect underlying issues in system planning or operational flexibility, influencing market pricing (ex. LMP volatility).

In [None]:
cm_oi_df = gah.handle_capacity_margin_outage_intensity_data(data)
gah.plot_interactive_scatter_two_features(data=cm_oi_df, feature_x="daily_capacity_margin", feature_y="daily_outage_intensity", x_units="(%)", y_units="(%)", region_filter=False, add_best_fit=True)

There is a clear negative linear relationship between capacity margin and outage intensity. As capacity margin increases (indicating more generation capacity is available), outage intensity decreases. This is expected as systems with more capacity buffer are less prone to stress.

## 4.2.6 Region Stress Ratio

*Note:* This feature has daily granularity and is region specific.

In [None]:
rsr_data = gah.handle_region_stress_ratio_data(df=data)

Evaluate Regional Stress Ratio over time, to pinpoint regions disproportionately stressed compared to the overall grid.

In [None]:
gah.plot_interactive_timeseries(data=rsr_data, feature="region_stress_ratio", feature_units="(%)")

Note: Total outages (MW), which region stress ratio is derived from, is the expected outages for the current day project at the start of that day. 

From the plot, the observed mirror effect is evident but can be explained by the fact that there are only two regions (Western and Mid-Atlantic Dominion) in the dataset and the total RTO outages is a fixed amount. Thus the stress ratios for individual regions are interdependent.

In [None]:
lmp_sr_df = gah.handle_region_stress_ratio_lmp_vol_data(df=data)
gah.plot_interactive_scatter_two_features(data=lmp_sr_df, feature_x="daily_region_stress_ratio", feature_y="daily_lmp_volatility", x_units="(%)", y_units="($/MWh)")

Surprisely, there seems a lack of a clear trend or slope indicates that higher or lower RSR values do not strongly correlate with daily LMP volatility. Most LMP volatility values are below $4/MWh, regardless of the RSR value. This clustering suggests that most of the market operates under stable conditions even as regional stress varies. There are outliers with very high volatility (up to $16/MWh), but these are rare and not systematically linked to RSR increases. Potentially, RSR reflects longer-term system stress, while LMP volatility responds to immediate market pressures, explain the weak correlation. Moreover, the grid is likely to have the flexibility in other regions to absorb local outages, resulting in low impact on LMP volatility.

Next, let's explore the relationship between region stress ratio and capacity margin. This could help explain how system-wide stress interacts with system resilience which is essential for identifying grid vulnerabilities in different seasons and predictions of potential emergencies when stress ratios rise and capacity margin falls. As both features have different levels of granularity, with region stress ratio having daily granularity with region specific data and capacity margin having hourly granularity with no region specific data, the data for both feature are values averaged across all regions for each day. Although the regional granularity is lost due to averaging, this relationship could still provide keys insights into grid performances.

In [None]:
cm_sr_df = gah.handle_region_stress_ratio_capacity_margin_data(df=data)
gah.plot_interactive_scatter_two_features(data=cm_sr_df, feature_x="daily_region_stress_ratio", feature_y="daily_capacity_margin", x_units="(%)", y_units="(%)", region_filter=False)

Across all seasons, although the data is quite dispersed, there is a negative trend indicating higher region stress ratios are associated with lower capacity margins. Summer shows more consistent and higher stress ratios, while winter has a broader range of capacity margins, highlighting seasonal differences. Lastly, Spring & Fall are transitional, with less extreme conditions overall.

Lastly, let's explore the relationship between region stress ratio and outage intensity. Outage Intensity reflects grid reliability, while Region Stress Ratio measures the strain or utilization of the grid. Similar to the analysis above, as outage intensity data is daily granular with no region specific data, region stress ratio data is again averaged across all regions for each day. Although regional granularity is lost, understanding this relationship could help quantify outage probabilities by season or stress intensity.

In [None]:
oi_sr_df = gah.handle_region_stress_ratio_outage_intensity_data(df=data)
gah.plot_interactive_scatter_two_features(data=oi_sr_df, feature_x="daily_region_stress_ratio", feature_y="daily_outage_intensity", x_units="(%)", y_units="(%)", region_filter=False)

When examining the scatterplot across all seasons, we observe a positive correlation between the daily region stress ratio and daily outage intensity. The overall trendline suggests that as grid stress increases, outage intensity also rises, although the variability in this relationship differs across seasons.

In spring and fall, the points are more tightly clustered, reflecting less volatility in the relationship between stress ratio and outage intensity. This likely corresponds to stable grid operations during these transitional periods when demand is moderate and predictable.

In winter, the trendline appears steeper, indicating a more pronounced relationship between stress ratio and outage intensity. This is likely driven by the increased grid stress caused by extreme cold weather and peak heating demand, which may heighten the risk of outages.

Summer also shows a positive correlation, but with slightly more spread compared to winter. This could reflect the variability introduced by extreme heat, air-conditioning loads, and occasional severe weather events like storms.

Overall, winter and summer emerge as critical seasons for analyzing outage probabilities due to the higher grid stress and more pronounced trends, while spring and fall offer a baseline of stability in comparison.

## 4.2.7 Congestion Risk

*Note:* This feature has daily granularity and is region specific.

In [None]:
gah.plot_interactive_rolling_timeseries(data=data, feature="congestion_risk_by_region", feature_units="(%)")

From the time series plot, there are clear "peaks" occurring in both winter (Dec–Feb) and summer (June–Aug). This is expected as this aligns with high heating and cooling demand periods, which typically stress the grid. The Western region (red line) exhibits greater variability compared to Mid-Atlantic Dominion (blue line), suggesting that transmission constraints or generation mix differences might be at play. Notably, major congestion spikes between Dec 2022 and Jan 2023 stand out, likely driven by extreme winter conditions or unexpected generator outages (ie. Winter Elliot). hese peaks suggest that winter congestion risk is not just high but also volatile, requiring further investigation into system stressors during these periods. One thing to note is that the levels of variability are less in shoulder months, but not by a significant amount. This outcome is due to the higher outage intensity in the shoulder months which align with strategic planning by grid operators.

In [None]:
gah.plot_interactive_calendar_heatmap(data=data, feature="congestion_risk_PJM", feature_units="(%)")

The heatmap provides a broader temporal view of potential congestion risk, showing how it fluctuates across different months and days. The most notable pattern is the persistent increase in congestion during winter (Dec–Feb) and summer (July–Aug), reinforcing the findings from the time series analysis. Additionally, there are isolated extreme congestion days (represented by darker blue spikes), particularly around Jun 2022 to Aug 2022 and near the end of Dec 2022, which may indicate major system disturbances, transmission constraints, or sudden demand surges. In contrast, spring and fall exhibit much lower congestion risk, reflecting the generally stable grid conditions during these months. The visualization suggests that grid stress is highly seasonal, with extreme congestion days warranting further investigation to determine whether they were caused by unexpected forced outages, generator shortfalls, or transmission limitations.

In [None]:
gah.plot_interactive_histogram(data=data, feature="congestion_risk_PJM", feature_units="(%)", bins=100)


The histogram distributions further emphasize the seasonal variations in congestion risk. Spring and fall exhibit the lowest congestion risk, with values clustering between 1.5%–2.5% and a relatively low standard deviation, indicating stable grid conditions. Summer congestion risk is higher on average (2.74%) and more variable, likely due to increased cooling demand and transmission constraints. Notably, there are some extreme congestion events reaching up to 6%, highlighting occasional grid stress. Winter congestion risk has the most extreme spikes, with a wide spread of values and some instances exceeding 8%–10%. This suggests that while winter congestion is less frequent than summer, when it does occur, it tends to be much more severe, possibly driven by storms, fuel shortages, or extreme cold weather. The findings confirm that summer and winter require the most attention for congestion risk mitigation, whereas spring and fall remain the most stable periods for the grid.

**Note:** After later reflection, having the ability to gather more data on specific paths (ie. more granularity in the data) and apply the same analysis may be a great method in spotting potential reoccuring congestion risk.

## 4.2.8 Emergency Trigger & Near Emergency Trigger

*Note:* Both features are hourly granularity but are NOT region specific.

In [None]:
trigger_data = gah.handle_daily_gen_capacity_data(df=data)

Let's evaluate these two binary indicators by season.

In [None]:
gah.plot_interactive_feature_count_breakdown_by_season(data=trigger_data, features=["emergency_triggered", "near_emergency"], region_filter=False)

Most events fall under the "Spring & Fall" season, with relatively fewer in "Summer" and "Winter". Near-emergencies occur significantly more frequently than full emergencies in all seasons. Possible rationale: These are transitional periods where generation capacity may be temporarily offline for maintenance or upgrades. This can lead to more stress events.

In [None]:
gah.plot_interactive_frequency_timeseries(data=trigger_data, features=["emergency_triggered", "near_emergency"], region_filter=False)

By Monthly (M) aggregation, Near-Emergencies are more frequent than Emergency-Triggers, and both fluctuate seasonally. Full emergencies are rare but spike dramatically during a few months. Peaks align with higher grid stress periods, like extreme temperatures in summer (high cooling demand) and winter (high heating demand).

By Hourly aggregation, Emergency-Triggers are more common during the early morning, while near-emergencies are relatively stable throughout the day but decline slightly in the late afternoon/evening. There seems to be morning ramp ups which may be directly related to demand surges as people wake up. After work hours, Near-Emergencies appear to slightly kick up as demand spikes again due to cooling/heating, cooking, and lighting.

## 4.2.9 Other Analysis

Here we will look at some relationships between the raw data from PJM vs. the engineered features.

Let's evaluate the relationship between forced outages vs. capacity margin. This could help uncover the grid's vulnerability to disruptions and identify stress periods.

In [None]:
fo_cm_data = gah.handle_forced_outages_capacity_margin_data(df=data)
gah.plot_interactive_scatter_two_features(data=fo_cm_data, feature_x="daily_capacity_margin", feature_y="daily_forced_outages_mw", x_units="(%)", y_units="(MW)", region_filter=False, add_best_fit=True)

The scatter plots reveal varying relationships between capacity margin and forced outages MW across seasons. Summer shows a clear negative relationship, where lower capacity margins correspond to higher forced outages, likely due to peak demand and grid stress. Spring and fall display a mostly flat relationship, indicating stable conditions with minimal correlation. Winter exhibits a slightly positive trend, where higher capacity margins are associated with slightly higher forced outages, potentially reflecting challenges like extreme weather or maintenance issues. These seasonal differences highlight the importance of tailored grid management strategies for each period, especially in summer when the grid is under the most stress.

In [None]:
gah.plot_interactive_timeseries_two_features(data=fo_cm_data, feature1="daily_capacity_margin", feature2="daily_forced_outages_mw", feature1_units="(%)", feature2_units="(MW)", region_filter=False)

The time series plot shows distinct seasonal patterns in daily capacity margin (blue) and daily forced outages MW (red). Capacity margin fluctuates cyclically, with peaks in summer and winter, reflecting increased demand during extreme weather periods, while valleys occur in milder spring and fall seasons. Forced outages are exhibit occasional sharp spikes, often coinciding with low capacity margins, particularly during winter. The seasonal bands further highlight these trends, emphasizing the inverse and volatile relationship between capacity margin and forced outages, especially during high-demand periods. This underscores the importance of seasonal planning for grid reliability.

Let's evaluate the relationship between forced outages vs. LMP volatility. This is essential because LMP volatility reflects market stress and supply-demand imbalances, which may be driven by unexpected forced outages.

In [None]:
fo_lv_data = gah.handle_forced_outages_lmp_vol_data(data)
gah.plot_interactive_scatter_two_features(data=fo_lv_data, feature_x="daily_lmp_volatility", feature_y="daily_forced_outages_mw", x_units="($/MWh)", y_units="(MW)", add_best_fit=True)

Across all regions and seasons, there is a slight positive trend, suggesting that higher LMP volatility is loosely associated with increased forced outages. However, the relationship is weak, with significant scatter in the data.

For the "Mid Atlantic - Dominion" region, the trend strengthens slightly in summer, where outages increase more sharply with rising volatility. This may highlight seasonal strain on the grid due to peak demand. In spring and fall, the data shows clustering with a moderate positive trend, reflecting steadier market behavior. Winter shows higher variability, with some extreme values indicating isolated grid stress events.

For the "Western" region, a consistent positive trend emerges across seasons, with sharper increases during summer and winter, likely driven by seasonal temperature extremes. Spring and fall exhibit tighter clustering and lower forced outages overall, signaling more stable grid conditions.

In [None]:
gah.plot_interactive_timeseries_two_features(data=fo_lv_data, feature1="daily_lmp_volatility", feature2="daily_forced_outages_mw", feature1_units="($/MWh)", feature2_units="(MW)")

The time series analysis of daily LMP volatility and forced outages reveals distinct seasonal patterns across regions. Both metrics peak during winter and summer, coinciding with periods of extreme demand, though their relationships vary by region. In the All Regions view, LMP volatility exhibits sharp spikes during seasonal extremes, while forced outages fluctuate more consistently, suggesting a mix of operational and external factors driving outages. In the "Mid Atlantic - Dominion" region, LMP volatility shows clear seasonal spikes, particularly in winter, while forced outages remain steady with slight increases during colder months, indicating a decoupling of market dynamics and operational stress. Conversely, the "Western" region shows a stronger alignment between LMP volatility and forced outages, with both peaking in winter and summer, reflecting more synchronized grid and market pressures.

---
# 5.0 Models



For simplicity, only three difference types of models will be considered for target variable. 

In [None]:
data = pd.read_parquet(os.path.join(dataframe_file_path, f"{output_file_name}.parquet"), engine="pyarrow")

## 5.1 Emergency Triggers

- **Grid Reliability:** Predicting emergency triggers helps identify potential grid stress events, enabling operators and market participants to proactively manage supply-demand imbalances.
- **Operational Preparedness:** Early predictions can assist in ensuring sufficient reserves, avoiding outages, and maintaining grid reliability during periods of high stress.
- **Market Impacts:** Emergency conditions often lead to volatile market prices, making this prediction valuable for traders and policymakers to mitigate financial risks.

**Model Specifications:**

- **Target Variable**: 'emergency_triggered' - Binary 0 (Not Triggered) vs. 1 (Triggered)
- **Features:** 
    - **Temporal Features:** 'hour, day_of_week, month, is_weekend, and season' capture time-based patterns in the grid's operation.
    - **Lagged Features:**
        - near_emergency: If the grid is near its emergency state, the likelihood of triggering an emergency increases.
        - capacity_margin: Shows how close the system has been to resource limits.
        - lmp_volatility: Reflects pricing stress, which could precede emergency conditions.
        - region_stress_ratio: Captures historical stress levels in a region.
    - **Rolling Averages:** 
        - lmp_volatility and region_stress_ratio: A spike in stress or volatility might last for hours/days before triggering an emergency. 
        - **Note:** *Rolling averages smooth noisy data and capture broader trends*

In [None]:
modelling_data = gah.emergency_trigger_set_up(data)
gah.walk_forward_validation_classification(data=modelling_data, 
                                            target_column="emergency_triggered", 
                                            model_save_path=model_output_path,
                                            models_to_use=["decision_tree", "random_forest", "light_gbm"])

## 5.2 LMP Volatility

- **Financial Risk Management:** High volatility in Locational Marginal Prices (LMPs) exposes market participants to price risks. Forecasting volatility enables traders to hedge their positions effectively.
- **Congestion Awareness:** LMP volatility often correlates with transmission congestion. Predicting volatility provides insights into grid bottlenecks and opportunities for congestion management.
- **Resource Optimization:** Accurate forecasts of price swings can guide resource dispatch, improve load forecasting, and optimize demand response strategies.

**Model Specifications:**

- **Target Variable**: 'lmp_volatility' - Continuous
- **Feature:** 
    - **Temporal Features:** 'hour, day_of_week, month, is_weekend, and season' capture time-based patterns in the grid's operation.
    - **Lagged Features:** 
        - lmp_volatility: To capture any persistence in price volatility over short to medium timeframes
        - lmp_abs_delta: To capture the effects large changes in LMP prices have on volatility.
        - capacity_margin: To capture hourly fluctuations and how it can impact price volatility.
        - near_emergency: Measures how recent near emergency events relate to future volatility. This was selected over 'emergency_triggered' as it is "softer" signal of grid stress as actual emergencies may have an immediate effect but do not provide information abouve the lead-up to a stressed grid.
    - **Rolling Averages:**
        - lmp_volatility, outage_intensity and region_stress_ratio
    - **Interaction Effects**
        - capacity_margin x region_stress_ratio, capacity_margin x near_emergency, region_stress x outage_intensity, capacity_margin x outage_intensity
        - **Note:** *Lagged values* will be used to create these interaction terms to reflect data availability in real-time.

**Note:** Due to the granularity (hence size of data), a random foreset model will not be used in predicting lmp volatility.

In [None]:
modelling_data = gah.lmp_volatility_set_up(data)
gah.walk_forward_validation_regression(data=modelling_data, 
                                            target_column="lmp_volatility", 
                                            model_save_path=model_output_path,
                                            models_to_use=["decision_tree", "light_gbm"])

## 5.3 Forced Outages

- **Enhancing Grid Stability:** Proactively identifying potential forced outages allows operators to mitigate risks of cascading failures, ensuring reliable electricity supply.
- **Managing Market Impacts:** Outage predictions help market participants anticipate price spikes and congestion, optimizing resource allocation and trading strategies.
- **Supporting Infrastructure Planning:** Patterns in forced outages provide insights into aging infrastructure, guiding investments in maintenance and system upgrades.


For predicting forced outages (MW), we need to select features that can help capture relationships between grid performance, stress, and outage patterns.

**Model Specifications:**
- **Target Variable:** 'forced_outages_mw' - Continuous
- **Feature:**
    - **Temporal Features:** 'month, day_of_week, is_weekend, and season' capture time-based patterns in the grid's operation.
    - **Lagged Features:**
        - forced_outages_mw: Captures persistence in forced outages.
        - outage_intensity: Measure's the grids outages severity overtime. Past intensities could indicate stress accumulation; leading to future forced outages.
        - region_stress_ratio: Reflects how stressed a specific region was historically.
    - **Rolling Averages:** 
        - forced_outages_mw, outage_intensity, region_stress_ratio, capacity_margin
    - **Interaction Effects:** outage_intensity x region_stress_ratio, capacity_margin x region_stress

In [None]:
modelling_data = gah.forced_outages_set_up(data)
gah.walk_forward_validation_regression(data=modelling_data, 
                                            target_column="forced_outages_mw", 
                                            model_save_path=model_output_path,
                                            models_to_use=["decision_tree", "random_forest", "light_gbm"])

# 6.0 Comments:

**Note:** The results of this analysis should be interpreted with caution, as the differing granularities of the data (Ex., daily vs. hourly metrics and region-specific vs. system-wide measures) may introduce some inaccuracies or limit the precision of the findings. The ideal dataset would ensure that every feature has both regional and hourly granularity for a more robust and accurate analysis.

Onto the dashboard!