# Electricity Price Prediction Project

## Overview
This project aims to predict future electricity prices using historical weather data and electricity price trends. We will analyze the relationship between weather patterns (temperature, rainfall, sunshine) and electricity costs, using data from multiple UK weather stations and historical electricity price records.

## Dataset
- **Electricity Data**: Historical time series of electricity prices (1920-2024).
- **Weather Data**: Historical weather data from Aberporth, Armagh, Chivenor, Manston, and Wick Airport.

## Methodology
1. **Data Loading & Cleaning**: Load datasets, handle missing values, and standardize formats.
2. **Feature Engineering**: Aggregate daily/monthly weather data into yearly averages to match electricity price granularity.
3. **Exploratory Data Analysis (EDA)**: Visualize trends and correlations between weather and prices.
4. **Machine Learning**: Train a Random Forest Regressor to predict electricity prices.
5. **Evaluation**: Assess model performance using RMSE and R2 score.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Set visualization style
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Data Loading and Cleaning

In [None]:
# Load Electricity Data
elec_path = 'Electric_data/Electricity_values_cleaned (1).csv'
df_elec = pd.read_csv(elec_path)

# Select relevant columns
# The file has columns: Year, GDP_Index, Electricity_Price_p_per_kWh, Electricity_CPI_2010_100
df_elec_clean = df_elec[['Year', 'Electricity_Price_p_per_kWh']].copy()
df_elec_clean.columns = ['Year', 'Price_pence_kWh']

# Clean numeric columns
df_elec_clean['Year'] = pd.to_numeric(df_elec_clean['Year'], errors='coerce')
df_elec_clean['Price_pence_kWh'] = pd.to_numeric(df_elec_clean['Price_pence_kWh'], errors='coerce')

# Drop rows with missing values
df_elec_clean = df_elec_clean.dropna()

# Sort by Year
df_elec_clean = df_elec_clean.sort_values('Year')

print(f"Electricity Data Range: {int(df_elec_clean['Year'].min())} - {int(df_elec_clean['Year'].max())}")
df_elec_clean.head()

In [None]:
# Load and Aggregate Weather Data
weather_files = [
    'Data/cleaned_data/Aberporth_cleaned.csv',
    'Data/cleaned_data/Armagh_cleaned.csv',
    'Data/cleaned_data/Chivenor_cleaned.csv',
    'Data/cleaned_data/Manston_cleaned.csv',
    'Data/cleaned_data/Wick-Airport_cleaned.csv'
]

weather_dfs = []

for file in weather_files:
    # Read file
    df = pd.read_csv(file)
    
    # Convert columns to numeric, coercing errors
    cols_to_clean = ['tmax', 'tmin', 'air_frost_days', 'rain_mm', 'sun_hours']
    for col in cols_to_clean:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')
    
    # Aggregate by Year
    agg_rules = {
        'tmax': 'mean',
        'tmin': 'mean',
        'air_frost_days': 'sum',
        'rain_mm': 'sum',
        'sun_hours': 'sum'
    }
    
    # Only apply aggregation to columns that exist in the specific file
    actual_agg_rules = {k: v for k, v in agg_rules.items() if k in df.columns}
    
    df_yearly = df.groupby('year').agg(actual_agg_rules).reset_index()
    weather_dfs.append(df_yearly)

# Concatenate all stations and calculate the National Average
all_weather = pd.concat(weather_dfs)
df_weather_national = all_weather.groupby('year').mean().reset_index()
df_weather_national = df_weather_national.rename(columns={'year': 'Year'})

print(f"Weather Data Range: {int(df_weather_national['Year'].min())} - {int(df_weather_national['Year'].max())}")
df_weather_national.head()

In [None]:
# Merge Electricity and Weather Datasets
df_merged = pd.merge(df_elec_clean, df_weather_national, on='Year', how='inner')

print(f"Merged Dataset Shape: {df_merged.shape}")
df_merged.tail()

## 2. Exploratory Data Analysis (EDA)

In [None]:
# Correlation Matrix
plt.figure(figsize=(10, 8))
correlation_matrix = df_merged.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix: Electricity Price vs Weather Features')
plt.show()

In [None]:
# Time Series Visualization
fig, ax1 = plt.subplots(figsize=(14, 7))

color = 'tab:red'
ax1.set_xlabel('Year', fontsize=12)
ax1.set_ylabel('Electricity Price (pence/kWh)', color=color, fontsize=12)
ax1.plot(df_merged['Year'], df_merged['Price_pence_kWh'], color=color, linewidth=2, label='Price')
ax1.tick_params(axis='y', labelcolor=color)
ax1.grid(True, alpha=0.3)

ax2 = ax1.twinx()  # Instantiate a second axes that shares the same x-axis

color = 'tab:blue'
ax2.set_ylabel('Avg Max Temp (Â°C)', color=color, fontsize=12)
ax2.plot(df_merged['Year'], df_merged['tmax'], color=color, linestyle='--', label='Tmax')
ax2.tick_params(axis='y', labelcolor=color)

plt.title('Historical Trend: Electricity Price vs Temperature', fontsize=16)
fig.tight_layout()
plt.show()

## 3. Machine Learning: Price Prediction

In [None]:
# Feature Selection
# We use Year as a feature to capture the temporal trend, along with weather variables.
features = ['Year', 'tmax', 'tmin', 'rain_mm', 'sun_hours', 'air_frost_days']
target = 'Price_pence_kWh'

X = df_merged[features]
y = df_merged[target]

# Handle any remaining NaNs (impute with mean)
X = X.fillna(X.mean())

# Train-Test Split
# Since this is time-series data, we cannot use random splitting.
# We will train on data up to 2018 and test on 2019-2024.
split_year = 2018

X_train = X[X['Year'] <= split_year]
y_train = y[X['Year'] <= split_year]
X_test = X[X['Year'] > split_year]
y_test = y[X['Year'] > split_year]

print(f"Training Set: {X_train.shape[0]} samples (Up to {split_year})")
print(f"Testing Set: {X_test.shape[0]} samples (After {split_year})")

In [None]:
# Model Training (Random Forest Regressor)
rf_model = RandomForestRegressor(n_estimators=200, random_state=42, max_depth=10)
rf_model.fit(X_train, y_train)

# Prediction
y_pred = rf_model.predict(X_test)

# Evaluation Metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Random Forest Model Performance:")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R2 Score: {r2:.4f}")

In [None]:
# Visualizing Predictions
plt.figure(figsize=(12, 6))
plt.plot(df_merged['Year'], df_merged['Price_pence_kWh'], label='Actual Price', color='blue', alpha=0.7)
plt.plot(X_test['Year'], y_pred, label='Predicted Price', color='red', linestyle='--', marker='o')
plt.axvline(x=split_year, color='green', linestyle=':', label='Train/Test Split')

plt.title('Electricity Price Prediction: Actual vs Predicted')
plt.xlabel('Year')
plt.ylabel('Price (pence/kWh)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Feature Importance Analysis
importances = rf_model.feature_importances_
indices = np.argsort(importances)[::-1]
feature_names = X.columns

plt.figure(figsize=(10, 6))
plt.title("Feature Importance: What drives the price prediction?")
plt.bar(range(X.shape[1]), importances[indices], align="center", color='teal')
plt.xticks(range(X.shape[1]), feature_names[indices], rotation=45)
plt.tight_layout()
plt.show()

## Conclusion
This notebook demonstrates an end-to-end machine learning workflow. We aggregated weather data, merged it with electricity prices, and used a Random Forest Regressor to predict future prices. The feature importance plot helps us understand if weather patterns or simple temporal trends (Year) are more dominant in driving prices.