<h2 align="center" style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Time Series Forecast of Energy Data: An Advanced Data Analytics Approach</h2>

### **Table of Contents**

- [Introduction](#Introduction)  
  - [Assessment Overview](#Assessment-Overview)  
  - [Key Findings](#Key-Findings)  
- [Install and Import Required Libraries](#Install-and-Import-Required-Libraries)  
- [Load Dataset](#Load-Dataset)
- [Data Exploration](#Data-Exploration)  
  - [Viewing First 5 Rows of Each DataFrame](#Viewing-First-5-Rows-of-Each-DataFrame)  
  - [Statistical Summary of Each DataFrame](#Statistical-Summary-of-Each-DataFrame)  
  - [Check for Missing Values](#Check-for-Missing-Values)
  - [Check for Duplicates](#Check-for-Duplicates)
  - [Data Exploration Summary](#Data-Exploration-Summary)
- [Data Wrangling](#Data-Wrangling)  

<h2 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Introduction</h2>

### **Assessment Overview**

### **Key Findings**

<h2 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Install and Import Required Libraries</h2>

The first step is to install and import of all necessary Python libraries required for data loading, preprocessing, analysis, modeling, and dashboard visualization. Libraries such as `pandas`, `numpy`, `matplotlib`, `seaborn`, `statsmodels`, and `tensorflow/keras` are used to support various stages of this time series forecasting project.

In [1]:
!pip install -q pmdarima numpy pandas matplotlib statsmodels tensorflow

In [2]:
import os
import time
from datetime import datetime, timedelta

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
pd.set_option('display.max_rows', 50) # Display 50 rows 
pd.set_option('display.max_columns', None) # Display all columns

<h2 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Load Dataset</h2>

Here, I load the cleaned energy and weather datasets provided for the forecasting task. The `energy_df` contains hourly power generation metrics and total load actual, while the `weather_df` includes hourly weather observations across different cities. These datasets form the basis for building an integrated forecasting model that leverages both energy consumption and meteorological factors.

In [4]:
energy_df = pd.read_csv('/kaggle/input/cct3-energy-forecast-datasets/datasets/energy (1).csv')
weather_df = pd.read_csv('/kaggle/input/cct3-energy-forecast-datasets/datasets/weather.csv')

<h2 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Data Exploration</h2>

In this section, I perform preliminary data exploration to gain an initial understanding of both datasets. This includes:
- Viewing sample records to understand structure and format
- Summarizing statistical distributions
- Checking for missing values and data completeness
- Identifying any duplicate records

These steps help assess data quality and shape decisions around wrangling.

### **Viewing First 5 Rows of Each DataFrame**

In [5]:
energy_df.head()

Unnamed: 0,time,generation biomass,generation fossil brown coal/lignite,generation fossil coal-derived gas,generation fossil gas,generation fossil hard coal,generation fossil oil,generation fossil oil shale,generation fossil peat,generation geothermal,generation hydro pumped storage aggregated,generation hydro pumped storage consumption,generation hydro run-of-river and poundage,generation hydro water reservoir,generation marine,generation nuclear,generation other,generation other renewable,generation solar,generation waste,generation wind offshore,generation wind onshore,forecast solar day ahead,forecast wind offshore eday ahead,forecast wind onshore day ahead,total load forecast,total load actual,price day ahead,price actual
0,2015-01-01 00:00:00+01:00,447.0,329.0,0.0,4844.0,4821.0,162.0,0.0,0.0,0.0,,863.0,1051.0,1899.0,0.0,7096.0,43.0,73.0,49.0,196.0,0.0,6378.0,17,,6436,26118,25385.0,50.1,65.41
1,2015-01-01 01:00:00+01:00,449.0,328.0,0.0,5196.0,4755.0,158.0,0.0,0.0,0.0,,920.0,1009.0,1658.0,0.0,7096.0,43.0,71.0,50.0,195.0,0.0,5890.0,16,,5856,24934,24382.0,48.1,64.92
2,2015-01-01 02:00:00+01:00,448.0,323.0,0.0,4857.0,4581.0,157.0,0.0,0.0,0.0,,1164.0,973.0,1371.0,0.0,7099.0,43.0,73.0,50.0,196.0,0.0,5461.0,8,,5454,23515,22734.0,47.33,64.48
3,2015-01-01 03:00:00+01:00,438.0,254.0,0.0,4314.0,4131.0,160.0,0.0,0.0,0.0,,1503.0,949.0,779.0,0.0,7098.0,43.0,75.0,50.0,191.0,0.0,5238.0,2,,5151,22642,21286.0,42.27,59.32
4,2015-01-01 04:00:00+01:00,428.0,187.0,0.0,4130.0,3840.0,156.0,0.0,0.0,0.0,,1826.0,953.0,720.0,0.0,7097.0,43.0,74.0,42.0,189.0,0.0,4935.0,9,,4861,21785,20264.0,38.41,56.04


In [6]:
weather_df.head()

Unnamed: 0,dt_iso,city_name,temp,temp_min,temp_max,pressure,humidity,wind_speed,wind_deg,rain_1h,rain_3h,snow_3h,clouds_all,weather_id,weather_main,weather_description,weather_icon
0,2015-01-01 00:00:00+01:00,Valencia,270.475,270.475,270.475,1001,77,1,62,0.0,0.0,0.0,0,800,clear,sky is clear,01n
1,2015-01-01 01:00:00+01:00,Valencia,270.475,270.475,270.475,1001,77,1,62,0.0,0.0,0.0,0,800,clear,sky is clear,01n
2,2015-01-01 02:00:00+01:00,Valencia,269.686,269.686,269.686,1002,78,0,23,0.0,0.0,0.0,0,800,clear,sky is clear,01n
3,2015-01-01 03:00:00+01:00,Valencia,269.686,269.686,269.686,1002,78,0,23,0.0,0.0,0.0,0,800,clear,sky is clear,01n
4,2015-01-01 04:00:00+01:00,Valencia,269.686,269.686,269.686,1002,78,0,23,0.0,0.0,0.0,0,800,clear,sky is clear,01n


### **Statistical Summary of Each DataFrame**

In [7]:
energy_df.describe()

Unnamed: 0,generation biomass,generation fossil brown coal/lignite,generation fossil coal-derived gas,generation fossil gas,generation fossil hard coal,generation fossil oil,generation fossil oil shale,generation fossil peat,generation geothermal,generation hydro pumped storage aggregated,generation hydro pumped storage consumption,generation hydro run-of-river and poundage,generation hydro water reservoir,generation marine,generation nuclear,generation other,generation other renewable,generation solar,generation waste,generation wind offshore,generation wind onshore,forecast solar day ahead,forecast wind offshore eday ahead,forecast wind onshore day ahead,total load forecast,total load actual,price day ahead,price actual
count,35045.0,35046.0,35046.0,35046.0,35046.0,35045.0,35046.0,35046.0,35046.0,0.0,35045.0,35045.0,35046.0,35045.0,35047.0,35046.0,35046.0,35046.0,35045.0,35046.0,35046.0,35064.0,0.0,35064.0,35064.0,35028.0,35064.0,35064.0
mean,383.51354,448.059208,0.0,5622.737488,4256.065742,298.319789,0.0,0.0,0.0,,475.577343,972.116108,2605.114735,0.0,6263.907039,60.228585,85.639702,1432.665925,269.452133,0.0,5464.479769,1439.066735,,5471.216689,28712.129962,28696.939905,49.874341,57.884023
std,85.353943,354.56859,0.0,2201.830478,1961.601013,52.520673,0.0,0.0,0.0,,792.406614,400.777536,1835.199745,0.0,839.667958,20.238381,14.077554,1680.119887,50.195536,0.0,3213.691587,1677.703355,,3176.312853,4594.100854,4574.98795,14.6189,14.204083
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,237.0,18105.0,18041.0,2.06,9.33
25%,333.0,0.0,0.0,4126.0,2527.0,263.0,0.0,0.0,0.0,,0.0,637.0,1077.25,0.0,5760.0,53.0,73.0,71.0,240.0,0.0,2933.0,69.0,,2979.0,24793.75,24807.75,41.49,49.3475
50%,367.0,509.0,0.0,4969.0,4474.0,300.0,0.0,0.0,0.0,,68.0,906.0,2164.0,0.0,6566.0,57.0,88.0,616.0,279.0,0.0,4849.0,576.0,,4855.0,28906.0,28901.0,50.52,58.02
75%,433.0,757.0,0.0,6429.0,5838.75,330.0,0.0,0.0,0.0,,616.0,1250.0,3757.0,0.0,7025.0,80.0,97.0,2578.0,310.0,0.0,7398.0,2636.0,,7353.0,32263.25,32192.0,60.53,68.01
max,592.0,999.0,0.0,20034.0,8359.0,449.0,0.0,0.0,0.0,,4523.0,2000.0,9728.0,0.0,7117.0,106.0,119.0,5792.0,357.0,0.0,17436.0,5836.0,,17430.0,41390.0,41015.0,101.99,116.8


In [8]:
weather_df.describe()

Unnamed: 0,temp,temp_min,temp_max,pressure,humidity,wind_speed,wind_deg,rain_1h,rain_3h,snow_3h,clouds_all,weather_id
count,178396.0,178396.0,178396.0,178396.0,178396.0,178396.0,178396.0,178396.0,178396.0,178396.0,178396.0,178396.0
mean,289.618605,288.330442,291.091267,1069.261,68.423457,2.47056,166.59119,0.075492,0.00038,0.004763,25.073292,759.831902
std,8.026199,7.955491,8.612454,5969.632,21.902888,2.09591,116.611927,0.398847,0.007288,0.222604,30.774129,108.733223
min,262.24,262.24,262.24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,200.0
25%,283.67,282.483602,284.65,1013.0,53.0,1.0,55.0,0.0,0.0,0.0,0.0,800.0
50%,289.15,288.15,290.15,1018.0,72.0,2.0,177.0,0.0,0.0,0.0,20.0,800.0
75%,295.15,293.730125,297.15,1022.0,87.0,4.0,270.0,0.0,0.0,0.0,40.0,801.0
max,315.6,315.15,321.15,1008371.0,100.0,133.0,360.0,12.0,2.315,21.5,100.0,804.0


In [9]:
weather_df.duplicated(keep='first').sum()

21

### **Check for Missing Values**

In [10]:
def print_missing_values(df, df_name="DataFrame"):
    missing = df.isnull().sum()
    missing = missing[missing > 0]

    if not missing.empty:
        print(f"Columns with missing values in {df_name}: ")
        print(missing)
    else:
        print(f"No missing values found in {df_name}.")

In [11]:
print_missing_values(energy_df, "energy_df")

Columns with missing values in energy_df: 
generation biomass                                19
generation fossil brown coal/lignite              18
generation fossil coal-derived gas                18
generation fossil gas                             18
generation fossil hard coal                       18
generation fossil oil                             19
generation fossil oil shale                       18
generation fossil peat                            18
generation geothermal                             18
generation hydro pumped storage aggregated     35064
generation hydro pumped storage consumption       19
generation hydro run-of-river and poundage        19
generation hydro water reservoir                  18
generation marine                                 19
generation nuclear                                17
generation other                                  18
generation other renewable                        18
generation solar                                  18
gen

In [12]:
print_missing_values(weather_df, "weather_df")

No missing values found in weather_df.


### **Check for Duplicates**

In [13]:
def print_duplicate_info(df, df_name="DataFrame"):
    duplicate_count = df.duplicated().sum()

    if duplicate_count > 0:
        print(f"\n{df_name} has {duplicate_count} duplicate row(s).")
    else:
        print(f"\nNo duplicate rows found in {df_name}.")

In [14]:
print_duplicate_info(energy_df, "energy_df")


No duplicate rows found in energy_df.


In [15]:
print_duplicate_info(weather_df, "weather_df")


weather_df has 21 duplicate row(s).


<h2 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Data Wrangling</h2>

#### Overview

This section focuses on preparing the data for analysis and modeling. Key tasks include:
- Converting timestamp columns to datetime format
- Handling missing and duplicate values
- Dropping irrelevant or forecast-based columns
- Aggregating weather data across cities by timestamp
- Merging energy and weather datasets into a unified time-indexed DataFrame

These steps ensure a clean and consistent dataset that accurately reflects real-world conditions and is suitable for time series modeling.

In [16]:
# Create a copy to avoid modifying original data
energy_clean = energy_df.copy()
weather_clean = weather_df.copy()

### **Datatime Conversion**

In [17]:
# Convert time column to datetime
energy_clean['time'] = pd.to_datetime(energy_clean['time'])
energy_clean.set_index('time', inplace=True)

# Sort by datetime index
energy_clean.sort_index(inplace=True)

print(f"Original energy data shape: {energy_df.shape}")
print(f"Date range: {energy_clean.index.min()} to {energy_clean.index.max()}")

weather_clean['dt_iso'] = pd.to_datetime(weather_clean['dt_iso'])

print(f"Original weather data shape: {weather_df.shape}")
print(f"Date range: {weather_clean['dt_iso'].min()} to {weather_clean['dt_iso'].max()}")

# Set datetime as index
weather_clean.set_index('dt_iso', inplace=True)
weather_clean.sort_index(inplace=True)

Original energy data shape: (35064, 29)
Date range: 2015-01-01 00:00:00+01:00 to 2018-12-31 23:00:00+01:00
Original weather data shape: (178396, 17)
Date range: 2015-01-01 00:00:00+01:00 to 2018-12-31 23:00:00+01:00


- **Decision**: Convert string timestamps to datetime objects and set as index  
- **Justification**: Converting timestamps to datetime object enables native time-series operations (resampling, rolling windows) and ensures proper temporal alignment when merging energy and weather datasets. Critical for maintaining chronological order in forecasting models.  

### **Handle Missing Values**

In [18]:
# Identify columns with high missing values (>50% missing)
missing_pct = (energy_clean.isnull().sum() / len(energy_clean)) * 100
high_missing_cols = missing_pct[missing_pct > 50].index.tolist()

print(f"Columns with >50% missing values: {high_missing_cols}", end="\n\n")

# Drop columns with excessive missing values (>50%)
energy_clean.drop(columns=high_missing_cols, inplace=True)
print(f"Dropped {len(high_missing_cols)} columns with >50% missing values")

Columns with >50% missing values: ['generation hydro pumped storage aggregated', 'forecast wind offshore eday ahead']

Dropped 2 columns with >50% missing values


In [19]:
# Handle remaining missing values
numeric_cols = energy_clean.select_dtypes(include=[np.number]).columns

# For energy generation columns
generation_cols = [col for col in numeric_cols if 'generation' in col.lower()]
for col in generation_cols:
    if energy_clean[col].isnull().sum() > 0:
        energy_clean[col] = energy_clean[col].fillna(method='ffill').fillna(method='bfill')

# For forecast and actual load columns
forecast_actual_cols = [col for col in numeric_cols if any(keyword in col.lower() 
                       for keyword in ['forecast', 'actual', 'load', 'price'])]

for col in forecast_actual_cols:
    if energy_clean[col].isnull().sum() > 0:
        energy_clean[col] = energy_clean[col].interpolate(method='linear')
        energy_clean[col] = energy_clean[col].fillna(method='ffill').fillna(method='bfill')

print(f"- Applied interpolation and forward/backward fill for missing values")
print(f"- Remaining missing values: {energy_clean.isnull().sum().sum()}")

- Applied interpolation and forward/backward fill for missing values
- Remaining missing values: 0


**High Missing Values (>50%)**
- **Decision**: Dropped columns with >50% missing (pumped storage, wind offshore forecast)
- **Justification**: Variables with extensive missing data (35,064 out of total observations) provide insufficient information for reliable forecasting models. Including them would introduce significant bias and reduce model performance, particularly critical for short time series forecasting where every data point matters.

**Generation Variables**
- **Decision**: Forward/backward fill
- **Justification**: My assumption here is that power generation systems have physical constraints that prevent instantaneous changes in output. Forward fill captures the operational reality that plants maintain relatively stable generation levels over short periods, while backward fill ensures no gaps remain. My thought process is that this approach preserves the underlying temporal patterns essential for energy forecasting.

**Load & Price Variables**
- **Decision**: Linear interpolation
- **Justification**: Another of my assumption here is that electricity demand and pricing exhibit smooth transitions due to market mechanisms and consumer behavior patterns. Linear interpolation maintains temporal continuity while providing realistic intermediate values that preserve the trend structure crucial for accurate load forecasting models.

### **Handle Duplicates**

In [20]:
print(f"Initial Data Shape: {weather_clean.shape} \n Duplicate rows found: {weather_clean.duplicated().sum()}")

# Remove duplicates - keep first occurrence
weather_clean.drop_duplicates(inplace=True)

print(f"Data shape after removing duplicates: {weather_clean.shape}")

Initial Data Shape: (178396, 16) 
 Duplicate rows found: 8622
Data shape after removing duplicates: (169774, 16)


- **Decision**: Remove 21 duplicates, keep first occurrence
- **Justification**: Duplicate weather records create artificial patterns that can bias model training and lead to overfitting. Removing duplicates ensures temporal accuracy where each timestamp has unique weather measurements, improving model generalization and preventing inflated performance metrics.

### **Drop Forecast Columns**

In [21]:
forecast_columns_to_drop = ['forecast wind onshore day ahead', 'forecast solar day ahead', 'total load forecast', 'price day ahead']

In [22]:
energy_clean.drop(columns=forecast_columns_to_drop, inplace=True)

- **Decision**: Drop forecast-related columns (`forecast wind onshore day ahead`, `forecast solar day ahead`, `total load forecast`, `price day ahead`).
- **Justification**: These features represent external predictive signals, which may not be available in real-world inference scenarios or during backtesting. Including them risks **data leakage**, especially when my goal is to build a model that forecasts energy demand/load independently. Removing these columns ensures the model only relies on actual historical observations and preserves the integrity of the forecasting task.

### **Merge DataFrames**

In [23]:
# Convert both indices to UTC first, then remove timezone info
energy_clean.index = pd.to_datetime(energy_clean.index, utc=True).tz_convert(None)
weather_clean.index = pd.to_datetime(weather_clean.index, utc=True).tz_convert(None)

In [24]:
# Aggregate weather data using mean across all cities per timestamp
weather_agg = (
    weather_clean.groupby(weather_clean.index).agg({
        'temp': 'mean',
        'temp_min': 'mean',
        'temp_max': 'mean',
        'pressure': 'mean',
        'humidity': 'mean',
        'wind_speed': 'mean',
        'wind_deg': 'mean',
        'rain_1h': 'mean',
        'rain_3h': 'mean',
        'snow_3h': 'mean',
        'clouds_all': 'mean'
    })
)

# Merge using datetime index
merged_df = energy_clean.join(weather_agg, how='inner')

In [25]:
merged_df.head(2)

Unnamed: 0,generation biomass,generation fossil brown coal/lignite,generation fossil coal-derived gas,generation fossil gas,generation fossil hard coal,generation fossil oil,generation fossil oil shale,generation fossil peat,generation geothermal,generation hydro pumped storage consumption,generation hydro run-of-river and poundage,generation hydro water reservoir,generation marine,generation nuclear,generation other,generation other renewable,generation solar,generation waste,generation wind offshore,generation wind onshore,total load actual,price actual,temp,temp_min,temp_max,pressure,humidity,wind_speed,wind_deg,rain_1h,rain_3h,snow_3h,clouds_all
2014-12-31 23:00:00,447.0,329.0,0.0,4844.0,4821.0,162.0,0.0,0.0,0.0,863.0,1051.0,1899.0,0.0,7096.0,43.0,73.0,49.0,196.0,0.0,6378.0,25385.0,65.41,272.491463,272.491463,272.491463,1016.4,82.4,2.0,135.2,0.0,0.0,0.0,0.0
2015-01-01 00:00:00,449.0,328.0,0.0,5196.0,4755.0,158.0,0.0,0.0,0.0,920.0,1009.0,1658.0,0.0,7096.0,43.0,71.0,50.0,195.0,0.0,5890.0,24382.0,64.92,269.7635,269.7635,269.7635,1035.0,97.0,0.0,229.0,0.0,0.0,0.0,0.0


- **Decision**: Aggregate weather data across all five Spanish cities using the **mean** of each numerical weather feature, then merge it with the national energy data by aligning on their **shared datetime index**.

- **Justification**: The energy dataset is recorded at a national level and indexed by timestamp, while the weather data includes multiple cities also indexed by timestamp. To create a consistent and comparable dataset, I computed the **mean weather conditions** across all cities per hour. 

- **Hypothesis**: I’m using the **mean** of weather variables (e.g., temperature, humidity, wind speed) across cities because I hypothesize that **national energy load reflects average comfort levels and environmental conditions** across the country, rather than being driven by local extremes in any single city.

<h2 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Exploratory Data Analysis</h2>