 📄 clean_energy_data.py — Step-by-Step Explanation

This script is used to clean Ireland's wind and solar energy data from SEAI. It prepares the raw Excel file for modeling by fixing missing values, creating proper timestamps, and saving a clean CSV.

## What the Script Does (in Simple Words)

### 1. **Import Required Libraries**

In [1]:
import pandas as pd
import os

input_path: location of the raw Excel file
output_path: where the cleaned CSV will be saved
Loads the data into a DataFrame.
Some rows are missing the year → this fills them in,Converts them into integers
The Unit column always says 'GWh' → not useful
Combines month and year into a clean date like 2018-01-01
Keeps only the columns needed for your model
Standardizes column names for readability and code clarity
Sorts the data by date so the time series is correct
Makes sure the save folder exists, saves the cleaned data to a .csv file
Run the Script(Only When Called Directly)
This runs the cleaning function when the script is executed.

In [None]:

def clean_energy_data(input_path, output_path):

    # Load the Excel file
    df = pd.read_excel(input_path)

    # Fill missing years
    df["Year of Period"] = df["Year of Period"].ffill().astype(int)

    # Drop the 'Unit' column (it's always 'GWh')
    df.drop(columns=["Unit"], inplace=True, errors='ignore')

    # Create a proper 'Date' column
    df["Date"] = pd.to_datetime(df["Month of Period"] + " " + df["Year of Period"].astype(str), format='%B %Y')

    # Keep only required columns
    df_cleaned = df[["Date", "Wind", "Solar Farms"]].copy()

    # Rename columns to follow modeling conventions
    df_cleaned.columns = ["Date", "Wind_GWh", "Solar_GWh"]

    # Sort by date
    df_cleaned.sort_values("Date", inplace=True)
    df_cleaned.reset_index(drop=True, inplace=True)

    # Save cleaned data
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    df_cleaned.to_csv(output_path, index=False)
    print(f"Cleaned energy data saved to: {output_path}")

    return df_cleaned


if __name__ == "__main__":
    # File paths
    input_file = "../data/raw/Energy_Data.xlsx"
    output_file = "../data/processed/Cleaned_Energy_Data.csv"

    # Run cleaning
    clean_energy_data(input_file, output_file)


Cleaned energy data saved to: ../data/processed/Cleaned_Energy_Data.csv
