# NESO Generation Mix Data Transformation
*Transforming Raw Energy Data into Analysis-Ready Format*

## Notebook Objectives

### Primary Goals
- **Transform raw NESO generation mix data** into clean, analysis-ready format
- **Standardise datetime handling** across all energy datasets
- **Create consistent fuel type categories** for cross-analysis compatibility
- **Generate summary statistics** and data quality reports

### User Stories
> **As a data analyst**, I want clear documentation and explanations for each NESO dataset we extract so that I and other team members can understand the source, structure, meaning and caveats of the data without digging into code.

> **As a renewable energy researcher**, I want standardised fuel categories (renewable vs fossil) so that I can easily calculate clean energy percentages over time.

> **As a policy maker**, I want reliable data transformation pipelines so that I can trust the insights derived from UK electricity generation data.

## About This Dataset

### Source: NESO Historic GB Generation Mix
- **Update Frequency**: Daily
- **Time Resolution**: 30-minute intervals
- **Geographic Coverage**: Great Britain transmission system
- **Attribution**: "Supported by National Energy SO Open Data"

### Stage 1: Environment Setup

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime

### Stage 2: Load raw data

In [4]:
# Load the raw regional carbon intensity data
df = pd.read_csv(r'../data/raw/generation_mix.csv')

# Check the row and column count and preview the first 4 rows
print(f"Loaded generation mix data: {df.shape[0]:,} rows × {df.shape[1]} columns")
df.head()

Loaded generation mix data: 291,184 rows × 34 columns


Unnamed: 0,DATETIME,GAS,COAL,NUCLEAR,WIND,WIND_EMB,HYDRO,IMPORTS,BIOMASS,OTHER,...,IMPORTS_perc,BIOMASS_perc,OTHER_perc,SOLAR_perc,STORAGE_perc,GENERATION_perc,LOW_CARBON_perc,ZERO_CARBON_perc,RENEWABLE_perc,FOSSIL_perc
0,2009-01-01T00:00:00,8369.0,15037.0,7099.0,244.0,61.0,246,2519.0,0.0,0.0,...,7.5,0.0,0.0,0.0,0.0,100.0,22.8,24.5,1.6,69.7
1,2009-01-01T00:30:00,8498.0,15095.0,7087.0,225.0,56.0,245,2497.0,0.0,0.0,...,7.4,0.0,0.0,0.0,0.0,100.0,22.6,24.3,1.6,70.0
2,2009-01-01T01:00:00,8474.0,15088.0,7074.0,203.0,51.0,246,2466.0,0.0,0.0,...,7.3,0.0,0.0,0.0,0.0,100.0,22.5,24.2,1.5,70.1
3,2009-01-01T01:30:00,8319.0,15034.0,7064.0,188.0,47.0,246,2440.0,0.0,0.0,...,7.3,0.0,0.0,0.0,0.0,100.0,22.6,24.3,1.4,70.0
4,2009-01-01T02:00:00,8296.0,15004.0,7052.0,173.0,43.0,246,2364.0,0.0,0.0,...,7.1,0.0,0.0,0.0,0.0,100.0,22.6,24.3,1.4,70.2


### Stage 3: Custom Function to create Data Dictionary 
I am using a reusable function to show the before and after alterations to the Neso Dataset

In [3]:
# Custom Function to create a comprehensive data dictionary for NESO Generation Mix dataset
# Takes a DataFrame and returns a data dictionary with NESO-specific column descriptions
def create_data_dictionary(df):
    # Official descriptions from NESO Historic GB Generation Mix dataset
    # Source: https://www.neso.energy/data-portal/historic-generation-mix/historic_gb_generation_mix
    descriptions = {
        'DATETIME': 'Date and time of historic generation mix and carbon intensity, given in UTC (ISO 8601 format)',
        'GAS': 'Amount of generation delivered by gas fuel type (MW)',
        'COAL': 'Amount of generation delivered by coal fuel type (MW)',
        'NUCLEAR': 'Amount of generation delivered by nuclear fuel type (MW)',
        'WIND': 'Amount of generation delivered by wind fuel type (MW)',
        'WIND_EMB': 'Amount of generation delivered by embedded wind (MW)',
        'HYDRO': 'Amount of generation delivered by hydro fuel type (MW)',
        'IMPORTS': 'Interconnector imports (MW)',
        'BIOMASS': 'Amount of generation delivered by biomass fuel type (MW)',
        'OTHER': 'Amount of generation delivered by other fuel types (MW)',
        'SOLAR': 'Amount of generation delivered by solar fuel type (MW)',
        'STORAGE': 'Amount of generation delivered by storage (MW)',
        'GENERATION': 'Sum of gas, coal, nuclear, wind, hydro and imports (MW)',
        'CARBON_INTENSITY': 'Carbon intensity of electricity - CO2 emissions per kWh of electricity consumed (gCO2/kWh)',
        'LOW_CARBON': 'Low carbon generation - wind, solar, hydro, nuclear, biomass (MW)',
        'ZERO_CARBON': 'Zero carbon generation - wind, solar, hydro, nuclear (MW)',
        'RENEWABLE': 'Renewable generation - wind, hydro, solar (MW)',
        'FOSSIL': 'Fossil generation - coal, natural gas (MW)',
        'GAS_perc': 'Gas generation as percentage of total generation (%)',
        'COAL_perc': 'Coal generation as percentage of total generation (%)',
        'NUCLEAR_perc': 'Nuclear generation as percentage of total generation (%)',
        'WIND_perc': 'Wind generation as percentage of total generation (%)',
        'WIND_EMB_perc': 'Embedded wind generation as percentage of total generation (%)',
        'HYDRO_perc': 'Hydro generation as percentage of total generation (%)',
        'IMPORTS_perc': 'Interconnector imports as percentage of total generation (%)',
        'BIOMASS_perc': 'Biomass generation as percentage of total generation (%)',
        'OTHER_perc': 'Other generation as percentage of total generation (%)',
        'SOLAR_perc': 'Solar generation as percentage of total generation (%)',
        'STORAGE_perc': 'Storage as percentage of total generation (%)',
        'GENERATION_perc': 'Total generation (gas, coal, nuclear, wind, hydro, imports) as percentage (%)',
        'LOW_CARBON_perc': 'Low carbon generation as percentage of total generation (%)',
        'ZERO_CARBON_perc': 'Zero carbon generation as percentage of total generation (%)',
        'RENEWABLE_perc': 'Renewable generation as percentage of total generation (%)',
        'FOSSIL_perc': 'Fossil generation as percentage of total generation (%)'
    }
    
    dictionary_data = []
    for column in df.columns:
        # Get 3 sample values (non-null)
        sample_values = df[column].dropna().head(3).tolist()
        sample_str = ', '.join([str(x) for x in sample_values])
        
        dictionary_data.append({
            'Column': column,
            'Data Type': str(df[column].dtype),
            'Missing Values': df[column].isnull().sum(),
            'Missing %': round((df[column].isnull().sum() / len(df)) * 100, 2),
            'Unique Values': df[column].nunique(),
            'Sample Values': sample_str,
            'Description': descriptions.get(column, 'Additional column - description needed (may be new interconnector or generation type)')
        })
    return pd.DataFrame(dictionary_data)

# Store the dictionary in a variable
raw_data_dictionary = create_data_dictionary(df)

# Display data dictionary
print("📊 NESO Generation Mix Dataset - Data Dictionary")
raw_data_dictionary

📊 NESO Generation Mix Dataset - Data Dictionary


Unnamed: 0,Column,Data Type,Missing Values,Missing %,Unique Values,Sample Values,Description
0,DATETIME,object,0,0.0,291184,"2009-01-01T00:00:00, 2009-01-01T00:30:00, 2009...",Date and time of historic generation mix and c...
1,GAS,float64,0,0.0,23775,"8369.0, 8498.0, 8474.0",Amount of generation delivered by gas fuel typ...
2,COAL,float64,0,0.0,23813,"15037.0, 15095.0, 15088.0",Amount of generation delivered by coal fuel ty...
3,NUCLEAR,float64,0,0.0,6787,"7099.0, 7087.0, 7074.0",Amount of generation delivered by nuclear fuel...
4,WIND,float64,0,0.0,16086,"244.0, 225.0, 203.0",Amount of generation delivered by wind fuel ty...
5,WIND_EMB,float64,0,0.0,5407,"61.0, 56.0, 51.0",Amount of generation delivered by embedded win...
6,HYDRO,int64,0,0.0,1323,"246, 245, 246",Amount of generation delivered by hydro fuel t...
7,IMPORTS,float64,0,0.0,7601,"2519.0, 2497.0, 2466.0",Interconnector imports (MW)
8,BIOMASS,float64,0,0.0,3200,"0.0, 0.0, 0.0",Amount of generation delivered by biomass fuel...
9,OTHER,float64,0,0.0,2424,"0.0, 0.0, 0.0",Amount of generation delivered by other fuel t...
