# NESO Generation Mix Data Transformation
*Transforming Raw Energy Data into Analysis-Ready Format*

## Notebook Objectives

### Primary Goals
- **Transform raw NESO generation mix data** into clean, analysis-ready format
- **Standardise datetime handling** across all energy datasets
- **Create consistent fuel type categories** for cross-analysis compatibility
- **Generate summary statistics** and data quality reports

### User Stories
> **As a data analyst**, I want clear documentation and explanations for each NESO dataset we extract so that I and other team members can understand the source, structure, meaning and caveats of the data without digging into code.

> **As a renewable energy researcher**, I want standardised fuel categories (renewable vs fossil) so that I can easily calculate clean energy percentages over time.

> **As a policy maker**, I want reliable data transformation pipelines so that I can trust the insights derived from UK electricity generation data.

## About This Dataset

### Source: NESO Historic GB Generation Mix
- **Update Frequency**: Daily
- **Time Resolution**: 30-minute intervals
- **Geographic Coverage**: Great Britain transmission system
- **Attribution**: "Supported by National Energy SO Open Data"

### Stage 1: Environment Setup
Here we import the python libraries we need ready to conduct some tranformation to validate the integrity of the dataset and make alterations where nessesary.

In [109]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
import os
from pathlib import Path

In [110]:
# Assign processed data directory 
processed_dir = "../data/processed"
dict_dir = '../data/data_dictionary'
stats_dir = '../data/descriptive_statistics'

print(f"{processed_dir}/ - Clean datasets for analysis")
print(f"{dict_dir}/ - Data dictionaries and documentation")
print(f"{stats_dir}/ - Descriptive statistics and summaries")

../data/processed/ - Clean datasets for analysis
../data/data_dictionary/ - Data dictionaries and documentation
../data/descriptive_statistics/ - Descriptive statistics and summaries


### Stage 2: Load raw data
Next we load in the data and check the shape and first few rows to validate with a small sample to initially verify if it has been read correctly.

In [111]:
# Load the raw generation mix data
df = pd.read_csv(r'../data/raw/generation_mix.csv')

# Check the row and column count and preview the first 4 rows
print(f"Loaded generation mix data: {df.shape[0]:,} rows × {df.shape[1]} columns")
df.head()

Loaded generation mix data: 291,184 rows × 34 columns


Unnamed: 0,DATETIME,GAS,COAL,NUCLEAR,WIND,WIND_EMB,HYDRO,IMPORTS,BIOMASS,OTHER,...,IMPORTS_perc,BIOMASS_perc,OTHER_perc,SOLAR_perc,STORAGE_perc,GENERATION_perc,LOW_CARBON_perc,ZERO_CARBON_perc,RENEWABLE_perc,FOSSIL_perc
0,2009-01-01T00:00:00,8369.0,15037.0,7099.0,244.0,61.0,246,2519.0,0.0,0.0,...,7.5,0.0,0.0,0.0,0.0,100.0,22.8,24.5,1.6,69.7
1,2009-01-01T00:30:00,8498.0,15095.0,7087.0,225.0,56.0,245,2497.0,0.0,0.0,...,7.4,0.0,0.0,0.0,0.0,100.0,22.6,24.3,1.6,70.0
2,2009-01-01T01:00:00,8474.0,15088.0,7074.0,203.0,51.0,246,2466.0,0.0,0.0,...,7.3,0.0,0.0,0.0,0.0,100.0,22.5,24.2,1.5,70.1
3,2009-01-01T01:30:00,8319.0,15034.0,7064.0,188.0,47.0,246,2440.0,0.0,0.0,...,7.3,0.0,0.0,0.0,0.0,100.0,22.6,24.3,1.4,70.0
4,2009-01-01T02:00:00,8296.0,15004.0,7052.0,173.0,43.0,246,2364.0,0.0,0.0,...,7.1,0.0,0.0,0.0,0.0,100.0,22.6,24.3,1.4,70.2


### Stage 3: Custom Function to create Data Dictionary 
I am using a reusable function I have previously written for my past projects to show the before and after alterations to the Neso Dataset, this will also show what quantity of data is missing from each column and help us identify if any data types need to be transformed so that they will function correctly in our analysis.

In [112]:
# Custom Function to create a comprehensive data dictionary for NESO Generation Mix dataset
# Takes a DataFrame and returns a data dictionary with NESO-specific column descriptions
def create_data_dictionary(df):
    # Official descriptions from NESO Historic GB Generation Mix dataset
    # Source: https://www.neso.energy/data-portal/historic-generation-mix/historic_gb_generation_mix
    descriptions = {
        'DATETIME': 'Date and time of historic generation mix and carbon intensity, given in UTC (ISO 8601 format)',
        'GAS': 'Amount of generation delivered by gas fuel type (MW)',
        'COAL': 'Amount of generation delivered by coal fuel type (MW)',
        'NUCLEAR': 'Amount of generation delivered by nuclear fuel type (MW)',
        'WIND': 'Amount of generation delivered by wind fuel type (MW)',
        'WIND_EMB': 'Amount of generation delivered by embedded wind (MW)',
        'HYDRO': 'Amount of generation delivered by hydro fuel type (MW)',
        'IMPORTS': 'Interconnector imports (MW)',
        'BIOMASS': 'Amount of generation delivered by biomass fuel type (MW)',
        'OTHER': 'Amount of generation delivered by other fuel types (MW)',
        'SOLAR': 'Amount of generation delivered by solar fuel type (MW)',
        'STORAGE': 'Amount of generation delivered by storage (MW)',
        'GENERATION': 'Sum of gas, coal, nuclear, wind, hydro and imports (MW)',
        'CARBON_INTENSITY': 'Carbon intensity of electricity - CO2 emissions per kWh of electricity consumed (gCO2/kWh)',
        'LOW_CARBON': 'Low carbon generation - wind, solar, hydro, nuclear, biomass (MW)',
        'ZERO_CARBON': 'Zero carbon generation - wind, solar, hydro, nuclear (MW)',
        'RENEWABLE': 'Renewable generation - wind, hydro, solar (MW)',
        'FOSSIL': 'Fossil generation - coal, natural gas (MW)',
        'GAS_perc': 'Gas generation as percentage of total generation (%)',
        'COAL_perc': 'Coal generation as percentage of total generation (%)',
        'NUCLEAR_perc': 'Nuclear generation as percentage of total generation (%)',
        'WIND_perc': 'Wind generation as percentage of total generation (%)',
        'WIND_EMB_perc': 'Embedded wind generation as percentage of total generation (%)',
        'HYDRO_perc': 'Hydro generation as percentage of total generation (%)',
        'IMPORTS_perc': 'Interconnector imports as percentage of total generation (%)',
        'BIOMASS_perc': 'Biomass generation as percentage of total generation (%)',
        'OTHER_perc': 'Other generation as percentage of total generation (%)',
        'SOLAR_perc': 'Solar generation as percentage of total generation (%)',
        'STORAGE_perc': 'Storage as percentage of total generation (%)',
        'GENERATION_perc': 'Total generation (gas, coal, nuclear, wind, hydro, imports) as percentage (%)',
        'LOW_CARBON_perc': 'Low carbon generation as percentage of total generation (%)',
        'ZERO_CARBON_perc': 'Zero carbon generation as percentage of total generation (%)',
        'RENEWABLE_perc': 'Renewable generation as percentage of total generation (%)',
        'FOSSIL_perc': 'Fossil generation as percentage of total generation (%)'
    }
    
    dictionary_data = []
    for column in df.columns:
        # Get 3 sample values (non-null)
        sample_values = df[column].dropna().head(3).tolist()
        sample_str = ', '.join([str(x) for x in sample_values])
        
        dictionary_data.append({
            'Column': column,
            'Data Type': str(df[column].dtype),
            'Missing Values': df[column].isnull().sum(),
            'Missing %': round((df[column].isnull().sum() / len(df)) * 100, 2),
            'Unique Values': df[column].nunique(),
            'Sample Values': sample_str,
            'Description': descriptions.get(column, 'Additional column - description needed (may be new interconnector or generation type)')
        })
    return pd.DataFrame(dictionary_data)

# Store the dictionary in a variable
raw_data_dictionary = create_data_dictionary(df)

# Display data dictionary
print("NESO Generation Mix Dataset - Data Dictionary")
raw_data_dictionary

NESO Generation Mix Dataset - Data Dictionary


Unnamed: 0,Column,Data Type,Missing Values,Missing %,Unique Values,Sample Values,Description
0,DATETIME,object,0,0.0,291184,"2009-01-01T00:00:00, 2009-01-01T00:30:00, 2009...",Date and time of historic generation mix and c...
1,GAS,float64,0,0.0,23775,"8369.0, 8498.0, 8474.0",Amount of generation delivered by gas fuel typ...
2,COAL,float64,0,0.0,23813,"15037.0, 15095.0, 15088.0",Amount of generation delivered by coal fuel ty...
3,NUCLEAR,float64,0,0.0,6787,"7099.0, 7087.0, 7074.0",Amount of generation delivered by nuclear fuel...
4,WIND,float64,0,0.0,16086,"244.0, 225.0, 203.0",Amount of generation delivered by wind fuel ty...
5,WIND_EMB,float64,0,0.0,5407,"61.0, 56.0, 51.0",Amount of generation delivered by embedded win...
6,HYDRO,int64,0,0.0,1323,"246, 245, 246",Amount of generation delivered by hydro fuel t...
7,IMPORTS,float64,0,0.0,7601,"2519.0, 2497.0, 2466.0",Interconnector imports (MW)
8,BIOMASS,float64,0,0.0,3200,"0.0, 0.0, 0.0",Amount of generation delivered by biomass fuel...
9,OTHER,float64,0,0.0,2424,"0.0, 0.0, 0.0",Amount of generation delivered by other fuel t...


### Stage 4: Data type correction

In [113]:
# Step 1: Convert DATETIME column into datetime format
try:
    df['DATETIME'] = pd.to_datetime(df['DATETIME'], utc=True)
    print(f"DATETIME converted to: {df['DATETIME'].dtype}")
except Exception as e:
    print(f"DATETIME conversion failed: {e}")


DATETIME converted to: datetime64[ns, UTC]


In [114]:
# Step 2: Check whats changed and if we have introduced any missing values

# Generate new data dictionary after transformations
transformed_data_dictionary = create_data_dictionary(df)

transformed_data_dictionary

Unnamed: 0,Column,Data Type,Missing Values,Missing %,Unique Values,Sample Values,Description
0,DATETIME,"datetime64[ns, UTC]",0,0.0,291184,"2009-01-01 00:00:00+00:00, 2009-01-01 00:30:00...",Date and time of historic generation mix and c...
1,GAS,float64,0,0.0,23775,"8369.0, 8498.0, 8474.0",Amount of generation delivered by gas fuel typ...
2,COAL,float64,0,0.0,23813,"15037.0, 15095.0, 15088.0",Amount of generation delivered by coal fuel ty...
3,NUCLEAR,float64,0,0.0,6787,"7099.0, 7087.0, 7074.0",Amount of generation delivered by nuclear fuel...
4,WIND,float64,0,0.0,16086,"244.0, 225.0, 203.0",Amount of generation delivered by wind fuel ty...
5,WIND_EMB,float64,0,0.0,5407,"61.0, 56.0, 51.0",Amount of generation delivered by embedded win...
6,HYDRO,int64,0,0.0,1323,"246, 245, 246",Amount of generation delivered by hydro fuel t...
7,IMPORTS,float64,0,0.0,7601,"2519.0, 2497.0, 2466.0",Interconnector imports (MW)
8,BIOMASS,float64,0,0.0,3200,"0.0, 0.0, 0.0",Amount of generation delivered by biomass fuel...
9,OTHER,float64,0,0.0,2424,"0.0, 0.0, 0.0",Amount of generation delivered by other fuel t...


#### Statistical Overview of numberic columns

In [115]:
# Get columns (excluding datetime)
cols = [col for col in df.columns if col != 'DATETIME']

# Collect describe statistics for all regional columns
describe_stats = []

for col in cols:
    stats = df[col].describe()
    
    # Create a dictionary with column name and all describe statistics
    stats_dict = {
        'Column': col,
        'Count': stats['count'],
        'Mean': round(stats['mean'], 2),
        'Std': round(stats['std'], 2),
        'Min': round(stats['min'], 2),
        '25%': round(stats['25%'], 2),
        '50% (Median)': round(stats['50%'], 2),
        '75%': round(stats['75%'], 2),
        'Max': round(stats['max'], 2),
        'Range': round(stats['max'] - stats['min'], 2),
        'Missing_Values': df[col].isnull().sum(),
        'Missing_Percentage': round((df[col].isnull().sum() / len(df)) * 100, 2)
    }
    
    describe_stats.append(stats_dict)

# Convert to DataFrame
describe_df = pd.DataFrame(describe_stats)

# Set column name as index for easier reading
describe_df.set_index('Column', inplace=True)

print("Descriptive Statistics Summary:")
print(f"Dataset contains {len(cols)} columns")
print(f"Time period: {df['DATETIME'].min()} to {df['DATETIME'].max()}")
print("\nDetailed statistics by region:")

# Display the DataFrame
describe_df

Descriptive Statistics Summary:
Dataset contains 33 columns
Time period: 2009-01-01 00:00:00+00:00 to 2025-08-11 07:30:00+00:00

Detailed statistics by region:


Unnamed: 0_level_0,Count,Mean,Std,Min,25%,50% (Median),75%,Max,Range,Missing_Values,Missing_Percentage
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
GAS,291184.0,12104.21,5489.73,23.0,7599.0,12156.0,16382.0,27868.0,27845.0,0,0.0
COAL,291184.0,5680.71,6549.63,0.0,229.0,2173.0,10838.0,26044.0,26044.0,0,0.0
NUCLEAR,291184.0,6361.87,1473.07,2065.0,5146.0,6528.0,7601.0,9342.0,7277.0,0,0.0
WIND,291184.0,3870.75,3657.9,0.0,961.0,2672.0,5631.0,17406.0,17406.0,0,0.0
WIND_EMB,291184.0,1036.4,954.42,0.0,284.0,769.0,1529.0,5947.0,5947.0,0,0.0
HYDRO,291184.0,392.53,245.31,0.0,187.0,360.0,559.0,1403.0,1403.0,0,0.0
IMPORTS,291184.0,2474.73,1632.6,0.0,1382.0,2435.0,3106.0,9148.0,9148.0,0,0.0
BIOMASS,291184.0,901.92,1076.93,0.0,0.0,0.0,1902.0,3328.0,3328.0,0,0.0
OTHER,291184.0,461.01,592.87,0.0,44.0,170.0,714.0,3036.0,3036.0,0,0.0
SOLAR,291184.0,927.35,1880.84,0.0,0.0,0.0,867.0,14035.0,14035.0,0,0.0


#### Data Type Correction Summary
In this data dictionary we can see that the datatypes match the expected formats to conduct my analysis and both in the raw and transformed data we can see we have no missing values.

### Stage 5: Data Completeness Checks
Now lets investigate if there are any missing timestamps between the start and end of our dataset, we know that there should be a record for every 30 minute interval.

In [116]:
# Get min and max datetime
min_datetime = df['DATETIME'].min()
max_datetime = df['DATETIME'].max()
actual_records = len(df)
print(f'Min: {min_datetime}')
print(f'Max: {max_datetime}')

Min: 2009-01-01 00:00:00+00:00
Max: 2025-08-11 07:30:00+00:00


In [117]:
# Create complete 30-minute array
expected_range = pd.date_range(
    start=min_datetime, 
    end=max_datetime, 
    freq='30min',  # 30-minute intervals
    tz='UTC'
)
expected_records = len(expected_range)

print(f"Coverage: {(actual_records/expected_records)*100:.2f}%")

Coverage: 100.00%


In [118]:
# Check for duplicate timestamps
duplicates = df[df.duplicated(subset=['DATETIME'])]
if len(duplicates) > 0:
    print(f"\n{len(duplicates)} duplicate timestamps found")
else:
    print(f"\nNo duplicate timestamps found")


No duplicate timestamps found


#### Datetime completeness summary
Every timestamp is present in the dataset with complete indormation. Now we can progress to evaluate the measures that are present in the data.

### Next Steps
Before we attempt to impute the missing values in the regional_carbon_intensity dataset, we will combine the regional_carbon_intensity prediction dataset with the generation_mix data to give us more opportunity to impute the values more accurately.

### Stage 6: Export Processed Dataset, updated dictionary and statistics

In [119]:
# Save the main processed dataset
generation_mix_file = os.path.join(processed_dir, 'generation_mix.csv')
df.to_csv(generation_mix_file, index=False)
print(f"Generation mix data saved: {generation_mix_file}")

# Save data dictionary
dict_file = os.path.join(dict_dir, 'generation_mix_data_dictionary.csv')
transformed_data_dictionary.to_csv(dict_file, index=False)
print(f"Data dictionary saved: {dict_file}")

# Save descriptive statistics
stats_file = os.path.join(stats_dir, 'generation_mix_descriptive_stats.csv')
describe_df.to_csv(stats_file)
print(f"Descriptive statistics saved: {stats_file}")

Generation mix data saved: ../data/processed/generation_mix.csv
Data dictionary saved: ../data/data_dictionary/generation_mix_data_dictionary.csv
Descriptive statistics saved: ../data/descriptive_statistics/generation_mix_descriptive_stats.csv
