# NESO Regional Carbon Intensity Data Transformation

### Notebook Objectives

### Primary Goals
- **Transform raw NESO regional carbon intensity data** into clean, analysis-ready format
- **Standardize datetime handling** across all energy datasets
- **Create consistent regional mapping** for geographic analysis
- **Generate summary statistics** and data quality reports for carbon intensity trends

### User Stories
> **As a data analyst**, I want clear documentation and explanations for each NESO dataset we extract so that I and other team members can understand the source, structure, meaning and caveats of the data without digging into code.

> **As a climate researcher**, I want standardized regional carbon intensity data so that I can easily identify the cleanest and dirtiest electricity by time and location across Great Britain.

> **As an energy consumer**, I want reliable carbon intensity transformation pipelines so that I can trust insights about when and where to use electricity most sustainably.

### About This Dataset

### Source: NESO Regional Carbon Intensity
- **Update Frequency**: Daily
- **Time Resolution**: 30-minute intervals
- **Attribution**: "Supported by National Energy SO Open Data"

### Stage 1: Environment Setup
Here we import the python libraries we need ready to conduct some tranformation to validate the integrity of the dataset and make alterations where nessesary.

In [6]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
import os
from pathlib import Path

In [7]:
# Assign processed data directory 
processed_dir = "../data/processed"
dict_dir = '../data/data_dictionary'
stats_dir = '../data/descriptive_statistics'

print(f"{processed_dir}/ - Clean datasets for analysis")
print(f"{dict_dir}/ - Data dictionaries and documentation")
print(f"{stats_dir}/ - Descriptive statistics and summaries")

../data/processed/ - Clean datasets for analysis
../data/data_dictionary/ - Data dictionaries and documentation
../data/descriptive_statistics/ - Descriptive statistics and summaries


### Stage 2: Load raw data
Next we load in the data and check the shape and first few rows to validate with a small sample to initially verify if it has been read correctly.

In [None]:
# Load the raw regional carbon intensity data
df = pd.read_csv('../data/raw/regional_carbon_intensity.csv')

# Check the row and column count and preview the first 5 rows
print(f"Loaded regional carbon intensity data: {df.shape[0]:,} rows × {df.shape[1]} columns")
df.head()

Loaded regional carbon intensity data: 119,629 rows × 15 columns


Unnamed: 0,datetime,North Scotland,South Scotland,North West England,North East England,Yorkshire,North Wales and Merseyside,South Wales,West Midlands,East Midlands,East England,South West England,South England,London,South East England
0,2018-09-17T23:00:00,30.0,11.0,38.0,24.0,319.0,157.0,271.0,40.0,325.0,58.0,72.0,116.0,49.0,73.0
1,2018-09-17T23:30:00,44.0,7.0,41.0,30.0,298.0,183.0,157.0,39.0,340.0,61.0,90.0,152.0,69.0,70.0
2,2018-09-18T00:00:00,44.0,8.0,39.0,30.0,295.0,180.0,155.0,39.0,337.0,60.0,89.0,152.0,68.0,70.0
3,2018-09-18T00:30:00,44.0,11.0,37.0,31.0,288.0,175.0,151.0,38.0,330.0,58.0,87.0,150.0,69.0,70.0
4,2018-09-18T01:00:00,45.0,13.0,34.0,32.0,278.0,171.0,148.0,37.0,326.0,57.0,85.0,151.0,70.0,69.0


### Stage 3: Custom Function to create Data Dictionary 
I am using a reusable function I have previously written for my past projects to show the before and after alterations to the Neso Dataset, this will also show what quantity of data is missing from each column and help us identify if any data types need to be transformed so that they will function correctly in our analysis.

In [9]:
# Custom Function to create a comprehensive data dictionary for NESO Regional Carbon Intensity dataset
# Takes a DataFrame and returns a data dictionary with NESO-specific column descriptions
def create_data_dictionary(df):
    # Official descriptions from NESO Regional Carbon Intensity dataset
    # Source: https://www.neso.energy/data-portal/regional-carbon-intensity-forecast
    descriptions = {
        'datetime': 'Timestamp of record, given in UTC (Coordinated Universal Time)',
        'North Scotland': 'North Scotland carbon intensity forecast, predicted using machine learning models and metered generation (gCO2/kWh)',
        'South Scotland': 'South Scotland carbon intensity forecast, predicted using machine learning models and metered generation (gCO2/kWh)',
        'North West England': 'North West England carbon intensity forecast, predicted using machine learning models and metered generation (gCO2/kWh)',
        'North East England': 'North East England carbon intensity forecast, predicted using machine learning models and metered generation (gCO2/kWh)',
        'Yorkshire': 'Yorkshire carbon intensity forecast, predicted using machine learning models and metered generation (gCO2/kWh)',
        'North Wales and Merseyside': 'North Wales and Merseyside carbon intensity forecast, predicted using machine learning models and metered generation (gCO2/kWh)',
        'South Wales': 'South Wales carbon intensity forecast, predicted using machine learning models and metered generation (gCO2/kWh)',
        'West Midlands': 'West Midlands carbon intensity forecast, predicted using machine learning models and metered generation (gCO2/kWh)',
        'East Midlands': 'East Midlands carbon intensity forecast, predicted using machine learning models and metered generation (gCO2/kWh)',
        'East England': 'East England carbon intensity forecast, predicted using machine learning models and metered generation (gCO2/kWh)',
        'South West England': 'South West England carbon intensity forecast, predicted using machine learning models and metered generation (gCO2/kWh)',
        'South England': 'South England carbon intensity forecast, predicted using machine learning models and metered generation (gCO2/kWh)',
        'London': 'London carbon intensity forecast, predicted using machine learning models and metered generation (gCO2/kWh)',
        'South East England': 'South East England carbon intensity forecast, predicted using machine learning models and metered generation (gCO2/kWh)'
    }
    
    dictionary_data = []
    for column in df.columns:
        # Get 3 sample values (non-null)
        sample_values = df[column].dropna().head(3).tolist()
        sample_str = ', '.join([str(x) for x in sample_values])
        
        # Truncate very long sample strings (for generation_mix JSON data)
        if len(sample_str) > 100:
            sample_str = sample_str[:100] + "..."
        
        dictionary_data.append({
            'Column': column,
            'Data Type': str(df[column].dtype),
            'Missing Values': df[column].isnull().sum(),
            'Missing %': round((df[column].isnull().sum() / len(df)) * 100, 2),
            'Unique Values': df[column].nunique(),
            'Sample Values': sample_str,
            'Description': descriptions.get(column, 'Additional column - description needed (may be new regional metric)')
        })
    return pd.DataFrame(dictionary_data)

# Store the dictionary in a variable
raw_data_dictionary = create_data_dictionary(df)

# Display data dictionary
print("NESO Regional Carbon Intensity Dataset - Data Dictionary")
raw_data_dictionary

NESO Regional Carbon Intensity Dataset - Data Dictionary


Unnamed: 0,Column,Data Type,Missing Values,Missing %,Unique Values,Sample Values,Description
0,datetime,object,0,0.0,119629,"2018-09-17T23:00:00, 2018-09-17T23:30:00, 2018...","Timestamp of record, given in UTC (Coordinated..."
1,North Scotland,float64,1311,1.1,382,"30.0, 44.0, 44.0","North Scotland carbon intensity forecast, pred..."
2,South Scotland,float64,1311,1.1,241,"11.0, 7.0, 8.0","South Scotland carbon intensity forecast, pred..."
3,North West England,float64,1311,1.1,289,"38.0, 41.0, 39.0","North West England carbon intensity forecast, ..."
4,North East England,float64,1311,1.1,262,"24.0, 30.0, 30.0","North East England carbon intensity forecast, ..."
5,Yorkshire,float64,1311,1.1,461,"319.0, 298.0, 295.0","Yorkshire carbon intensity forecast, predicted..."
6,North Wales and Merseyside,float64,1311,1.1,670,"157.0, 183.0, 180.0",North Wales and Merseyside carbon intensity fo...
7,South Wales,float64,1311,1.1,641,"271.0, 157.0, 155.0","South Wales carbon intensity forecast, predict..."
8,West Midlands,float64,1311,1.1,526,"40.0, 39.0, 39.0","West Midlands carbon intensity forecast, predi..."
9,East Midlands,float64,1311,1.1,749,"325.0, 340.0, 337.0","East Midlands carbon intensity forecast, predi..."


#### Statistical Overview of numberic columns

In [10]:
# Get regional columns (excluding datetime)
regional_cols = [col for col in df.columns if col != 'datetime']

# Collect describe statistics for all regional columns
describe_stats = []

for col in regional_cols:
    stats = df[col].describe()
    
    # Create a dictionary with column name and all describe statistics
    stats_dict = {
        'Column': col,
        'Count': stats['count'],
        'Mean': round(stats['mean'], 2),
        'Std': round(stats['std'], 2),
        'Min': round(stats['min'], 2),
        '25%': round(stats['25%'], 2),
        '50% (Median)': round(stats['50%'], 2),
        '75%': round(stats['75%'], 2),
        'Max': round(stats['max'], 2),
        'Range': round(stats['max'] - stats['min'], 2),
        'Missing_Values': df[col].isnull().sum(),
        'Missing_Percentage': round((df[col].isnull().sum() / len(df)) * 100, 2)
    }
    
    describe_stats.append(stats_dict)

# Convert to DataFrame
describe_df = pd.DataFrame(describe_stats)

# Set column name as index for easier reading
describe_df.set_index('Column', inplace=True)

print("Descriptive Statistics Summary:")
print(f"Dataset contains {len(regional_cols)} regional carbon intensity columns")
print(f"Time period: {df['datetime'].min()} to {df['datetime'].max()}")
print("\nDetailed statistics by region:")

# Display the DataFrame
describe_df

Descriptive Statistics Summary:
Dataset contains 14 regional carbon intensity columns
Time period: 2018-09-17T23:00:00 to 2025-08-13T11:30:00

Detailed statistics by region:


Unnamed: 0_level_0,Count,Mean,Std,Min,25%,50% (Median),75%,Max,Range,Missing_Values,Missing_Percentage
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
North Scotland,118318.0,69.29,92.25,0.0,0.0,16.0,123.0,383.0,383.0,1311,1.1
South Scotland,118318.0,37.37,37.77,0.0,11.0,23.0,51.0,245.0,245.0,1311,1.1
North West England,118318.0,64.08,46.59,0.0,27.0,54.0,92.0,289.0,289.0,1311,1.1
North East England,118318.0,38.85,40.4,0.0,13.0,24.0,47.0,271.0,271.0,1311,1.1
Yorkshire,118318.0,208.18,81.1,0.0,143.0,216.0,269.0,491.0,491.0,1311,1.1
North Wales and Merseyside,118318.0,138.4,105.62,0.0,52.0,116.0,202.0,717.0,717.0,1311,1.1
South Wales,118318.0,302.06,96.96,0.0,253.0,338.0,371.0,748.0,748.0,1311,1.1
West Midlands,118318.0,169.03,101.77,-13.0,77.0,164.0,249.0,554.0,567.0,1311,1.1
East Midlands,118318.0,284.88,117.59,2.0,196.0,291.0,362.0,774.0,772.0,1311,1.1
East England,118318.0,147.81,78.73,0.0,85.0,142.0,206.0,402.0,402.0,1311,1.1


#### Assesment of Statistical values
We can see some regions occasionally experience a carbon intensity of below 0, this is possible if a region is exportgenerating enough zero carbon energy to export excess zero carbon energy to other regions and therefore offset its neightboughs carbon intensity. For this reason we will leave the negative carbon intensity values in our analysis.

### Stage 4: Data type correction

In [11]:
# Step 1: Convert DATETIME column into datetime format
try:
    df['datetime'] = pd.to_datetime(df['datetime'], utc=True)
    print(f"DATETIME converted to: {df['datetime'].dtype}")
except Exception as e:
    print(f"DATETIME conversion failed: {e}")

DATETIME converted to: datetime64[ns, UTC]


In [12]:
# Step 2: Check whats changed and if we have introduced any missing values

# Generate new data dictionary after transformations
transformed_data_dictionary = create_data_dictionary(df)

transformed_data_dictionary

Unnamed: 0,Column,Data Type,Missing Values,Missing %,Unique Values,Sample Values,Description
0,datetime,"datetime64[ns, UTC]",0,0.0,119629,"2018-09-17 23:00:00+00:00, 2018-09-17 23:30:00...","Timestamp of record, given in UTC (Coordinated..."
1,North Scotland,float64,1311,1.1,382,"30.0, 44.0, 44.0","North Scotland carbon intensity forecast, pred..."
2,South Scotland,float64,1311,1.1,241,"11.0, 7.0, 8.0","South Scotland carbon intensity forecast, pred..."
3,North West England,float64,1311,1.1,289,"38.0, 41.0, 39.0","North West England carbon intensity forecast, ..."
4,North East England,float64,1311,1.1,262,"24.0, 30.0, 30.0","North East England carbon intensity forecast, ..."
5,Yorkshire,float64,1311,1.1,461,"319.0, 298.0, 295.0","Yorkshire carbon intensity forecast, predicted..."
6,North Wales and Merseyside,float64,1311,1.1,670,"157.0, 183.0, 180.0",North Wales and Merseyside carbon intensity fo...
7,South Wales,float64,1311,1.1,641,"271.0, 157.0, 155.0","South Wales carbon intensity forecast, predict..."
8,West Midlands,float64,1311,1.1,526,"40.0, 39.0, 39.0","West Midlands carbon intensity forecast, predi..."
9,East Midlands,float64,1311,1.1,749,"325.0, 340.0, 337.0","East Midlands carbon intensity forecast, predi..."


#### Data Type Correction Summary
In this data dictionary we can see that the datatypes match the expected formats to conduct my analysis, we are missing records accross all other columns which we will investigate futher below.

### Stage 5: Data Completeness Checks
Now lets investigate if there are any missing timestamps between the start and end of our dataset, we know that there should be a record for every 30 minute interval.

In [13]:
# Get min and max datetime
min_datetime = df['datetime'].min()
max_datetime = df['datetime'].max()
actual_records = len(df)
print(f'Min: {min_datetime}')
print(f'Max: {max_datetime}')

Min: 2018-09-17 23:00:00+00:00
Max: 2025-08-13 11:30:00+00:00


In [14]:
# Create complete 30-minute array
expected_range = pd.date_range(
    start=min_datetime, 
    end=max_datetime, 
    freq='30min',  # 30-minute intervals
    tz='UTC'
)
expected_records = len(expected_range)

print(f"Coverage: {(actual_records/expected_records)*100:.2f}%")

Coverage: 98.84%


In [15]:
# Find missing timestamps using sets
existing_timestamps = set(df['datetime'])
expected_timestamps = set(expected_range)
missing_timestamps = expected_timestamps - existing_timestamps

if len(missing_timestamps) == 0:
    print("✅ All timestamps present")
else:
    print(f"{len(missing_timestamps)} missing timestamps:")
    # Sort and show the missing values
    missing_sorted = sorted(list(missing_timestamps))
    for ts in missing_sorted:
        print(f"  {ts}")

1405 missing timestamps:
  2024-01-25 10:30:00+00:00
  2024-01-25 11:00:00+00:00
  2024-01-25 11:30:00+00:00
  2024-01-25 12:00:00+00:00
  2024-01-25 12:30:00+00:00
  2024-01-25 13:00:00+00:00
  2024-01-25 13:30:00+00:00
  2024-01-25 14:00:00+00:00
  2024-01-25 14:30:00+00:00
  2024-01-25 15:00:00+00:00
  2024-01-25 15:30:00+00:00
  2024-01-25 16:00:00+00:00
  2024-01-25 16:30:00+00:00
  2024-01-25 17:00:00+00:00
  2024-01-25 17:30:00+00:00
  2024-01-25 18:00:00+00:00
  2024-01-25 18:30:00+00:00
  2024-01-25 19:00:00+00:00
  2024-01-25 19:30:00+00:00
  2024-01-25 20:00:00+00:00
  2024-01-25 20:30:00+00:00
  2024-01-25 21:00:00+00:00
  2024-01-25 21:30:00+00:00
  2024-01-25 22:00:00+00:00
  2024-01-25 22:30:00+00:00
  2024-01-25 23:00:00+00:00
  2024-01-25 23:30:00+00:00
  2024-01-26 00:00:00+00:00
  2024-01-26 00:30:00+00:00
  2024-01-26 01:00:00+00:00
  2024-01-26 01:30:00+00:00
  2024-01-26 02:00:00+00:00
  2024-01-26 02:30:00+00:00
  2024-01-26 03:00:00+00:00
  2024-01-26 03:30:00+0

In [16]:
# Check for duplicate timestamps
duplicates = df[df.duplicated(subset=['datetime'])]
if len(duplicates) > 0:
    print(f"\n{len(duplicates)} duplicate timestamps found")
else:
    print(f"\nNo duplicate timestamps found")


No duplicate timestamps found


In [17]:
# Rows with nan\null values
rows_with_missing = df[df.isnull().any(axis=1)]
print(f"\nRows with missing data: {len(rows_with_missing):,} ({(len(rows_with_missing)/len(df))*100:.1f}%)")


Rows with missing data: 1,311 (1.1%)


In [18]:
# Rows with all regions missing
if len(regional_cols) > 0:
    rows_all_missing = df[df[regional_cols].isnull().all(axis=1)]
    print(f"Rows with all regions missing: {len(rows_all_missing):,}")

Rows with all regions missing: 1,311


#### Datetime completeness summary
We have two distinct types of missing data in this dataset, we have missing rows where a timestamp was not recorded into the database, we also have records where a timestamp has been recorded but there is no data accross the regions.

#### Missing data Challenges

Data considerations: The missing data present in the transform_regional_carbon_intensity appears to be systemic based on the missing timestamp data and that the missing prediction data is always missing for all regions when nan values are present.

**a. Linear Weighting (Time Efficient)**
Use historical proportional relationships between each region's carbon intensity and the national GB carbon intensity. Calculate each region's typical proportion of total GB carbon intensity, then multiply the national value by these proportions to estimate missing regional values. This preserves regional patterns while being computationally fast.

**b. Machine Learning (Most Accurate)**
Train sophisticated models (Random Forest, XGBoost, Neural Networks) to learn complex non-linear relationships between generation mix, temporal patterns, weather conditions, and regional carbon intensity. This approach can identify feature importance per region and capture intricate energy system dynamics but requires significant computational resources and model tuning.

**c. K-Nearest Neighbors Imputation (Balanced Approach)**
Find timestamps with similar generation mix patterns (fuel types, total generation, renewable percentages) and impute missing regional carbon intensity values based on these similar energy conditions. KNN captures non-linear relationships between generation mix and carbon intensity while being computationally efficient and preserving realistic energy patterns from actual historical data.

**d. Temporal Pattern Analysis**
Analyse cyclical patterns (hourly, daily, weekly, seasonal) in each region's carbon intensity and impute missing values based on temporal similarity. Uses techniques like seasonal decomposition or Fourier analysis to identify recurring patterns. Limited by its inability to account for variable renewable generation or fuel mix changes.

**e. Simple Statistical Imputation**
Replace missing values with statistical measures (mean, median, mode) calculated from available regional data. While computationally simple, this approach ignores temporal patterns, generation mix influences, and can introduce bias by flattening natural variations in carbon intensity.

**f. Case Deletion**
Remove all records containing missing values from the analysis. This preserves data integrity but reduces dataset size by 0-2.2%, potentially losing valuable information and introducing selection bias if missing data patterns are not random.

**Current Plan:** KNN imputation provides the best balance of accuracy and computational efficiency for this energy dataset, leveraging the strong relationship between generation mix and carbon intensity while preserving realistic temporal patterns.

### Next Steps
Before we attempt to use KNN to impute the missing values, we will combine the regional_carbon_intensity prediction dataset with the generation_mix data to use the additional features to offer me more opportunity to impute the values more accurately.

### Stage 6: Export Processed Dataset, updated dictionary and statistics

In [19]:
# Save the processed dataset
regional_ci_file = os.path.join(processed_dir, 'regional_carbon_intensity.csv')
df.to_csv(regional_ci_file, index=False)
print(f"Regional carbon intensity data saved: {regional_ci_file}")

# Save data dictionary to dedicated folder
dict_file = os.path.join(dict_dir, 'regional_carbon_intensity_dictionary.csv')
transformed_data_dictionary.to_csv(dict_file, index=False)
print(f"Data dictionary saved: {dict_file}")

# Save descriptive statistics to dedicated folder
stats_file = os.path.join(stats_dir, 'regional_carbon_intensity_descriptive_stats.csv')
describe_df.to_csv(stats_file)
print(f"Descriptive statistics saved: {stats_file}")


Regional carbon intensity data saved: ../data/processed/regional_carbon_intensity.csv
Data dictionary saved: ../data/data_dictionary/regional_carbon_intensity_dictionary.csv
Descriptive statistics saved: ../data/descriptive_statistics/regional_carbon_intensity_descriptive_stats.csv
