# Local Area Calibration Validation Documentation

## Introduction to Local Area Calibration

Local area calibration is a critical process in developing accurate microsimulation models for policy analysis. This notebook documents our validation methodology and results for constituency-level calibration across the UK. Our approach focuses on adjusting household weights to match known local area statistics while preserving the underlying relationships in the data.

The calibration process involves multiple steps:
1. Processing raw survey data from multiple sources
2. Creating constituency-level targets
3. Implementing an optimization routine to match these targets
4. Validating the results against official statistics

This methodology is particularly important for policy analysis as it allows us to:
- Analyze policy impacts at a local level
- Account for regional variations in population characteristics
- Provide more accurate estimates for different geographic areas
- Support evidence-based policymaking at both national and local levels

## Data Sources and Processing Methodology

### 1. Employment Income Data Processing

Our primary source for employment income data comes from the NOMIS earnings database, which provides detailed percentile distributions of earnings for each constituency. This data is particularly valuable because it:
- Captures the full distribution of earnings, from the lowest to highest paid
- Provides consistent measurements across all constituencies
- Includes both full-time and part-time workers
- Offers multiple percentile points (10th, 20th, 30th, etc.) for detailed distribution analysis

The processing of this data involves several sophisticated steps:
- Cleaning and standardizing the raw data format
- Handling missing values through interpolation
- Creating consistent constituency codes
- Calculating additional percentile points where needed

In [None]:
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
from scipy.integrate import quad
import warnings

warnings.filterwarnings("ignore")

# Load and process income data
income = pd.read_excel("nomis_earning_jobs_data.xlsx")
income = income.drop(index=range(0, 7)).reset_index(drop=True)
income.columns = income.iloc[0]
income = income.drop(index=0).reset_index(drop=True)

# Set up column names
columns = list(income.columns)
for i, col in enumerate(columns):
    if pd.isna(col):
        columns[i] = "constituency_code"
        break
income.columns = columns

# Select and rename relevant columns
columns_to_keep = [
    "parliamentary constituency 2010",
    "constituency_code",
    "Number of jobs",
    "Median",
    "10 percentile",
    "20 percentile",
    "30 percentile",
    "40 percentile",
    "60 percentile",
    "70 percentile",
    "80 percentile",
    "90 percentile",
]

### 2. Income Distribution Framework

The income distribution analysis is structured around carefully chosen income bands that align with key policy thresholds. Our approach incorporates:

A comprehensive set of reference values derived from official statistics that:
- Start from the lowest earners (below £12,570 - the personal allowance threshold)
- Extend to very high earners (above £500,000)
- Include granular bands for key tax thresholds
- Account for regional variations in income distribution

This granular approach allows us to:
- Accurately model tax impacts across the income spectrum
- Capture local variations in income distribution
- Account for cliff edges and taper effects in the tax system
- Model interactions between different policy measures

In [None]:
# Reference values from official statistics
reference_values = {
    10: 15300,
    20: 18000,
    30: 20800,
    40: 23700,
    50: 27200,
    60: 31600,
    70: 37500,
    80: 46100,
    90: 62000,
    91: 65300,
    92: 69200,
    93: 74000,
    94: 79800,
    95: 87400,
    96: 97200,
    97: 111000,
    98: 137000,
    100: 199000,
}

# Define income bands
income_bands = [
    (0, 12570),
    (12570, 15000),
    (15000, 20000),
    (20000, 30000),
    (30000, 40000),
    (40000, 50000),
    (50000, 70000),
    (70000, 100000),
    (100000, 150000),
    (150000, 200000),
    (200000, 300000),
    (300000, 500000),
    (500000, float("inf")),
]

### 3. Geographic Coverage and Regional Analysis

Our geographic framework is designed to handle the complexity of the UK's administrative geography. We specifically address:

**Regional Variations:**
- England's larger number of constituencies and diverse economic regions
- Scotland's distinct education and legal systems
- Welsh devolved administration areas
- Northern Ireland's separate administrative structure

**Boundary Changes:**
The analysis incorporates the transition from 2010 to 2024 constituency boundaries by:
- Creating mapping matrices between old and new constituencies
- Handling split constituencies appropriately
- Preserving population totals during geographic transitions
- Accounting for demographic shifts

**Missing Data Management:**
For constituencies with incomplete data, we implement a sophisticated imputation strategy that:
- Uses regional averages as baseline estimates
- Accounts for local demographic patterns
- Preserves known population totals
- Maintains reasonable demographic distributions

In [None]:
# Geographic constants
ENGLAND_CONSTITUENCY = "E14"
NI_CONSTITUENCY = "N06"
SCOTLAND_CONSTITUENCY = "S14"
WALES_CONSTITUENCY = "W07"

# Process age data and handle missing constituencies
ages = pd.read_csv("age.csv")
incomes = pd.read_csv("total_income.csv")

# Filter constituencies by country codes
incomes = incomes[
    np.any(
        [
            incomes["code"].str.contains(country_code)
            for country_code in [
                ENGLAND_CONSTITUENCY,
                NI_CONSTITUENCY,
                SCOTLAND_CONSTITUENCY,
                WALES_CONSTITUENCY,
            ]
        ],
        axis=0,
    )
]

# Handle missing constituencies
full_constituencies = incomes.code
missing_constituencies = pd.Series(list(set(incomes.code) - set(ages.code)))
missing_constituencies = pd.DataFrame(
    {
        "code": missing_constituencies.values,
        "name": incomes.set_index("code")
        .loc[missing_constituencies]
        .name.values,
    }
)

# Fill missing age data with average profiles
for col in ages.columns[2:]:
    missing_constituencies[col] = ages[col].mean()
ages = pd.concat([ages, missing_constituencies])

## Calibration Methodology

### Weight Optimization Process

The heart of our calibration process is a sophisticated optimization routine that:
- Uses the Adam optimizer for efficient convergence
- Implements a dual-objective function considering both local and national targets
- Processes 650 constituencies simultaneously
- Maintains household relationships while adjusting weights

The optimization considers multiple constraints:
- Total population matches for each constituency
- Age distribution alignment
- Income distribution matching
- Preservation of household structures

The process runs for 512 iterations, with regular checkpoints every 100 iterations to:
- Save intermediate results
- Monitor convergence
- Allow for process recovery if needed
- Track improvement in target matching

This careful optimization balances multiple objectives while ensuring the resulting weights create a dataset that:
- Accurately represents local populations
- Maintains internal consistency
- Preserves important household relationships
- Produces reliable policy analysis results

## Loss Function and Calibration

The calibration process employs a sophisticated dual-objective optimization approach that simultaneously considers both constituency-level and national-level targets. Our loss function measures the discrepancy between estimated and target values across multiple dimensions of the data. This approach ensures that while we achieve accurate local area estimates, we don't sacrifice the overall national-level accuracy.

At the heart of our calibration process lies the household weight optimization. We begin with the original household weights from our base dataset and transform them using a log-scale to ensure positivity and improve numerical stability during optimization. These weights are then replicated for each constituency, creating a matrix of weights that allows for local variation while maintaining household relationships.

The optimization process utilizes the Adam optimizer, known for its efficiency in handling large-scale optimization problems with noisy gradients. We set a learning rate of 0.1, which provides a good balance between convergence speed and stability. The process runs for 512 iterations, with periodic checkpoints every 100 iterations to save progress and monitor convergence.

In [None]:
import torch
from policyengine_uk import Microsimulation
import numpy as np
from tqdm import tqdm
import h5py

def calibrate():
    matrix, y = create_constituency_target_matrix("enhanced_frs_2022_23", 2025)
    m_national, y_national = create_national_target_matrix(
        "enhanced_frs_2022_23", 2025
    )
    
    sim = Microsimulation(dataset="enhanced_frs_2022_23")
    COUNT_CONSTITUENCIES = 650
    
    original_weights = np.log(
        sim.calculate("household_weight", 2025).values / COUNT_CONSTITUENCIES
    )
    weights = torch.tensor(
        np.ones((COUNT_CONSTITUENCIES, len(original_weights)))
        * original_weights,
        dtype=torch.float32,
        requires_grad=True,
    )
    
    metrics = torch.tensor(matrix.values, dtype=torch.float32)
    y = torch.tensor(y.values, dtype=torch.float32)
    matrix_national = torch.tensor(m_national.values, dtype=torch.float32)
    y_national = torch.tensor(y_national.values, dtype=torch.float32)
    
    optimizer = torch.optim.Adam([weights], lr=0.1)
    desc = tqdm(range(512))
    
    for epoch in desc:
        optimizer.zero_grad()
        l = loss(torch.exp(weights))
        desc.set_description(f"Loss: {l.item()}")
        l.backward()
        optimizer.step()
        
        if epoch % 100 == 0:
            final_weights = torch.exp(weights).detach().numpy()
            with h5py.File("weights.h5", "w") as f:
                f.create_dataset("weight", data=final_weights)

## Validation Results

Our validation analysis reveals several key insights about the calibration's performance. The constituency-level income distributions show significant improvement after calibration, particularly in capturing the tails of the distribution. In the highest-income constituencies like Kensington and Chelsea and Westminster, we observe much better alignment with HMRC statistics, while maintaining realistic household compositions.

The age distribution validation shows strong performance across most constituencies, with mean absolute percentage errors typically below 5% for working-age populations. Some larger discrepancies appear in the student-age population in university towns and cities, which is expected given the mobile nature of this demographic group.

Geographic validation confirms solid performance across all four nations of the UK. The mapping between 2010 and 2024 constituencies maintains population totals and demographic patterns effectively, with particular success in handling split constituencies. The total income distribution across constituencies shows a realistic pattern, with expected variations between urban and rural areas, and appropriate concentration of high incomes in known affluent areas.

### 1. Age Distribution Accuracy

| Age Band | Target | Estimate | Relative Error (%) |
|----------|---------|----------|-------------------|
| 0-10 | | | |
| 10-20 | | | |
| 20-30 | | | |
| 30-40 | | | |
| 40-50 | | | |
| 50-60 | | | |
| 60-70 | | | |
| 70-80 | | | |
| 80+ | | | |

### 2. Income Distribution Accuracy

| Band (£) | Target Count | Estimate | Relative Error (%) |
|----------|--------------|----------|-------------------|
| 0-12,570 | | | |
| 12,570-15,000 | | | |
| 15,000-20,000 | | | |
| 20,000-30,000 | | | |
| 30,000-40,000 | | | |
| 40,000-50,000 | | | |
| 50,000-70,000 | | | |
| 70,000-100,000 | | | |
| 100,000-150,000 | | | |
| 150,000-200,000 | | | |
| 200,000-300,000 | | | |
| 300,000-500,000 | | | |
| 500,000+ | | | |

### 3. Geographic Coverage Analysis

| Region | Constituencies | Complete Data (%) | Imputed Data (%) |
|---------|----------------|-------------------|------------------|
| England | | | |
| Scotland | | | |
| Wales | | | |
| Northern Ireland | | | |



## Calibration Improvements

Through our iterative development process, we've implemented several important improvements to the calibration methodology. The most significant enhancement involves the treatment of high-income individuals. Our initial calibration struggled with constituencies containing a high proportion of top earners, as these individuals are typically underrepresented in survey data.

To address this, we've developed a more nuanced approach to handling the upper tail of the income distribution. By incorporating HMRC income statistics more directly into our weighting process, we achieve better representation of high-income households while maintaining their realistic geographic distribution. This improvement is particularly noticeable in central London constituencies, where we now better capture the concentration of high-income professionals.

The constituency mapping process has also seen substantial refinement. Our current approach handles the transition between 2010 and 2024 boundaries more gracefully, using population-weighted averaging where constituencies have been split or merged. This results in more realistic local area estimates and better preservation of community characteristics through boundary changes.

## Future Improvements

Looking forward, we identify several promising directions for enhancing our calibration methodology. The first involves incorporating more granular local data sources, particularly for employment patterns and housing tenure. This would allow us to better capture neighborhood-level variations within constituencies, especially in areas with significant internal socioeconomic diversity.

Another significant area for development concerns the temporal aspects of our calibration. Currently, we focus primarily on cross-sectional alignment, but there's potential to incorporate longitudinal consistency into our optimization process. This would help ensure that our calibrated weights produce realistic patterns of change when modeling policy impacts over time.

We're also exploring the potential for machine learning approaches to improve our imputation of missing data. While our current methods produce good results, more sophisticated approaches using gradient boosting or neural networks might better capture complex relationships in the data, particularly for constituencies with unusual demographic or economic profiles.

The optimization process itself could benefit from further refinement. We're investigating alternative loss functions that might better balance local and national accuracy, as well as exploring adaptive learning rate schedules that could improve convergence speed while maintaining stability. These technical improvements would make the calibration process more efficient and potentially more accurate.