PolicyEngine's free and open-source microsimulation model estimates the budgetary, distributional and poverty impacts of UK tax and benefit reforms by simulating the full details of policy over a large representative dataset of UK households. In this post, we'll provide a brief overview of how PolicyEngine UK's microsimulation model works, and an update on how we maintain and validate the model's accuracy.

## Model overview

PolicyEngine UK is a *static* microsimulation model- it does not (yet) incorporate behavioural responses like labour supply reactions to policy changes. Instead, we assume that households do not change their behaviour in response to policy changes, and that the only way that policy changes affect households is through their direct effects on household incomes.

To estimate the direct effects of policy changes, we apply the actual policy rules as specified in legislation to each of a large (tens of thousands) survey of UK households. We can then change the rules, and see how the totals of different variables change. For example, we could change the personal tax allowance from £12,570 to £15,000 and aggregate the tax payments before and after the policy change to estimate how much more tax is collected from the households in our survey.

The model is written in Python, and you can follow all of our real-time development [on GitHub](https://github.com/PolicyEngine/PolicyEngine-UK). Other models exists maintained by other organisations which use the same microsimulation approach: the IFS' TAXBEN, UKMOD at the University of Essex, and, the IPPR model, and internal models at HMRC and DWP. However, only PolicyEngine UK and UKMOD publish their policy implementation details and validation statistics.

## How PolicyEngine differs from other models

PolicyEngine's core approach to estimate policy impacts is the same as other static microsimulation models. However, we use a novel data science-based approach to improve the accuracy of the model's outputs significantly compared to other models (where we have been able to compare).

Microsimulation models are widely used by researchers to estimate policy impacts (questions for which we don't know the answer). But when we attempt to validate the models by asking them questions for which we do know the answer (for example, total Income Tax revenue in 2021), we often find that the model answers are significantly different from the ground truth. This problem is large and *exists in every microsimulation model that publishes details of attempts to measure it*.

Assuming that the policy implementations in the model are correct (while the law is complex and we cannot test every possible household, we publish and meet hundreds of automated tests on every version update), the most likely explanation for this is that the model's survey data is not representative of the population: the model's outputs are only as good as the data that we feed into it.

We have adopted an approach to reduce this problem by using machine learning techniques to improve the survey's accuracy by using data from other trusted sources: OBR, HMRC, DWP, ONS and others. We essentially do the following:

1. Take the initial survey data
2. Add synthetic households (using other microdata) and previous-year households with zero weight
3. Collect trusted external statistics describing tax-benefit and demographic properties of the UK
4. Train a machine learning model adjust the weights of the survey to best fit those external statistics

The resulting weighted survey powers PolicyEngine's impact estimates.

## Validation

PolicyEngine meets tax-benefit and demographic totals closely, and estimates program impacts over a five-year horizon. For example, the chart below shows our projections for three key benefits: Child Benefit, Housing Benefit and Universal Credit.

*Figure 1: PolicyEngine UK's projections for three key benefits*

In [1]:
import pandas as pd
import plotly.express as px
from policyengine_core.charts import *
from plotly.express.colors import sample_colorscale

training_log_cps = pd.read_csv(
    "/Users/nikhil/policyengine/policyengine-uk/calibration_log_cps.csv.gz"
)

chosen_metrics = [
    "Child Benefit budgetary impact (UK)",
    "Housing Benefit budgetary impact (GB)",
    "Universal Credit budgetary impact (GB)",
]

training_log_cps["Source dataset"] = "Enhanced FRS"

training_log = training_log_cps
training_log_targets = training_log.copy()
training_log_targets["value"] = training_log_targets["target"]
training_log_targets["Source dataset"] = "Official"
training_log = pd.concat([training_log, training_log_targets])

last_value_df = (
    training_log[training_log.name.isin(chosen_metrics)]
    .groupby(["Source dataset", "time_period", "name"])
    .last()
    .reset_index()[["name", "value", "Source dataset", "time_period"]]
)

x = np.linspace(0.2, 1, 5)
c = sample_colorscale("Blues", list(x))

last_value_df["time_period"] = last_value_df.time_period.astype(str)
last_value_df["text"] = last_value_df["value"].apply(lambda x: f"{x/1e9:,.0f}")
fig = px.bar(
    last_value_df[last_value_df["Source dataset"] == "Enhanced FRS"],
    y="value",
    color="time_period",
    x="name",
    barmode="group",
    text="text",
)

fig = format_fig(fig)

for i in range(len(fig.data)):
    fig.data[i].marker.color = c[i]

fig.update_layout(
    legend_title="Calendar year",
    xaxis_tickvals=chosen_metrics,
    xaxis_ticktext=["Child Benefit", "Housing Benefit", "Universal Credit"],
    xaxis_title="",
    yaxis_title="Budgetary impact (£)",
)

But how does PolicyEngine align with the best estimates of the ground truth? We can compare PolicyEngine's estimates with two other sources to estimate how our data enhancement approach performs: the original survey data, and official statistics and projections from government. Shown below is, for each calendar year in the budget horizon, how the relative errors in tax-benefit-related statistical targets become better or worse. Over 80% of these targets improve.

*Figure 2: Relative errors in tax-benefit-related statistical targets over the budget horizon*

In [2]:
# Convert to [name, value, time_period, source]
# source: (value at epoch=0 -> "Original FRS", value at epoch=last -> "Enhanced FRS", target at epoch=0 -> "Official")
import warnings

warnings.filterwarnings("ignore")

original_frs_values = (
    training_log_cps[training_log_cps.epoch == 0]
    .copy()[["name", "value", "time_period"]]
    .sort_values(["name", "time_period"])
)
original_frs_values["source"] = "Original FRS"

enhanced_frs_values = (
    training_log_cps[training_log_cps.epoch == training_log_cps.epoch.max()]
    .copy()[["name", "value", "time_period"]]
    .sort_values(["name", "time_period"])
)
enhanced_frs_values["source"] = "Enhanced FRS"

official_values = (
    training_log_cps[training_log_cps.epoch == 0]
    .copy()[["name", "target", "time_period"]]
    .sort_values(["name", "time_period"])
)
official_values["value"] = official_values["target"]
del official_values["target"]
official_values["source"] = "Official"

enhanced_frs_values["error"] = (
    official_values["value"].values - enhanced_frs_values["value"].values
)
original_frs_values["error"] = (
    official_values["value"].values - original_frs_values["value"].values
)
enhanced_frs_values["abs_error"] = np.abs(enhanced_frs_values["error"].values)
original_frs_values["abs_error"] = np.abs(original_frs_values["error"].values)
enhanced_frs_values["rel_error"] = (
    enhanced_frs_values["error"].values / official_values["value"].values
)
original_frs_values["rel_error"] = (
    original_frs_values["error"].values / official_values["value"].values
)
enhanced_frs_values["rel_error_abs"] = np.abs(
    enhanced_frs_values["rel_error"].values
)
original_frs_values["rel_error_abs"] = np.abs(
    original_frs_values["rel_error"].values
)

enhanced_frs_values["rel_error_original"] = original_frs_values[
    "rel_error_abs"
].values
enhanced_frs_values["rel_error_abs_change"] = (
    enhanced_frs_values["rel_error_abs"].values
    / original_frs_values["rel_error_abs"].values
    - 1
)
enhanced_frs_values["error_abs_change"] = (
    enhanced_frs_values["abs_error"].values
    / original_frs_values["abs_error"].values
    - 1
)

df = enhanced_frs_values

a = (
    df[df.rel_error_original > 0.03]
    .groupby("time_period")
    .rel_error_abs_change.quantile(
        [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
    )
    .reset_index()
)
a.columns = ["time_period", "decile", "rel_error_abs_change"]
a["decile"] = a.decile.astype(str)
a["text"] = a["rel_error_abs_change"].apply(lambda x: f"{x:+.0%}")
fig = (
    px.bar(
        a,
        x="time_period",
        y="rel_error_abs_change",
        color="decile",
        barmode="group",
        text="text",
    )
    .update_traces(
        # text should always be horizontal
        # textangle=0,
    )
    .update_layout(
        uniformtext_minsize=8,
    )
)

x = np.linspace(0.2, 1, 10)
c = sample_colorscale("Blues", list(x))

fig = format_fig(fig)

for i in range(len(fig.data)):
    fig.data[i].marker.color = c[i]

fig.update_layout(
    legend_title="Quantile",
    xaxis_title="Calendar year",
    yaxis_title="Change in relative error",
    yaxis_tickformat="+.0%",
)

We've also made all our calibration validation results available in an interactive dashboard, which is available on GitHub [here](https://github.com/nikhilwoodruff/policyengine-uk-validation) (screenshot below). We welcome feedback or comments on our approach- feel free to [get in touch](https://policyengine.org/uk/contact).

![Figure 3: PolicyEngine UK's calibration validation dashboard](https://github.com/PolicyEngine/policyengine-app/assets/35577657/c4d0e71e-cc6b-4191-aaaa-5970f4ac3cc9)