# Student loan repayment validation

This notebook compares PolicyEngine UK's calculated student loan repayments against reported repayments from the Family Resources Survey (FRS) microdata. Understanding the alignment between modelled and reported values helps assess model accuracy and identify areas for improvement.

## Background

Student loan repayments in the UK are calculated as a percentage of income above a threshold, varying by loan plan:

- **Plan 1** (pre-2012 England/Wales, Scotland, NI): 9% of income above £24,990 (2024-25)
- **Plan 2** (post-2012 England/Wales): 9% of income above £27,295 (2024-25)
- **Plan 4** (Scotland post-2017): 9% of income above £27,660 (2024-25)
- **Plan 5** (England post-2023): 9% of income above £25,000 (2024-25)
- **Postgraduate**: 6% of income above £21,000 (2024-25)

The FRS captures reported student loan repayments, while PolicyEngine calculates repayments based on income and loan plan type.

In [None]:
from policyengine_uk import Microsimulation
import numpy as np
import pandas as pd

sim = Microsimulation()
year = 2025

In [None]:
# Get student loan data
reported = sim.calculate("student_loan_repayments", year).values
modelled = sim.calculate("student_loan_repayment", year).values
plan = sim.calculate("student_loan_plan", year).values
income = sim.calculate("adjusted_net_income", year).values
weight = sim.calculate("person_weight", year).values

## Student loan plan distribution

First, let's examine the distribution of student loan plans in the weighted population:

In [None]:
# Plan distribution (weighted)
plan_names = {0: "None", 1: "Plan 1", 2: "Plan 2", 3: "Postgraduate", 4: "Plan 4", 5: "Plan 5"}
for plan_id, name in plan_names.items():
    count = weight[plan == plan_id].sum() / 1e6
    print(f"{name}: {count:.2f}m people")

## Aggregate comparison

Comparing total reported vs modelled repayments:

In [None]:
total_reported = (reported * weight).sum() / 1e9
total_modelled = (modelled * weight).sum() / 1e9

print(f"Total reported repayments: £{total_reported:.2f}bn")
print(f"Total modelled repayments: £{total_modelled:.2f}bn")
print(f"Ratio (modelled/reported): {total_modelled/total_reported:.2f}")

## Individual-level alignment

For people who report making student loan repayments, how well do our calculations align?

In [None]:
# Filter to people with reported repayments > 0
has_reported = reported > 0

if has_reported.sum() > 0:
    # Correlation
    correlation = np.corrcoef(reported[has_reported], modelled[has_reported])[0, 1]
    print(f"Correlation (people with reported > 0): {correlation:.3f}")
    
    # Match rate
    both_positive = (reported > 0) & (modelled > 0)
    match_rate = both_positive.sum() / has_reported.sum() * 100
    print(f"People with both reported & modelled > 0: {match_rate:.1f}% of reporters")
    
    # Mean values
    print(f"\nMean reported (reporters): £{reported[has_reported].mean():,.0f}")
    print(f"Mean modelled (reporters): £{modelled[has_reported].mean():,.0f}")
    print(f"Mean income (reporters): £{income[has_reported].mean():,.0f}")

## Analysis of discrepancies

The relatively low individual-level correlation suggests several factors may explain differences:

1. **Timing differences**: Reported repayments reflect actual payments made during the tax year, which may include voluntary overpayments or vary based on pay frequency and employment changes.

2. **Employment variation**: Someone may have had periods below or above the repayment threshold during the year, while our model assumes constant annual income.

3. **Multiple loan plans**: Some individuals may have both Plan 1 and Plan 2 loans, complicating the calculation.

4. **Study status**: Current students may have different repayment patterns not fully captured in the model.

5. **Plan misclassification**: The loan plan imputation in the microdata may not perfectly match individuals' actual loan types.

Despite individual-level variation, the aggregate totals are reasonably aligned, suggesting the model captures the overall scale of student loan repayments in the UK economy.

## Conclusion

PolicyEngine UK's student loan repayment model produces aggregate totals within a reasonable range of reported values. The individual-level correlation is lower than for income tax calculations, reflecting the complexity of student loan timing and the limitations of annual income-based calculations. For microsimulation purposes, the model provides a reasonable approximation of student loan repayment flows, while users should be aware of these limitations when analysing individual-level impacts.