# Student loan repayment validation

This notebook compares PolicyEngine UK's calculated student loan repayments against reported repayments from the Family Resources Survey (FRS) microdata. Understanding the alignment between modelled and reported values helps assess model accuracy and identify areas for improvement.

## Background

Student loan repayments in the UK are calculated as a percentage of income above a threshold, varying by loan plan:

- **Plan 1** (pre-2012 England/Wales, Scotland, NI): 9% of income above threshold
- **Plan 2** (post-2012 England/Wales): 9% of income above threshold
- **Plan 4** (Scotland post-2017): 9% of income above threshold
- **Plan 5** (England post-2023): 9% of income above threshold
- **Postgraduate**: 6% of income above threshold

The FRS captures reported student loan repayments (`student_loan_repayments`), while PolicyEngine calculates repayments based on income and loan plan type (`student_loan_repayment`).

**Note:** The FRS `student_loans` variable (from `tuborr`) represents the amount borrowed THIS YEAR by current students, not total outstanding balance. Outstanding balance data would need to be imputed from the Wealth and Assets Survey (WAS).

In [None]:
from policyengine_uk import Microsimulation
import numpy as np
import pandas as pd

sim = Microsimulation()
year = 2025

In [None]:
# Get student loan data
reported = sim.calculate("student_loan_repayments", year).values
modelled = sim.calculate("student_loan_repayment", year).values
plan = sim.calculate("student_loan_plan", year).values
income = sim.calculate("adjusted_net_income", year).values
weight = sim.calculate("person_weight", year).values

## Student loan plan distribution

First, let's examine the distribution of student loan plans in the weighted population:

In [None]:
# Plan distribution (weighted) - plan values are strings
from policyengine_uk.variables.gov.hmrc.student_loans.student_loan_plan import StudentLoanPlan

for p in StudentLoanPlan:
    mask = plan == p.value
    count = weight[mask].sum() / 1e6
    print(f"{p.name}: {count:.2f}m people")

## Aggregate comparison

Comparing total reported vs modelled repayments:

In [None]:
total_reported = (reported * weight).sum() / 1e9
total_modelled = (modelled * weight).sum() / 1e9

print(f"Total reported repayments: \u00a3{total_reported:.2f}bn")
print(f"Total modelled repayments: \u00a3{total_modelled:.2f}bn")
print(f"Ratio (modelled/reported): {total_modelled/total_reported:.2f}")

## Individual-level alignment

For people who report making student loan repayments, how well do our calculations align?

In [None]:
# Filter to people with reported repayments > 0
has_reported = reported > 0

if has_reported.sum() > 0:
    # Correlation
    correlation = np.corrcoef(reported[has_reported], modelled[has_reported])[0, 1]
    print(f"Correlation (people with reported > 0): {correlation:.3f}")
    
    # Match rate
    both_positive = (reported > 0) & (modelled > 0)
    match_rate = both_positive.sum() / has_reported.sum() * 100
    print(f"People with both reported & modelled > 0: {match_rate:.1f}% of reporters")
    
    # Mean values
    print(f"\nMean reported (reporters): \u00a3{reported[has_reported].mean():,.0f}")
    print(f"Mean modelled (reporters): \u00a3{modelled[has_reported].mean():,.0f}")
    print(f"Mean income (reporters): \u00a3{income[has_reported].mean():,.0f}")

## Correlation among those required to pay

The overall correlation is low because many reporters have incomes below the repayment threshold. Let's look at correlation only among those required to make payments:

In [None]:
# Get thresholds
params = sim.tax_benefit_system.parameters
thresholds = params.gov.hmrc.student_loans.thresholds
plan_1_threshold = thresholds.plan_1(f'{year}-01-01')
plan_2_threshold = thresholds.plan_2(f'{year}-01-01')

# People above threshold for their plan
above_threshold = (
    ((plan == 'PLAN_1') & (income > plan_1_threshold)) |
    ((plan == 'PLAN_2') & (income > plan_2_threshold))
)

print(f'People above repayment threshold: {above_threshold.sum():,}')
print(f'Weighted: {weight[above_threshold].sum()/1e6:.2f}m')

if above_threshold.sum() > 0:
    # Unweighted correlation
    corr_unweighted = np.corrcoef(reported[above_threshold], modelled[above_threshold])[0, 1]
    
    # Weighted correlation
    w = weight[above_threshold]
    r = reported[above_threshold]
    m = modelled[above_threshold]
    
    mean_r = np.average(r, weights=w)
    mean_m = np.average(m, weights=w)
    
    cov = np.average((r - mean_r) * (m - mean_m), weights=w)
    std_r = np.sqrt(np.average((r - mean_r)**2, weights=w))
    std_m = np.sqrt(np.average((m - mean_m)**2, weights=w))
    corr_weighted = cov / (std_r * std_m)
    
    print(f'\nCorrelation (unweighted): {corr_unweighted:.3f}')
    print(f'Correlation (weighted): {corr_weighted:.3f}')
    print(f'\nMean reported: \u00a3{np.average(r, weights=w):,.0f}')
    print(f'Mean modelled: \u00a3{np.average(m, weights=w):,.0f}')

## Deep dive: Why do some reporters have zero modelled repayments?

A significant fraction of people who report making repayments have zero modelled repayments. Let's investigate why.

In [None]:
# People who report repayments but we model zero
has_reported = reported > 0
modelled_zero = modelled == 0
problem = has_reported & modelled_zero

print(f"People reporting repayments: {has_reported.sum():,}")
print(f"Of those, modelled = 0: {problem.sum():,} ({problem.sum()/has_reported.sum()*100:.1f}%)")
print()

# Why modelled = 0? Check plan distribution
print("Plan distribution for problem cases:")
for p in ['NONE', 'PLAN_1', 'PLAN_2', 'PLAN_4', 'PLAN_5']:
    count = (problem & (plan == p)).sum()
    print(f"  {p}: {count:,}")

In [None]:
# Check income levels for problem cases
print("Income stats for problem cases (reported > 0, modelled = 0):")
print(f"  Mean income: \u00a3{income[problem].mean():,.0f}")
print(f"  Median income: \u00a3{np.median(income[problem]):,.0f}")
print(f"  Min/Max: \u00a3{income[problem].min():,.0f} / \u00a3{income[problem].max():,.0f}")
print()

print(f"Repayment thresholds for {year}:")
print(f"  Plan 1: \u00a3{plan_1_threshold:,.0f}")
print(f"  Plan 2: \u00a3{plan_2_threshold:,.0f}")
print()

# How many are below threshold?
problem_plan1 = problem & (plan == 'PLAN_1')
problem_plan2 = problem & (plan == 'PLAN_2')

print(f"Plan 1 problem cases below threshold: {(income[problem_plan1] < plan_1_threshold).sum():,} of {problem_plan1.sum():,}")
print(f"Plan 2 problem cases below threshold: {(income[problem_plan2] < plan_2_threshold).sum():,} of {problem_plan2.sum():,}")

## Analysis of discrepancies

The analysis reveals that **all** problem cases (people reporting repayments but with zero modelled) have incomes below the repayment threshold. The model is correctly applying the threshold logic - these people should not owe mandatory repayments based on their annual income.

So why are they reporting repayments? Several factors explain this:

1. **Voluntary repayments**: People can choose to pay more than the minimum required, or make direct payments to reduce their loan balance faster.

2. **Intra-year income variation**: The FRS captures annual income, but someone may have had higher-paying employment for part of the year (triggering PAYE deductions) before their income dropped.

3. **Self-employed direct repayments**: Self-employed individuals make direct repayments based on prior year income, which may differ from current year income.

4. **Employment timing**: Someone starting or ending employment mid-year may have PAYE deductions based on annualised pay that differs from their actual annual income.

These factors represent a fundamental limitation of annual microsimulation models - we cannot capture voluntary overpayments or intra-year income dynamics.

Among those **required to pay** (income above threshold), the weighted correlation is much higher (~0.68), indicating the model works well for mandatory repayments.

## Known limitations

### No cap at outstanding balance

The model currently calculates repayments as 9% of income above threshold with no cap. For high earners, this can produce unrealistically high repayments that exceed their actual loan balance.

Example from the data:
- Person with \u00a3420k income
- Modelled repayment: \u00a335,470
- Reported repayment: \u00a31,903
- Explanation: They likely paid off their loan during the year

### FRS `student_loans` is not outstanding balance

The FRS variable `student_loans` (from `tuborr`) represents the amount borrowed THIS YEAR by current students, not total outstanding balance. This is why 98.7% of people reporting repayments have `student_loans = 0`.

### Potential improvement: WAS imputation

The Wealth and Assets Survey (WAS) contains student loan balance data:
- `Tot_LosR7_aggr` - total loans
- `Tot_los_exc_SLCR7_aggr` - total loans excluding SLC
- Difference = SLC debt

This could be imputed to the FRS (similar to how other wealth variables are imputed) to enable capping repayments at outstanding balance.

## Calibration status

Student loan repayments are **not currently calibrated** to external aggregate targets in policyengine-uk-data. The reported values come directly from the FRS without reweighting to match official statistics.

Potential calibration targets from SLC 2024-25 statistics:
- England total repayments: \u00a35.0bn
- Scotland: \u00a3203m
- Wales: \u00a3229m  
- Northern Ireland: \u00a3182m
- **UK Total: ~\u00a35.6bn**

See [GitHub issue](https://github.com/PolicyEngine/policyengine-uk-data/issues/237) for tracking.

## Conclusion

PolicyEngine UK's student loan repayment model produces aggregate totals within ~5% of reported FRS values. Key findings:

| Metric | Value |
|--------|-------|
| Aggregate ratio (modelled/reported) | ~0.95 |
| Weighted correlation (above threshold) | ~0.68 |
| Weighted correlation (all reporters) | ~0.16 |

The lower overall correlation reflects:
1. Many FRS respondents report repayments despite having incomes below the threshold
2. This includes voluntary repayments, intra-year income changes, and employment timing effects
3. These factors cannot be captured in an annual microsimulation model

The model correctly applies the statutory repayment formula (9% of income above threshold), but users should be aware that:
- Individual-level repayment predictions have significant uncertainty
- Aggregate totals are more reliable than individual predictions
- The model does not capture voluntary overpayments
- High earners may have overstated repayments (no cap at balance)