# Diagnosing UK Country Filtering Issue in policyengine.py

This notebook tests whether `policyengine.py` properly filters simulations by UK country (e.g., Wales).

## The Issue
When running a simulation filtered to a specific UK country (e.g., `country/wales`), we get:
```
ValueError: Unable to set value "[ True  True  True ... False False False]" for variable 
"would_evade_tv_licence_fee", as its length is 8470 while there are 4108 households in the simulation.
```

## Hypothesis
The `to_input_dataframe()` method doesn't export `person_household_id`, causing the filtered simulation
to lose entity relationship information and incorrectly set up household counts.

## Step 1: Setup and Imports

In [None]:
import numpy as np
import pandas as pd
from policyengine import Simulation

# Check policyengine version
import policyengine
print(f"policyengine version: {policyengine.__version__}")

## Step 2: Create a Baseline UK Simulation

First, let's create a standard UK-wide simulation and examine its structure.

In [None]:
# Create a UK-wide simulation (no region filter)
print("Creating UK-wide simulation...")
sim_uk = Simulation(country="uk", scope="macro")

# Access the underlying country simulation
underlying_sim = sim_uk.baseline_simulation

print(f"\n=== UK-Wide Simulation Structure ===")
print(f"Person count: {underlying_sim.persons.count}")
print(f"Household count: {underlying_sim.household.count}")
print(f"BenUnit count: {underlying_sim.benunit.count}")

In [None]:
# Check the country distribution in the UK simulation
country_values = sim_uk.calculate("country")
print("\n=== Country Distribution (Household Level) ===")
print(country_values.value_counts())

In [None]:
# Check person-level country distribution
country_person = underlying_sim.calculate("country", map_to="person")
unique, counts = np.unique(country_person, return_counts=True)
print("\n=== Country Distribution (Person Level) ===")
for u, c in zip(unique, counts):
    print(f"  {u}: {c} persons")

## Step 3: Test `to_input_dataframe()` Export

Let's examine what columns are exported by `to_input_dataframe()` to see if entity linkage variables are included.

In [None]:
# Export the simulation to a dataframe
print("Exporting simulation to DataFrame...")
df = underlying_sim.to_input_dataframe()

print(f"\n=== Exported DataFrame ===")
print(f"Shape: {df.shape}")
print(f"Number of columns: {len(df.columns)}")

In [None]:
# Check for entity ID and linkage columns
print("\n=== Entity-Related Columns ===")

id_columns = [c for c in df.columns if '_id' in c.lower()]
print(f"\nColumns containing '_id': {len(id_columns)}")
for col in sorted(id_columns):
    print(f"  - {col}")

# Specifically check for critical columns
critical_cols = ['person_id', 'household_id', 'person_household_id', 'benunit_id', 'person_benunit_id']
print(f"\n=== Critical Entity Linkage Columns ===")
for col_base in critical_cols:
    matching = [c for c in df.columns if c.startswith(col_base)]
    if matching:
        print(f"  {col_base}: FOUND -> {matching}")
    else:
        print(f"  {col_base}: MISSING!")

In [None]:
# Check if person_household_id has known periods in the simulation
print("\n=== Checking Known Periods for Entity Linkage Variables ===")

for var_name in ['person_id', 'household_id', 'person_household_id', 'person_benunit_id']:
    try:
        holder = underlying_sim.get_holder(var_name)
        known_periods = holder.get_known_periods()
        print(f"  {var_name}: known_periods = {list(known_periods)}")
    except Exception as e:
        print(f"  {var_name}: ERROR - {e}")

## Step 4: Simulate Country Filtering (Wales)

Now let's create a Wales-filtered simulation and see what happens.

In [None]:
# Create a Wales simulation
print("Creating Wales simulation...")
print("(This may trigger the error we're diagnosing)")
print()

try:
    sim_wales = Simulation(country="uk", scope="macro", region="country/wales")
    wales_underlying = sim_wales.baseline_simulation
    
    print(f"\n=== Wales Simulation Structure ===")
    print(f"Person count: {wales_underlying.persons.count}")
    print(f"Household count: {wales_underlying.household.count}")
    print(f"BenUnit count: {wales_underlying.benunit.count}")
    
    # Check if counts make sense
    if wales_underlying.household.count == wales_underlying.persons.count:
        print("\n*** WARNING: Household count equals person count! ***")
        print("This suggests entity linkage was lost during filtering.")
        
except Exception as e:
    print(f"\n*** ERROR creating Wales simulation ***")
    print(f"Error type: {type(e).__name__}")
    print(f"Error message: {e}")
    import traceback
    traceback.print_exc()

## Step 5: Manual Reproduction of the Filtering Process

Let's manually reproduce what `_apply_region_to_simulation` does to understand where it breaks.

In [None]:
# Step-by-step reproduction of the filtering logic
print("=== Manual Reproduction of Country Filtering ===")

# Step 1: Export to DataFrame
print("\n[Step 1] Exporting to DataFrame...")
df = underlying_sim.to_input_dataframe()
print(f"  DataFrame shape: {df.shape}")
print(f"  Columns with 'household': {[c for c in df.columns if 'household' in c.lower()][:10]}...")

In [None]:
# Step 2: Calculate country at person level
print("\n[Step 2] Calculating country at person level...")
country_person_level = underlying_sim.calculate("country", map_to="person").values
print(f"  Country array shape: {country_person_level.shape}")
print(f"  Unique values: {np.unique(country_person_level)}")

# Count Welsh persons
wales_mask = country_person_level == "WALES"
print(f"  Welsh persons: {wales_mask.sum()}")
print(f"  Non-Welsh persons: {(~wales_mask).sum()}")

In [None]:
# Step 3: Filter DataFrame to Wales
print("\n[Step 3] Filtering DataFrame to Wales...")
df_wales = df[wales_mask]
print(f"  Filtered DataFrame shape: {df_wales.shape}")

# Check what person_household_id looks like in filtered data
phh_cols = [c for c in df_wales.columns if 'person_household_id' in c]
if phh_cols:
    print(f"  person_household_id columns: {phh_cols}")
    for col in phh_cols:
        vals = df_wales[col].values
        print(f"    {col}: {len(np.unique(vals))} unique values")
else:
    print("  person_household_id: NOT IN DATAFRAME!")
    print("  This is likely the root cause of the issue.")

In [None]:
# Step 4: Try to create a new simulation from filtered DataFrame
print("\n[Step 4] Creating new simulation from filtered DataFrame...")

from policyengine_uk import Microsimulation

try:
    new_sim = Microsimulation(dataset=df_wales)
    
    print(f"  New simulation created!")
    print(f"  Person count: {new_sim.persons.count}")
    print(f"  Household count: {new_sim.household.count}")
    
    # Critical check
    if new_sim.household.count == new_sim.persons.count:
        print("\n  *** CONFIRMED: Household count equals person count! ***")
        print("  The entity linkage was lost because person_household_id is missing.")
    elif new_sim.household.count == len(np.unique(df_wales.iloc[:, 0])):
        print("\n  *** Household count matches first column's unique values ***")
        print("  This confirms the fallback behavior in build_from_dataset()")
        
except Exception as e:
    print(f"  Error creating simulation: {e}")
    import traceback
    traceback.print_exc()

In [None]:
# Step 5: Try to calculate would_evade_tv_licence_fee (this should trigger the error)
print("\n[Step 5] Attempting to calculate would_evade_tv_licence_fee...")

try:
    # This calculation uses random(household), which will fail if household count is wrong
    result = new_sim.calculate("would_evade_tv_licence_fee")
    print(f"  Calculation succeeded!")
    print(f"  Result shape: {result.shape}")
    print(f"  Result dtype: {result.dtype}")
except ValueError as e:
    print(f"  *** ValueError (expected): ***")
    print(f"  {e}")
    
    # Parse the error to understand the mismatch
    error_str = str(e)
    if "length is" in error_str and "while there are" in error_str:
        print(f"\n  This confirms the array size mismatch issue.")
except Exception as e:
    print(f"  Unexpected error: {type(e).__name__}: {e}")

## Step 6: Deeper Investigation - What Does household_id Return?

Let's check what `household_id` returns in the broken simulation.

In [None]:
# Check household_id in the new (potentially broken) simulation
print("=== Investigating household_id in Filtered Simulation ===")

try:
    # This is what random() calls internally
    hh_ids = new_sim.calculate("household_id", 2025)
    print(f"household_id result length: {len(hh_ids)}")
    print(f"household_id unique count: {len(np.unique(hh_ids))}")
    print(f"Expected household count: {new_sim.household.count}")
    
    if len(hh_ids) != new_sim.household.count:
        print(f"\n*** MISMATCH: household_id has {len(hh_ids)} values but simulation has {new_sim.household.count} households ***")
except Exception as e:
    print(f"Error: {e}")

In [None]:
# Check the holder for household_id
print("\n=== Checking household_id Holder ===")
try:
    holder = new_sim.get_holder("household_id")
    known_periods = holder.get_known_periods()
    print(f"Known periods: {list(known_periods)}")
    
    for period in known_periods:
        arr = holder.get_array(period)
        print(f"  Period {period}: array shape = {arr.shape if arr is not None else 'None'}")
except Exception as e:
    print(f"Error: {e}")

## Step 7: Compare with Working Approaches (Constituency/LA)

Constituency and LA filtering use weight adjustment instead of DataFrame subsetting. Let's verify this works.

In [None]:
# Test constituency filtering (should work)
print("=== Testing Constituency Filtering (Should Work) ===")

try:
    sim_constituency = Simulation(country="uk", scope="macro", region="constituency/Cardiff South and Penarth")
    const_underlying = sim_constituency.baseline_simulation
    
    print(f"Constituency simulation created successfully!")
    print(f"  Person count: {const_underlying.persons.count}")
    print(f"  Household count: {const_underlying.household.count}")
    
    # Try the problematic calculation
    result = sim_constituency.calculate("would_evade_tv_licence_fee")
    print(f"  would_evade_tv_licence_fee calculated successfully!")
    print(f"  Result length: {len(result)}")
    
except Exception as e:
    print(f"Error: {type(e).__name__}: {e}")

In [None]:
# Test local authority filtering (should work)
print("\n=== Testing Local Authority Filtering (Should Work) ===")

try:
    sim_la = Simulation(country="uk", scope="macro", region="local_authority/Cardiff")
    la_underlying = sim_la.baseline_simulation
    
    print(f"LA simulation created successfully!")
    print(f"  Person count: {la_underlying.persons.count}")
    print(f"  Household count: {la_underlying.household.count}")
    
    # Try the problematic calculation
    result = sim_la.calculate("would_evade_tv_licence_fee")
    print(f"  would_evade_tv_licence_fee calculated successfully!")
    print(f"  Result length: {len(result)}")
    
except Exception as e:
    print(f"Error: {type(e).__name__}: {e}")

## Summary and Conclusions

In [None]:
print("="*70)
print("DIAGNOSIS SUMMARY")
print("="*70)

print("""
Based on the tests above:

1. COUNTRY FILTERING (country/wales):
   - Uses to_input_dataframe() + DataFrame subsetting + new Microsimulation()
   - FAILS because person_household_id is not exported
   - Results in household count = person count (entity linkage lost)

2. CONSTITUENCY FILTERING (constituency/...):
   - Uses weight adjustment on existing simulation
   - WORKS because entity structure is preserved

3. LOCAL AUTHORITY FILTERING (local_authority/...):
   - Uses weight adjustment on existing simulation  
   - WORKS because entity structure is preserved

ROOT CAUSE:
- to_input_dataframe() only exports variables with known periods
- person_household_id doesn't have known periods (it's derived from dataset structure)
- When building from filtered DataFrame, the fallback creates 1 household per person

RECOMMENDED FIX:
- Option A: Fix to_input_dataframe() to always export entity linkage variables
- Option B: Use weight-zeroing for country filtering (like constituency/LA)
""")