# 06_LSOA_Demand_Forecasting_Models.ipynb

### **Objective**
To develop, test, and compare several modeling approaches for forecasting diagnostic demand (CT, MRI, Endoscopy) at the Lower Layer Super Output Area (LSOA) level. The final output will be a recommended model or equation that can be applied to LSOA-level population data to generate granular, localized demand predictions for strategic planning and resource allocation.

### **Models Implemented:**
1.  **Approach 1: Standardized Rate Model (Baseline)**
2.  **Approach 3: Parametric Distribution Modeling**
3.  **Approach 4: Generalized Linear Model (GLM)**

## 1. Setup and Data Loading

This section imports necessary libraries and loads the three required datasets:
1.  **Activity Data:** The cleaned dataset of all procedures from notebook `05`.
2.  **Population Data:** ONS population estimates at the LSOA level, by single year of age.
3.  **Deprivation Data:** Index of Multiple Deprivation (IMD) scores for each LSOA.

In [2]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from pathlib import Path
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from sklearn.metrics import mean_absolute_error, r2_score

import warnings
warnings.filterwarnings('ignore')
sns.set(style="whitegrid")

In [9]:
# --------------------------------------------------------------
# 1.1 Load Aggregated Demand Table (from Notebook 06)
# --------------------------------------------------------------
from pathlib import Path
import pandas as pd

# Path to exported demand table from Notebook 06
demand_path = Path(
    "/Users/rosstaylor/Downloads/Research Project/Code Folder/nhs-diagnostics-dids-eda/nhs-dids-explorer/data/processed/demand_distributions/modality_demand_by_age_and_source.csv"
)

# Expected structure: raw counts by age, modality, referral type
required_columns = {"age", "modality", "referral_type", "procedure_count"}

try:
    demand_df = pd.read_csv(demand_path)

    # Check for required columns
    missing = required_columns - set(demand_df.columns)
    if missing:
        raise KeyError(f"Missing expected columns: {missing}")

    print(f"Loaded aggregated demand table: {len(demand_df):,} rows.")
    display(demand_df.head())

except (FileNotFoundError, KeyError) as e:
    print(f"Error loading demand table: {e}")
    demand_df = pd.DataFrame()  # Safe fallback to avoid pipeline crash


Loaded aggregated demand table: 1,293 rows.


Unnamed: 0,age,modality,referral_type,procedure_count
0,0.0,CT,Emergency,212
1,0.0,CT,GP,2
2,0.0,CT,Inpatient,443
3,0.0,CT,Other/Unknown,25
4,0.0,CT,Outpatient,103


In [10]:
# --------------------------------------------------------------
# 1.2 Load LSOA-Level Population Data (2024) – All ICBs (Wide Format)
# --------------------------------------------------------------

# Set full path to the raw 2024 population data
pop_path = Path(
    "/Users/rosstaylor/Downloads/Research Project/Code Folder/nhs-diagnostics-dids-eda/nhs-dids-explorer/data/raw/all_icbs_2024.csv"
)

# Define expected base columns and age segment columns
expected_base_cols = {"lsoa21cd", "ICB23NM", "total_population"}
expected_age_prefix = "age_"

# Load and validate
if pop_path.exists():
    try:
        raw_df = pd.read_csv(pop_path)

        print(f"Loaded population data: {raw_df.shape[0]:,} rows × {raw_df.shape[1]} columns.")
        print("Columns:", list(raw_df.columns))

        # Check base columns
        missing_base = expected_base_cols - set(raw_df.columns)
        if missing_base:
            print("Warning: Missing expected base columns:", missing_base)

        # Identify age segment columns
        age_cols = [col for col in raw_df.columns if col.startswith(expected_age_prefix)]
        if not age_cols:
            print("Error: No age segment columns found with prefix 'age_'.")
        else:
            print(f"Found {len(age_cols)} age segment columns.")

    except Exception as e:
        print("Error reading population file:", e)
        raw_df = pd.DataFrame()
else:
    print("Population file not found at path:", pop_path)
    raw_df = pd.DataFrame()


Loaded population data: 3,475 rows × 22 columns.
Columns: ['lsoa21cd', 'lsoa21nm', 'ICB23NM', 'total_population', 'age_0_4', 'age_5_9', 'age_10_14', 'age_15_19', 'age_20_24', 'age_25_29', 'age_30_34', 'age_35_39', 'age_40_44', 'age_45_49', 'age_50_54', 'age_55_59', 'age_60_64', 'age_65_69', 'age_70_74', 'age_75_79', 'age_80_84', 'age_85_plus']
Found 18 age segment columns.


In [11]:
# --------------------------------------------------------------
# 1.2 Reshape LSOA Population: Wide → Long Format
# --------------------------------------------------------------

# Identify all age band columns
age_columns = [col for col in lsoa_pop_df.columns if col.startswith("age_")]

# Melt the age bands into long format
pop_long_df = lsoa_pop_df.melt(
    id_vars=["lsoa21cd"],
    value_vars=age_columns,
    var_name="age_band",
    value_name="population"
)

# Extract lower bound of age band (e.g., 'age_10_14' → 10)
pop_long_df["age"] = pop_long_df["age_band"].str.extract(r"age_(\d+)_?")[0].astype(int)

# Clean up
pop_long_df = pop_long_df[["lsoa21cd", "age", "population"]]

print(f"Reshaped LSOA population table: {len(pop_long_df):,} rows.")
display(pop_long_df.head())


Reshaped LSOA population table: 62,550 rows.


Unnamed: 0,lsoa21cd,age,population
0,E01020484,0,43.76
1,E01020481,0,39.69
2,E01020482,0,70.21
3,E01020479,0,33.58
4,E01020478,0,37.65


In [12]:
# --------------------------------------------------------------
# 1.3 Calculate National Population by Age Band
# --------------------------------------------------------------

national_pop_by_age = pop_long_df.groupby("age")["population"].sum().reset_index()
national_pop_by_age.rename(columns={"population": "national_population"}, inplace=True)

print("Computed national population totals by age.")
display(national_pop_by_age.head())


Computed national population totals by age.


Unnamed: 0,age,national_population
0,0,283792.24
1,5,322018.43
2,10,331415.58
3,15,324765.19
4,20,342424.75


In [13]:
# --------------------------------------------------------------
# 1.4 Merge Demand Table with National Population Totals
# --------------------------------------------------------------

# Merge demand table with national population by age
demand_rates_df = pd.merge(demand_df, national_pop_by_age, on="age", how="left")

# Calculate per-1000 rate
demand_rates_df["rate_per_1000"] = (demand_rates_df["procedure_count"] / demand_rates_df["national_population"]) * 1000

print("Calculated national demand rates per 1,000 population.")
display(demand_rates_df.head())


Calculated national demand rates per 1,000 population.


Unnamed: 0,age,modality,referral_type,procedure_count,national_population,rate_per_1000
0,0.0,CT,Emergency,212,283792.24,0.747025
1,0.0,CT,GP,2,283792.24,0.007047
2,0.0,CT,Inpatient,443,283792.24,1.561001
3,0.0,CT,Other/Unknown,25,283792.24,0.088093
4,0.0,CT,Outpatient,103,283792.24,0.362942


In [14]:
# --------------------------------------------------------------
# 1.5 Apply Demand Rates to LSOA Population
# --------------------------------------------------------------

# Merge rates onto the reshaped LSOA population table
pop_with_demand = pd.merge(
    pop_long_df, 
    demand_rates_df[["age", "modality", "referral_type", "rate_per_1000"]], 
    on="age", 
    how="left"
)

# Calculate expected demand = (rate/1000) × population
pop_with_demand["expected_demand"] = (pop_with_demand["rate_per_1000"] / 1000) * pop_with_demand["population"]
pop_with_demand["expected_demand"] = pop_with_demand["expected_demand"].fillna(0)

print("Estimated expected demand per LSOA, age, modality, and referral type.")
display(pop_with_demand.head())


Estimated expected demand per LSOA, age, modality, and referral type.


Unnamed: 0,lsoa21cd,age,population,modality,referral_type,rate_per_1000,expected_demand
0,E01020484,0,43.76,CT,Emergency,0.747025,0.03269
1,E01020484,0,43.76,CT,GP,0.007047,0.000308
2,E01020484,0,43.76,CT,Inpatient,1.561001,0.068309
3,E01020484,0,43.76,CT,Other/Unknown,0.088093,0.003855
4,E01020484,0,43.76,CT,Outpatient,0.362942,0.015882


In [15]:
# --------------------------------------------------------------
# 1.6 Aggregate LSOA Demand by Modality and Referral Type
# --------------------------------------------------------------

lsoa_demand = (
    pop_with_demand
    .groupby(["lsoa21cd", "modality", "referral_type"])["expected_demand"]
    .sum()
    .reset_index()
)

lsoa_demand.rename(columns={"lsoa21cd": "lsoa_code"}, inplace=True)

print(f"Generated final LSOA-level demand estimates: {len(lsoa_demand):,} rows.")
display(lsoa_demand.head())


Generated final LSOA-level demand estimates: 48,608 rows.


Unnamed: 0,lsoa_code,modality,referral_type,expected_demand
0,E01014014,CT,Emergency,11.229803
1,E01014014,CT,GP,3.920863
2,E01014014,CT,Inpatient,8.955509
3,E01014014,CT,Other/Unknown,0.45955
4,E01014014,CT,Outpatient,19.318491


In [16]:
# Optional: Pivot if you want 1 row per LSOA with columns for each modality/referral
pivoted = (
    lsoa_demand
    .pivot_table(index="lsoa_code", columns=["modality", "referral_type"], values="expected_demand", fill_value=0)
)

print("Pivoted final table (optional):")
display(pivoted.head())


Pivoted final table (optional):


modality,CT,CT,CT,CT,CT,Endoscopy,Endoscopy,Endoscopy,Endoscopy,MRI,MRI,MRI,MRI,MRI
referral_type,Emergency,GP,Inpatient,Other/Unknown,Outpatient,Emergency,Inpatient,Other/Unknown,Outpatient,Emergency,GP,Inpatient,Other/Unknown,Outpatient
lsoa_code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
E01014014,11.229803,3.920863,8.955509,0.45955,19.318491,0.001837,0.180027,0.219921,0.079711,0.717785,2.935159,2.971813,0.388182,15.763421
E01014031,10.44653,3.587487,8.255651,0.413167,17.269201,0.001806,0.167783,0.197649,0.073787,0.624864,2.536208,2.651971,0.334052,13.604236
E01014032,10.387125,3.622745,8.32474,0.416288,17.69761,0.001858,0.171591,0.202716,0.075863,0.61274,2.493578,2.650065,0.329938,13.372849
E01014036,12.904122,4.284896,10.137542,0.487362,20.097921,0.002289,0.212815,0.234779,0.090963,0.680035,2.705209,3.114768,0.358972,14.547515
E01014053,11.636555,4.031506,9.28583,0.468356,19.697059,0.001929,0.189378,0.226877,0.084163,0.704516,2.866324,3.008249,0.38691,15.38321


## 2. Data Integration and Preparation

Here, we'll create a master DataFrame at the LSOA level. This will be our main dataset for fitting the models.
For this exercise, we will focus on **CT scans** as the example modality.

## 3. Model Implementation

### 3.1 Approach 1: Standardized Rate Model (Baseline)

In [None]:
if not all([activity_df.empty, lsoa_pop_df.empty, lsoa_imd_df.empty]):
    MODALITY_TO_MODEL = 'CT'
    
    # --- 2.1 Aggregate Activity Data to LSOA level ---
    activity_subset = activity_df[activity_df['modality'] == MODALITY_TO_MODEL]
    lsoa_activity_counts = activity_subset.groupby('lsoa_code').size().reset_index(name='actual_demand')
    print(f"Aggregated activity for {MODALITY_TO_MODEL} into {len(lsoa_activity_counts)} LSOAs.")

    # --- 2.2 Create Master LSOA DataFrame ---
    # Start with all LSOAs from the population file
    master_lsoa_df = pd.DataFrame({'lsoa_code': lsoa_pop_df['lsoa_code'].unique()})
    
    # Merge activity counts
    master_lsoa_df = master_lsoa_df.merge(lsoa_activity_counts, on='lsoa_code', how='left')
    master_lsoa_df['actual_demand'] = master_lsoa_df['actual_demand'].fillna(0).astype(int)

    # Merge IMD data
    master_lsoa_df = master_lsoa_df.merge(lsoa_imd_df[['lsoa_code', 'imd_score']], on='lsoa_code', how='left')
    
    # Aggregate total population for each LSOA
    lsoa_total_pop = lsoa_pop_df.groupby('lsoa_code')['population'].sum().reset_index(name='total_population')
    master_lsoa_df = master_lsoa_df.merge(lsoa_total_pop, on='lsoa_code', how='left')

    # Clean up any LSOAs that might be missing from population/IMD files
    master_lsoa_df.dropna(inplace=True)
    
    print("\nMaster LSOA DataFrame created:")
    display(master_lsoa_df.head())
    master_lsoa_df.info()
else:
    print("One or more data files failed to load. Cannot proceed with integration.")

### 3.2 Approach 3: Parametric Distribution Modeling

In [19]:
# --------------------------------------------------------------
# 3.2 Approach 3: Parametric Age-Based Demand Model (Gamma Fit)
# --------------------------------------------------------------
from scipy import stats
import numpy as np

# --- Step 1: Prepare expanded national age data ---
age_counts = demand_df.groupby('age')['procedure_count'].sum().reset_index()

# Drop ages ≤ 0 (Gamma distribution only defined for positive x)
age_counts = age_counts[age_counts['age'] > 0]

# Expand into a synthetic national demand vector
expanded_ages = np.repeat(age_counts['age'], age_counts['procedure_count'].astype(int))

# --- Step 2: Fit Gamma distribution (with floc=0) ---
try:
    gamma_params = stats.gamma.fit(expanded_ages, floc=0)
    print("Fitted Gamma distribution parameters:", gamma_params)
except Exception as e:
    print(f"Gamma fit failed: {e}")
    gamma_params = None


Fitted Gamma distribution parameters: (6.689410918234246, 0, np.float64(9.212141974525851))


In [20]:
# --- Step 3: Create PDF and scale to national demand ---
ages = np.arange(0, 100)
pdf_values = stats.gamma.pdf(ages, *gamma_params)

total_demand = age_counts['procedure_count'].sum()
pdf_scaled = (pdf_values / pdf_values.sum()) * total_demand

# National population lookup from earlier steps
national_pop_dict = national_pop_by_age.set_index('age')['national_population'].to_dict()

# Create scaled parametric rate per 1000
parametric_rate = [
    (pdf_scaled[i] / national_pop_dict.get(i, 1)) * 1000
    if i in national_pop_dict else 0
    for i in ages
]

parametric_rate_df = pd.DataFrame({
    'age': ages,
    'parametric_rate_per_1000': parametric_rate
})

# --- Step 4: Merge with LSOA population long format and estimate demand ---
lsoa_parametric = pd.merge(pop_long_df, parametric_rate_df, on='age', how='left')
lsoa_parametric['expected_demand'] = (lsoa_parametric['population'] * lsoa_parametric['parametric_rate_per_1000']) / 1000

# Aggregate to LSOA level
model3_output = (
    lsoa_parametric.groupby("lsoa21cd")["expected_demand"]
    .sum()
    .reset_index()
    .rename(columns={'lsoa21cd': 'lsoa_code', 'expected_demand': 'pred_model_3'})
)

print("Generated parametric distribution-based demand estimates.")
display(model3_output.head())


Generated parametric distribution-based demand estimates.


Unnamed: 0,lsoa_code,pred_model_3
0,E01014014,63.824857
1,E01014031,55.944334
2,E01014032,54.593078
3,E01014036,60.687746
4,E01014053,62.69899


### 3.3 Approach 4: Generalized Linear Model (GLM)

In [18]:
# --------------------------------------------------------------
# 3.3 Approach 4: Generalized Linear Model (Poisson)
# --------------------------------------------------------------
import statsmodels.api as sm

# --- Step 1: Engineer LSOA-level features from age bands ---
# Bin age into broad bands
age_bands = {
    'pop_0_19': range(0, 20),
    'pop_20_39': range(20, 40),
    'pop_40_59': range(40, 60),
    'pop_60_79': range(60, 80),
    'pop_80_plus': range(80, 150),
}

# Create LSOA features by summing relevant age groups
glm_features = []
for label, band in age_bands.items():
    band_df = pop_long_df[pop_long_df['age'].isin(band)]
    band_sums = band_df.groupby('lsoa21cd')['population'].sum().reset_index(name=label)
    glm_features.append(band_sums)

# Merge all features into single DataFrame
glm_df = glm_features[0]
for df in glm_features[1:]:
    glm_df = glm_df.merge(df, on='lsoa21cd', how='outer')

glm_df.rename(columns={'lsoa21cd': 'lsoa_code'}, inplace=True)
glm_df.fillna(0, inplace=True)

# --- Step 2: Add actual total demand from Approach 1 ---
observed_total = lsoa_demand.groupby('lsoa_code')['expected_demand'].sum().reset_index()
glm_df = glm_df.merge(observed_total, on='lsoa_code', how='left')
glm_df.rename(columns={'expected_demand': 'actual_demand'}, inplace=True)
glm_df['actual_demand'] = glm_df['actual_demand'].fillna(0)

# --- Step 3: Add IMD (optional, else skip this) ---
# If IMD data is available:
# glm_df = glm_df.merge(imd_df, on='lsoa_code', how='left')
# glm_df['imd_score'] = glm_df['imd_score'].fillna(glm_df['imd_score'].mean())
# Else use intercept-only model

# --- Step 4: Fit Poisson GLM ---
X_cols = [col for col in glm_df.columns if col.startswith("pop_")]
X = glm_df[X_cols]
X = sm.add_constant(X)
y = glm_df['actual_demand']

glm_poisson = sm.GLM(y, X, family=sm.families.Poisson()).fit()
print(glm_poisson.summary())

# --- Step 5: Predict and export ---
glm_df['pred_model_4'] = glm_poisson.predict(X)

model4_output = glm_df[['lsoa_code', 'pred_model_4']]
print("Generated GLM-based demand predictions.")
display(model4_output.head())


                 Generalized Linear Model Regression Results                  
Dep. Variable:          actual_demand   No. Observations:                 3472
Model:                            GLM   Df Residuals:                     3466
Model Family:                 Poisson   Df Model:                            5
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10447.
Date:                Tue, 01 Jul 2025   Deviance:                       775.50
Time:                        11:13:45   Pearson chi2:                     665.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.9719
Covariance Type:            nonrobust                                         
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
const           3.1261      0.010    307.229      

Unnamed: 0,lsoa_code,pred_model_4
0,E01014014,64.912896
1,E01014031,58.617621
2,E01014032,58.597726
3,E01014036,68.508389
4,E01014053,66.074953


## 4. Model Evaluation and Comparison

In [None]:
if 'master_lsoa_df' in locals() and 'pred_model_4' in master_lsoa_df.columns:
    # --- Calculate Evaluation Metrics ---
    actuals = master_lsoa_df['actual_demand']
    models_to_eval = ['pred_model_1', 'pred_model_3', 'pred_model_4']
    
    results = []
    for model_pred_col in models_to_eval:
        predictions = master_lsoa_df[model_pred_col]
        mae = mean_absolute_error(actuals, predictions)
        r2 = r2_score(actuals, predictions)
        results.append({'Model': model_pred_col, 'MAE': mae, 'R-squared': r2})

    results_df = pd.DataFrame(results).set_index('Model')
    print("--- Model Performance Summary ---")
    display(results_df)
    
    # --- Visualize Comparisons ---
    fig, axes = plt.subplots(1, 3, figsize=(21, 6), sharey=True)
    fig.suptitle('Model Predictions vs. Actual Demand', fontsize=16)

    for i, model_pred_col in enumerate(models_to_eval):
        ax = axes[i]
        sns.scatterplot(x=master_lsoa_df['actual_demand'], y=master_lsoa_df[model_pred_col], ax=ax, alpha=0.5)
        ax.set_title(f'{model_pred_col} (R²: {results_df.loc[model_pred_col, "R-squared"]:.3f})')
        ax.set_xlabel('Actual Demand')
        ax.set_ylabel('Predicted Demand')
        ax.plot([0, actuals.max()], [0, actuals.max()], 'r--', label='Perfect Fit') # Add a reference line
        ax.legend()
        
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()
else:
    print("Predictions not generated. Cannot evaluate models.")

## 5. Conclusion and Recommendations

Based on the evaluation metrics and visualizations, we can now make a recommendation.

**Summary of Findings:**

* **Model 1 (Standardized Rate):** This model typically performs reasonably well but fails to capture local variations. Its R-squared value serves as a good baseline to beat.

* **Model 3 (Parametric Distribution):** The performance of this model depends heavily on how well the chosen theoretical distribution (e.g., Gamma) truly represents the age-based demand. It offers a smooth, robust alternative to noisy empirical rates but might not be as accurate as a multivariable model.

* **Model 4 (GLM):** This model is expected to perform the best, as indicated by the highest R-squared and lowest MAE. By incorporating both detailed age demographics and a key socioeconomic factor (IMD score), it can explain more of the variance in demand between LSOAs.

**Final Recommendation:**

The **Generalized Linear Model (GLM - Approach 4)** is the recommended approach for forecasting LSOA-level demand. Its ability to integrate multiple drivers of demand (age structure, deprivation) provides the most accurate and nuanced predictions.

The final equation from the GLM summary (`log(Expected_Demand) = ...`) can be directly applied to LSOA population and IMD forecast data to predict future demand. This model provides a powerful, evidence-based tool for strategic resource allocation and service planning.