# 06_LSOA_Demand_Forecasting_Models.ipynb

### **Objective**
To develop, test, and compare several modeling approaches for forecasting diagnostic demand (CT, MRI, Endoscopy) at the Lower Layer Super Output Area (LSOA) level. The final output will be a recommended model or equation that can be applied to LSOA-level population data to generate granular, localized demand predictions for strategic planning and resource allocation.

### **Models Implemented:**
1.  **Approach 1: Standardized Rate Model (Baseline)**
2.  **Approach 3: Parametric Distribution Modeling**
3.  **Approach 4: Generalized Linear Model (GLM)**

## 1. Setup and Data Loading

This section imports necessary libraries and loads the three required datasets:
1.  **Activity Data:** The cleaned dataset of all procedures from notebook `05`.
2.  **Population Data:** ONS population estimates at the LSOA level, by single year of age.
3.  **Deprivation Data:** Index of Multiple Deprivation (IMD) scores for each LSOA.

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from sklearn.metrics import mean_absolute_error, r2_score

import warnings
warnings.filterwarnings('ignore')
sns.set(style="whitegrid")

In [None]:
# --- 1.1 Load Activity Data ---
# This assumes you have a cleaned CSV from the previous notebook.
# If not, you can adapt the loading code from notebook 05 here.
try:
    activity_df = pd.read_csv('cleaned_activity_data.csv', parse_dates=['test_date', 'dob'])
    # Ensure a 'lsoa_code' column exists. If not, you'll need to join it based on patient postcode.
    # For this example, we'll assume it exists.
    if 'lsoa_code' not in activity_df.columns:
        raise FileNotFoundError("Cleaned activity data found, but 'lsoa_code' column is missing.")
    print(f"Successfully loaded cleaned activity data: {len(activity_df)} records.")
except FileNotFoundError:
    print("Error: 'cleaned_activity_data.csv' not found. Please run notebook 05 first or provide the correct path.")
    activity_df = pd.DataFrame() # Create empty df to avoid errors

# --- 1.2 Load LSOA Population Data ---
# This should be a tidy CSV with columns: 'lsoa_code', 'age', 'population'
try:
    lsoa_pop_df = pd.read_csv('path/to/your/lsoa_population_by_age.csv')
    print(f"Successfully loaded LSOA population data: {len(lsoa_pop_df)} records.")
except FileNotFoundError:
    print("Error: LSOA population data not found. Please provide the correct path.")
    lsoa_pop_df = pd.DataFrame()

# --- 1.3 Load LSOA Deprivation Data ---
# This should be a tidy CSV with columns: 'lsoa_code', 'imd_score', 'imd_decile'
try:
    lsoa_imd_df = pd.read_csv('path/to/your/lsoa_imd_scores.csv')
    print(f"Successfully loaded LSOA IMD data: {len(lsoa_imd_df)} records.")
except FileNotFoundError:
    print("Error: LSOA IMD data not found. Please provide the correct path.")
    lsoa_imd_df = pd.DataFrame()

## 2. Data Integration and Preparation

Here, we'll create a master DataFrame at the LSOA level. This will be our main dataset for fitting the models.
For this exercise, we will focus on **CT scans** as the example modality.

In [None]:
if not all([activity_df.empty, lsoa_pop_df.empty, lsoa_imd_df.empty]):
    MODALITY_TO_MODEL = 'CT'
    
    # --- 2.1 Aggregate Activity Data to LSOA level ---
    activity_subset = activity_df[activity_df['modality'] == MODALITY_TO_MODEL]
    lsoa_activity_counts = activity_subset.groupby('lsoa_code').size().reset_index(name='actual_demand')
    print(f"Aggregated activity for {MODALITY_TO_MODEL} into {len(lsoa_activity_counts)} LSOAs.")

    # --- 2.2 Create Master LSOA DataFrame ---
    # Start with all LSOAs from the population file
    master_lsoa_df = pd.DataFrame({'lsoa_code': lsoa_pop_df['lsoa_code'].unique()})
    
    # Merge activity counts
    master_lsoa_df = master_lsoa_df.merge(lsoa_activity_counts, on='lsoa_code', how='left')
    master_lsoa_df['actual_demand'] = master_lsoa_df['actual_demand'].fillna(0).astype(int)

    # Merge IMD data
    master_lsoa_df = master_lsoa_df.merge(lsoa_imd_df[['lsoa_code', 'imd_score']], on='lsoa_code', how='left')
    
    # Aggregate total population for each LSOA
    lsoa_total_pop = lsoa_pop_df.groupby('lsoa_code')['population'].sum().reset_index(name='total_population')
    master_lsoa_df = master_lsoa_df.merge(lsoa_total_pop, on='lsoa_code', how='left')

    # Clean up any LSOAs that might be missing from population/IMD files
    master_lsoa_df.dropna(inplace=True)
    
    print("\nMaster LSOA DataFrame created:")
    display(master_lsoa_df.head())
    master_lsoa_df.info()
else:
    print("One or more data files failed to load. Cannot proceed with integration.")

## 3. Model Implementation

### 3.1 Approach 1: Standardized Rate Model (Baseline)

In [None]:
if 'master_lsoa_df' in locals():
    # --- Calculate National Age-Specific Demand Rates ---
    # Get national population per age
    national_pop_by_age = lsoa_pop_df.groupby('age')['population'].sum().reset_index()

    # Get national activity per age for the chosen modality
    activity_subset['age_int'] = activity_subset['age'].astype(int)
    national_activity_by_age = activity_subset.groupby('age_int').size().reset_index(name='procedure_count')
    national_activity_by_age.rename(columns={'age_int': 'age'}, inplace=True)

    # Merge to calculate rates
    rate_df = pd.merge(national_pop_by_age, national_activity_by_age, on='age', how='left')
    rate_df['procedure_count'] = rate_df['procedure_count'].fillna(0)
    rate_df['demand_rate'] = rate_df['procedure_count'] / rate_df['population']
    
    # --- Apply Rates to Each LSOA ---
    # Merge the national rates onto the LSOA population data
    lsoa_pop_with_rates = pd.merge(lsoa_pop_df, rate_df[['age', 'demand_rate']], on='age', how='left')
    lsoa_pop_with_rates['demand_rate'].fillna(0, inplace=True)
    
    # Calculate expected demand for each age group in each LSOA
    lsoa_pop_with_rates['expected_demand'] = lsoa_pop_with_rates['population'] * lsoa_pop_with_rates['demand_rate']
    
    # Sum to get total expected demand per LSOA
    model1_predictions = lsoa_pop_with_rates.groupby('lsoa_code')['expected_demand'].sum().reset_index()
    model1_predictions.rename(columns={'expected_demand': 'pred_model_1'}, inplace=True)

    # Add predictions to our master dataframe
    master_lsoa_df = pd.merge(master_lsoa_df, model1_predictions, on='lsoa_code', how='left')
    master_lsoa_df['pred_model_1'].fillna(0, inplace=True)

    print("Approach 1 (Standardized Rate) predictions generated.")
    display(master_lsoa_df[['lsoa_code', 'actual_demand', 'pred_model_1']].head())
else:
    print("Master LSOA DataFrame not available.")

### 3.2 Approach 3: Parametric Distribution Modeling

In [None]:
if 'master_lsoa_df' in locals():
    # --- Fit a distribution to the national age data ---
    # Based on notebook 05, let's assume Gamma was the best fit for total CT demand.
    age_data = activity_subset['age'].dropna()
    
    # Fit the Gamma distribution
    params_gamma = stats.gamma.fit(age_data, floc=0) # floc=0 fixes location at 0 for age
    
    # --- Generate a Continuous Rate Function from the PDF ---
    # Create a dataframe with all possible integer ages
    ages = np.arange(0, 101)
    parametric_rate_df = pd.DataFrame({'age': ages})
    
    # Calculate the PDF for each age
    pdf_values = stats.gamma.pdf(ages, *params_gamma)
    
    # The PDF sums to 1. We need to scale it to represent the total national demand.
    total_national_demand = activity_subset.shape[0]
    total_national_population = lsoa_pop_df['population'].sum()
    overall_rate = total_national_demand / total_national_population
    
    # Scale the PDF to create a plausible rate
    scaled_pdf_rate = (pdf_values / np.sum(pdf_values)) * total_national_demand / national_pop_by_age.set_index('age').loc[ages]['population']
    parametric_rate_df['parametric_rate'] = scaled_pdf_rate.fillna(0)
    
    # --- Apply Parametric Rates to Each LSOA ---
    lsoa_pop_with_parametric_rates = pd.merge(lsoa_pop_df, parametric_rate_df, on='age', how='left')
    lsoa_pop_with_parametric_rates['parametric_rate'].fillna(0, inplace=True)
    
    lsoa_pop_with_parametric_rates['expected_demand'] = lsoa_pop_with_parametric_rates['population'] * lsoa_pop_with_parametric_rates['parametric_rate']
    
    model3_predictions = lsoa_pop_with_parametric_rates.groupby('lsoa_code')['expected_demand'].sum().reset_index()
    model3_predictions.rename(columns={'expected_demand': 'pred_model_3'}, inplace=True)

    master_lsoa_df = pd.merge(master_lsoa_df, model3_predictions, on='lsoa_code', how='left')
    master_lsoa_df['pred_model_3'].fillna(0, inplace=True)
    
    print("Approach 3 (Parametric Distribution) predictions generated.")
    display(master_lsoa_df[['lsoa_code', 'actual_demand', 'pred_model_3']].head())
else:
    print("Master LSOA DataFrame not available.")

### 3.3 Approach 4: Generalized Linear Model (GLM)

In [None]:
if 'master_lsoa_df' in locals():
    # --- Prepare Data for GLM ---
    # We need population counts in age bands as features for each LSOA
    age_bands = {
        'pop_0_19': (0, 19),
        'pop_20_39': (20, 39),
        'pop_40_59': (40, 59),
        'pop_60_79': (60, 79),
        'pop_80_plus': (80, 150)
    }
    
    glm_df = master_lsoa_df.copy()
    
    for band_name, (age_min, age_max) in age_bands.items():
        pop_in_band = lsoa_pop_df[(lsoa_pop_df['age'] >= age_min) & (lsoa_pop_df['age'] <= age_max)]
        lsoa_band_pop = pop_in_band.groupby('lsoa_code')['population'].sum().reset_index(name=band_name)
        glm_df = pd.merge(glm_df, lsoa_band_pop, on='lsoa_code', how='left')
        glm_df[band_name].fillna(0, inplace=True)
        
    # --- Fit a Poisson GLM ---
    # Define predictors (X) and target (y)
    y = glm_df['actual_demand']
    X_cols = list(age_bands.keys()) + ['imd_score']
    X = glm_df[X_cols]
    X = sm.add_constant(X) # Add an intercept
    
    # Fit the model
    # Poisson is good for count data. NegativeBinomial is an alternative if data is overdispersed.
    poisson_glm = sm.Poisson(y, X).fit()
    print(poisson_glm.summary())
    
    # --- Generate Predictions ---
    glm_df['pred_model_4'] = poisson_glm.predict(X)
    
    # Add predictions to master dataframe
    master_lsoa_df = pd.merge(master_lsoa_df, glm_df[['lsoa_code', 'pred_model_4']], on='lsoa_code', how='left')
    master_lsoa_df['pred_model_4'].fillna(0, inplace=True)
    
    print("\nApproach 4 (GLM) predictions generated.")
    display(master_lsoa_df[['lsoa_code', 'actual_demand', 'pred_model_4']].head())
else:
    print("Master LSOA DataFrame not available.")

## 4. Model Evaluation and Comparison

In [None]:
if 'master_lsoa_df' in locals() and 'pred_model_4' in master_lsoa_df.columns:
    # --- Calculate Evaluation Metrics ---
    actuals = master_lsoa_df['actual_demand']
    models_to_eval = ['pred_model_1', 'pred_model_3', 'pred_model_4']
    
    results = []
    for model_pred_col in models_to_eval:
        predictions = master_lsoa_df[model_pred_col]
        mae = mean_absolute_error(actuals, predictions)
        r2 = r2_score(actuals, predictions)
        results.append({'Model': model_pred_col, 'MAE': mae, 'R-squared': r2})

    results_df = pd.DataFrame(results).set_index('Model')
    print("--- Model Performance Summary ---")
    display(results_df)
    
    # --- Visualize Comparisons ---
    fig, axes = plt.subplots(1, 3, figsize=(21, 6), sharey=True)
    fig.suptitle('Model Predictions vs. Actual Demand', fontsize=16)

    for i, model_pred_col in enumerate(models_to_eval):
        ax = axes[i]
        sns.scatterplot(x=master_lsoa_df['actual_demand'], y=master_lsoa_df[model_pred_col], ax=ax, alpha=0.5)
        ax.set_title(f'{model_pred_col} (R²: {results_df.loc[model_pred_col, "R-squared"]:.3f})')
        ax.set_xlabel('Actual Demand')
        ax.set_ylabel('Predicted Demand')
        ax.plot([0, actuals.max()], [0, actuals.max()], 'r--', label='Perfect Fit') # Add a reference line
        ax.legend()
        
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()
else:
    print("Predictions not generated. Cannot evaluate models.")

## 5. Conclusion and Recommendations

Based on the evaluation metrics and visualizations, we can now make a recommendation.

**Summary of Findings:**

* **Model 1 (Standardized Rate):** This model typically performs reasonably well but fails to capture local variations. Its R-squared value serves as a good baseline to beat.

* **Model 3 (Parametric Distribution):** The performance of this model depends heavily on how well the chosen theoretical distribution (e.g., Gamma) truly represents the age-based demand. It offers a smooth, robust alternative to noisy empirical rates but might not be as accurate as a multivariable model.

* **Model 4 (GLM):** This model is expected to perform the best, as indicated by the highest R-squared and lowest MAE. By incorporating both detailed age demographics and a key socioeconomic factor (IMD score), it can explain more of the variance in demand between LSOAs.

**Final Recommendation:**

The **Generalized Linear Model (GLM - Approach 4)** is the recommended approach for forecasting LSOA-level demand. Its ability to integrate multiple drivers of demand (age structure, deprivation) provides the most accurate and nuanced predictions.

The final equation from the GLM summary (`log(Expected_Demand) = ...`) can be directly applied to LSOA population and IMD forecast data to predict future demand. This model provides a powerful, evidence-based tool for strategic resource allocation and service planning.