# Computing variable survival rates by demographic group

## Goal

Given a five-year CRC survival rate, compute the mean of the exponential distribution with that survival rate.

When time to death has an exponential distribution, there is a mathematical relationship between the survival rate and the mean of the distribution:
```
μ = -n / ln(P(t > n))
```
Where `μ` is the mean of the exponential distribution, `n` is the time at which survival is assessed (e.g., 5 for a 5-year survival rate), and `P(t > n)` is the survival rate. 

We will take advantage of this relationship to calculate model parameters `mean_duration_clin*_dead` directly, rather than calibrating them to survival rates as in the past. This will enable us to quickly calculate survival rates separately for various demographic groups, and parameterize the model to account for cohort demographics in its simulation of CRC survival.

In [None]:
import json
import shutil
from copy import deepcopy

import numpy as np

In [2]:
def mean_dist_from_surv_rate(surv_rate: float, at_year: int) -> float:
    """
    Calculates the mean of an exponential distribution with the survival rate surv_rate at year at_year.

    When time to death has an exponential distribution, there is a mathematical relationship between the
    survival rate and the mean of the distribution:

    μ = -n / ln(P(t > n))

    Where `μ` is the mean of the exponential distribution, `n` is the time at which survival is assessed
    (e.g., 5 for a 5-year survival rate), and `P(t > n)` is the survival rate. 

    Parameters
    ----------
    surv_rate : float
        The survival rate at year `at_year`.
    at_year : int
        The year at which the survival rate is assessed.
   
    Returns
    -------
    float
        The mean of the exponential distribution.
    """
    return -1 * at_year / np.log(surv_rate)

## Approach

This notebook uses the final approach described in `survival_rate_adjustment.ipynb`. In that notebook, we explored various ways to do this. Here, we implement the final approach of calculating the time-to-CRC-death distribution means directly from relative survival rates. That avoids the complexity of using overall survival rates and death tables.  

## Step 1: population-level

Before we calculate time-to-CRC-death distribution means by demographic group, we'll use the same approach to calculate them for the overall population.

From the exploration in Step 6 of `survival_rate_adjustment.ipynb`, we know that this approach will lead to smaller distribution means, and therefore more CRC death, than the calibration approach. This is probably due to the dynamic explored in Step 5 of that notebook, which shows that our calibration approach underestimates CRC death.

So first, we'll compare the values calculated here to the calibrated values in an experiment. By isolating this change, we will learn how much difference this change makes on CRC death rates under various screening scenarios. 

In [3]:
# 5 Year Survival
# Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov) SEER*Stat Database: Incidence - SEER Research Data, 17 Registries, Nov 2023 Sub (2000-2021) - Linked To County Attributes - Time Dependent (1990-2022) Income/Rurality, 1969-2022 Counties, National Cancer Institute, DCCPS, Surveillance Research Program, released April 2024, based on the November 2023 submission.									
# Includes Cases Diagnosed in 2013-2019

relative_survival_rates = {
    "stage1": 0.931,
    "stage2": 0.851,
    "stage3": 0.698,
    "stage4": 0.137,
}

In [4]:
overall_dist_means = {k: mean_dist_from_surv_rate(v, 5) for k, v in relative_survival_rates.items()}

In [5]:
with open("crcsim/experiment/parameters.json") as f:
    default_params = json.load(f)

In [6]:
params_relative_survival = deepcopy(default_params)

for stage in range(1, 5):
    params_relative_survival[f'mean_duration_clin{stage}_dead'] = overall_dist_means[f'stage{stage}']

In [7]:
with open("crcsim/experiment/parameters_relative_survival.json", "w") as f:
    json.dump(params_relative_survival, f, indent=2)

# Step 2: by demographic group

In [8]:
# 5 Year Survival
# Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov) SEER*Stat Database: Incidence - SEER Research Data, 17 Registries, Nov 2023 Sub (2000-2021) - Linked To County Attributes - Time Dependent (1990-2022) Income/Rurality, 1969-2022 Counties, National Cancer Institute, DCCPS, Surveillance Research Program, released April 2024, based on the November 2023 submission.									
# Includes Cases Diagnosed in 2013-2019

relative_survival_rates_by_group = {
    "black_female": {
        "stage1": 0.911,
        "stage2": 0.846,
        "stage3": 0.678,
        "stage4": 0.103,
    },
    "black_male": {
        "stage1": 0.893,
        "stage2": 0.800,
        "stage3": 0.658,
        "stage4": 0.102,
    },
    "hispanic_female": {
        "stage1": 0.937,
        "stage2": 0.845,
        "stage3": 0.723,
        "stage4": 0.162,
    },
    "hispanic_male": {
        "stage1": 0.913,
        "stage2": 0.829,
        "stage3": 0.679,
        "stage4": 0.147,
    },
    "white_female": {
        "stage1": 0.931,
        "stage2": 0.863,
        "stage3": 0.695,
        "stage4": 0.149,
    },
    "white_male": {
        "stage1": 0.932,
        "stage2": 0.848,
        "stage3": 0.703,
        "stage4": 0.130,
    }
}

In [9]:
dist_means_by_group = {}

for group, rates in relative_survival_rates_by_group.items():
    dist_means_by_group[group] = {k: mean_dist_from_surv_rate(v, 5) for k, v in rates.items()}

In [10]:
params_by_demog = deepcopy(default_params)

In [11]:
# Remove the mean_duration_clin*_dead parameters for the overall population
for stage in range(1, 5):
    params_by_demog.pop(f"mean_duration_clin{stage}_dead")

# Then add the demographic-specific parameters
for group, dist_means in dist_means_by_group.items():
    for stage in range(1, 5):
        param_name = f"mean_duration_clin{stage}_dead_{group}"
        params_by_demog[param_name] = dist_means[f'stage{stage}']

In [13]:
with open("crcsim/experiment/parameters_by_demog.json", "w") as f:
    json.dump(params_by_demog, f, indent=2)

## Preserving exact formatting

Let's create functions that make targeted replacements to preserve the formatting of parameters.json.

In [None]:
def create_relative_survival_params_file(values, output_file, template_file="crcsim/experiment/parameters.json"):
    """
    Create parameters_relative_survival.json by simply replacing the 4 mean_duration_clin*_dead values
    while preserving the exact formatting.
    
    Parameters
    ----------
    values : dict
        Dictionary containing the new mean_duration_clin*_dead values
    output_file : str
        Output file path
    template_file : str, optional
        Template file path, by default parameters.json
    """
    # First, copy the template file to the output file
    shutil.copyfile(template_file, output_file)
    
    # Read the output file
    with open(output_file, 'r') as f:
        lines = f.readlines()
    
    # Replace the mean_duration_clin*_dead values
    for i, line in enumerate(lines):
        for stage in range(1, 5):
            pattern = f'"mean_duration_clin{stage}_dead":'
            if pattern in line:
                # Extract the parts before and after the value
                before_value = line.split(':', 1)[0] + ': '
                after_value = ',' if line.strip().endswith(',') else ''
                
                # Replace with the new value
                new_value = values[f'stage{stage}']
                lines[i] = f"{before_value}{new_value}{after_value}\n"
                break
    
    # Write the modified file
    with open(output_file, 'w') as f:
        f.writelines(lines)
    
    print(f"Relative survival parameters written to {output_file} with formatting preserved")

In [28]:
def create_demographic_params_file(demo_values, output_file, template_file="crcsim/experiment/parameters.json"):
    """
    Create parameters_by_demog.json by replacing the mean_duration_clin*_dead parameters
    with demographic-specific ones while preserving the exact formatting.
    
    Parameters
    ----------
    demo_values : dict
        Dictionary of demographic group -> stage -> value
    output_file : str
        Output file path
    template_file : str, optional
        Template file path, by default parameters.json
    """
    # First, copy the template file to the output file
    shutil.copyfile(template_file, output_file)
    
    # Read the output file
    with open(output_file, 'r') as f:
        lines = f.readlines()
    
    # Step 1: Find the mean_duration_clin*_dead lines and their indentation
    stage_lines = {}
    indent = "  "
    for i, line in enumerate(lines):
        for stage in range(1, 5):
            pattern = f'"mean_duration_clin{stage}_dead":'
            if pattern in line:
                stage_lines[stage] = i
                # Extract indentation
                indent = line[:len(line) - len(line.lstrip())]
                break
    
    # Step 2: Replace the first stage with demographic parameters for all stages
    if 1 in stage_lines:
        # Get the index of the first stage line
        first_line = stage_lines[1]
        
        # Create lines for the demographic parameters for all stages
        demo_lines = []
        groups = sorted(demo_values.keys())
        
        for stage in range(1, 5):
            for group in groups:
                param_name = f"mean_duration_clin{stage}_dead_{group}"
                value = demo_values[group][f'stage{stage}']
                # Add comma to all lines
                demo_lines.append(f"{indent}\"{param_name}\": {value},\n")
        
        # Remove the last comma from the last line if needed
        if not lines[stage_lines[max(stage_lines.keys())]].strip().endswith(','):
            demo_lines[-1] = demo_lines[-1].replace(',\n', '\n')
        
        # Replace all 4 stage lines with the demographic parameters
        # Use the first line's position and remove the other 3
        lines[first_line:first_line+1] = demo_lines
        
        # Remove all the old stage parameters (the indices will change after inserting new lines)
        # So we need to rebuild the lines list and search again
        output_text = ''.join(lines)
        lines = output_text.split('\n')
        
        # Find and remove each original mean_duration_clin*_dead parameter
        # (working backward to avoid index issues)
        for i in range(len(lines)-1, -1, -1):
            line = lines[i]
            if any(f'"mean_duration_clin{stage}_dead":' in line for stage in range(1, 5)):
                lines.pop(i)
        
        output_text = '\n'.join(lines)
        
    # Write the modified file
    with open(output_file, 'w') as f:
        f.write(output_text)
    
    print(f"Demographic parameters written to {output_file} with formatting preserved")

In [29]:
# Create the relative survival parameters file
create_relative_survival_params_file(overall_dist_means, "crcsim/experiment/parameters_relative_survival.json")

Relative survival parameters written to crcsim/experiment/parameters_relative_survival.json with formatting preserved


In [30]:
# Create the demographic parameters file
create_demographic_params_file(dist_means_by_group, "crcsim/experiment/parameters_by_demog.json")

Demographic parameters written to crcsim/experiment/parameters_by_demog.json with formatting preserved
