### **01 - Incidence of MLTC**
#### **01C02 - Manuscript outputs - main table progression rate - PRRs**

This notebook takes the inputs from 01C01, and calculates Progression Rate Ratios (PRRs) and applies confidence intervals to both rates and PRRs

**Imports**

In [1]:
%%pyspark
# required imports
from scipy import stats
import pyspark.sql.functions as F  # noqa: N812 F401

# requires blank line after last import


In [2]:
if (!requireNamespace("PHEindicatormethods", quietly = TRUE)) {
  install.packages("PHEindicatormethods")
}
library(PHEindicatormethods)

**Parameters**

In [4]:
%%pyspark
# parameter cell
incidence_schema = ""  # "mltc_incidence_outputs_v40_20230331"
analysis_year = ""  # "2022/23"
segmentation_schema = ""  # "obh_segmentation_v40_20230331"

# optional, can be blank


In [5]:
%%pyspark
# Set parameters in Spark configuration with 'param.' prefix (for use in SQL cells)
spark.conf.set("param.incidence_schema", incidence_schema)
spark.conf.set("param.analysis_year", analysis_year)
spark.conf.set("param.segmentation_schema", segmentation_schema)


---

#### **01C02 - Manuscript outputs - main table progression rates - PRRs**


**2022/23 main descriptive table - incidence of 1+, 2+, 3+ conditions overall and by socio-demographic breakdowns**

Separate functions are used below to calculate
- Rates and associated confidence intervals
- PRRs and associated confidence intervals

**Unpivot crude rates and calculate confidence intervals using Byar's method**

Unpivoting is required to create a dataframe in the format needed to use `PHEindicatormethods` package `phe_rate` function.

In [5]:
%%sql

CREATE    OR REPLACE TEMPORARY VIEW unpivoted AS
SELECT    gender_description,
          breakdown_type,
          socio_demographic_breakdown,
          '0 to 1+' AS transition,
          person_years_0 AS person_years,
          incidence_0_1_plus AS incidence
FROM      ${param.incidence_schema}.mm_incidence_transitions_main_results
UNION ALL
SELECT    gender_description,
          breakdown_type,
          socio_demographic_breakdown,
          '1 to 2+' AS transition,
          person_years_1 AS person_years,
          incidence_1_2_plus AS incidence
FROM      ${param.incidence_schema}.mm_incidence_transitions_main_results
UNION ALL
SELECT    gender_description,
          breakdown_type,
          socio_demographic_breakdown,
          '2 to 3+' AS transition,
          person_years_2 AS person_years,
          incidence_2_3_plus AS incidence
FROM      ${param.incidence_schema}.mm_incidence_transitions_main_results

Convert to RSpark DataFrame then R DataFrame

In [6]:
df_unpivoted_r <- sql("SELECT * FROM unpivoted")


In [7]:
r_unpivoted <- collect(df_unpivoted_r)


Calculate crude rates and confidence intervals

In [8]:
# Calculate rates with confidence intervals
r_crude_rate_output <- phe_rate(
  data = r_unpivoted,
  x = incidence,
  n = person_years,
  multiplier = 100,
  confidence = 0.95
)

# Convert the R DataFrame to a SparkR DataFrame
df_r_crude_rate_output <- createDataFrame(r_crude_rate_output)

# Save as a temporary view
createOrReplaceTempView(df_r_crude_rate_output, "r_crude_rate_output_view")


**Crude Progression Rate Ratio (PRR) Confidence Interval Calculation**

The Progression Rate Ratio (PRR) is used to compare progression rates between two populations. It is expressed as the ratio of incidence or progression rates between two distinct groups. This section calculates PRRs and their corresponding confidence intervals.

**Methodology**

1. **Data Inputs**:
    - `events_col_1` and `events_col_2`: Number of progression events for Population 1 and Population 2, respectively.
    - `person_years_col_1` and `person_years_col_2`: Person-years of observation for Population 1 and Population 2, respectively.
    - `z_value`: Z-value for the desired confidence level, pre-calculated for efficiency (e.g. 1.96 for 95% confidence).

2. **Calculate Crude Rates (CR)**:
    - Crude rates are calculated as the number of events divided by the person-years of observation for each population.

3. **Calculate Progression Rate Ratio (PRR)**:
    - PRR is computed as the ratio of the two crude rates.

4. **Estimate Variance of log(PRR)**:
    - Variance of the logarithm of PRR is estimated using:
      - *var_log_prr = 1 / events_col_1 + 1 / events_col_2*
    - This variance determines the confidence interval width.

5. **Confidence Interval Calculation**:
    - Calculate lower and upper bounds for the PRR confidence interval using:
      - *ci_lower_prr = exp(log(prr) - z_value * sqrt(var_log_prr))*
      - *ci_upper_prr = exp(log(prr) + z_value * sqrt(var_log_prr))*

**Output**

The output includes:

- Crude rates for both populations.
- Calculated Progression Rate Ratio.
- Lower and upper bounds of the PRR confidence interval.


Restructure input data in required format

In [9]:
%%sql

CREATE    OR REPLACE TEMPORARY VIEW prr_calculation_input AS
SELECT    u0.gender_description,
          u0.breakdown_type,
          u0.socio_demographic_breakdown,
          '1 to 2+' AS transition,
          u0.person_years AS person_years_col_1,
          u0.incidence AS events_col_1,
          u1.person_years AS person_years_col_2,
          u1.incidence AS events_col_2
FROM      unpivoted u0
INNER     JOIN unpivoted u1 ON u1.gender_description = u0.gender_description
AND       u1.breakdown_type = u0.breakdown_type
AND       u1.socio_demographic_breakdown = u0.socio_demographic_breakdown
AND       u1.transition = '1 to 2+'
WHERE     u0.transition = '0 to 1+'
UNION ALL
SELECT    u0.gender_description,
          u0.breakdown_type,
          u0.socio_demographic_breakdown,
          '2 to 3+' AS transition,
          u0.person_years AS person_years_col_1,
          u0.incidence AS events_col_1,
          u2.person_years AS person_years_col_2,
          u2.incidence AS events_col_2
FROM      unpivoted u0
INNER     JOIN unpivoted u2 ON u2.gender_description = u0.gender_description
AND       u2.breakdown_type = u0.breakdown_type
AND       u2.socio_demographic_breakdown = u0.socio_demographic_breakdown
AND       u2.transition = '2 to 3+'
WHERE     u0.transition = '0 to 1+'

Set confidence level and calculate Z value (and set as a variable to be available in subsequent SQL cells)

In [10]:
%%pyspark
# Define the confidence level
confidence_level = 0.95

# Calculate the Z-value for the given confidence level
z_value = stats.norm.ppf((1 + confidence_level) / 2)

spark.conf.set("param.z_value", str(z_value))


Create a temporary view for the crude rates and variance calculation

In [11]:
%%sql

CREATE    OR REPLACE TEMPORARY VIEW crude_rates AS
SELECT    gender_description,
          breakdown_type,
          socio_demographic_breakdown,
          transition,
          events_col_1,
          person_years_col_1,
          events_col_2,
          person_years_col_2,
          (events_col_1 / person_years_col_1) AS cr_population_1,
          (events_col_2 / person_years_col_2) AS cr_population_2,
          (events_col_2 / person_years_col_2) / (events_col_1 / person_years_col_1) AS progression_rate_ratio,
          (1 / events_col_1) + (1 / events_col_2) AS var_log_prr
FROM      prr_calculation_input

Create a temporary view to calculate confidence intervals


In [12]:
%%sql

CREATE    OR REPLACE TEMPORARY VIEW prr_confidence_intervals AS
SELECT    *,
          EXP(LOG(progression_rate_ratio) - ${param.z_value} * SQRT(var_log_prr)) AS lower_confidence_interval_prr,
          EXP(LOG(progression_rate_ratio) + ${param.z_value} * SQRT(var_log_prr)) AS upper_confidence_interval_prr
FROM      crude_rates

Create final view

In [13]:
%%sql

CREATE    OR REPLACE TEMPORARY VIEW prr_results_view AS
SELECT    gender_description,
          breakdown_type,
          socio_demographic_breakdown,
          transition,
          cr_population_1,
          cr_population_2,
          progression_rate_ratio,
          lower_confidence_interval_prr,
          upper_confidence_interval_prr
FROM      prr_confidence_intervals

**Create single combined output table with small number suppression applied**

<blockquote style="color: #D8000C; background-color: #FFD2D2; padding: 10px; border-left: 6px solid #D8000C;">
  <strong>⚠️ Warning:</strong> DROP TABLE is currently commented out, as this table does not need to be recreated each time the incidence analysis is run.
</blockquote>

In [14]:
%%sql

--DROP TABLE IF EXISTS ${param.incidence_schema}.output_01C02_incidence_results_main_table_PRRs

In [15]:
%%sql

CREATE    TABLE ${param.incidence_schema}.output_01C02_incidence_results_main_table_PRRs USING PARQUET AS
SELECT    m.gender_description,
          m.breakdown_type,
          m.socio_demographic_breakdown,
          CASE
                    WHEN m.unique_people_fy_period BETWEEN 1 AND 7  THEN '***'
                    ELSE CAST(m.unique_people_fy_period AS STRING)
          END AS unique_people_fy_period,
          CASE
                    WHEN m.unique_people_fy_end BETWEEN 1 AND 7  THEN '***'
                    ELSE CAST(m.unique_people_fy_end AS STRING)
          END AS unique_people_fy_end,
          CASE
                    WHEN m.person_years_0 BETWEEN 1 AND 7  THEN '***'
                    ELSE CAST(m.person_years_0 AS STRING)
          END AS person_years_0,
          CASE
                    WHEN m.incidence_0_1_plus BETWEEN 1 AND 7  THEN '***'
                    ELSE CAST(m.incidence_0_1_plus AS STRING)
          END AS incidence_0_1_plus,
          m.progression_rate_0_1_plus,
          r0.lowercl AS lower_cl_0_1,
          r0.uppercl AS upper_cl_0_1,
          CASE
                    WHEN m.person_years_1 BETWEEN 1 AND 7  THEN '***'
                    ELSE CAST(m.person_years_1 AS STRING)
          END AS person_years_1,
          CASE
                    WHEN m.incidence_1_2_plus BETWEEN 1 AND 7  THEN '***'
                    ELSE CAST(m.incidence_1_2_plus AS STRING)
          END AS incidence_1_2_plus,
          m.progression_rate_1_2_plus,
          r1.lowercl AS lower_cl_1_2,
          r1.uppercl AS upper_cl_1_2,
          prr1.progression_rate_ratio AS prr_1_2,
          prr1.lower_confidence_interval_prr AS lower_cl_prr_1_2,
          prr1.upper_confidence_interval_prr AS upper_cl_prr_1_2,
          CASE
                    WHEN m.person_years_2 BETWEEN 1 AND 7  THEN '***'
                    ELSE CAST(m.person_years_2 AS STRING)
          END AS person_years_2,
          CASE
                    WHEN m.incidence_2_3_plus BETWEEN 1 AND 7  THEN '***'
                    ELSE CAST(m.incidence_2_3_plus AS STRING)
          END AS incidence_2_3_plus,
          m.progression_rate_2_3_plus,
          r2.lowercl AS lower_cl_2_3,
          r2.uppercl AS upper_cl_2_3,
          prr2.progression_rate_ratio AS prr_2_3,
          prr2.lower_confidence_interval_prr AS lower_cl_prr_2_3,
          prr2.upper_confidence_interval_prr AS upper_cl_prr_2_3
FROM      ${param.incidence_schema}.mm_incidence_transitions_main_results m
INNER     JOIN r_crude_rate_output_view r0 ON r0.gender_description = m.gender_description
AND       r0.breakdown_type = m.breakdown_type
AND       r0.socio_demographic_breakdown = m.socio_demographic_breakdown
AND       r0.transition = '0 to 1+'
INNER     JOIN r_crude_rate_output_view r1 ON r1.gender_description = m.gender_description
AND       r1.breakdown_type = m.breakdown_type
AND       r1.socio_demographic_breakdown = m.socio_demographic_breakdown
AND       r1.transition = '1 to 2+'
INNER     JOIN r_crude_rate_output_view r2 ON r2.gender_description = m.gender_description
AND       r2.breakdown_type = m.breakdown_type
AND       r2.socio_demographic_breakdown = m.socio_demographic_breakdown
AND       r2.transition = '2 to 3+'
INNER     JOIN prr_results_view prr1 ON prr1.gender_description = m.gender_description
AND       prr1.breakdown_type = m.breakdown_type
AND       prr1.socio_demographic_breakdown = m.socio_demographic_breakdown
AND       prr1.transition = '1 to 2+'
INNER     JOIN prr_results_view prr2 ON prr2.gender_description = m.gender_description
AND       prr2.breakdown_type = m.breakdown_type
AND       prr2.socio_demographic_breakdown = m.socio_demographic_breakdown
AND       prr2.transition = '2 to 3+'
ORDER BY  CASE
                    WHEN m.gender_description = 'NA' THEN 0
                    ELSE m.gender_description
          END,
          CASE
                    WHEN m.breakdown_type = 'NA' THEN 0
                    ELSE m.breakdown_type
          END,
          CASE
                    WHEN m.socio_demographic_breakdown = 'NA' THEN 0
                    ELSE m.socio_demographic_breakdown
          END

In [16]:
%%sql

SELECT    *
FROM      ${param.incidence_schema}.output_01C02_incidence_results_main_table_PRRs

Also extract unique count of people across the whole 6 year study period, to reference in Methods

In [5]:
%%sql

SELECT    *
FROM      ${param.incidence_schema}.mm_incidence_period_population_6_years

Finally, extract mean and SD age for 2022/23 to reference in the Results

**1. Mean and median 2022/23 period population age** (person time approach)

In [7]:
%%sql

SELECT    COUNT(DISTINCT pseudo_nhs_number) AS people,
          SUM(month_financial_year_fraction) AS person_time,
          MIN(perc_50) AS median_age,
          AVG(age_id) AS mean_age,
          STD(age_id) AS standard_deviation_age
FROM      (
          SELECT    pseudo_nhs_number,
                    f.date_id,
                    month_financial_year_fraction,
                    age_id,
                    PERCENTILE_CONT (0.5) within GROUP (
                    ORDER BY  age_id * 1.0
                    ) OVER (
                    PARTITION BY 1
                    ) AS perc_50
          FROM      ${param.segmentation_schema}.fact_model f
          INNER     JOIN ${param.segmentation_schema}.dim_date d ON d.date_id = f.date_id
          INNER     JOIN ${param.segmentation_schema}.dim_person p ON p.person_id = f.person_id
          WHERE     f.gp_id IS NOT NULL
          AND       f.age_id >= 20
          AND       d.financial_year = '${param.analysis_year}'
          ) x

**2. Mean and median 2022/23 period population age at entry to study (in selected year)** - i.e. min age for each person in 2022/23

In [8]:
%%sql

SELECT    COUNT(DISTINCT pseudo_nhs_number) AS people,
          'N/A' AS person_time,
          MIN(perc_50) AS median_age,
          AVG(age_id) AS mean_age,
          STD(age_id) AS standard_deviation_age
FROM      (
          SELECT    pseudo_nhs_number,
                    age_id,
                    PERCENTILE_CONT (0.5) within GROUP (
                    ORDER BY  age_id * 1.0
                    ) OVER (
                    PARTITION BY 1
                    ) AS perc_50
          FROM      (
                    SELECT    pseudo_nhs_number,
                              MIN(age_id) AS age_id
                    FROM      ${param.segmentation_schema}.fact_model f
                    INNER     JOIN ${param.segmentation_schema}.dim_date d ON d.date_id = f.date_id
                    INNER     JOIN ${param.segmentation_schema}.dim_person p ON p.person_id = f.person_id
                    WHERE     f.gp_id IS NOT NULL
                    AND       f.age_id >= 20
                    AND       d.financial_year = '${param.analysis_year}'
                    GROUP BY  pseudo_nhs_number
                    ) x
          ) y

**3. Mean and median 2022/23 snapshot population at start of year** - i.e. as of 30/04/2022 (the Segmentation Dataset is a monthly dataset, so this is the first monthly entry for 2022/23)

In [9]:
%%sql

SELECT    COUNT(DISTINCT pseudo_nhs_number) AS people,
          'N/A' AS person_time,
          MIN(perc_50) AS median_age,
          AVG(age_id) AS mean_age,
          STD(age_id) AS standard_deviation_age
FROM      (
          SELECT    pseudo_nhs_number,
                    f.date_id,
                    age_id,
                    PERCENTILE_CONT (0.5) within GROUP (
                    ORDER BY  age_id * 1.0
                    ) OVER (
                    PARTITION BY 1
                    ) AS perc_50
          FROM      ${param.segmentation_schema}.fact_model f
          INNER     JOIN (
                    -- Earliest end of month snapshot within selected financial year
                    SELECT    MIN(date_id) AS date_id
                    FROM      ${param.segmentation_schema}.dim_date
                    WHERE     financial_year = '${param.analysis_year}'
                    AND       end_of_month IS TRUE
                    ) d ON d.date_id = f.date_id
          INNER     JOIN ${param.segmentation_schema}.dim_person p ON p.person_id = f.person_id
          WHERE     f.gp_id IS NOT NULL
          AND       f.age_id >= 20
          ) x