### **01 - Incidence of MLTC**
#### **01H01 - Manuscript outputs - all condition incidence by age**

**Imports**

In [1]:
# required imports

# requires blank line after last import


In [2]:
%%sparkr

if (!requireNamespace("PHEindicatormethods", quietly = TRUE)) {
  install.packages("PHEindicatormethods")
}
library(PHEindicatormethods)

**Parameter cell**

In [3]:
# parameter cell
incidence_schema = ""  # "mltc_incidence_outputs_v40_20230331"
analysis_year = ""  # "2022/23"
segmentation_schema = ""  # "obh_segmentation_v40_20230331"
pipeline_schema = ""  # "pipeline_v40_20230331"

# optional, can be blank


In [4]:
# Set parameters in Spark configuration with 'param.' prefix (for use in SQL cells)
spark.conf.set("param.incidence_schema", incidence_schema)
spark.conf.set("param.analysis_year", analysis_year)
spark.conf.set("param.segmentation_schema", segmentation_schema)
spark.conf.set("param.pipeline_schema", pipeline_schema)


---

#### **01H - Manuscript outputs - all condition incidence by age**

This notebook extracts the overall incidence rate by condition, split by gender and age group

**a - Create incidence base table**

**Run time** ~13 mins

This section uses the pipeline healthy well insertions table, which contains one row per person per subsegment entry. This is used instead of the pipeline subsegment combinations table (used for other analyses in this project) as here we are interested in new subsegment diagnoses, rather than the condition_count associated with a specific combination of subsegments.

Subsegments are constrained down to those that appear in the previously defined configuration

<blockquote style="color: #333333; background-color: #FFBF00; padding: 10px; border-left: 6px solid #C48800;">
  <strong>💡 TODO (EBT):</strong> Update table name and column names when refreshing to versions post-v4.0_20230331.
</blockquote>

In [5]:
%%sql

CREATE    OR REPLACE TEMPORARY VIEW all_incidence_transitions AS
SELECT    NHS_Number,
          person_id,
          subsegment,
          MIN(psc.start_date) AS start_date
FROM      ${param.pipeline_schema}.py_7_pipeline_pre_subsegment_combination psc
INNER     JOIN ${param.segmentation_schema}.dim_person p ON p.pseudo_nhs_number = psc.NHS_Number
INNER     JOIN ${param.incidence_schema}.config_subsegments cs on cs.subsegment_name = psc.subsegment -- restrict to subsegments in config
GROUP BY  NHS_Number,
          person_id,
          subsegment


Link these transitions to the `fact_model` to remove unregistered rows, and apply age bandings

<blockquote style="color: #D8000C; background-color: #FFD2D2; padding: 10px; border-left: 6px solid #D8000C;">
  <strong>⚠️ Warning:</strong> DROP TABLE is currently commented out, as this table does not need to be recreated each time the incidence analysis is run.
</blockquote>

In [6]:
%%sql

--DROP TABLE IF EXISTS ${param.incidence_schema}.mm_all_incidence_transitions_age

In [7]:
%%sql

CREATE    TABLE ${param.incidence_schema}.mm_all_incidence_transitions_age USING PARQUET AS
SELECT    f.date_id AS transition_month_date_id,
          t.subsegment,
          p.person_id,
          p.pseudo_nhs_number,
          f.age_id AS age,
          CASE WHEN a.ten_year IN ('90-99','100-109','110-119') THEN '90+' ELSE a.ten_year END as age_band,
          p.gender_description          
FROM      all_incidence_transitions t
INNER     JOIN ${param.segmentation_schema}.dim_person p ON p.person_id = t.person_id
INNER     JOIN ${param.segmentation_schema}.dim_date d ON d.date = LAST_DAY(t.start_date)
INNER     JOIN ${param.segmentation_schema}.fact_model f ON f.person_id = p.person_id
AND       d.date_id = f.date_id
INNER     JOIN ${param.segmentation_schema}.dim_age a ON a.age_id = f.age_id
WHERE     f.gp_id IS NOT NULL
AND       f.age_id >= 20
AND       d.financial_year = '${param.analysis_year}'


**b - Create denominator person time base table**

Denominators for all condition incidence are person years spent **without** the condition.

These are calculated in two steps:
- First whole population person years are calculated
- Then condition-specific person years are calculated

Condition-specific person years are then subtracted from whole population person years for each condition, to obtain person years **without** each condition.

Calculate person years for whole population:


**Run time** ~1 min

<blockquote style="color: #D8000C; background-color: #FFD2D2; padding: 10px; border-left: 6px solid #D8000C;">
  <strong>⚠️ Warning:</strong> DROP TABLE is currently commented out, as this table does not need to be recreated each time the incidence analysis is run.
</blockquote>

In [None]:
%%sql

--DROP TABLE IF EXISTS ${param.incidence_schema}.mm_person_years_whole_pop

In [12]:
%%sql

CREATE    TABLE ${param.incidence_schema}.mm_person_years_whole_pop USING PARQUET AS
SELECT    age_band,
          gender_description,
          SUM(person_years) AS person_years
FROM      ${param.incidence_schema}.mm_incidence_person_time
WHERE     financial_year = '${param.analysis_year}'
GROUP BY  age_band,
          gender_description
ORDER BY  age_band,
          gender_description

Calculate person years with each condition:

**Run time** ~4 mins

<blockquote style="color: #D8000C; background-color: #FFD2D2; padding: 10px; border-left: 6px solid #D8000C;">
  <strong>⚠️ Warning:</strong> DROP TABLE is currently commented out, as this table does not need to be recreated each time the incidence analysis is run.
</blockquote>

In [13]:
%%sql

--DROP TABLE IF EXISTS ${param.incidence_schema}.mm_person_years_by_subsegment

In [14]:
%%sql

CREATE    TABLE ${param.incidence_schema}.mm_person_years_by_subsegment USING PARQUET AS
SELECT    financial_year,
          subsegment_name,
          CASE WHEN a.ten_year IN ('90-99','100-109','110-119') THEN '90+' ELSE a.ten_year END as age_band,
          p.gender_description,
          SUM(d.month_financial_year_fraction) AS person_years
FROM      ${param.segmentation_schema}.fact_model f
INNER     JOIN ${param.segmentation_schema}.dim_person p ON p.person_id = f.person_id
INNER     JOIN ${param.incidence_schema}.breakdown_subsegment_combinations_config bsc ON bsc.old_subsegment_combination_id = f.subsegment_combination_id
INNER     JOIN ${param.segmentation_schema}.dim_date d ON d.date_id = f.date_id
INNER     JOIN ${param.segmentation_schema}.dim_age a ON a.age_id = f.age_id
WHERE     f.gp_id IS NOT NULL
AND       d.financial_year = '${param.analysis_year}'
AND       f.age_id >= 20
GROUP BY  financial_year,
          subsegment_name,      
          CASE WHEN a.ten_year IN ('90-99','100-109','110-119') THEN '90+' ELSE a.ten_year END,
          p.gender_description

Subtract condition-specific person years with each condition from whole population person years:

In [15]:
%%sql

CREATE    OR REPLACE TEMPORARY VIEW person_time_without_each_subsegment AS
SELECT    s.subsegment_name,
          s.age_band,
          s.gender_description,
          s.person_years AS person_years_with_subsegment,
          p.person_years AS person_years_whole_population,
          p.person_years - s.person_years AS person_years_without_subsegment
FROM      ${param.incidence_schema}.mm_person_years_by_subsegment s
INNER     JOIN ${param.incidence_schema}.mm_person_years_whole_pop p ON p.age_band = s.age_band
AND       s.gender_description = p.gender_description

**c - Combine results to obtain age and gender-specific incidence rates for each condition**

<blockquote style="color: #D8000C; background-color: #FFD2D2; padding: 10px; border-left: 6px solid #D8000C;">
  <strong>⚠️ Warning:</strong> DROP TABLE is currently commented out, as this table does not need to be recreated each time the incidence analysis is run.
</blockquote>

In [16]:
%%sql

--DROP TABLE IF EXISTS ${param.incidence_schema}.mm_all_incidence_by_subsegment_and_age

In [17]:
%%sql

CREATE    TABLE ${param.incidence_schema}.mm_all_incidence_by_subsegment_and_age USING PARQUET AS
SELECT    py.age_band,
          py.gender_description,
          py.subsegment_name as subsegment,
          incidence,
          person_years_without_subsegment,
          (incidence * 1.0) / (person_years_without_subsegment * 1.0) * 100000 AS incidence_rate
FROM      person_time_without_each_subsegment py
LEFT      OUTER JOIN (
          SELECT    gender_description,
                    age_band,
                    subsegment,
                    COUNT(*) AS incidence 
          FROM      ${param.incidence_schema}.mm_all_incidence_transitions_age
          WHERE     gender_description NOT IN ('NOT KNOWN', 'NOT SPECIFIED')
          GROUP BY  gender_description,
                    age_band,
                    subsegment
          ) num ON py.age_band = num.age_band
AND       py.gender_description = num.gender_description
AND       py.subsegment_name = num.subsegment
WHERE     py.gender_description NOT IN ('NOT KNOWN', 'NOT SPECIFIED')

**d - Calculate confidence intervals using Byar's method**

This section uses `PHEindicatormethods` package `phe_rate` function.

Convert to RSpark DataFrame then R DataFrame

In [5]:
%%sparkr

df_combined_r <- sql("SELECT * FROM ${param.incidence_schema}.mm_all_incidence_by_subsegment_and_age")

In [6]:
%%sparkr

r_combined <- collect(df_combined_r)

Calculate crude rate and confidence intervals

In [7]:
%%sparkr

# Calculate rates with confidence intervals
r_crude_rate_output <- phe_rate(
  data = r_combined,
  x = incidence,
  n = person_years_without_subsegment,
  multiplier = 100000,
  confidence = 0.95
)

# Convert the R DataFrame to a SparkR DataFrame
df_r_crude_rate_output <- createDataFrame(r_crude_rate_output)

# Save as a temporary view
createOrReplaceTempView(df_r_crude_rate_output, "r_crude_rate_output_view")


**Extract output with small number suppression**

<blockquote style="color: #D8000C; background-color: #FFD2D2; padding: 10px; border-left: 6px solid #D8000C;">
  <strong>⚠️ Warning:</strong> DROP TABLE is currently commented out, as this table does not need to be recreated each time the incidence analysis is run.
</blockquote>

In [None]:
%%sql

--DROP TABLE IF EXISTS ${param.incidence_schema}.output_01H01_all_incidence_by_subsegment_and_age

In [8]:
%%sql

CREATE    TABLE ${param.incidence_schema}.output_01H01_all_incidence_by_subsegment_and_age USING PARQUET AS
SELECT    age_band,
          gender_description,
          subsegment,
          CASE
                    WHEN incidence BETWEEN 1 AND 7  THEN '***'
                    ELSE incidence
          END AS incidence,
          CASE
                    WHEN person_years_without_subsegment BETWEEN 1 AND 7  THEN '***'
                    ELSE person_years_without_subsegment
          END AS person_years_without_subsegment,
          CASE
                    WHEN incidence BETWEEN 1 AND 7  THEN '***'
                    ELSE value
          END AS incidence_rate,
          lowercl AS lower_cl,
          uppercl AS upper_cl
FROM      r_crude_rate_output_view
ORDER BY  subsegment,
          gender_description,
          age_band

In [9]:
%%sql

SELECT    *
FROM      ${param.incidence_schema}.output_01H01_all_incidence_by_subsegment_and_age